Visual Knowledge Graphs & Multimodal LLMs

Our research group investigates the intersection of Large Language Models (LLMs), multimodal learning (with a focus on vision), and graph theory. We aim to build more intelligent, robust, and explainable AI systems capable of understanding and interacting with the world in a meaningful way.

Potential topics:

Visual Knowledge Graphs & LLMs
Our research explores enhancing Large Language Models (LLMs) with structured visual knowledge. We automatically construct Knowledge Graphs from images, representing objects and their relationships, to improve tasks like image understanding and question answering. We focus on integrating these graphs into LLM reasoning processes for more accurate and explainable AI.

Embodied AI & Dynamic Scene Graphs
We develop intelligent agents that perceive their environment visually and interact with it using natural language. Our approach uses LLMs as the core ‘brain,’ combined with dynamic visual scene graphs to represent changing surroundings. This enables robots and simulated characters to navigate, plan actions, and respond to complex instructions.

Hallucination Detection via Graph Consistency
Image generation models powered by LLMs can produce impressive visuals, but often suffer from inconsistencies or factual errors (hallucinations). Our research addresses this challenge by building graph representations of expected visual scenes and using them to verify the generated images. This allows us to detect and mitigate hallucinations for more reliable AI.

Cross-Modal Retrieval with Enhanced Graph Embeddings
We aim to revolutionize image search by enabling users to find images based on their meaning, not just keywords. Our research uses Large Language Models (LLMs) to enhance graph embeddings of both images and text, allowing for more accurate cross-modal retrieval and richer visual discovery.

Involved Researchers:

Valerio Venanzi