Graph-driven RAG: Approaches Compared (Part 1)
Earlier this year, Microsoft announced their GraphRAG project, and ignited a storm of interest in how Knowledge Graphs can improve results from Generative AI. Their approach is novel, but certainly not the only pattern for augmenting LLM queries with graph data. In this series, we will take a high level look at the characteristics of several graph-enabled RAG approaches (including Microsoft-style GraphRAG) and compare them to other approaches on a simple use case.
This post is one in a series exploring the Graph + RAG + LLM equation.
- Motivation & Game Plan (This Post)
- Basic RAG & RAG with Context Compression
- Microsoft-style Graph RAG
- Knowledge Graph RAG
- Summary & Key Take-aways
Recap: What’s RAG again?
Retrieval-Augmented Generation (RAG) is a process to aid Large Language Models (LLMs) in answering questions about data they haven’t seen before. This could be because the data is newer than their training data, or because the data is private to an organization that didn’t train the model.
In general, the goal is to find “just the right data” to give the LLM alongside the query itself. To take context clues from the query text, we must take both the query and the source data into the realm of LLM “embeddings” — a mathematical trick to represent the semantic meaning of data as a very long string of numbers (a vector) which are easy to compare. As long as some element of data or metadata is represented as embedding vectors, we can grab that data quickly at query-time by embedding the query and finding the closest matches.
But what are we embedding and comparing? Raw data? Curated data? Metadata about the data? Summarized data?
Whatever we’re embedding, it’s all RAG
First, we should acknowledge that Microsoft-style GraphRAG doesn’t cover the entire Graph + RAG + LLM space. Their approach is aimed at one use case in particular — the summarization of large documents too bulky for a typical LLM context window. The novelty in their approach is in using a Knowledge Graph to create multiple levels of increasingly broad data summaries, until finally the LLM has baby-stepped its way to a summary of the entire corpus. This is potentially ground-breaking when the need is: “Tell me what this book is about?”
What about Knowledge Retrieval? Many of the most exciting AI use cases involve getting the right answer at the right time. If my question requires a single fact buried deep in a corpus of documents, I need my AI to find that needle in the haystack, not describe the shape of the haystack. To do this, RAG is still critical, but the nature of what’s in the RAG pipeline and how we can use Graph to improve it is very different.
The Field
To enable us to observe different approaches in action, we will experiment with simple examples of query answering using four different strategies:
- “Basic” RAG
- RAG with Context Compression
- Community Summarization Graph RAG (ie. Microsoft-style Graph RAG)
- Knowledge Graph RAG
Rules of the Road
We aren’t going to performance tune any of these examples, but scale is definitely a consideration. There are a lot of great looking tutorials online getting insights out of a single PDF document, but I am primarily interested in techniques that scale to span multiple domains of knowledge and many data sources across complex organizations.
One more restriction: There are a variety of Graph RAG solutions described online that assume as a precondition that an organization’s data has been already ingested or mapped into a Knowledge Graph. That is, it’s structured data, specifically structured as a graph. That is a VERY powerful place to be as an organization, but is also a non-trivial undertaking. As such, comparing these approaches to RAG that starts with unstructured, raw text is like comparing apples and oranges. So, for our purposes we will consider pre-existing Knowledge Graphs to be “cheating”.
The Setup
Our implementation of each RAG strategy will leave experienced readers with obvious suggestions for improvement. That’s ok; keeping it simple should be enough to draw some interesting observations. Each implementation will share some basics ingredients:
- Document Corpus: Simplified English Wikipedia
- Embedding Model: Databricks hosted “databricks-bge-large-en”
- Vector Search: Databricks Vectorsearch
- Chat Model: Databricks hosted “databricks-dbrx-instruct”
- Orchestration: LangChain in a notebook
- Query Prompt: Simple “rlm/rag-prompt” from LangChainHub
We will also use the same silly question across the board: “What is the natural habitat of the penguin?” Why is this question an interesting challenge? The term “penguin” occurs in the Wikipedia corpus 361 times in a surprising number of contexts:
- Linux mascot
- Euphemism for a Catholic nun
- Name of a popular publishing company
- Batman villain
- Silly looking antarctic animal
On the other hand, it’s a word that is unlikely to have many synonyms. These would be no problem for LLMs, but its a handy characteristic for a human trying to validate answers by keyword searching the source text.
user_query = "what is the natural habitat of the penguin?"
In the remaining posts in this series, we will see how four approaches to RAG stack up so we can see how Graph data might make an impact.