Graph-driven RAG: Approaches Compared (Part 5)
This post is one in a series exploring the Graph + RAG + LLM equation.
- Motivation & Game Plan
- Basic RAG & RAG with Context Compression
- Microsoft-style Graph RAG
- Knowledge Graph RAG
- Summary & Key Take-aways (This Post)
Quick Recap
Earlier this year, Microsoft announced their GraphRAG project, and ignited a storm of interest in how Knowledge Graphs can improve results from Generative AI. Their approach is novel, but certainly not the only pattern for augmenting LLM queries with graph data. In this series, we took a high level look at the characteristics of several graph-enabled RAG approaches (including Microsoft-style GraphRAG) and compared them to other approaches on a simple use case. More information on the setup can be found in the first post.
Along the way, we were reminded that while some implementation details vary between approaches, all RAG applications use a semantic (embedded vector) comparison between the user query and some private data or metadata that will help the LLM answer the question at hand. The key difference is what gets embedded and retrieved.
- Basic RAG: Chunks of raw data
- RAG with Contextual Compression: Chunks of raw data, but condensed
- Microsoft-style Graph RAG: Summaries of data communities
- Knowledge Graph RAG: Graph concepts and relationships
Reviewing the Results
As we pointed out along the way, the details of these four approaches to RAG have an impact on operational efficiency and ability to scale. The graph based approaches, for example, don’t require the whole corpus of source data to be represented in an (expensive) vector store.
Those factors aside, most AI initiatives are judged most critically by end users who only care about the quality of the output. Let’s see how we did against our test question.
user_query = "what is the natural habitat of the penguin?"
Basic RAG:
‘A penguin is a flightless bird that lives in the Southern Hemisphere, mostly in Antarctica. They are adapted to live in the water and on land. Penguins have a streamlined body, wings that have become flippers, and a scale-like feathers that help them swim. They eat krill, fish, and squid. Penguins are social animals and live in colonies. They lay one or two eggs at a time and take turns keeping them warm. Penguins are known for their waddling walk and their ability to slide on their bellies across the ice. There are 18 species of penguins, and they vary in size from the little penguin, which is about 16 inches tall, to the emperor penguin, which can be up to 3 feet tall. Penguins are popular animals and are often kept in zoos and aquariums. They are also studied by scientists to learn about their behavior, ecology, and evolution.’
My high school English teacher would have marked this harshly, as it meanders a bit into off-topic information. In terms of helpful details, we get Antarctica, Southern Hemisphere, and I suppose that they live in both water and on land.
RAG with Context Compression:
‘Penguins typically nest on the ground, often pushing rocks around the nest to protect it from flooding. Some birds, like puffins, dig tunnels in the ground or in cliffs for their nests, but penguins do not typically engage in this behavior.’
On each of these tests, I ran the final chain invocation a few times. RAG with Context Compression stood out as exhibiting the most variability between runs. I’ll post a second example.
‘The natural habitat of penguins is the Antarctic region, specifically on the ice and in the surrounding oceans. They feed on krill, fish, and cephalopods. Penguins are well-adapted to the cold climate, with waterproof feathers and a layer of blubber to keep them warm. They nest together in large groups, called colonies, on the ice to lay their eggs and raise their chicks. The Emperor penguin is the largest and tallest penguin species, and they are the only birds that can lay their eggs on ice. Other penguin species found in Antarctica include the King Penguin, Chinstrap Penguin, Macaroni Penguin, Gentoo Penguin, and Adelie Penguin.’
Why did we get this variability? Well, once source data has been embedded, the similarity search for documents against the query text is deterministic: comparison of vectors is mathematical and will give the same “best documents” every time. However, with context compression we are taking a large group of documents and asking an LLM to compress or filter them — a decidedly non-deterministic task. The other approaches send a very predictable list of documents to an LLM for one summarization and are thus significantly more predictable.
Microsoft-style Graph RAG
‘Penguins are found in a variety of habitats, including coastal areas, islands, and open ocean. They are adapted to both cold and temperate climates, with different species found in Antarctica, South America, Africa, and other regions.’
This is pretty much what I expected from an approach based on summarization. It’s correct, but in the broadest possible sense. (Though I can’t find any mention of Africa in the community summaries.) As mentioned in Part 3, the level of detail available might be improved with improved community selection, or an LLM with a larger context window. At minimum, this result highlights the risk of losing critical facts in approaches relying on summarization of source data.
It’s also worth noting that this “broad” result isn’t really any worse than the Basic RAG result. While it’s peppered with details like penguin height and diet, those details aren’t actually useful in answering the question at hand.
Knowledge Graph RAG
‘Penguins are found in the Scotia Sea, which is partly in the Southern Ocean and mostly in the South Atlantic Ocean. This region is known for its stormy and cold weather, and the islands in this area are rocky and partly covered in ice and snow year-round. The Scotia Sea is an important habitat for various penguin species, including large numbers of King Penguins, as well as Chinstrap Penguin, Macaroni Penguin, Gentoo Penguin, Adelie Penguin, and Rockhopper Penguin.’
We’re evaluating these responses subjectively, but I think its fair to say this approach to RAG achieved the maximum level of on-topic detail available. (Both in comparison to other methods, and my own skimming of the source docs.)
It is, however, missing a few higher level statements found in the other results. Specifically, it doesn’t tell me that penguins are most in Antarctica which feels like a useful piece of context.
In Summary
When we think about LLMs communicating with humans in natural language, and then consider the semantic nature of a knowledge graph, it certainly seems intuitive that the two technologies would be complementary. I wrote this series to try to get beyond the “hand waving” that happens on this topic and provide a concrete (albeit brief) primer for the larger Generative AI community on what Graph + RAG + LLM might look like in real terms.
Hopefully this quick overview helps newcomers to the knowledge graph world navigate the flurry of Graph RAG information coming at them.
What’s Next
In all Graph RAG approaches, there are key areas for improvement. Kobai is focused on bringing solutions to these, notably:
- Enhanced Entity Resolution for node classification, powered by our scalable Databricks graph backend.
- The examples above all started with unstructured, textual data as a source. In reality, organizations have significant structured data that can be used to maximize the utility of unstructured data. Kobai is developing unique features for hybrid structured/unstructured/semi-structured GenAI querying which will be discussed in future posts.
- A key area often overlooked in extracting graphs from unstructured data is including a user-generated semantic model as part of the process. Kobai Studio allows users to quickly and graphically create such a model, and use it with our SDK to enhance the quality of graph data extraction.