NashTech Blog

Introduction

In the ever-expanding world of AI, we’re always on the hunt for the next big thing. Retrieval-Augmented Generation (RAG) has been making waves, and now we’ve got a new player in town: Microsoft’s GraphRAG! 🌊

But wait, what about our old friend, LlamaIndex? In a previous post, Implementing Graph RAG with LlamaIndex for Enhanced Document Understanding, we took LlamaIndex for a spin and discovered how it could help overcome the limitations of traditional RAG approaches. Now, it’s time to get the party started with Microsoft’s latest release.

Picture this: It’s July, and Microsoft drops a bombshell with GraphRAG, a data pipeline and transformation suite that’s here to make your data dreams come true. It’s like a Swiss Army knife for AI, extracting meaningful, structured data from unstructured text using Large Language Models (LLMs). In this blog, we’ll unravel the mysteries of GraphRAG and show you how to get it up and running with local LLMs from Ollama or Groq, plus a local embedding model. So buckle up, because it’s going to be a fun ride!

Understanding Microsoft’s GraphRAG

Microsoft’s GraphRAG is like the cool, new kid on the block who’s got all the latest tech tricks up its sleeve. It’s designed to turn chaotic text into beautifully structured data using the magical powers of LLMs. Although it loves OpenAI models, you can tweak it to play nicely with others too. Let’s dive into some of its shiny features:

Key Features of Microsoft’s GraphRAG:

  • Text Units: Chops up your input corpus into TextUnits, which are like the building blocks for the rest of the process. Think of them as the LEGO bricks of data analysis.
  • Entity Extraction: Unleashes the power of LLMs to extract entities, relationships, and key claims from TextUnits. It’s like having a super-sleuth at your service.
  • Hierarchical Clustering: Uses the fancy Leiden technique to cluster data hierarchically, offering a visual representation of entities and their relationships. Picture an intricate dance of data points.
  • Community Summarization: Summarizes each community and its constituents, providing a holistic view of the dataset. It’s like getting a bird’s eye view of a city map.
    Source:
    Blog: Microsoft GraphRAG
    Github: Microsoft GraphRAG Repository

Querying Capabilities:

Microsoft’s GraphRAG isn’t just about crunching data; it’s got some serious querying chops too:

  • Global Search: Perfect for when you need to answer those big-picture questions. It leverages community summaries to deliver insights like a pro.
  • Local Search: Zooms in on specific entities, fanning out to their neighbors and associated concepts. It’s like being a detective on a hot trail.

Setting Up Microsoft’s GraphRAG Project

Ready to roll up your sleeves and dive in? Let’s set up Microsoft’s GraphRAG and see some magic happen!

Requirements:

  • Python Version: 3.10-3.12

Step 1: Prepare Your Environment

  • Create a Virtual Environment:
    First, let’s set up a virtual environment to keep things neat and tidy. Run these commands to create and activate it:
				
					python3 -m venv myenv
source myenv/bin/activate

				
			

This will create a new virtual environment named myenv and activate it, ensuring that any packages we install don’t interfere with other projects.

  • Install Required Packages:

    Now, install the GraphRAG package using the following command:

				
					pip install graphrag

				
			
  • Create the Input Directory:
    Set up a directory to hold the documents you want to use for knowledge graph creation:
				
					mkdir -p ./ragtest/input

				
			
  • Add Your Documents:
    Move any documents you want to use into the input directory you just created.

Step 2: Initialize Your Workspace Variables

Initialize your workspace by running the following command:

 
				
					python -m graphrag.index --init --root ./ragtest

				
			

This command creates two files in the ./ragtest directory:

  • .env: Contains environment variables required for running the GraphRAG pipeline.
  • settings.yaml: Contains pipeline settings, allowing you to customize your configuration.

Step 3: Configure Your Environment

Obtain your Groq API key and save it in the .env file. The .env file is crucial for setting up the necessary environment variables for the project.

Using Local Embedding Models

Microsoft’s GraphRAG is designed to support OpenAI-compatible APIs. However, by making a few adjustments, we can use local embedding models, such as the Nomic-Embed-Text model, hosted using Ollama.

Step 4: Modify Embedding Configuration

To use a local embedding model, modify the openai_embeddings_llm.py file in the GraphRAG library. Follow these steps:

 

  • Find the Location of the openai_embeddings_llm.py File:
    Run the following command to locate the file:
				
					sudo find / -name openai_embeddings_llm.py

				
			
  • Navigate to the File:
    Open the file in a code editor and replace the existing code with the following code:
				
					# Copyright (c) 2024 Microsoft Corporation.
# Licensed under the MIT License

"""The EmbeddingsLLM class."""

from typing_extensions import Unpack
from graphrag.llm.base import BaseLLM
from graphrag.llm.types import (
    EmbeddingInput,
    EmbeddingOutput,
    LLMInput,
)
from .openai_configuration import OpenAIConfiguration
from .types import OpenAIClientTypes
import ollama

class OpenAIEmbeddingsLLM(BaseLLM[EmbeddingInput, EmbeddingOutput]):
    _client: OpenAIClientTypes
    _configuration: OpenAIConfiguration

    def __init__(self, client: OpenAIClientTypes, configuration: OpenAIConfiguration):
        self._client = client
        self._configuration = configuration

    async def _execute_llm(
        self, input: EmbeddingInput, **kwargs: Unpack[LLMInput]
    ) -> EmbeddingOutput | None:
        args = {
            "model": self._configuration.model,
            **(kwargs.get("model_parameters") or {}),
        }
        embedding_list = []
        for inp in input:
            embedding = ollama.embeddings(model="nomic-embed-text", prompt=inp)
            embedding_list.append(embedding["embedding"])
        return embedding_list

				
			

In this basically we are importing ollama and for the embedding model we are defining it as “nomic-embed-text”.

Running the GraphRAG Indexing Pipeline

Step 5: Execute the Indexing Pipeline

Run the indexing pipeline to process the input data and generate the knowledge graph:

				
					python -m graphrag.index --root ./ragtest

				
			

The indexing process will execute all necessary steps to transform the input data into a structured knowledge graph.

Querying the Generated Knowledge Graph

Step 6: Query the Graph

You can query the generated knowledge graph using global or local search methods. Here’s an example of querying the graph for the top five skills:

				
					python -m graphrag.query --root ./ragtest --method global "What are the top 5 themes of the document?"

				
			

Step 7: Interpreting Query Results

The GraphRAG system will provide results based on the configured queries. These results offer insights into the structure and relationships within your data, leveraging the power of the knowledge graph and LLMs.

Conclusion

Microsoft’s GraphRAG provides a powerful framework for extracting structured data from unstructured text, offering a significant improvement over traditional RAG approaches. By integrating local LLMs and embedding models, we can customize GraphRAG to suit our specific needs, leveraging its robust querying capabilities to gain meaningful insights from complex datasets.

Discover more from NashTech Blog

Subscribe now to keep reading and get access to the full archive.

Continue reading