Leveraging BigQuery as a Vector Store

In recent years, the use of BigQuery, Google’s fully managed, highly scalable data warehouse, has expanded beyond traditional data analytics. One intriguing application is using BigQuery as a vector store for machine learning and data science tasks.

Consequently, this approach takes advantage of BigQuery’s strengths in handling massive datasets and performing SQL queries at scale, allowing for efficient management and querying of vector data.

What is a Vector Store?

As a result, it efficiently performs similarity searches, enabling quick retrieval of items or information based on their vector embeddings.

It efficiently performs similarity searches, enabling quick retrieval of items or information based on their vector embeddings, often used in applications like recommendation systems and semantic search.

Why Use BigQuery as a Vector Store?

BigQuery provides several advantages for storing and querying vectors:

Scalability: BigQuery handles petabytes of data, making it ideal for storing large collections of vectors.
SQL Interface: It offers a familiar SQL interface for querying vectors, allowing you to easily integrate it with existing data pipelines and analysis tools.
Performance: BigQuery optimizes query execution for fast performance, crucial for working with large-scale vector data.
Managed Service: BigQuery manages infrastructure, scaling, and maintenance automatically, reducing your operational overhead.

Using BigQuery as a Vector Store

Let's explore a practical example of how you can leverage BigQuery as a vector store using Python and the google-cloud-bigquery library.

Example: Storing and Querying Vectors:

In this example, we’ll store vectors representing customer preferences in BigQuery and demonstrate how to query them.

Setup:

Ensure you have the google-cloud-bigquery library installed. If not, you can install it using the corresponding command.

				
					pip install google-cloud-bigquery

Creating a Vector Table:

Assume we have vectors representing customer preferences stored as numpy arrays. We can create a table in BigQuery to store these vectors. Here’s a Python script to create and populate the table.

				
					from google.cloud import bigquery
import numpy as np

# Initialize BigQuery client
client = bigquery.Client()

# Define schema for the table
schema = [
    bigquery.SchemaField("customer_id", "STRING", mode="REQUIRED"),
    bigquery.SchemaField("vector", "FLOAT64", mode="REPEATED"),
]

# Create a new table
table_id = "your_project.your_dataset.vector_store"
table = bigquery.Table(table_id, schema=schema)
table = client.create_table(table)

# Sample data: customer vectors (replace with your actual data)
customers = [
    {"customer_id": "customer1", "vector": np.random.rand(10).tolist()},
    {"customer_id": "customer2", "vector": np.random.rand(10).tolist()},
    # Add more customers as needed
]

# Insert data into BigQuery
errors = client.insert_rows_json(table_id, customers)
if errors:
    print(f"Errors: {errors}")

print("Data successfully inserted into BigQuery.")

Querying Vectors

Once the vectors stored in BigQuery, you can query them using SQL. Here’s how you can retrieve vectors for a specific customer, this function retrieves and prints the vector associated with "customer1".

				
					def query_vectors(customer_id):
    query = f"""
        SELECT vector
        FROM `{table_id}`
        WHERE customer_id = '{customer_id}'
    """
    query_job = client.query(query)
    results = query_job.result()

    for row in results:
        print(f"Vector for {customer_id}: {row.vector}")

# Example usage
query_vectors("customer1")

The previous example provided a basic illustration of the process. Now, we will delve into how to integrate this functionality using the Retrieval-Augmented Generation (RAG) approach, leveraging the power of large language models (LLMs).

RAG Generation on BigQuery with Langchain

Picture needing quick info about the 2024 Football Matches Information or guide about maintenance schedules or roadside assistance. Traditionally, you’d sift through the owner’s manual for answers.

Do you think a model could answer all your questions?

It might or might not be possible, as it depends on the model’s knowledge base. To ensure the model provides accurate answers without incorrect information, Let’s see how we can overcome this issue

We start by loading the our data as in documents into memory using LangChain’s PyPDFLoader:

				
					from langchain_community.document_loaders import PyPDFLoader

loader = PyPDFLoader("example_data.pdf")
documents = loader.load_and_split()

Now, we will split our data into chunks:

				
					from langchain.text_splitter import RecursiveCharacterTextSplitter

# Split the documents into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=50,
    separators=["\n\n", "\n", ".", "!", "?", ",", " ", ""],
)
doc_splits = text_splitter.split_documents(documents)

# Add chunk number to metadata
for idx, split in enumerate(doc_splits):
    split.metadata["chunk"] = idx

With text chunks in `doc_splits`, we’ll generate embeddings and store them in BigQuery. This involves setting up a LangChain Vector Store with `BigQueryVectorSearch`, configuring Google Cloud and BigQuery, and using the `textembedding-gecko` model from VertexAI.

Finally, we use the vector store to convert document chunks into embeddings and upload them to BigQuery using LangChain.

				
					from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.vectorstores import BigQueryVectorSearch

embedding_model = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest", project=PROJECT_ID
)

bq_vector_cars_manual = BigQueryVectorSearch(
    project_id=PROJECT_ID,
    dataset_name=DATASET,
    table_name=TABLE,
    location=REGION,
    embedding=embedding_model,
)

bq_vector_cars_manual.add_documents(doc_splits)

We can inspect the BigQuery table and confirm that it contains the document metadata, content, and text embedding.

Query and retrieval

With text embeddings in BigQuery, search for relevant chunks to ground your answers, a process known as RAG. Set up a Vertex AI LLM and a LangChain retriever to fetch documents using BigQuery Vector Search.

				
					from langchain_google_vertexai import VertexAI
from langchain.chains import RetrievalQA

llm = VertexAI(model_name="gemini-pro")

retriever = bq_vector_cars_manual.as_retriever()

For Q&A chains, the retriever integrates directly and requires no additional setup. When a question is asked:

Convert it into a text embedding.
Conduct a vector search in BigQuery to retrieve relevant document chunks.
Use these chunks in the LLM’s prompt to generate a concise answer.

				
					search_query = "How do I activate the rearview camera in the 2024 Google Starlight?"

retrieval_qa = RetrievalQA.from_chain_type(
    llm=llm, chain_type="stuff", retriever=retriever
)
retrieval_qa.invoke(search_query)

{'query': 'How do I activate the rearview camera in the 2024 Google Starlight?',
 'result': 'To activate the rearview camera in the 2024 Google Starlight, press the “Camera” button on the dashboard or shift into reverse gear.'}

Conclusion

Using BigQuery as a vector store allows efficient retrieval of relevant information by leveraging text embeddings. Integrate this with RAG techniques and LangChain to ensure precise and contextually accurate responses from your language model, enhancing overall data accessibility and relevance.

Solutions

Technology advisory

Cloud engineering

Data solutions

AI and machine learning

Application engineering

Maintenance and support

Business process solutions

Quality solutions

Industry

Financial services and insurance

Healthcare

Retail

Travel

Media and publishing

Hi-tech and IOT

Logistics and supply chain

Education

Our thinking

News

Insights

Blog