NashTech Blog

RAG enhancement: Finetuning Your Rerank model

Table of Contents

Key Takeaways

In RAG systems, a Reranker (Cross-Encoder) is essential for precision. While initial retrieval gathers candidates based on speed, the Reranker analyzes the deep interaction between query and context to ensure the most relevant data reaches the LLM. The core of effective finetuning is Hard Negative Mining: training the model on “distractors” that look correct but are irrelevant. This sharpens the model’s ability to distinguish subtle differences. By automating this with tools like mine_hard_negatives, you transform standard data into a specialized dataset, drastically reducing hallucinations and boosting overall search accuracy.

Why Reranking is crucial for Search (Especially in RAG)

In modern AI applications, particularly those leveraging Retrieval-Augmented Generation (RAG), finding the right source material is paramount to producing accurate and reliable answers. A poor retrieval step leads directly to a “garbage in, garbage out” scenario.

To ensure the best context is fed to the Large Language Model (LLM), the process of finding the right document for a user’s query is often split into two powerful stages: Retrieval and Reranking.

  • Retrieval (Recall): A fast, scalable model (typically a Bi-Encoder like BGE-M3 or an embedding model) quickly filters through your entire document corpus (the knowledge base). This stage is about speed and maximizing recall—making sure the correct answer is captured within the top 50–100 initial candidates.
  • Reranking (Precision): This is where the slower, more powerful Cross-Encoder Reranker, such as BAAI/bge-reranker-v2-m3, steps in. It meticulously analyzes the semantic relationship between the Query and each candidate passage before those passages are sent to the LLM. By assigning a highly precise relevance score, the Reranker ensures that the absolute best and most relevant source documents are pushed to the top, boosting precision and guaranteeing the LLM is working with high-quality context

Dataset Preparation: The Crucial Role of Hard Negatives

The performance of your Reranker depends almost entirely on the quality of your training data. For Cross-Encoders, data must be structured as labeled pairs, and the key to success is leveraging Hard Negatives.

The Training Data Format

Your dataset must consist of triples: (Query, Passage, Label).

  • Query: The user’s search intent (e.g., “What is a transformer model?”).
  • Passage: The candidate text/document.
  • Label: A binary score (1 for relevant, 0 for irrelevant).

Creating Hard Negatives

A “hard negative” is a non-relevant passage that is still semantically similar to the query, making it difficult for the model to distinguish from a positive answer. Training on these difficult cases forces the model to learn subtle differences, resulting in robust performance.

Here is how you can generate them automatically using the sentence_transformers library:

  1. Start with Positives: Use your existing dataset of correct pairs: (Query A, Positive Passage A).
  2. Select an Efficient Embedding Model: Use a fast Bi-Encoder (like a smaller BGE variant) to quickly encode all your queries and passages.
  3. Mine Hard Negatives: Use the mine_hard_negatives utility function. This function uses the fast embeddings to:
    • Compare Query A with all other Positive Passages (B, C, D…) in your corpus.
    • Identify passages (like Positive B or C) that have a suspiciously high similarity score with Query A, even though they are not Query A’s correct answer.
    • These are your Hard Negatives.
from sentence_transformers.util import mine_hard_negatives

# The function automatically creates (Query, Passage, Label=0) pairs
# from your (Query, Positive Passage) dataset.
hard_train_dataset = mine_hard_negatives(
    train_dataset,
    embedding_model,
    num_negatives=5,  # How many hard negatives per positive pair
    max_score=0.8,    # Only consider samples with a similarity score of at most 0.8
    use_faiss=True,   # Recommended for large datasets
)

The resulting dataset will be a mix of Positive Pairs (Label=1) and Hard Negative Pairs (Label=0), ready for finetuning.

Finetuning the Cross-Encoder Model

The BGE Reranker is a standard Transformer model configured for Sequence Classification with a single output (the relevance score).

Step 3.1: Load Model and Tokenizer

We use AutoModelForSequenceClassification and set num_labels=1 for the single relevance score output.

from transformers import AutoModelForSequenceClassification, AutoTokenizer

model_name = "BAAI/bge-reranker-v2-m3"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=1)

Step 3.2: Prepare Data for Cross-Encoding

The key difference from Bi-Encoders is that you must concatenate the Query and the Passage before tokenizing.

def preprocess_function(examples):
    # Cross-Encoder format: [CLS] Query [SEP] Passage [SEP]
    return tokenizer(
        examples["query"],
        examples["passage"],
        truncation=True,
        max_length=512,
        padding="max_length"
    )

Step 3.3: Define Training Parameters

The training process uses standard techniques, but pay attention to the batch size, as Cross-Encoders consume significantly more GPU memory.

from transformers import TrainingArguments, Trainer

training_args = TrainingArguments(
    output_dir="./bge_reranker_finetuned",
    num_train_epochs=3,                  
    per_device_train_batch_size=8,       # Keep this low (e.g., 4, 8, or 16)
    gradient_accumulation_steps=4,       # Use this to simulate a larger batch size
    learning_rate=2e-5,                  # Typical Finetuning LR
    evaluation_strategy="epoch",
)

Step 3.4: Train the Model

The Trainer handles the details, using a Binary Cross-Entropy Loss function by default for the classification task, which is ideal for labeled pairs.

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_train_data,
    # eval_dataset=tokenized_eval_data,
    tokenizer=tokenizer,
)

trainer.train()
trainer.save_model("./my_custom_reranker")

By following this process, you effectively specialize the BGE Reranker on your specific domain, ensuring that your search application provides the highest possible precision when ranking the final set of results.

Picture of Hung Nguyen Dinh

Hung Nguyen Dinh

I am an AI Tech Lead at NashTech Vietnam. I have been with the company for over 10 years. I am passionate about exploring new technologies and knowledge in software development and the AI field.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

Scroll to Top