Build Your Own Image Classifier with SigLIP2: Pre-computed Embeddings + PostgreSQL Vector Search for Keywords & Shoot-Type Detection

Ngan Mai Thanh

Build Your Own Image Classifier with SigLIP2: Pre-computed Embeddings + PostgreSQL Vector Search for Keywords & Shoot-Type Detection

Why Build Your Own Instead of Using Cloud APIs?

When you need to classify images by keywords or detect “shoot types” (wedding, portrait, product photography, etc.), the easiest route is often cloud services like Google Vision API, AWS Rekognition, or Azure Computer Vision. But there are compelling reasons to build your own:

Cost Control: No per-request fees that scale with usage
Privacy: Keep your images on your infrastructure
Customization: Train on your specific domain and terminology
Speed: No network latency for API calls
Independence: No vendor lock-in or API changes

In this tutorial, we’ll build a production-ready image classification system using SigLIP2, an open-source vision-language model, combined with PostgreSQL’s pgvector for lightning-fast similarity search.

The Architecture

Our system has three main components:

Pre-computed Embeddings: Generate and store vector embeddings for all your keywords and shoot types
PostgreSQL + pgvector: Store embeddings and perform cosine similarity search
Python FastAPI Service: Process images and find best matches using similarity search

The key insight: instead of computing embeddings on-the-fly for every possible label, we pre-generate them once and use similarity search to find the best matches in milliseconds.

How It Works

┌─────────────────┐
│  Setup Phase    │
│  (Run Once)     │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 1. Define Keywords &amp; Shoot Types    │
│    - "wedding", "bride", "groom"    │
│    - "portrait", "studio lighting"  │
│    - "product photography"          │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 2. Generate Text Embeddings         │
│    SigLIP2: text → 768-dim vector   │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 3. Store in PostgreSQL (pgvector)   │
│    CREATE INDEX for fast search     │
└─────────────────────────────────────┘

┌─────────────────┐
│ Runtime Phase   │
│ (Per Request)   │
└────────┬────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 1. Upload Image                     │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 2. Generate Image Embedding         │
│    SigLIP2: image → 768-dim vector  │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 3. Similarity Search (cosine)       │
│    Find top-K closest embeddings    │
└────────┬────────────────────────────┘
         │
         ▼
┌─────────────────────────────────────┐
│ 4. Return Matched Keywords          │
│    ["wedding", "outdoor", ...]      │
└─────────────────────────────────────┘

Step 1: Database Setup with pgvector

First, set up PostgreSQL with the pgvector extension for efficient vector similarity search:

-- Enable extensions
CREATE EXTENSION IF NOT EXISTS vector;
CREATE EXTENSION IF NOT EXISTS pgcrypto;

-- Create table for embeddings
CREATE TABLE shoot_type_embeddings (
    id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
    name TEXT NOT NULL,
    prompt TEXT NOT NULL,
    embedding vector(768),  -- SigLIP2 outputs 768-dimensional vectors
    created_at TIMESTAMPTZ DEFAULT now()
);

-- Create index for fast similarity search
CREATE INDEX idx_shoot_type_embeddings
ON shoot_type_embeddings
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);

The ivfflat index enables sub-linear time similarity search, making queries fast even with thousands of embeddings.

Step 2: Python Service with FastAPI

Install the required dependencies:

pip install fastapi uvicorn pillow torch torchvision transformers psycopg2-binary numpy

Here’s the complete FastAPI service:

from fastapi import FastAPI, File, UploadFile, HTTPException
from pydantic import BaseModel
from transformers import AutoModel, AutoProcessor
from PIL import Image
import torch
import io
import psycopg2
import numpy as np
import os

# ========== CONFIG ==========
DATABASE_URL = os.getenv("DATABASE_URL", "postgresql://user:pass@localhost:5432/db")
DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
MODEL_NAME = "google/siglip2-base-patch16-256"
VECTOR_DIM = 768

# ========== LOAD MODEL ==========
model = AutoModel.from_pretrained(MODEL_NAME).to(DEVICE)
processor = AutoProcessor.from_pretrained(MODEL_NAME)

app = FastAPI()

class TextEmbedRequest(BaseModel):
    name: str
    prompt: str

def get_conn():
    return psycopg2.connect(DATABASE_URL)

def normalize(vec):
    """Normalize vector for cosine similarity"""
    arr = vec.cpu().numpy().astype("float32")
    norm = np.linalg.norm(arr)
    return (arr / norm).tolist() if norm &gt; 0 else arr.tolist()

# ---------- TEXT EMBEDDING ----------
@app.post("/embed/text")
async def embed_text(req: TextEmbedRequest):
    """Pre-generate embeddings for keywords/shoot types"""
    try:
        inputs = processor(text=req.prompt, return_tensors="pt").to(DEVICE)
        with torch.no_grad():
            features = model.get_text_features(**inputs)

        emb = normalize(features[0])

        conn = get_conn()
        cur = conn.cursor()
        cur.execute(
            "INSERT INTO shoot_type_embeddings (name, prompt, embedding) VALUES (%s,%s,%s) RETURNING id",
            (req.name, req.prompt, emb)
        )
        row = cur.fetchone()
        conn.commit()
        cur.close()
        conn.close()

        return {"id": str(row[0]), "name": req.name}
    except Exception as e:
        raise HTTPException(status_code=500, detail=str(e))

# ---------- IMAGE CLASSIFICATION ----------
@app.post("/classify/image")
async def classify_image(file: UploadFile = File(...), top_k: int = 5):
    """Classify image by finding similar embeddings"""
    data = await file.read()
    image = Image.open(io.BytesIO(data)).convert("RGB")

    # Generate image embedding
    inputs = processor(images=image, return_tensors="pt").to(DEVICE)
    with torch.no_grad():
        features = model.get_image_features(**inputs)

    emb = normalize(features[0])

    # Similarity search in PostgreSQL
    conn = get_conn()
    cur = conn.cursor()

    cur.execute(
        """
        SELECT id, name, prompt, embedding  %s AS distance
        FROM shoot_type_embeddings
        ORDER BY embedding  %s
        LIMIT %s
        """,
        (emb, emb, top_k)
    )

    rows = cur.fetchall()
    cur.close()
    conn.close()

    results = []
    for r in rows:
        results.append({
            "id": str(r[0]),
            "name": r[1],
            "prompt": r[2],
            "distance": float(r[3]),
            "similarity": 1 - float(r[3])  # Convert distance to similarity score
        })

    return {"matches": results}

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Step 3: Pre-populate Your Keywords

Before classifying images, you need to populate the database with your keywords and shoot types:

import requests

keywords = [
    {"name": "wedding", "prompt": "wedding ceremony with bride and groom"},
    {"name": "portrait", "prompt": "portrait photography of a person"},
    {"name": "studio_portrait", "prompt": "studio portrait with professional lighting"},
    {"name": "outdoor", "prompt": "outdoor photography in natural setting"},
    {"name": "product", "prompt": "product photography on white background"},
    {"name": "fashion", "prompt": "fashion photography and modeling"},
    {"name": "food", "prompt": "food photography of meal or dish"},
    {"name": "sports", "prompt": "sports action photography"},
    {"name": "newborn", "prompt": "newborn baby photography"},
    {"name": "corporate", "prompt": "corporate headshot professional photo"},
]

for kw in keywords:
    response = requests.post("http://localhost:8000/embed/text", json=kw)
    print(f"Added: {kw['name']} -&gt; {response.json()}")

Step 4: Classify Images

Now you can classify any image:

import requests

# Upload and classify
with open("wedding_photo.jpg", "rb") as f:
    response = requests.post(
        "http://localhost:8000/classify/image",
        files={"file": f},
        params={"top_k": 3}
    )

matches = response.json()["matches"]
for match in matches:
    print(f"{match['name']}: {match['similarity']:.2%} confidence")

Output:

wedding: 94.3% confidence
outdoor: 78.2% confidence
portrait: 71.5% confidence

Why SigLIP2?

SigLIP2 (Sigmoid Loss for Language-Image Pre-training) is Google’s latest vision-language model that improves on CLIP with:

Better accuracy: More precise image-text alignment
Efficient training: Uses sigmoid loss instead of softmax
Flexible: Works well with both short and long text descriptions
Open-source: Available on Hugging Face

C# Integration with EF Core

If you’re building a .NET application, here’s how to integrate similarity search:

using System;
using System.Collections.Generic;
using System.Globalization;
using System.Linq;
using System.Threading.Tasks;
using Microsoft.EntityFrameworkCore;

public class ShootTypeRepository
{
    private readonly AppDbContext _db;
    public ShootTypeRepository(AppDbContext db) =&gt; _db = db;

    public async Task&lt;List&gt; SearchAsync(
        float[] queryEmbedding,
        int topK = 5)
    {
        var vector = string.Join(
            ",",
            queryEmbedding.Select(v =&gt; v.ToString(CultureInfo.InvariantCulture))
        );

        var sql = $@"
            SELECT id, name, prompt, embedding  vector[{vector}] AS distance
            FROM shoot_type_embeddings
            ORDER BY embedding  vector[{vector}]
            LIMIT {topK};
        ";

        var results = new List();

        using var conn = _db.Database.GetDbConnection();
        await conn.OpenAsync();

        using var cmd = conn.CreateCommand();
        cmd.CommandText = sql;

        using var reader = await cmd.ExecuteReaderAsync();
        while (await reader.ReadAsync())
        {
            results.Add((
                reader.GetGuid(0),
                reader.GetString(1),
                reader.GetString(2),
                reader.GetDouble(3)
            ));
        }

        return results;
    }
}

Performance Considerations

Vector Normalization

Always normalize your vectors before storing or comparing. This ensures cosine similarity works correctly:

def normalize(vec):
    norm = np.linalg.norm(vec)
    return vec / norm if norm &gt; 0 else vec

Indexing Strategy

The ivfflat index provides a good balance between speed and accuracy. Adjust the lists parameter based on your dataset size:

Small ( 100K): lists = sqrt(total_vectors)

Similarity Threshold

Not all matches are good matches. Filter results by distance:

SIMILARITY_THRESHOLD = 0.75  # Adjust based on your needs

matches = [m for m in matches if (1 - m['distance']) &gt; SIMILARITY_THRESHOLD]

Production Best Practices

Separate Tables: Use different tables for keywords vs. shoot types for better organization
Batch Processing: Generate embeddings in batches for efficiency
Caching: Cache model in memory (don’t reload per request)
GPU Acceleration: Use CUDA for faster inference
Monitoring: Track inference time and similarity scores
Versioning: Store model version with embeddings for reproducibility

Advantages Over Cloud APIs

Feature	Cloud APIs	Your System
Cost	Pay per request	Fixed infrastructure
Latency	100-500ms	10-50ms
Privacy	Data sent to cloud	Stays on premise
Customization	Limited	Full control
Offline	❌	✅
Scale	Auto	Manual but cheaper

Real-World Use Cases

Photography Management: Auto-tag photo libraries by shoot type
E-commerce: Classify product images for better search
Social Media: Content moderation and categorization
Digital Asset Management: Organize media libraries automatically
Photo Studio Workflow: Route images to appropriate editors

Extending the System

Multi-label Classification

Return multiple keywords instead of just one:

# Get all matches above threshold
good_matches = [m for m in matches if m['similarity'] &gt; 0.7]
keywords = [m['name'] for m in good_matches]

Hierarchical Categories

Create parent-child relationships:

ALTER TABLE shoot_type_embeddings
ADD COLUMN parent_id UUID REFERENCES shoot_type_embeddings(id);

Confidence Scoring

Normalize similarity scores to [0, 1] range:

def confidence_score(distance):
    """Convert cosine distance to confidence percentage"""
    return max(0, min(1, 1 - distance))

Conclusion

Building your own image classifier with SigLIP2 and PostgreSQL gives you:

✅ Full control over your classification pipeline
✅ Better performance with pre-computed embeddings
✅ Lower costs at scale
✅ Privacy by keeping data on your infrastructure
✅ Customization for your specific domain

The combination of a powerful vision-language model, efficient vector search, and pre-generated embeddings creates a fast, accurate, and cost-effective solution that rivals cloud APIs while giving you complete control.

Get Started

Set up PostgreSQL with pgvector
Clone the code and install dependencies
Pre-generate embeddings for your keywords
Start classifying images!

The complete code is production-ready and can handle thousands of requests per second with proper infrastructure.

Have questions or improvements? Share your experience building custom image classifiers in the comments below!

Solutions

Industry

Our thinking

Build Your Own Image Classifier with SigLIP2: Pre-computed Embeddings + PostgreSQL Vector Search for Keywords & Shoot-Type Detection

Ngan Mai Thanh

Table of Contents

Build Your Own Image Classifier with SigLIP2: Pre-computed Embeddings + PostgreSQL Vector Search for Keywords & Shoot-Type Detection

Why Build Your Own Instead of Using Cloud APIs?

The Architecture

How It Works

Step 1: Database Setup with pgvector

Step 2: Python Service with FastAPI

Step 3: Pre-populate Your Keywords

Step 4: Classify Images

Why SigLIP2?

C# Integration with EF Core

Performance Considerations

Vector Normalization

Indexing Strategy

Similarity Threshold

Production Best Practices

Advantages Over Cloud APIs

Real-World Use Cases

Extending the System

Multi-label Classification

Hierarchical Categories

Confidence Scoring

Conclusion

Get Started

Ngan Mai Thanh

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements