Building an AI-powered application with LangChain and Google Vision

Quang Truong

Introduction

In the rapidly evolving landscape of AI-powered applications, the need for intelligent agents that can process both text and visual information has become increasingly important. Today, I’ll walk you through the architecture and implementation of a sophisticated AI Agent Service that combines the power of LangChain, OpenAI, and Google Vision API to create a unified conversational AI experience.

In this blog, we will together build a micro service called Agent Service to leverage Langchain and goole vision to provide a usecase in a AI-powered application.

What is the Agent Service?

The Agent Service is a Node.js/TypeScript microservice that provides a unified API for AI-powered conversations with support for both text and image processing. It’s designed to be part of a larger microservices architecture, specifically built for e-commerce applications where product recommendations and visual analysis are crucial.

Key Features:

Unified Text & Image Processing: Single endpoint handles both text queries and image uploads
LangChain Integration: Leverages LangChain framework for LLM orchestration
Google Vision API: Advanced image analysis including object detection, OCR, and safety assessment
Product Recommendations: AI-powered product search and recommendations

Architecture Overview

Technology Stack

Runtime: Node.js with TypeScript
Framework: Express.js
AI Framework: LangChain (@langchain/openai, @langchain/core)
Image Processing: Google Cloud Vision API
File Upload: Multer

Core Components

Main Server

The entry point sets up the Express server with essential middleware. Express.js (commonly called Express) is a fast, minimalist, and flexible web framework for Node.js. It provides a simple way to build web applications, APIs, and backend services.

import express from 'express';
import cors from 'cors';
import { config } from './configs/environment';
import agentRoutes from './routes/agentRoutes';

const app = express();

// Middleware
app.use(cors());
app.use(express.json());
app.use(express.urlencoded({ extended: true }));

// Routes
app.use('/api/agents', agentRoutes);

Unified Chat Endpoint

The /chat endpoint is the core API of the Agent Service that enables users to send text input along with an optional image upload. It processes the request by validating inputs, analyzing images using a vision service, and enhancing the user query with contextual details. The enriched prompt is then sent to an LLM for reasoning, and product recommendations are retrieved from the Product DB. Finally, it returns a structured response that combines AI insights, image analysis, and product suggestions.

router.post('/chat', (req: any, res: any, next: any) => {
  upload.single('image')(req, res, async (err: any) => {
    // Handle file upload errors
    if (err) {
      return res.status(400).json({
        success: false,
        error: err.message || 'File upload failed'
      });
    }

    const { input } = req.body;
    const imageFile = req.file;
    
    // Validate input and optional image
    const validation = imageFile 
      ? validateAgentRequestWithImage(req.body, imageFile)
      : validateAgentRequest(req.body);
      
    // Process image if present
    let imageAnalysis = undefined;
    if (imageFile) {
      imageAnalysis = await visionService.analyzeImage(imageFile.path);
    }

    // Create enhanced prompt with image context
    let enhancedInput = input;
    if (imageAnalysis) {
      enhancedInput = `${input}

[Image Analysis Context]
- Image labels detected: ${imageAnalysis.labels.join(', ')}
- Text in image: ${imageAnalysis.text}
- Objects detected: ${imageAnalysis.objects.join(', ')}
- Faces detected: ${imageAnalysis.faces}

Please provide a response that takes into account both the user's text input and the image analysis context above.`;
    }

    // Call LLM chain with enhanced context
    const response = await runLLMChain(enhancedInput);
    
    // Fetch product recommendations
    const productRecommendations = await productService.searchProducts(
      response.openai, 
      imageAnalysis, 
      input
    );

    return res.status(200).json({
      success: true,
      data: {
        ...response,
        imageAnalysis,
        hasImage: !!imageFile,
        ...productRecommendations
      },
      timestamp: new Date().toISOString()
    });
  });
});

LangChain Integration

Integrates LangChain with OpenAI to power the product recommendation logic. LangChain is a framework that simplifies working with Large Language Models (LLMs) by handling prompts, chaining tasks, and connecting external services. Here, it acts as a wrapper around the OpenAI API, making it easier to configure the model, set parameters like temperature, and invoke prompts.

The OpenAI API provides the actual LLM capabilities (e.g., GPT-4) that generate natural, conversational responses. Together, they allow the system to transform user input into helpful product recommendations.

The service uses LangChain to orchestrate AI model interactions:

import { OpenAI } from '@langchain/openai';
import { config } from '../configs/environment';

export const runLLMChain = async (input: string): Promise<LLMResponse> => {
  // Check if OpenAI API key is available
  if (!process.env.OPENAI_API_KEY) {
    console.warn('⚠️  OPENAI_API_KEY not found - using mock response');
    return {
      openai: `I understand you're looking for product recommendations. Based on your query about "${input}", I can help you find suitable options.`,
      huggingface: 'HuggingFace response placeholder',
    };
  }

  try {
    // Initialize OpenAI with LangChain
    const openai = new OpenAI({
      model: config.openaiModel,
      temperature: 0.7,
      openAIApiKey: config.openaiApiKey,
    });

    // Enhanced prompt for better product recommendations
    const prompt = `You are a helpful product recommendation assistant. Analyze the user's request and provide relevant information about products they might be interested in.

User request: ${input}

Please respond in a conversational way, mentioning specific product categories, brands, or features that would be relevant to their needs. Be specific about product names when possible.`;

    const response = await openai.invoke(prompt);
    
    return {
      openai: response,
      huggingface: 'HuggingFace response placeholder',
    };
  } catch (error) {
    console.error('❌ Error calling OpenAI:', error);
    
    // Fallback response in case of API error
    return {
      openai: `I understand you're asking about "${input}". While I'm having connectivity issues with my full AI capabilities, I can still help you find relevant products based on your query.`,
      huggingface: 'HuggingFace response placeholder',
    };
  }
};

Google Vision Integration

The GoogleVisionService class integrates with Google Cloud Vision API, a powerful image analysis service from Google. It allows applications to understand the content of images by detecting labels, objects, text (OCR), faces, and even evaluating safe search categories (e.g., adult, medical, or violent content)

Advanced image analysis using Google Cloud Vision API:

import { ImageAnnotatorClient } from '@google-cloud/vision';

export class GoogleVisionService {
  private client: ImageAnnotatorClient;

  constructor() {
    this.client = new ImageAnnotatorClient({
      ...(process.env.GOOGLE_APPLICATION_CREDENTIALS 
        ? {} 
        : { keyFilename: process.env.GOOGLE_APPLICATION_CREDENTIALS })
    });
  }

  async analyzeImage(imagePath: string): Promise<VisionAnalysisResult> {
    try {
      if (!process.env.GOOGLE_APPLICATION_CREDENTIALS) {
        Logger.warn('⚠️  GOOGLE_APPLICATION_CREDENTIALS not found - using mock analysis');
        return this.getMockAnalysis(imagePath);
      }

      Logger.info(`🔍 Analyzing image with Google Vision: ${imagePath}`);

      // Label detection
      const [labelResult] = await this.client.labelDetection(imagePath);
      const labels = labelResult.labelAnnotations?.map(label => 
        `${label.description} (${Math.round((label.score || 0) * 100)}%)`
      ) || [];

      return {
        labels,
        text: 'Text detection available in full implementation',
        objects: ['Object detection available in full implementation'],
        safeSearch: {
          adult: 'VERY_UNLIKELY',
          spoof: 'VERY_UNLIKELY', 
          medical: 'VERY_UNLIKELY',
          violence: 'VERY_UNLIKELY',
          racy: 'VERY_UNLIKELY'
        },
        faces: 0
      };

    } catch (error) {
      Logger.error('❌ Error analyzing image with Google Vision:', error as Error);
      return this.getMockAnalysis(imagePath);
    }
  }

  async cleanup(imagePath: string): Promise<void> {
    try {
      await fs.unlink(imagePath);
      Logger.info(`🗑️  Cleaned up temporary image: ${imagePath}`);
    } catch (error) {
      Logger.warn(`⚠️  Could not cleanup image ${imagePath}: ${(error as Error).message}`);
    }
  }
}

API Usage Examples

After setting up the server, we can test the /chat endpoint using cURL. The example command sends a POST request to http://localhost:3000/api/agents/chat with a JSON body containing the user’s input. In this case, the input is “I need a laptop for gaming and video editing”. The Agent Service will receive the request, process it through the LLM chain, and return a structured JSON response with product recommendations or insight

Text-Only Request

curl -X POST http://localhost:3000/api/agents/chat \
  -H "Content-Type: application/json" \
  -d '{"input": "I need a laptop for gaming and video editing"}'

Text + Image Request

curl -F "image=@laptop.jpg" \
     -F "input=What kind of laptop is this and what would you recommend?" \
     http://localhost:3000/api/agents/chat

JavaScript/Fetch Example

const formData = new FormData();
formData.append('input', 'Analyze this product image for me');
formData.append('image', fileInput.files[0]);

const response = await fetch('http://localhost:3000/api/agents/chat', {
  method: 'POST',
  body: formData
});

const data = await response.json();
console.log(data);

Response Format

The service returns structured responses with comprehensive information:

{
  "success": true,
  "data": {
    "openai": "Based on your image, I can see this is a gaming laptop with RGB lighting...",
    "huggingface": "HuggingFace response placeholder",
    "imageAnalysis": {
      "labels": ["Laptop (95%)", "Gaming (87%)", "Computer (92%)"],
      "text": "ASUS ROG Strix G15",
      "objects": ["Laptop (90%)", "Keyboard (85%)"],
      "safeSearch": {
        "adult": "VERY_UNLIKELY",
        "violence": "VERY_UNLIKELY",
        "racy": "VERY_UNLIKELY"
      },
      "faces": 0
    },
    "hasImage": true,
    "products": [
      {
        "id": "laptop-001",
        "name": "ASUS ROG Strix G15",
        "category": "Gaming Laptops",
        "price": 1299.99
      }
    ],
    "totalCount": 1
  },
  "timestamp": "2024-01-15T10:30:00.000Z"
}

Use Cases & Applications

E-commerce Product Recommendations

Users upload product images and ask for recommendations
AI analyzes the image and suggests similar or complementary products
Text queries about product features get intelligent responses

Customer Support

Visual troubleshooting with image uploads
Product identification from photos
Automated responses to common questions

Conclusion

The Agent Service represents a modern approach to AI-powered applications, combining the best of multiple technologies to create a comprehensive conversational AI experience. Its architecture demonstrates best practices in microservice development, with robust error handling, comprehensive validation, and scalable design.

The integration of LangChain, OpenAI, and Google Vision creates a powerful foundation for building intelligent applications that can understand both text and visual information. Whether you’re building an e-commerce platform, customer support system, or content moderation tool, this service provides the building blocks for sophisticated AI interactions.

The service’s modular design and comprehensive documentation make it easy to extend and customize for specific use cases, while its production-ready features ensure reliability and performance in real-world applications.

Quang Truong

Line Manager at NashTech, I am a curious and motivated software engineer with a passion for creating applications that make life easier and more enjoyable.

Solutions

Industry

Our thinking

Building an AI-powered application with LangChain and Google Vision

Quang Truong

Table of Contents

Introduction

What is the Agent Service?

Architecture Overview

Technology Stack

Core Components

Main Server

Unified Chat Endpoint

LangChain Integration

Google Vision Integration

API Usage Examples

Response Format

Use Cases & Applications

E-commerce Product Recommendations

Customer Support

Conclusion

Quang Truong

Leave a Comment Cancel Reply

Suggested Article

NashTech

Solutions

Useful links

Connect with us

Our achievements