NashTech Insights

Databricks – Connecting to Databricks via FASTAPI

Toan Dang
Toan Dang
Table of Contents

Key takeaways:
By employing FastAPI to establish a connection with Databricks, users can effortlessly access Delta Lake tables and make them accessible through APIs. This integration facilitates real-time data consumption and interaction, empowering developers to build efficient and scalable applications that leverage the capabilities of Databricks for Big Data processing and Delta Lake.

Introduction

FastAPI serves as a powerful framework for API backends, allowing connections to Databricks, a robust analytics and data processing platform. This integration allows for seamless access to Delta Lake tables, leveraging Databricks’ advanced capabilities for data management and analysis. By combining FastAPI’s streamlined API development with Databricks’ extensive features, developers can harness the full potential of Databricks, enabling scalable, efficient, and data-driven applications.

Advantages of Databricks

Databricks, a leading Lakehouse platform built on Apache Spark, offers numerous benefits in the world of data, analytics, and AI.

  • Unified Platform: Databricks unifies data, analytics, and AI on a single platform, simplifying workflows and enabling seamless collaboration.
  • Built-in Libraries: With built-in libraries for data processing and machine learning, Databricks provides a comprehensive toolkit for data scientists and engineers.
  • Scalability: Databricks offers a scalable infrastructure that can handle large volumes of data and accommodate growing workloads. It leverages cloud-based resources to scale up or down based on demand, ensuring optimal performance and cost-effectiveness.
  • Streamlined Data Management: Databricks offers automated and reliable ETL processes, fast data ingestion, and open and secure data sharing, ensuring efficient data management and enabling faster insights and decision-making.
  • Databricks promotes the concept of a “lakehouse” architecture, which combines the best features of data lakes and data warehouses. It allows for efficient storage, management, and analysis of structured and unstructured data, offering the benefits of both worlds.

High level architecture

The following diagram outline the architecture to establish a direct connection between FastAPI and the Delta Lake table through Databricks and exposes data through APIs.

The sample presented here uses the data(dim_category) from the previous post as a demonstration.

Steps to set up APIs to access Databricks

Environment Setup

  • Python >=3.9
  • python packages:
    • fastapi
    • pydantic
    • SQLAlchemy
    • sqlalchemy-databricks
  • File “.env” to store environment variables configure to connect Databricks (or use Key Vault to store these values for more security). Get this value in “Advanced Options” of the cluster compute and generate the PAS for ACCESS_TOKEN in “User Setting”:
    • SERVER_HOSTNAME=adb-xxxxxxxxxx.azuredatabricks.net
    • HTTP_PATH=sql/protocolv1/o/xxxxxxx/xxxx-xxxxxx-xxxxxxxx
    • ACCESS_TOKEN=xxxxxxxxxxxxxxxxxxxxxxxxx
  • File config.py to read variables and file main.py to run the FastAPI.

Read config variables

This sample read Databricks configuration from environment variables using Python:

from typing import Optional, Dict, Any
from pydantic import BaseSettings, validator

class Settings(BaseSettings):
    SERVER_HOSTNAME: str
    HTTP_PATH: str
    ACCESS_TOKEN: str
    DATABRICKS_DB: str = "dw_tickit"
    DATABRICKS_DATABASE_URI: Optional[str] = None

    @validator("DATABRICKS_DATABASE_URI", pre=True)
    def assemble_db_connection(cls, v: Optional[str], values: Dict[str, Any]) -> Any:
        if isinstance(v, str):
            return v
        return "databricks+connector://token:{}@{}:443/{}".format(
            values.get("ACCESS_TOKEN"),
            values.get("SERVER_HOSTNAME"),
            values.get("DATABRICKS_DB"),
        )
    class Config:
        case_sensitive = True
        env_file = ".env"
settings = Settings()

Expose API via FastAPI

Steps to create APIs doing the following actions on dim_category table:

  • Create a new category (POST)
  • Read a category by a specific catid (GET)
  • Update an existing category (PUT)
  • Delete a category (DELETE)

The code below is the whole setup for all the routes in the fastapi_databricks.py file for the API. Here’s the code:

# How to run:
# uvicorn fastapi_databricks:app --reload

from fastapi import FastAPI, Depends, Query, status
from fastapi.responses import JSONResponse
from fastapi.encoders import jsonable_encoder

from sqlalchemy import create_engine, Column, Integer, String, DateTime, Date
from sqlalchemy.orm import sessionmaker
from sqlalchemy.ext.declarative import declarative_base

from pydantic import BaseModel
from random import randrange
from config import settings

app = FastAPI()

engine = create_engine(
    settings.DATABRICKS_DATABASE_URI,
    connect_args={
        "http_path": settings.HTTP_PATH,
    },
)

Session = sessionmaker(autocommit=False, autoflush=False, bind=engine)

def session():
    db = Session()
    try:
        yield db
    finally:
        db.close()


Base = declarative_base(bind=engine)

# Entity category
class Category(Base):
    __tablename__ = "dim_category"

    cat_key = Column(Integer, primary_key=True, nullable=False)
    catid = Column(Integer, nullable=False)
    catgroup = Column(String, nullable=True)
    catname = Column(String, nullable=True)
    catdesc = Column(String, nullable=True)
    update_date = Column(DateTime, nullable=True)
    run_date = Column(Date, nullable=True)
    batch_id = Column(Integer, nullable=True)
    hash = Column(String, nullable=True)


# Request Body
class CategoryRequest(BaseModel):
    catname: str = Query(..., max_length=50)
    catgroup: str = Query(..., max_length=50)


@app.get("/")
def root():
    return {"message": "Hi Databricks!"}


@app.get("/category")
def get_category(
    catid: int = None,
    catname: str = Query(None, max_length=50),
    db: Session = Depends(session),
):
    if catid is not None:
        result_set = db.query(Category).filter(Category.catid == catid).all()
    elif catname is not None:
        result_set = db.query(Category).filter(Category.catname == catname).all()
    else:
        result_set = db.query(Category).all()
    response_body = jsonable_encoder(result_set)
    return JSONResponse(status_code=status.HTTP_200_OK, content=response_body)


@app.post("/category")
def create_category(request: CategoryRequest, db: Session = Depends(session)):
    category = Category(
        cat_key=randrange(1_000_000, 10_000_000),
        catid=randrange(1_000_000, 10_000_000),
        catname=request.catname,
        catgroup=request.catname,
    )
    db.add(category)
    db.commit()
    response_body = jsonable_encoder({"category_id": category.catid})
    return JSONResponse(status_code=status.HTTP_200_OK, content=response_body)


@app.put("/category/{id}")
def update_category(catid: int, request: CategoryRequest, db: Session = Depends(session)):
    category = db.query(Category).filter(Category.catid == catid).first()
    if category is None:
        return JSONResponse(status_code=status.HTTP_404_NOT_FOUND)
    category.catname = request.catname
    category.catgroup = request.catgroup
    db.commit()
    response_body = jsonable_encoder({"category_id": category.catid})
    return JSONResponse(status_code=status.HTTP_200_OK, content=response_body)


@app.delete("/category/{id}")
def delete_category(catid: int, db: Session = Depends(session)):
    db.query(Category).filter(Category.catid == catid).delete()
    db.commit()
    response_body = jsonable_encoder({"category_id": catid, "msg": "record deleted"})
    return JSONResponse(status_code=status.HTTP_200_OK, content=response_body)

The result example when call the API to get all the categories http://127.0.0.1:8000/category:

[{"cat_key":1,"catdesc":"Major League Baseball","catname":"MLB","run_date":"2023-05-25","hash":"f09fe6199bed3a4392707993335930b169ed02ffbfee6348ae094a9473488dfa","catid":1,"catgroup":"Sports","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":2,"catdesc":"All jazz singers and bands","catname":"Jazz","run_date":"2023-05-25","hash":"8b14858c4df400eca29cba1b444c85ba380176ceb88dc498f6703f2d761b9830","catid":10,"catgroup":"Concerts","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":3,"catdesc":"All symphony, concerto, and choir concerts","catname":"Classical","run_date":"2023-05-25","hash":"dfe572c7b2936e94fe2744f95816e8fd5bd9ae51ce3ce10d2a4272d403322506","catid":11,"catgroup":"Concerts","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":4,"catdesc":"Harry Potter","catname":"Harry Potter","run_date":"2023-05-25","hash":"723e3f4d5b866a4963165c428534836ce3ce3c874f38e312aa2740d469887439","catid":12,"catgroup":"FILMS","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":5,"catdesc":"Premier League","catname":"Premier League","run_date":"2023-05-25","hash":"3c5273fa027b953f86e6fc9836f061aff6596b7141eae714bbce612b7bec363e","catid":13,"catgroup":"SPORT","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":6,"catdesc":"National Hockey League","catname":"NHL","run_date":"2023-05-25","hash":"4425fdaccb7ec3698cedf4115812ecb9c265d0d62597628fca3219a32ca0697f","catid":2,"catgroup":"Sports","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":7,"catdesc":"National Football League","catname":"NFL","run_date":"2023-05-25","hash":"3f8f8256a74cde0605bf6082818c2b7bf56c9f6d2f35f36c7fd52e7473249eb7","catid":3,"catgroup":"Sports","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":8,"catdesc":"National Basketball Association","catname":"NBA","run_date":"2023-05-25","hash":"62455b22140dc43457ae444a1ddd19b629132220ed13894fe9d9abb858a2f61f","catid":4,"catgroup":"Sports","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":9,"catdesc":"Major League Soccer","catname":"MLS","run_date":"2023-05-25","hash":"4833dfd72fadb64a870f2430e2b908f53bf88e7a1b4d88fb6602e6a6d913d49a","catid":5,"catgroup":"Sports","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":10,"catdesc":"Musical theatre","catname":"Musicals","run_date":"2023-05-25","hash":"b21ca2f65ecd0d4e3b102fe6cfa996890194db0832e9811f060519e89911c5b7","catid":6,"catgroup":"Shows","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":11,"catdesc":"All non-musical theatre","catname":"Plays","run_date":"2023-05-25","hash":"0fb62bac736aae191015d9b5f262b841b75ea9ee6e59f8c88af669c07cde575f","catid":7,"catgroup":"Shows","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":12,"catdesc":"All opera and light opera","catname":"Opera","run_date":"2023-05-25","hash":"7400e00c9768d4c1fa6628181c99302caf5992da72b94f2ead821b377f0f4178","catid":8,"catgroup":"Shows","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":13,"catdesc":"All rock and pop music concerts","catname":"Pop","run_date":"2023-05-25","hash":"84ff7dc87509731ff46073bd0c5ac07ee170ca520701b7dc0537bd415e80dd4e","catid":9,"catgroup":"Concerts","update_date":"2023-05-25T00:00:00+00:00","batch_id":279},{"cat_key":-1,"catdesc":"unknown","catname":"unknown","run_date":"1970-01-01","hash":null,"catid":-1,"catgroup":"unknown","update_date":"1970-01-01T00:00:00+00:00","batch_id":0}]

Notes

– If the Databricks cluster is off when the API makes a request, it can take 3–10 minutes for the cluster to spin up before you get a response
– Be aware of the data size and the cost to query the data
– Be mindful of cluster costs

Summary

With this setup, you can quickly create a full CRUD API that leverages data from the Lakehouse architecture on Databricks and provide ability to process data real-time. This baseline example can also open up more API possibilities with integrations into ML models deployed in Databricks.

References

Toan Dang

Toan Dang

Toàn is a Data Solutions Architect with over 10 years of experience at Nash Tech, specializing in cloud-based data solutions on Azure, AWS, and GCP. My skills encompass database design architecture, AI, machine learning, natural language processing, computer vision, performance tuning, and optimization. I excel at designing efficient database systems, developing intelligent solutions, and optimizing data operations for maximum performance. With my comprehensive expertise, I deliver robust and tailored data solutions that leverage cutting-edge technologies to solve complex business problems.

Leave a Comment

Your email address will not be published. Required fields are marked *

Suggested Article

%d bloggers like this: