In the first part of our Databricks series for testers, ‘Introducing Databricks Lakeflow’, we’ve explored the fundamentals of Lakeflow. Now, let’s dive into the next chapter: Lakeflow Connect, the data ingestion layer of Databricks Lakeflow.
Lakeflow Connect is the first gate in the data pipeline. If data is ingested incorrectly, every downstream transformation (Silver/Gold tables, reports, ML features) is at risk. Ensuring data quality requires more than output checks. It demands end-to-end trust, beginning with Lakeflow Connect.
What is Lakeflow Connect
Lakeflow Connect is a fully managed data ingestion solution from Databricks, providing a simple, efficient connector to ingest data from a wide range of sources into the Databricks Lakehouse without building or maintaining custom connectors.
Key Features:
- Broad Connector Support: Databases (SQL Server, PostgreSQL…), SaaS applications (Salesforce, Workday, Google Analytics…), cloud storage, streaming platforms.
- Change Data Capture (CDC) Support: Changes in source systems (inserts, updates, deletes) are automatically captured and propagated
- Declarative Setup: Define ingestion rules declaratively, what to pull, from where, and at what frequency. Lakeflow handles the rest, simplifying operations.
- Flexible Ingestion Models:
- Fully Managed Connectors: Provide out-of-the-box ingestion with minimal setup, including automated schema evolution, retries, and CDC (Change Data Capture).
- Declarative Pipelines: Define ingestion rules without writing custom code, simplifying the setup process
- Structured Streaming: Offers full customisation for advanced use cases.
- Serverless Compute: Runs on serverless infrastructure across major cloud providers (AWS, Azure, GCP), which enhances scalability and cost-efficiency.
- Governance with Unity Catalog: Integrates with Unity Catalog for secure data ingestion, lineage tracking, and access control.
- No-Code Interface: A simple UI allows both technical and non-technical users to build and manage data ingestion pipelines.
Lakeflow Connector Types
Lakeflow Connect provides a versatile range of connector types to support flexible and scalable data ingestion into the Databricks Lakehouse.
These connectors are designed to accommodate diverse data sources and ingestion needs, from fully managed solutions to highly customizable frameworks.
This tiered approach ensures organisations can efficiently and reliably move data regardless of its origin or complexity.
Manual File Upload
Allows users to upload local files directly to Databricks:
- Upload a file to a volume
- Create a table from a local file
Standard Connectors
Support data ingestion from various sources:
- Cloud object storage: AWS S3, Google Cloud Storage and Azure Data Lake Storage
- Streaming Platforms: Apache Kafka, Amazon Kinesis, Google Pub/Sub, Apache Pulsar
Managed Connectors
Provide out-of-the-box ingestion with minimal setup and maintenance, including built-in support for:
- Enterprise SaaS Applications:
- Salesforce
- Workday
- ServiceNow
- SharePoint
- Google Analytics
- Databases:
- SQL Server
- Azure SQL Database
- Amazon RDS for SQL Server
Key Checks for Testers
1. Relational Database Connectors (Batch / CDC)
- Validate row counts between source tables and Bronze.
- Ensure inserts, updates, and deletes propagate correctly.
- Test schema drift handling (new columns, renamed fields).
- Pay attention to incremental loads vs. full loads.
2. Cloud Storage Connectors (Files / Batch)
- Confirm that all files have been ingested (no missing or skipped files).
- Validate that file formats (CSV, Parquet, JSON) are parsed correctly.
- Ensure partitioning is respected (e.g., by date/hour).
- Test ingestion for malformed or delayed files: no silent row drops, no partial ingestion, proper lineage for late arrivals, and clear logs and alerts
3. API / External Service Connectors (Batch / Micro-batch)
- Verify data completeness across paginated or chunked API responses.
- Handle throttling, retries, and error codes gracefully.
- Check for field-level accuracy (no truncation, no data type mismatches).
- Validate refresh schedules (daily/hourly pulls).
4. Streaming Connectors (Kafka, Event Hubs, Kinesis, etc.)
- Ensure event order is preserved when required.
- Monitor latency – data should arrive within SLA.
- Validate throughput under spikes (no dropped events).
- Confirm replay/recovery works after failures.
Ingestion Methods
Lakeflow Connect supports multiple ingestion methods to meet diverse operational and analytical needs, ranging from fully automated pipelines to highly customizable workflows.
Traditional Batch
- Load data as batches of rows, process all records each time it runs.
- Best for one-time loads, ad hoc ingestion, or scheduled jobs that consistently read and process the entire dataset
- Common techniques
- SQL: CREATE TABLES SELECT
- Python: spark.read.load()
Incremental batch
- Process only new records, skip previously loaded records automatically
- Great for scheduling jobs or pipelines. Simple and repeatable for incremental file ingestion.
- Common techniques
- SQL: COPY INTO
- Python: spark.readStream (Auto Loader with a timed trigger)
- Declarative Pipelines: CREATE OR REFRESH STREAMING TABLE
Streaming
- Continuously load data rows or batches of data rows as it is generated, allowing for querying data in near real-time.
- Best for near real-time streaming or incremental ingestion with high automation and scalability.
- Common techniques
- Python: spark.readStream (Auto Loader with a timed trigger)
- Declarative Pipelines: trigger mode continuous
Key Checks for Testers
1, Batch Ingestion
- Completeness: Validate that all rows from sources are loaded.
- Schedule reliability: Ensure jobs run on time and recover from failures.
- Schema validation: Check that changes in the source schema don’t silently break loads.
2. Streaming Ingestion
- Event timeliness: Verify events arrive within expected latency.
- Ordering: Confirm event order is maintained where required.
- Throughput: Test under high-volume scenarios to ensure no data loss.
- Error handling: Ensure bad records are quarantined, not dropped silently. They should be captured and stored separately.
3, Change Data Capture (CDC)
- Integrity: Verify inserts, updates, and deletes are correctly reflected in the Lakehouse.
- Completeness: Ensure no missing or delayed CDC events.
- Duplicates: Check that updates/deletes aren’t applied more than once.
- Schema drift: Validate that schema changes are detected and handled properly.
Conclusion
Lakeflow Connect is not just a technical component, it’s the foundation of trust in the data pipeline. When testers understand their role and actively validate ingestion behaviour, they become key contributors to data reliability. By detecting silent failures early, tracing issues to their source, and collaborating closely with Data Engineers, testers help prevent costly rework and ensure that downstream systems operate on clean, trustworthy data. In a world where data drives decisions, quality must start at the very first gate.
Reference: