Databricks open-sources declarative ETL framework powering 90% faster pipeline builds

Share This Post


Join the event trusted by enterprise leaders for nearly two decades. VB Transform brings together the people building real enterprise AI strategy. Learn more


Today, at its annual Data + AI Summit, Databricks announced that it is open-sourcing its core declarative ETL framework as Apache Spark Declarative Pipelines, making it available to the entire Apache Spark community as part of the upcoming 4.1 release. 

Databricks launched the framework as Delta Live Tables (DLT) in 2022 and has since expanded it to help teams build and operate reliable, scalable data pipelines end-to-end. The move to open-source it reinforces the company’s commitment to open ecosystems while marking an effort to one-up rival Snowflake, which recently launched its own Openflow service for data integration—a crucial component of data engineering. 

Snowflake’s offering taps Apache NiFi to centralize any data from any source into its platform, while Databricks is making its in-house pipeline engineering technology open, allowing users to run it anywhere Apache Spark is supported — and not just on its own platform.

Declare pipelines, let Spark handle the rest

Traditionally, data engineering has been associated with three main pain points: complex pipeline authoring, manual operations overhead and the need to maintain separate systems for batch and streaming workloads. 

With Spark Declarative Pipelines, engineers describe what their pipeline should do using SQL or Python, and Apache Spark handles the execution. The framework automatically tracks dependencies between tables, manages table creation and evolution and handles operational tasks like parallel execution, checkpoints, and retries in production.

“You declare a series of datasets and data flows, and Apache Spark figures out the right execution plan,” Michael Armbrust, distinguished software engineer at Databricks, said in an interview with VentureBeat. 

The framework supports batch, streaming and semi-structured data, including files from object storage systems like Amazon S3, ADLS, or GCS, out of the box. Engineers simply have to define both real-time and periodic processing through a single API, with pipeline definitions validated before execution to catch issues early — no need to maintain separate systems.

“It’s designed for the realities of modern data like change data feeds, message buses, and real-time analytics that power AI systems. If Apache Spark can process it (the data), these pipelines can handle it,” Armbrust explained. He added that the declarative approach marks the latest effort from Databricks to simplify Apache Spark.

“First, we made distributed computing functional with RDDs (Resilient Distributed Datasets). Then we made query execution declarative with Spark SQL. We brought that same model to streaming with Structured Streaming and made cloud storage transactional with Delta Lake. Now, we’re taking the next leap of making end-to-end pipelines declarative,” he said.

Proven at scale 

While the declarative pipeline framework is set to be committed to the Spark codebase, its prowess is already known to thousands of enterprises that have used it as part of Databricks’ Lakeflow solution to handle workloads ranging from daily batch reporting to sub-second streaming applications.

The benefits are pretty similar across the board: you waste way less time developing pipelines or on maintenance tasks and achieve much better performance, latency, or cost, depending on what you want to optimize for.

Financial services company Block used the framework to cut development time by over 90%, while Navy Federal Credit Union reduced pipeline maintenance time by 99%. The Spark Structured Streaming engine, on which declarative pipelines are built, enables teams to tailor the pipelines for their specific latencies, down to real-time streaming.

“As an engineering manager, I love the fact that my engineers can focus on what matters most to the business,” said Jian Zhou, senior engineering manager at Navy Federal Credit Union. “It’s exciting to see this level of innovation now being open-sourced, making it accessible to even more teams.”

Brad Turnbaugh, senior data engineer at 84.51°, noted the framework has “made it easier to support both batch and streaming without stitching together separate systems” while reducing the amount of code his team needs to manage.

Different approach from Snowflake

Snowflake, one of Databricks’ biggest rivals, has also taken steps at its recent conference to address data challenges, debuting an ingestion service called Openflow. However, their approach is a tad different from that of Databricks in terms of scope.

Openflow, built on Apache NiFi, focuses primarily on data integration and movement into Snowflake’s platform. Users still need to clean, transform and aggregate data once it arrives in Snowflake. Spark Declarative Pipelines, on the other hand, goes beyond by going from source to usable data. 

“Spark Declarative Pipelines is built to empower users to spin up end-to-end data pipelines — focusing on the simplification of data transformation and the complex pipeline operations that underpin those transformations,” Armbrust said.

The open-source nature of Spark Declarative Pipelines also differentiates it from proprietary solutions. Users don’t need to be Databricks customers to leverage the technology, aligning with the company’s history of contributing major projects like Delta Lake, MLflow and Unity Catalog to the open-source community.

Availability timeline

Apache Spark Declarative Pipelines will be committed to the Apache Spark codebase in an upcoming release as part of version 4.1. The exact timeline, however, remains unclear.

“We’ve been excited about the prospect of open-sourcing our declarative pipeline framework since we launched it,” Armbrust said. “Over the last 3+ years, we’ve learned a lot about the patterns that work best and fixed the ones that needed some fine-tuning. Now it’s proven and ready to thrive in the open.”

The open source rollout also coincides with the general availability of Databricks Lakeflow Declarative Pipelines, the commercial version of the technology that includes additional enterprise features and support.

Databricks Data + AI Summit runs from June 9 to 12, 2025



Source link

Related Posts

- Advertisement -spot_img