When first adopting a data + AI platform, many organizations take a Life + Shift approach to migrating pipelines and workloads. Unfortunately, this can sometimes result in costly data pipelines with lengthy run times. Fortunately, Databricks provides a myriad of options to optimize existing Databricks pipelines, as well as enhancing pipelines from other sources when migrating to Databricks.

Data Pipeline Optimization Options

The robustness of the Databricks platform allows for a variety of options to optimize your pipelines, some will be a better fit for your pipelines than others. Here, in no particular order, we have listed five of Entrada’s favorite Databricks methods to support pipeline optimization:

  • Liquid Clustering: A feature that automatically determines the best data layout for your pipelines, reducing run times and cost
  • Cluster Settings: The UI that enables updates to compute configuration that can result in more performant pipelines
  • Photon: A vectorized engine that runs workloads faster to reduce cost per workload
  • Partition Strategy: An optimized partition strategy that reduces the number of files that need to be read, resulting in faster processing times
  • Z Ordering: A process that uses colocation on data files for data skipping to reduce the amount of data that needs to be read

Customer Success Story 

The Challenge:

A large sales automation software platform partnered with Entrada to support their objectives of maximizing the performance of legacy Databricks pipelines, and evaluating the case for the migration of EMR streaming jobs.

The Solution: Entrada provided a detailed set of performance related findings and ROI recommendations related to the client’s infrastructure for over 300 data pipelines. These findings were then prioritized with the remediation implemented by Entrada:

  • Migrated the highest cost and lowest performing EMR pipelines to Databricks
  • Optimized high cost and low performing Databricks pipelines
  • Leveraged a combination of Databricks capabilities (i.e. Liquid Clustering, cluster settings, and advanced partitioning) to provide a more performant experience with the client’s lakehouse pipelines
  • Migrated pipelines to Unity Catalog to enable future AI initiatives
  • Optimized pipelines to be DLT compliant

The Results: 

Through strategic collaboration with Entrada, the client successfully enhanced the performance and cost-efficiency of its data infrastructure and experienced: 

  • 72% cost savings in run cost for largest pipeline
  • >82% reduction in initial load time
  • >50% reduction in run time on largest tables

The availability of Databricks tools to enhance and optimize performance of data pipelines can be intimidating, but leveraging these tools correctly is paramount to the cost containment and efficient management of your data estate. Entrada’s experts can help your business enhance and optimize existing Databricks workloads, as well as migrate poor performing workloads to Databricks for a healthier data ecosystem.

About Entrada

Entrada is a Databricks-focused consulting and implementation partner backed by Databricks Ventures. Entrada harnesses the power of Databricks to help customers accelerate their AI + data initiatives. Our expertise in AI/ML, Databricks, and analytics is centered around industry-centric solutions. Our mission is to simplify complex data + AI challenges and support end-to-end transformations, delivering future-ready solutions fast.

Other blog posts
Abstract gear and network visualization representing the Databricks FinOps cost control architecture covered in the article.

From Cost Visibility to Action: Scaling FinOps Intelligence with Databricks System Tables and Genie

This post walks through the architecture Entrada built around that observation, the Serverless Cost Control Accelerator, and, more importantly, the design principles behind it. Regardless os whether we’re a platform engineer, SRE, or FinOps lead trying to decide where to invest, the principles matter more than the product.

Read more
Abstract healthcare data architecture showing a secure medical research platform for imaging, clinical notes, and lab data on Databricks

Building Secure, AI-Ready Medical Research Platforms on Databricks

Research organizations need faster, more reliable ways to prepare sensitive data for analysis without loosening their grip on governance and privacy. Across the medical research platforms we’ve built on Databricks, the same patterns keep proving their worth: cleaner ingestion, standardized de-identification, simpler access to research-ready datasets, and a foundation that holds up when analytics and AI ambitions grow. Here’s what we’ve learned about designing these environments well.

Read more
Post cover "Lakebase: The Death of the Siloed Application Database" by William Guzmán Daugherty Data Engineer at Entrada

Lakebase: The Death of the Siloed Application Database

Every enterprise manages two separate, expensive database systems: OLTP for real-time transactions and OLAP for analytics. The pipeline connecting them is the most fragile thing in the entire stack. Databricks’ Lakebase makes that pipeline optional, offering a strategic opportunity to collapse two stacks into one and finally deliver the near-real-time data that critical business applications need.

Read more
Show all posts
GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.