The History Of Big Data Platforms

April 11, 2024

The History of Big Data Platforms

Here at Entrada, we have bet our company on Databricks, and our customers have chosen to use Databricks as their modern data platform. There are many reasons for this, some unique to each client, and many shared across them. Databricks is an easy platform to quickly gain business value from, and skilled expertise can support utilizing it to its fullest extent. This is the mission of Entrada – to maximize the value organizations get out of Databricks.

Part I of Databricks vs Microsoft Fabric, A Multi-Part Blog Exploration of the Platforms and their Benefits

Introduction

For a little under a year now, we have had some clients bring up Microsoft Fabric. Fabric positions itself as a unified data platform adopting the Lakehouse architecture – much like Databricks, who pioneered the lakehouse. This is the beginning of a series of blogs that will explore different data engineering and analytics topics and how they relate to Databricks and Fabric in particular.

The topics include popular data engineering subjects such as business analytics/visualization (particularly with Power BI), data platform openness and vendor lock-in, differing cost models, data sharing across platforms, advanced data engineering techniques and tools, and no-code/low-code tools and how useful they actually are in their current states.

This first part of the series will cover the history of Databricks starting in the early 2000s, when data volumes were growing too large for traditional analytical methods with vertical scaling to Databricks’ creation of many ubiquitously adopted modern data paradigms and positioned themselves as the unquestioned leader in the field. With this information uncovered, we will understand the philosophy behind the creation of Fabric and why Microsoft has decided to launch it at the end of 2023.

Origins of Big Data and Spark

2008 marked a critical turning point in the realm of big data that had not been seen since 2004 when Google engineers published their now famous MapReduce white paper, effectively ending the era of large mainframe computers for processing tasks. UC Berkeley established its renowned ‘Algorithms, Machines, and People Lab’ – now colloquially known as AMPLab, to focus on big data and cloud computing research. Co-founded by an already respected researcher in distributed systems and networking, Ion Stoica, with a focus on overcoming the limitations of MapReduce’s disk based limitations.

This idea became more concrete once Matei Zaharia, a new distributed systems PhD with an excellent academic background, joined AMPLab under the mentorship of Stoica. It was Zaharia who initially began exploring how to overcome the downfalls of MapReduce, coming up with the concept of the Resilient Distributed Dataset (RDD), the RDD, which cached distributions of data in memory with a new DAG execution model to enable more complex workflows that laid the foundation of what was to become Spark and effectively ended the Hadoop dominion over big data processing.

Today, on the cusp of Spark 4.0, the RDD feels like a distant memory. Spark’s architecture allowed many new APIs to extend it, such as Spark Structured Streaming and MLlib, which would have been unthinkable in the MapReduce era. Spark fundamentally changed the landscape of big data processing, and almost immediately attained massive adoption.

Founding of Databricks and Lakehouse

Developed as an open source project, in 2013, Spark was donated to the Apache Software Foundation, becoming a top-level project. Uncoincidentally, Databricks was founded that same year, by none other than Zaharia and Stoica as a cloud-based platform centered around Spark, along with a few other AMPLab distributed systems PhDs and early Spark developers. Since launching the platform for commercial consumption in 2015, they have continued to usher in many big, new data revolutions such as the Delta file protocol to enable more reliable and performant data lakes, and the now ubiquitous Lakehouse concept in 2019 to unify the best of data warehouses and data lakes.

In recent years, Databricks has been the unquestioned go-to for varied and complex big data analytics needs, and as of 2023, MIT Technology Review Insights has shown that 74% of global enterprises have adopted lakehouse. For the past three years, Gartner has recognized Databricks as a leader in Cloud DBMS. In recent years, Databricks, with its Data Intelligence Platform, has positioned itself into a leader in the field of AI, pioneering work into some of the most advanced open source LLMs on the market as well as offering a full-end-to-end AI lifecycle management aspect to the platform (like real time model serving) that is kept in the same place, and managed the same way, as your other data assets.

Entrada and our customers have also bet on Databricks, benefiting from all of our data and AI assets being on a single, unified, fully governed, and truly open platform with an optimized Apache Spark runtime. We love that we can use DBSQL for our data warehousing needs, Delta Sharing for our data transfer requirements, and Databricks Marketplace for our data monetization efforts. Databricks has a vast partner ecosystem utilizing many seamless connectors if your org is using any third party software, and since Databricks is built on top of open source technology, it is simple to read the data ingested by Databricks from anywhere you need to. Because of innovations like these, many large organizations have reported productivity gains and cost reductions by migrating to Databricks from its competitors.

With all the talk surrounding the lakehouse architecture and its game-changing features, other platforms have begun to adopt this approach. One particular new entry into the domain that our clients have been asking about is Microsoft Fabric. We will focus specifically on how Microsoft has decided to use the lakehouse on their platform and how that relates to what Databricks is doing.

Conclusion

Fabric is the latest platform to adopt the now open source Databricks creations such as Delta Lake and Spark to position themselves as a reliable and performant lakehouse. Databricks has led most of the innovation in this sector of the market for many years now.

Does Fabric represent a serious threat to the dominion of Databricks? How do the two integrate? Does the Databricks Data Intelligence Platform, with its proven track record and mature features, still remain the go-to choice for your data engineering needs? The next part of the series will explore one of the most advertised features of Fabric – its integration with Power BI (the team behind the development of Fabric), its usefulness and limitations, and how this same integration currently works on Databricks.

Interested in learning more? Check out the next blog in this series, Part II: Power BI Implementations.

Entrada

The History of Big Data Platforms