As a dedicated Data Engineer and Databricks Solutions Architect my mission is to build the architecture that turns raw data into a competitive advantage. I establish the robust pipelines and engineering patterns that transform legacy systems into reliable high performance cloud environments.

While the Lakehouse architecture offers flexibility the fundamental principles of Data Modeling are more critical than ever to ensure performance and cost efficiency. We must demystify the notion that data modeling is now irrelevant.

Redefining Data Modeling for the Lakehouse

Data modeling is the art of translating business requirements into a clear and structured blueprint of the data. It separates what is essential from what is noise to create a stable foundation.

I always advise clients that this is not a one time task because a good model evolves over time. We must move through three specific levels of abstraction to succeed:

  1. Conceptual: We define the high level objects such as customers or orders without worrying about technical types.
  2. Logical: We define the structure and business rules in a technology agnostic way.
  3. Physical: We build the blueprint tailored to the Databricks platform to optimize performance and storage.

The Databricks Advantage: Modeling with Unity Catalog

Databricks offers best practices to turn the physical layer into functional data structures through the Unity Catalog (UC). I use UC as the technical framework to implement the physical data model while maintaining visibility into the conceptual business rules.

1. Naming Convention and Governance

I use the UC structure to enforce a clean architecture. You should use one catalog per environment and schemas to define your medallion architecture structure.

Here is how I set up the initial structure to keep the environment organized:

SQL:
-- Creating a catalog for the production environment
CREATE CATALOG IF NOT EXISTS prod;

-- Creating schemas for the Medallion Architecture
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE SCHEMA IF NOT EXISTS prod.silver;
CREATE SCHEMA IF NOT EXISTS prod.gold;

This structure allows business analysts to browse the catalog and see relationships which makes the model transparent to the whole team. UC also helps with documentation by using AI to suggest descriptions for your tables and columns. This helps turn a cryptic physical column like cust_v2_id back into a clear business term.

2. Implementing the Relational Model

It is critical to understand that unlike traditional RDBMS, Spark does not enforce Primary and Foreign Key constraints – they serve primarily as informational metadata. However, they remain a vital part of the Lakehouse architecture for optimization purposes.

I trust the RELY keyword to activate specific optimizations. This acts as a performance hint rather than a hard constraint, telling the Databricks optimizer to trust your data integrity to achieve performance improvements like JOIN elimination. You need to first validate that your data has no duplicates or orphaned keys (SQL and Python snippets are great for this!). If that’s the case, your queries may produce incorrect results. I recommend using Delta Live Tables (DLT) Expectations to explicitly validate data quality during ingestion. By filtering out invalid records with @dlt.expect_or_drop, you ensure the level of data integrity that the RELY constraint assumes.

IMPORTANT: The query optimizations associated with the RELY command specifically require queries to run on Photon-enabled compute. Without Photon, these hints may be ignored.

Here is an example of creating a Fact table that trusts the dimension table:

SQL:
-- Fact Sales Table Creation
CREATE TABLE prod.gold.fact_sales (
    order_id BIGINT,
    customer_sk BIGINT,
    amount DECIMAL(10,2),
    -- Using RELY to tell the Databricks Optimizer it can trust this relationship
    CONSTRAINT sales_customer_fk 
    FOREIGN KEY (customer_sk) REFERENCES prod.gold.dim_customers (customer_sk) RELY
);

If you already have existing tables without “constraints” you can add them later to optimize performance without recreating the table:

SQL:
-- Adding a constraint with RELY after table creation
ALTER TABLE prod.gold.dim_customers
ADD CONSTRAINT customer_pk PRIMARY KEY (customer_sk) RELY;

3. Physical Layout with Liquid Clustering

Modeling in the Lakehouse also means defining how data sits on the disk. I now use Liquid Clustering to replace the rigid folder based hierarchy of traditional partitioning.This has been a feature that has been introduced recently by Databricks that replaces the traditional CLUSTER BY method.

Databricks automatically clusters data by the chosen columns which supports high cardinality keys and multi dimensional filtering. This effectively future proofs your table because you can update clustering keys as business needs change.

Here is how I create a high performance table using this feature:

SQL:
CREATE TABLE prod.gold.orders (
    order_id BIGINT,
    customer_id STRING,
    transaction_date TIMESTAMP,
    total_amount DOUBLE,
    store_id INT
)
-- We cluster by the columns we filter or join on most often
CLUSTER BY (customer_id, transaction_date);

When the business logic evolves I can simply alter the table to change the physical layout without rewriting the data:

SQL:
-- If business needs change you can update the keys without a rewrite
ALTER TABLE prod.gold.orders
CLUSTER BY (customer_id, store_id);

-- To trigger the clustering optimization manually
OPTIMIZE prod.gold.orders;

-- In Databricks Runtime 16.0 and above, you can force reclustering of all records in a table with the following syntax: 
OPTIMIZE prod.gold.orders FULL;

IMPORTANT: Per Databricks documentation, large tables with millions of records that haven’t been previously optimized can take hours to run the  OPTIMIZE FULL operation. This feature is currently in Public Preview for managed Apache Iceberg tables.

Final Words of Advice

I have often seen data projects fail for two main reasons: not properly defining the what, why, and how, and secondly, by not treating data as a product. Although storage prices are currently the cheapest they have ever been, processing speed is not, and with the AI boom, this resource will only become more expensive down the road.

I always recommend that the client take a step back and reevaluate their priorities when building or migrating a data warehouse from the ground up. Skipping the Data Modeling, architecture, and planning phase will only lead to unreliable data integrity across your entities, increased compute costs, and, at best, a project built on a shaky foundation. 

Investing the time up front to understand your business processes, define clear use cases, and design a robust data model pays off in the long run. It not only reduces technical debt and surprise costs later, but also enables teams to move faster and make decisions with confidence.

Is your Data Lake turning into a Data Swamp?

Contact us at Entrada to review your current Lakehouse architecture. Let us help you build a foundation that is ready for the future.

Other blog posts
Abstract gear and network visualization representing the Databricks FinOps cost control architecture covered in the article.

From Cost Visibility to Action: Scaling FinOps Intelligence with Databricks System Tables and Genie

This post walks through the architecture Entrada built around that observation, the Serverless Cost Control Accelerator, and, more importantly, the design principles behind it. Regardless os whether we’re a platform engineer, SRE, or FinOps lead trying to decide where to invest, the principles matter more than the product.

Read more
Abstract healthcare data architecture showing a secure medical research platform for imaging, clinical notes, and lab data on Databricks

Building Secure, AI-Ready Medical Research Platforms on Databricks

Research organizations need faster, more reliable ways to prepare sensitive data for analysis without loosening their grip on governance and privacy. Across the medical research platforms we’ve built on Databricks, the same patterns keep proving their worth: cleaner ingestion, standardized de-identification, simpler access to research-ready datasets, and a foundation that holds up when analytics and AI ambitions grow. Here’s what we’ve learned about designing these environments well.

Read more
Post cover "Lakebase: The Death of the Siloed Application Database" by William Guzmán Daugherty Data Engineer at Entrada

Lakebase: The Death of the Siloed Application Database

Every enterprise manages two separate, expensive database systems: OLTP for real-time transactions and OLAP for analytics. The pipeline connecting them is the most fragile thing in the entire stack. Databricks’ Lakebase makes that pipeline optional, offering a strategic opportunity to collapse two stacks into one and finally deliver the near-real-time data that critical business applications need.

Read more
Show all posts
GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.