As a dedicated Data Engineer and Databricks Solutions Architect my mission is to build the architecture that turns raw data into a competitive advantage. I establish the robust pipelines and engineering patterns that transform legacy systems into reliable high performance cloud environments.

While the Lakehouse architecture offers flexibility the fundamental principles of Data Modeling are more critical than ever to ensure performance and cost efficiency. We must demystify the notion that data modeling is now irrelevant.

Redefining Data Modeling for the Lakehouse

Data modeling is the art of translating business requirements into a clear and structured blueprint of the data. It separates what is essential from what is noise to create a stable foundation.

I always advise clients that this is not a one time task because a good model evolves over time. We must move through three specific levels of abstraction to succeed:

  1. Conceptual: We define the high level objects such as customers or orders without worrying about technical types.
  2. Logical: We define the structure and business rules in a technology agnostic way.
  3. Physical: We build the blueprint tailored to the Databricks platform to optimize performance and storage.

The Databricks Advantage: Modeling with Unity Catalog

Databricks offers best practices to turn the physical layer into functional data structures through the Unity Catalog (UC). I use UC as the technical framework to implement the physical data model while maintaining visibility into the conceptual business rules.

1. Naming Convention and Governance

I use the UC structure to enforce a clean architecture. You should use one catalog per environment and schemas to define your medallion architecture structure.

Here is how I set up the initial structure to keep the environment organized:

SQL:
-- Creating a catalog for the production environment
CREATE CATALOG IF NOT EXISTS prod;

-- Creating schemas for the Medallion Architecture
CREATE SCHEMA IF NOT EXISTS prod.bronze;
CREATE SCHEMA IF NOT EXISTS prod.silver;
CREATE SCHEMA IF NOT EXISTS prod.gold;

This structure allows business analysts to browse the catalog and see relationships which makes the model transparent to the whole team. UC also helps with documentation by using AI to suggest descriptions for your tables and columns. This helps turn a cryptic physical column like cust_v2_id back into a clear business term.

2. Implementing the Relational Model

It is critical to understand that unlike traditional RDBMS, Spark does not enforce Primary and Foreign Key constraints – they serve primarily as informational metadata. However, they remain a vital part of the Lakehouse architecture for optimization purposes.

I trust the RELY keyword to activate specific optimizations. This acts as a performance hint rather than a hard constraint, telling the Databricks optimizer to trust your data integrity to achieve performance improvements like JOIN elimination. You need to first validate that your data has no duplicates or orphaned keys (SQL and Python snippets are great for this!). If that’s the case, your queries may produce incorrect results. I recommend using Delta Live Tables (DLT) Expectations to explicitly validate data quality during ingestion. By filtering out invalid records with @dlt.expect_or_drop, you ensure the level of data integrity that the RELY constraint assumes.

IMPORTANT: The query optimizations associated with the RELY command specifically require queries to run on Photon-enabled compute. Without Photon, these hints may be ignored.

Here is an example of creating a Fact table that trusts the dimension table:

SQL:
-- Fact Sales Table Creation
CREATE TABLE prod.gold.fact_sales (
    order_id BIGINT,
    customer_sk BIGINT,
    amount DECIMAL(10,2),
    -- Using RELY to tell the Databricks Optimizer it can trust this relationship
    CONSTRAINT sales_customer_fk 
    FOREIGN KEY (customer_sk) REFERENCES prod.gold.dim_customers (customer_sk) RELY
);

If you already have existing tables without “constraints” you can add them later to optimize performance without recreating the table:

SQL:
-- Adding a constraint with RELY after table creation
ALTER TABLE prod.gold.dim_customers
ADD CONSTRAINT customer_pk PRIMARY KEY (customer_sk) RELY;

3. Physical Layout with Liquid Clustering

Modeling in the Lakehouse also means defining how data sits on the disk. I now use Liquid Clustering to replace the rigid folder based hierarchy of traditional partitioning.This has been a feature that has been introduced recently by Databricks that replaces the traditional CLUSTER BY method.

Databricks automatically clusters data by the chosen columns which supports high cardinality keys and multi dimensional filtering. This effectively future proofs your table because you can update clustering keys as business needs change.

Here is how I create a high performance table using this feature:

SQL:
CREATE TABLE prod.gold.orders (
    order_id BIGINT,
    customer_id STRING,
    transaction_date TIMESTAMP,
    total_amount DOUBLE,
    store_id INT
)
-- We cluster by the columns we filter or join on most often
CLUSTER BY (customer_id, transaction_date);

When the business logic evolves I can simply alter the table to change the physical layout without rewriting the data:

SQL:
-- If business needs change you can update the keys without a rewrite
ALTER TABLE prod.gold.orders
CLUSTER BY (customer_id, store_id);

-- To trigger the clustering optimization manually
OPTIMIZE prod.gold.orders;

-- In Databricks Runtime 16.0 and above, you can force reclustering of all records in a table with the following syntax: 
OPTIMIZE prod.gold.orders FULL;

IMPORTANT: Per Databricks documentation, large tables with millions of records that haven’t been previously optimized can take hours to run the  OPTIMIZE FULL operation. This feature is currently in Public Preview for managed Apache Iceberg tables.

Final Words of Advice

I have often seen data projects fail for two main reasons: not properly defining the what, why, and how, and secondly, by not treating data as a product. Although storage prices are currently the cheapest they have ever been, processing speed is not, and with the AI boom, this resource will only become more expensive down the road.

I always recommend that the client take a step back and reevaluate their priorities when building or migrating a data warehouse from the ground up. Skipping the Data Modeling, architecture, and planning phase will only lead to unreliable data integrity across your entities, increased compute costs, and, at best, a project built on a shaky foundation. 

Investing the time up front to understand your business processes, define clear use cases, and design a robust data model pays off in the long run. It not only reduces technical debt and surprise costs later, but also enables teams to move faster and make decisions with confidence.

Is your Data Lake turning into a Data Swamp?

Contact us at Entrada to review your current Lakehouse architecture. Let us help you build a foundation that is ready for the future.

GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.