CI/CD For The Lakehouse: Terraform & Databricks CI/CD

March 26, 2026

True CI/CD for the Lakehouse: Infrastructure as Code (IaC) & DABs

There is a conversation I have had more times than I can count. A client tells me their team “already has CI/CD.” When I ask them to walk me through it, the answer usually sounds like this: a developer runs a notebook to completion, exports it, uploads it to a shared folder, and notifies the production team via Slack to “pull the latest version.” That is not CI/CD. That is a deployment ceremony wrapped in good intentions.

CI/CD for Lakehouse architecture with Databricks, Terraform, and Unity Catalog

Autor

William Guzmán-Daugherty

The gap I often see isn’t in understanding the theory but in its implementation, especially for the Lakehouse. This comes down to two layers that need to work together: Terraform for the platform foundation and Declarative Automation Bundles (formerly known as Databricks Asset Bundles) for the workload layer. Get both right, and you’ll have a truly automated, auditable, and reproducible data platform.

The Two-Layer Architecture

Before writing a single line of YAML or HCL, it is worth understanding why this is a two-layer problem.

Most teams starting CI/CD for Databricks automate notebook deployments and integrate pipelines with Git, feeling accomplished. However, they often overlook that the environment – workspace, Unity Catalog hierarchy, storage credentials, cluster policies, service principals – was manually assembled via UI by someone now absent from the team.

The result is an environment that is impossible to replicate. You cannot spin up a clean staging workspace. You cannot audit what changed and when. And you cannot recover quickly if something goes wrong at the infrastructure level.

The correct separation is:

Terraform handles the infrastructure layer: cloud workspaces, Unity Catalog metastores, catalogs, schemas, storage credentials, and identity.

DABs handle the workload layer: jobs, pipelines, notebooks, libraries, and environment-specific parameters.

Both layers need to live in Git (except for the Terraform state file).

Layer 1: Terraform for the Platform Foundation

Databricks has an officially supported Terraform provider that operates at two levels – account and workspace – and you need both. The account-level provider manages metastores and workspace provisioning. The workspace-level provider manages Unity Catalog objects, cluster policies, and permissions.

As I covered in the Data Modeling article and the Multi-Cloud Federation article, your Unity Catalog hierarchy is the foundation of your entire Lakehouse. Terraform is what keeps that foundation consistent across environments and developers.

Here is how you provision the full Unity Catalog structure as code, not clicks:

# Always authenticate with a Service Principal - never a personal token
provider "databricks" {
  alias         = "workspace"
  host          = var.workspace_url
  client_id     = var.sp_client_id
  client_secret = var.sp_client_secret
}

# Create the environment-specific catalog
resource "databricks_catalog" "env_catalog" {
  provider     = databricks.workspace
  metastore_id = var.metastore_id
  name         = var.environment  # "dev", "staging", or "prod"
  comment      = "Managed by Terraform — do not modify manually"
}

# Medallion Architecture schemas as code
resource "databricks_schema" "bronze" {
  provider     = databricks.workspace
  catalog_name = databricks_catalog.env_catalog.name
  name         = "bronze"
}

resource "databricks_schema" "silver" {
  provider     = databricks.workspace
  catalog_name = databricks_catalog.env_catalog.name
  name         = "silver"
}

resource "databricks_schema" "gold" {
  provider     = databricks.workspace
  catalog_name = databricks_catalog.env_catalog.name
  name         = "gold"
}

# Permissions declared as code — no ambiguity about who granted what
resource "databricks_grants" "catalog_grants" {
  provider = databricks.workspace
  catalog  = databricks_catalog.env_catalog.name

  grant {
    principal  = "data-engineers"
    privileges = ["USE_CATALOG", "CREATE_SCHEMA", "CREATE_TABLE"]
  }

  grant {
    principal  = "data-analysts"
    privileges = ["USE_CATALOG"]
  }
}

IMPORTANT: Always store Terraform state remotely – in an S3 bucket, Azure Blob container, or GCS bucket. Never commit state files to Git. A lost or corrupted local state file in a production environment is a serious incident waiting to happen.

Layer 2: Databricks Asset Bundles for Workload Deployment

Once your infrastructure is stable and managed through code, you need a plan for deploying the workloads that run on it. This is where Declarative Automation Bundles (DABs) come into play.

A bundle is an end-to-end definition of a project – it packages your source code together with declarative YAML definitions of jobs, pipelines, and cluster configurations, and deploys them as a single unit to a target environment. Under the hood, DABs use Terraform while abstracting away the state management complexity so your data engineering team can focus on what matters.

Every bundle has a databricks.yml file as its root. Here is a production-ready configuration:

bundle:
  name: lakehouse_pipelines

include:
  - resources/jobs/*.yml
  - resources/pipelines/*.yml

variables:
  catalog_name:
    description: "Unity Catalog target catalog"
    default: dev

targets:
  dev:
    mode: development
    default: true
    workspace:
      host: ${var.dev_workspace_host}
      # Each developer gets their own isolated path — no overwriting each other
      root_path: /Workspace/Users/${workspace.current_user.userName}/.bundle/${bundle.name}/${bundle.target}
    variables:
      catalog_name: dev

  prod:
    mode: production
    workspace:
      host: ${var.prod_workspace_host}
      root_path: /Shared/.bundle/prod/${bundle.name}
    # Production always runs as a Service Principal — never a human identity
    run_as:
      service_principal_name: sp-lakehouse-prod
    permissions:
      - level: CAN_MANAGE
        group_name: data-platform-admins
    variables:
      catalog_name: prod

Your jobs are defined in separate YAML files that the bundle automatically picks up. This can then be deployed directly from the Databricks UI:

# resources/jobs/transformation_job.yml

resources:
  jobs:
    silver_transformation:
      name: "Silver Layer Transformation — ${bundle.target}"

      email_notifications:
        on_failure:
          - data-platform-alerts@company.com

      tasks:
        - task_key: transform_customers
          notebook_task:
            notebook_path: ../src/transformation/transform_silver.py
            base_parameters:
              catalog: ${var.catalog_name}
              source_schema: bronze
              target_schema: silver
          new_cluster:
            spark_version: "15.4.x-scala2.12"
            node_type_id: Standard_DS3_v2
            num_workers: 2

Pro Tip: Always include ${bundle.target} in your job names. When dev and staging are deployed side by side, this suffix prevents name collisions in the Workflows UI and makes it immediately clear which environment a job belongs to.

Once your bundle is configured, the CLI commands are straightforward:

# Validate your bundle - catches schema errors before they hit an environment
databricks bundle validate -t dev

# Deploy and run
databricks bundle deploy -t dev
databricks bundle run -t dev silver_transformation

# Promote to production
databricks bundle deploy -t prod

From here, wrapping these CLI commands in a GitHub Actions or Azure DevOps pipeline is what closes the loop – a pull request triggers validation and unit tests, a merge to main deploys to staging, and a release branch deploys to production with a manual approval gate.

Final Words of Advice

Implementing true CI/CD for the Lakehouse isn’t a weekend project, but the benefits are immediate. Teams that adopt this architecture ship faster, experience fewer production issues, and spend less time debugging environment-specific problems that should never occur in the first place.

There is one principle I always return to: if you can’t recreate your entire Databricks environment from a git clone and two pipeline runs, then you lack true CI/CD.

Begin with the infrastructure. Define your Unity Catalog hierarchy in Terraform. Add your bundle configuration. Connect it to your preferred CI/CD tool. Each step builds on the previous one, and you can continue shipping without interruption.

Is your team ready to move from manual Databricks deployments to a true engineering platform? To discuss a personalized CI/CD strategy for your Lakehouse, contact Entrada today.

Entrada

True CI/CD for the Lakehouse: Infrastructure as Code (IaC) & DABs

The Two-Layer Architecture

Layer 1: Terraform for the Platform Foundation

Layer 2: Databricks Asset Bundles for Workload Deployment

Final Words of Advice

GET IN TOUCH

Millions of users worldwide trust Entrada

Services

Industries

Solutions

Resources

About

True CI/CD for the Lakehouse: Infrastructure as Code (IaC) & DABs

The Two-Layer Architecture

Layer 1: Terraform for the Platform Foundation

Layer 2: Databricks Asset Bundles for Workload Deployment

Final Words of Advice

Databricks Genie vs Power BI and Tableau: Should You Add It, Replace, or Ignore?

Why Your Databricks ML Pipelines Are Burning Cash (And How to Fix Them)

Entering the Agent Era: Data + AI Summit 2026 Reflection

GET IN TOUCH

Millions of users worldwide trust Entrada