When Databricks released Databricks Apps, Lakebase (managed Postgres), and Unity Catalog, Entrada saw an opportunity to change the paradigm. We needed much more than to just tag columns in UC, such as to build a full featured, hierarchical Business Glossary that rivals IKC in functionality but lives inside the Data Intelligence Platform, and even addresses some of its limitations.

This is the story of how we at Entrada built a custom Databricks Business Glossary – a custom application that merges the user-friendly experience of a traditional catalog with the raw power of Unity Catalog, effectively retiring legacy technical debt while supercharging data governance.

Databricks Apps UI showing the running status, deployment logs, and active resources for the uat-business-glossary application.
Databricks App Deployment
Deploying the custom Databricks App interface. This serverless frontend provides a secure, SSO-authenticated portal for stewards to manage the hierarchical Business Glossary.

The Gap: IBM Knowledge Catalog vs. Unity Catalog

To understand what we built, we must first understand the gap we bridged.

IBM Knowledge Catalog (IKC) is excellent at defining hierarchical taxonomy. It handles “Categories,” “Secondary Categories,” “Stewards,” and complex relationships (e.g. “Part of,” “Synonym”). However, syncing these definitions to the physical tables in your lakehouse is often a fragile, ETL-heavy process.

Databricks Unity Catalog is the best in the world at physical metadata management (Lineage, Schemas, Tables). However, out of the box, it is flat. You can add comments and tags to columns, but it lacks the UI for a non-technical data steward to manage a complex, multi-level domain hierarchy or define reference data sets without writing SQL.

Our Solution: We built a custom Databricks App that provides the rich, hierarchical UI of IKC but uses Unity Catalog and Lakebase as the engine. It creates a “Single Pane of Glass” where business definition meets physical data.

For those uninitiated with IKC, here are a few screenshots of that product so we can compare them to what we built at Entrada in Databricks (and keep in mind that we are comparing an enterprise-grade product with something we built in a few weeks with two primary developers):

Interface of IBM Knowledge Catalog showing business term details, "Is a type of" type relationships, and "Is a part of" part relationships for data governance.
IBM Knowledge Catalog (Legacy) Interface
Visualizing the benchmark: IBM Knowledge Catalog (IKC) handles complex hierarchical taxonomy but remains siloed from the physical data layer.
A pop-up window in IBM Knowledge Catalog titled "Send for approval?" with a comment field and a due date selector for a business term.
Governance Workflow in IBM Knowledge Catalog
Legacy governance in action: The approval workflow for business terms in IBM Knowledge Catalog, requiring manual intervention and comments.
A dataset preview interface in IBM Knowledge Catalog showing 10 columns of data in a grid, including technical metadata like format (CSV), file size (93 KB), and last refresh time.
Dataset View in IBM Knowledge Catalog
Physical data management in IBM Knowledge Catalog. While providing detailed schema previews and profiling, this view is often siloed from the high-level business definitions managed by stewards.

Compare these to the screenshots in the below sections from our App.

The Architecture: A Full-Stack Data App

We leveraged the full Databricks ecosystem to build a robust, three-tier application.

1. The Frontend: Dash on Databricks Apps

We chose Plotly Dash for the frontend because of its ability to handle complex data visualizations and interactivity within a Python ecosystem. Hosted on Databricks Apps, the frontend is serverless, secure, and automatically authenticated via Databricks SSO.

  • Domain Hierarchy Sidebar: Just like IKC, we built a recursive tree view (domain_hierarchy.py) that allows users to drill down from “Global Domains” to specific “Sub-domains.”
  • Term Grid: A high performance AG Grid (term_grid.py) allows stewards to filter, sort, and edit thousands of terms instantly.
  • Graph Visualization: Using dash-cytoscape, we built a lineage view (graph_view.py) that visualizes the relationship between a Business Term, its Stewards, and the physical Unity Catalog Columns it is linked to.
Interface of the Databricks Business Glossary app showing a searchable grid of business terms, definitions, assigned domains, and classifications with a navigation sidebar.
Image 6: All Business Terms Grid View
The Single Pane of Glass in action: A high-performance AG Grid within the Databricks App that allows stewards to search, filter, and manage thousands of Business Terms alongside their definitions and domains.

2. The Control Plane: Lakebase (PostgreSQL)

While Unity Catalog stores the physical metadata, we needed a place to store the semantic relationships, which is the ‘tissue’ that connects definitions. We utilized Databricks Lakebase, a fully managed Postgres instance for these transactional workloads.

We designed a normalized relational schema to handle the complexity that flat tags cannot:

  • glossary_domains: Stores the recursive parent-child hierarchy.
  • glossary_terms: The core definitions, tied to domains.
  • glossary_term_relations: A polymorphic table handling term-to-term, term-to-reference-set, and term-to-value relationships.
  • link_states: A sophisticated state machine that tracks the synchronization status between our glossary and the physical columns in Unity Catalog.
Project dashboard in Databricks Lakebase showing compute status, RAM usage metrics, and branch settings for the production environment.
Databricks Lakebase Monitoring Dashboard
The Control Plane: Monitoring the Lakebase (managed Postgres) instance, which stores the complex semantic relationships and hierarchical “tissue” that connects business definitions.

3. The Integration Layer: Unity Catalog & System Tables

This is where the magic happens. We wrote an AssetLinksExecutor that acts as the bridge. When a Steward links a Business Term (e.g. “Adjusted Gross Revenue”) to a physical column (finance.sales.revenue_adj), the app doesn’t just store that link; it pushes the definition back to Unity Catalog via ALTER COLUMN COMMENT commands.

Key Technical Features

The “Asset Linker” Engine

One of the hardest challenges in governance is keeping the glossary in sync with the database. We built a robust engine (asset_links_executor.py) that performs a “State Diff.”

When a steward updates a definition in the App:

  1. The app identifies all physical columns linked to that term.
  2. It calculates a “Diff” (Add, Update, Delete).
  3. It executes COMMENT ON COLUMN statements against Unity Catalog to ensure that a Data Analyst querying the table via SQL sees the exact same definition as the Steward in the App.
  4. It logs the synchronization state, handling schema drift (e.g. if a physical column is dropped, the link is marked INVALID rather than crashing the app).

IKC Parity + Migration

To replace IKC, we had to support its data format. We built a custom parser (ikc.py) that handles the complex, multi-row export format of IBM Knowledge Catalog.

  • Recursive Category Parsing: It reads Category >> SubCategory >> Leaf strings and auto generates the domain IDs in Lakebase.
  • Stewardship Mapping: It resolves email addresses from IKC exports to Databricks Principals.
  • Reference Data: We built a dedicated module to manage Reference Data Sets (e.g. “State Codes,” “Industry Classifications”) and link them to terms, a feature often missing in standard catalogs.
Interface of the Reference Data Management module in the Databricks App showing data sets like Classification and NAICS Code with their corresponding linked values.
Reference Data Management Module
Centralizing governance with the Reference Data Management module, allowing stewards to define and link standardized data sets like classifications directly to business terms.

Interactive Graph Lineage

We wanted to visualize the impact of a term. Using graph_builder.py, we construct a network graph dynamically.

  • Nodes: Terms, Reference Values, and Physical Assets (Tables/Columns).
  • Edges: Relationships (e.g. “Term X classifies Column Y”).
    This allows a user to click a term and instantly see every physical table in the Lakehouse that relies on that definition.
A dynamic network graph in the Databricks App showing nodes for business terms and reference sets, with a details panel displaying term IDs and definitions.
Interactive Relationship Graph
Visualizing the semantic impact: The Relationship Graph uses dash-cytoscape to map connections between business terms, reference values, and physical Unity Catalog assets.

Deployment: Databricks Asset Bundles (DABs)

We used Databricks Asset Bundles to professionalize the SDLC.

  • Configuration: databricks.yml defines targets for lab, uat, and prod.
  • Isolation: Each environment gets its own Schema and Lakebase connection string.
  • Automation: We included a specialized init_schema job that detects the schema version and applies SQL migration scripts (00.config.sql to 09.term_stewards.sql) automatically upon deployment.

Why This Wins

By building this in Databricks, we achieved three things that IKC never could:

  1. Immediacy: The moment a definition is updated in the Glossary, it is visible to a Data Scientist in a Notebook or a SQL Analyst in the Query Editor. There is no sync lag.
  2. TCO Reduction: We eliminated the licensing cost of a standalone governance tool and the operational overhead of maintaining a separate server.
  3. Extensibility: Because the backend is standard SQL and the frontend is Python, we can add features (like GenAI-based definition suggestions) in an afternoon, not a 6 month vendor roadmap cycle.
YAML configuration in Databricks Asset Bundles defining app resources, SQL warehouses, and Lakebase connection strings.
Databricks Asset Bundle (DABs) Setup
Professionalizing the SDLC using Databricks Asset Bundles (DABs) to automate Business Glossary deployments and resource isolation

Conclusion

This project proves that Databricks Apps can be extended well beyond simple dashboards. It is a capable platform for building complex, enterprise grade data management software. By combining the semantic richness of a dedicated glossary with the physical reality of Unity Catalog, instead of replacing IKC, we were able to build something better. A glossary that actually lives where the data is, and all of the added abilities that go along with that concept.

Ready to retire your legacy technical debt and supercharge your data governance? Contact the Entrada Team today to learn how we can help you build a native, bonafide Business Glossary directly within your Databricks environment.

GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.