You draft the SOW with the only instruction being to migrate all 1,000 tables to the cloud using Azure and Databricks. When you ask the client, “Why?”, they reply, “We don’t know what’s inside all the tables, so we want to import everything regardless of the cost, time, and effort.” What would be your next move, given that their team consists of one person juggler, an analyst, and roles for a data engineer, manager, and junior data analyst, with a 6-month timeline? In the back of your head, you know this is almost impossible.

The initial plan was a brute-force approach: connect via JDBC and ingest everything using SELECT * FROM on_premise_db…`. The client quickly realized the flaw in this logic. A single query ran for days, but managed to pull in only about 100 tables. Luckily, auto-scaling was disabled; otherwise, they would have faced a massive compute bill for data they hadn’t even validated yet. There had to be a more intelligent way to handle this volume. Enter Databricks Lakehouse Federation. This capability allows you to map and query external data sources (such as PostgreSQL, MySQL, or Snowflake) directly from Databricks without moving a single byte of data.

For on-premise-to-cloud migrations, this is a game-changer. Instead of blindly ingesting 1,000 tables, we federated them instantly. We could then profile the data in place, identify the critical tables for migration, and leave the legacy “noise” behind.

To be clear, federation isn’t a novel concept – PostgreSQL has used Foreign Data Wrappers (FDW) for years, and other platforms have similar virtualization features. However, applying this strategy *within* the Unity Catalog governance layer brings a modern, scalable twist to a classic pattern. Now, let’s shift from theory to practice. Here is how you architect this solution for your own migration challenges.

Before writing any code, it’s important to differentiate between the two “flavors” of Federation. Although they appear the same in the Unity Catalog UI, since both show up as Foreign Catalogs, their underlying engine implementations are quite different.

There exist two types of federation in Databricks: 

  • Query Federation: This is the platform we use for our On-Premise Data Warehouse and for sources such as PostgreSQL, MySQL, SQL Server, and Snowflake.
  • Catalog Federation: This is used for other Metastores (like AWS Glue, Hive Metastore, or another Unity Catalog).

The 3-Step Federation Framework

Establishing the Lakehouse Federation is remarkably straightforward. It adheres to a logical hierarchy that reflects the Unity Catalog object model. Below is an overview of how we transformed what was previously a 1,000-table challenge into a manageable, zero-copy discovery process.

Step 1: The Secure Handshake (The Connection)

First, you establish a secure bridge between Databricks and your on-premises system using your standard JDBC driver. You can also do this using the Databricks UI.

SQL:
-- Creating the connection to the on-premise PostgreSQL database
CREATE CONNECTION on_prem_postgres 
TYPE POSTGRESQL 
OPTIONS (
  host 'internal-db.client.com',
  port '5432',
  user secret('db_secrets', 'pg_user'),
  password secret('db_secrets', 'pg_password')
);

Pro Tip: Once created, you can use GRANT USE CONNECTION ON CONNECTION on_prem_postgres TO to control exactly which engineers are allowed to leverage this bridge.

Step 2: The Metadata Mirror (The Foreign Catalog)

This is where the magic happens. You create a Foreign Catalog that uses the connection you just established to map the external database into your Lakehouse.

SQL:
-- Mapping the external database into Unity Catalog
CREATE FOREIGN CATALOG legacy_warehouse 
USING CONNECTION on_prem_postgres 
OPTIONS (database 'prod_reporting');

The Result: Hit execute, and within seconds, all 1,000 tables populate in your Unity Catalog Data Explorer. They are fully visible and fully governable, but no actual data has been moved or duplicated.

Step 3: Profiling & The “Polite Guest” Protocol

Now that the data is visible, you can query it using standard Spark SQL to figure out what actually matters. However, because we are using Query Federation, we must be “polite guests.” Every query we write pushes the compute down to the client’s on-premise server. Instead of running heavy, unoptimized joins, we run targeted profiling queries using pushdown predicates (filters) to minimize the strain on their system:

SQL:
-- Profiling the data to see if a table is actually used/populated
SELECT 
  COUNT(*) as total_rows,
  MAX(last_updated_date) as latest_record
FROM legacy_warehouse.sales.big_table_500_cols
WHERE last_updated_date >= '2024-01-01';

Phase 4: Strategic Ingestion (The Final Move)

By running a few basic profiling queries across the federated catalog, we mathematically proved what anyone who has spent time in legacy systems already knows: most of it is a graveyard. We showed that most of those ‘critical’ tables were either completely empty, abandoned backups (`_v2_final`), or hadn’t been updated in years. Just like that, we cut the migration scope by 30% to 50% without writing a single ETL pipeline or moving large amounts of data to the cloud.
For the remaining critical tables, we didn’t just leave them federated (which would incur ongoing network latency). We used Unity Catalog to strategically ingest them as Materialized Views, converting them to high-performance Delta tables for downstream analytics.

SQL:
-- Strategically bringing only the necessary, active data into the Lakehouse
CREATE MATERIALIZED VIEW main_catalog.gold.active_customer_profiles
AS SELECT * FROM legacy_warehouse.crm.customers 
WHERE account_status = 'ACTIVE';

Conclusion: Stop Moving Dark Data

The traditional principle of data engineering was “Extract, Load, Transform” – accumulate all data into the data swamp and organize it subsequently. Today, that approach is too slow, too expensive, and carries too much liability.

Unity Catalog Lakehouse Federation fundamentally changes the traditional approach to a new standard: “Federate, Profile, Ingest.” By maintaining data in its original location until its value is demonstrated, this method turns what used to be lengthy six-month processes into swift, focused migration sprints.

Ultimately, federation isn’t just about saving on compute costs; it is a strategic approach to moving your workloads from the notebook to production faster, safer, and with absolute clarity.

Over to You, I’m sure I’m not the only one who has stared down the barrel of a multi-thousand-table “blind” migration mandate. Have you faced the “migrate everything just in case” scenario in your own projects? I’d love to hear how your team handled the discovery phase, or if you are currently battling the JDBC brute-force method.

Ready to modernize your data architecture without the overhead of traditional migration methods?

Entrada helps organizations implement advanced Unity Catalog strategies that deliver immediate value. Contact us today to discuss your specific migration needs and start your transition to a modern Data Intelligence Platform.

Other blog posts
dais26 travel e1782399186323

Entering the Agent Era: Data + AI Summit 2026 Reflection

Data + AI Summit 2026 brought the Entrada team back to San Francisco alongside more than 31,000 members of the data and AI community. As a pure-play Databricks partner, being there felt like standing at the epicenter of the next chapter of enterprise AI. A Few Proud Moments for the Entrada Team The Data + […]

Read more
Abstract data visualization showing a businessman interacting with a holographic stock chart, candlestick graphs, and financial KPI icons emerging from a tablet — symbolizing modern data architecture and AI-driven analytics on the Databricks Lakehouse.

The “Agent-Ready” Lakehouse: Bridging Data Modeling and Agentic AI

For most of the last decade, the goal of a data platform was simple: make the data available. Land it, govern it, and let the humans take it from there. That goal is no longer enough. In 2026, the consumer of your enterprise data is increasingly likely to be something other than a human. It […]

Read more
Digital data house representing the Mortgage Intelligence Platform by Entrada, with Cotality, Genie, and Lakebase

Mortgage Intelligence Platform: Building a Databricks-Native Lead Engine with Cotality, Genie, and Lakebase

Mortgage lenders sit on rich data across CRM, LOS, and servicing systems, yet still struggle to identify which borrowers are about to transact. Entrada’s Mortgage Intelligence Platform addresses that gap with a Databricks-native architecture: Cotality property intelligence delivered through Delta Sharing and Unity Catalog, deterministic scoring as governed SQL primitives, Genie grounded in a curated semantic layer, and Lakebase Postgres recording every approval and audit event. The result is a governed lead generation layer that tells growth teams who to contact, why now, and with what offer – and proves it afterward.

Read more
Show all posts
GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.