You draft the SOW with the only instruction being to migrate all 1,000 tables to the cloud using Azure and Databricks. When you ask the client, “Why?”, they reply, “We don’t know what’s inside all the tables, so we want to import everything regardless of the cost, time, and effort.” What would be your next move, given that their team consists of one person juggler, an analyst, and roles for a data engineer, manager, and junior data analyst, with a 6-month timeline? In the back of your head, you know this is almost impossible.

The initial plan was a brute-force approach: connect via JDBC and ingest everything using SELECT * FROM on_premise_db…`. The client quickly realized the flaw in this logic. A single query ran for days, but managed to pull in only about 100 tables. Luckily, auto-scaling was disabled; otherwise, they would have faced a massive compute bill for data they hadn’t even validated yet. There had to be a more intelligent way to handle this volume. Enter Databricks Lakehouse Federation. This capability allows you to map and query external data sources (such as PostgreSQL, MySQL, or Snowflake) directly from Databricks without moving a single byte of data.

For on-premise-to-cloud migrations, this is a game-changer. Instead of blindly ingesting 1,000 tables, we federated them instantly. We could then profile the data in place, identify the critical tables for migration, and leave the legacy “noise” behind.

To be clear, federation isn’t a novel concept – PostgreSQL has used Foreign Data Wrappers (FDW) for years, and other platforms have similar virtualization features. However, applying this strategy *within* the Unity Catalog governance layer brings a modern, scalable twist to a classic pattern. Now, let’s shift from theory to practice. Here is how you architect this solution for your own migration challenges.

Before writing any code, it’s important to differentiate between the two “flavors” of Federation. Although they appear the same in the Unity Catalog UI, since both show up as Foreign Catalogs, their underlying engine implementations are quite different.

There exist two types of federation in Databricks: 

  • Query Federation: This is the platform we use for our On-Premise Data Warehouse and for sources such as PostgreSQL, MySQL, SQL Server, and Snowflake.
  • Catalog Federation: This is used for other Metastores (like AWS Glue, Hive Metastore, or another Unity Catalog).

The 3-Step Federation Framework

Establishing the Lakehouse Federation is remarkably straightforward. It adheres to a logical hierarchy that reflects the Unity Catalog object model. Below is an overview of how we transformed what was previously a 1,000-table challenge into a manageable, zero-copy discovery process.

Step 1: The Secure Handshake (The Connection)

First, you establish a secure bridge between Databricks and your on-premises system using your standard JDBC driver. You can also do this using the Databricks UI.

SQL:
-- Creating the connection to the on-premise PostgreSQL database
CREATE CONNECTION on_prem_postgres 
TYPE POSTGRESQL 
OPTIONS (
  host 'internal-db.client.com',
  port '5432',
  user secret('db_secrets', 'pg_user'),
  password secret('db_secrets', 'pg_password')
);

Pro Tip: Once created, you can use GRANT USE CONNECTION ON CONNECTION on_prem_postgres TO to control exactly which engineers are allowed to leverage this bridge.

Step 2: The Metadata Mirror (The Foreign Catalog)

This is where the magic happens. You create a Foreign Catalog that uses the connection you just established to map the external database into your Lakehouse.

SQL:
-- Mapping the external database into Unity Catalog
CREATE FOREIGN CATALOG legacy_warehouse 
USING CONNECTION on_prem_postgres 
OPTIONS (database 'prod_reporting');

The Result: Hit execute, and within seconds, all 1,000 tables populate in your Unity Catalog Data Explorer. They are fully visible and fully governable, but no actual data has been moved or duplicated.

Step 3: Profiling & The “Polite Guest” Protocol

Now that the data is visible, you can query it using standard Spark SQL to figure out what actually matters. However, because we are using Query Federation, we must be “polite guests.” Every query we write pushes the compute down to the client’s on-premise server. Instead of running heavy, unoptimized joins, we run targeted profiling queries using pushdown predicates (filters) to minimize the strain on their system:

SQL:
-- Profiling the data to see if a table is actually used/populated
SELECT 
  COUNT(*) as total_rows,
  MAX(last_updated_date) as latest_record
FROM legacy_warehouse.sales.big_table_500_cols
WHERE last_updated_date >= '2024-01-01';

Phase 4: Strategic Ingestion (The Final Move)

By running a few basic profiling queries across the federated catalog, we mathematically proved what anyone who has spent time in legacy systems already knows: most of it is a graveyard. We showed that most of those ‘critical’ tables were either completely empty, abandoned backups (`_v2_final`), or hadn’t been updated in years. Just like that, we cut the migration scope by 30% to 50% without writing a single ETL pipeline or moving large amounts of data to the cloud.
For the remaining critical tables, we didn’t just leave them federated (which would incur ongoing network latency). We used Unity Catalog to strategically ingest them as Materialized Views, converting them to high-performance Delta tables for downstream analytics.

SQL:
-- Strategically bringing only the necessary, active data into the Lakehouse
CREATE MATERIALIZED VIEW main_catalog.gold.active_customer_profiles
AS SELECT * FROM legacy_warehouse.crm.customers 
WHERE account_status = 'ACTIVE';

Conclusion: Stop Moving Dark Data

The traditional principle of data engineering was “Extract, Load, Transform” – accumulate all data into the data swamp and organize it subsequently. Today, that approach is too slow, too expensive, and carries too much liability.

Unity Catalog Lakehouse Federation fundamentally changes the traditional approach to a new standard: “Federate, Profile, Ingest.” By maintaining data in its original location until its value is demonstrated, this method turns what used to be lengthy six-month processes into swift, focused migration sprints.

Ultimately, federation isn’t just about saving on compute costs; it is a strategic approach to moving your workloads from the notebook to production faster, safer, and with absolute clarity.

Over to You, I’m sure I’m not the only one who has stared down the barrel of a multi-thousand-table “blind” migration mandate. Have you faced the “migrate everything just in case” scenario in your own projects? I’d love to hear how your team handled the discovery phase, or if you are currently battling the JDBC brute-force method.

Ready to modernize your data architecture without the overhead of traditional migration methods?

Entrada helps organizations implement advanced Unity Catalog strategies that deliver immediate value. Contact us today to discuss your specific migration needs and start your transition to a modern Data Intelligence Platform.

GET IN TOUCH

Millions of users worldwide trust Entrada

For all inquiries including new business or to hear more about our services, please get in touch. We’d love to help you maximize your Databricks experience.