GenAI Truth: The Importance Of Data Engineering Maturity

December 22, 2025

The Inconvenient Truth About GenAI: Why Data Engineering Maturity Matters More Than Your Model

These days, the market is obsessed with discovering the “best” GenAI model, whether it is Llama 3, GPT-4, Claude, or the next variant. That is the wrong war. Every expert is vying for the newest toy.

Author

Alex Barreto

Over the last several years of building AI systems and Databricks implementations for Entrada, I have learned that the best path to the uncomfortable truth is:

The model is rarely the bottleneck.

Your data is.

Foundation models will increasingly become a commodity. It’s the data that you own, the quality of which is the only competitive advantage that really matters. And the fact is, for most organizations, the bottleneck is actually:

Is the data you have discoverable, credible, and well-governed enough for any model?

Many organizations are currently test driving Ferraris with garden hose fuel. In the environments I have examined, perhaps 10-15% of organizations have anything approaching the “high-performance fuel system” that GenAI, for real, needs.

The remaining nine operate with an AI team asking the data team, “Can you send me a fresh export?” every time they wish to retrain the model.

That’s not a pipeline.

That’s a bottleneck.

The “Garbage In, Hallucination Out” Reality Check

We all know the phrase Garbage In, Garbage Out. In the classic BI environment, it could refer to a weird chart, a poor KPI, or a dashboard that doesn’t smell right.

With GenAI, the failure mode is much more dangerous because:

Poor data → fluent, confident, and beautiful lies.

There is none that lights up red when the LLM hallucinates – and certainly not a universal, push-button one. Therefore, the defenses need to be architectural:

Observability – be very specific about which chunks of the data were actually retrieved, from which documents, and when those documents were last updated. If you cannot display “what the model actually saw,” you’re flying blind.

Citation Enforcement – it should require the model to give citations for sources, which can then be checked programmatically for existence and relevance.

Semantic Drift Detection – keep track of when users ask questions that are not within the coverage of the knowledge base that you have.

This is where data engineering becomes existential for GenAI accuracy.

Architecture First: The Medallion Architecture for AI

When organizations purchase the latest model without data maturity, Day 1 will typically consist of “hero work” where a handful of folks are hacking together CSVs for a proof of concept. This will not scale.

To scale, you require architectural rigor.

Unstructured Data: The Challenge, and the Importance of Lakehouse

The traditional data warehouse was developed for structured analysis. GenAI needs:

Structured data

Semi-structured logs

Unstructured data, images, and PDFs

Vector embeddings

Metadata and lineage

Unified governance for all of the above

The Databricks Data Intelligence Platform, developed using the Lakehouse architecture, is among the soundest platforms for this because it offers:

Only one storage layer (Delta Lake) for warehouse-grade reliability

Single layer of governance (Unity Catalog) for data, models, features, and vector indexes

Native vector search and AI tooling (Mosaic AI)

The ability to work with both unstructured and structured data simultaneously without “ETL spaghetti”

It is this convergence of data, computation, embeddings, and governance that allows GenAI systems to run safely and at scale.

Applying Bronze, Silver, and Gold to RAG

RAG is not a product, it’s a pipeline.

To figure out whether you actually have a pipeline or simply “random stuff in a bucket,” ask these three questions of your system:

What document did this answer come from?

When was the document last updated?

Who is authorized to view it?

If the members of your team are not able to do these three within a minute, you are not ready for the production environment.

Here’s the mapping of the Medallion Architecture onto GenAI:

Bronze – Raw Documents

Land raw PDFs, Confluence pages, emails, and metadata about files into Delta tables or Volumes, unchanged. Do not alter provenance.

Silver – Refined, Chunked, Embedded

Normalize the data, removing any duplicates

Chunk text

Create embeddings

Store them in structured Delta tables along with semantic metadata

Gold – Curated and Governed Corpora

Develop use case specific governed views (Customer Support KB, HR Policies, Engineering Handbook)

Use Unity Catalog security on this, ensuring that your RAG data pipeline only extracts data that the users have access to

This provides you with lineage, observability, reproducibility, and enforceable access control.

Silent Killer: Governance and Security in the Age of AI

Data Lineage view in Unity Catalog. Graphical visualization of the data flow path allows you to precisely track the origin of information used by GenAI models and identify dependencies between documents and the model.

The number one GenAI fear among CISOs and business leaders is:

“Will the chatbot leak confidential data?”

But, unfortunately, what these systems do is actually create that very risk, specifically where the documents are stored in external vectors that do not relate to enterprise management.

The Significance of Unity Catalog for GenAI

The “circuit breaker” of governance for your AI environment is the Unity Catalog. It provides the following:

Centralized Policies – Permissions are defined centrally, applicable for SQL, notebooks, dashboards, machine learning, and RAG retrieval.

Scoped Access – AI systems access defined Gold tables/views, not unrestricted corpora.

Auditability – Each access request, whether it’s data, embedding, or model call, is recorded for auditing purposes.

When used together with Mosaic AI Vector Search, the vector indexes reside within the Unity Catalog, inheriting the row/column security and data masking rules of the underlying data. This removes the loophole where the intern could accidentally query the very same semantic index that the CHRO uses.

This is the current definition for security-aware retrieval.

Accelerating Time-to-Value with the Data Intelligence Platform

**Databricks Data Intelligence platform overview: an integrated environment combining data engineering, data warehousing, and AI/ML, linked by a common Unity Catalog management layer.**

Reducing the “Data Gravity Tax”

In organizations, the data warehouse, data lake, ML environment, and vector storage are often separated. This creates a measurable cost:

Weeks of engineering time to sync data

Constant breakage because of schema changes

Duplicated permissions

Drift between environments

Throughout my field experience, the average loss of velocity for the teams has been 30% to 50% because of this cross-system coordination.

With the Databricks Data Intelligence Platform:

Data lives once

Governance is unified

Models and agents operate where the data already resides

New Gold tables are immediately searchable and queryable on the basis of the same permissions and lineage

This speeds up the iterative process for AI development teams considerably.

Champion’s Audit: Evaluating the Maturity of GenAI

After conducting audits on numerous enterprise setups, I immediately identify the following three red flags:

No Single Source of Truth – If the data resides in shadow systems numbering in the dozens with no enterprise catalog, you’re not ready.

Ad-Hoc Pipelines that Run on Cron Jobs – This will ensure inconsistent model data and results.

Governance Bolted On – If each tool is responsible for its own authorization, GenAI will become a problem.

***Catalog Explorer and SQL Editor interface, enabling engineers and analysts to quickly verify data quality, schema, and availability in real time.***

The First Step Executives Should Take Today

It is premature to appoint AI scientists until you have developed your data foundation. They will use 60-80% of their time on incidental data engineering – cleaning CSVs, reverse-engineering schemas, wiring exports. That’s a $300K expert doing $80K work, and they will burn out.

The Executive Mandate

Before adding new personnel to the AI staff:

Assign a leader for the implementation of a single Lakehouse platform and catalog.

Perform the following test: Pick your top 3 use cases for AI. Ask your data people: “Can you give me a governed, documented, automatically refreshed data set for these, today?”

If the answer is no, then you know where the next investment should go.

Conclusion

Excellent AI demands excellent data engineering. The truth is that unless you have a firm foundation – designed with Medallion architecture, governed with Unity Catalog, and powered with Mosaic AI – your GenAI strategy will likely struggle past POC.

You do not need a better model.

You need better data.

And the maturity to deliver it – data engineering maturity.

Entrada

The Inconvenient Truth About GenAI: Why Data Engineering Maturity Matters More Than Your Model

The “Garbage In, Hallucination Out” Reality Check

Architecture First: The Medallion Architecture for AI

Unstructured Data: The Challenge, and the Importance of Lakehouse