Building An AI Billing Agent On Databricks

May 4, 2026

Building an AI Billing Agent on Databricks: Anomaly Detection, Genie Analytics, and Governed Write-Back at Scale

Inside the Customer Billing Accelerator from Entrada and Databricks, an agentic AI stack that detects anomalies, answers finance questions in plain English, and writes back to source systems, all governed through Unity Catalog.

Abstract financial visualization with a hand typing on a laptop keyboard, overlaid with bar charts, line graphs, and binary code in blue tones, representing data analytics and billing intelligence.

Author

Alex Barreto

It’s 4:47 PM on a Friday. A finance analyst at a large enterprise notices that one of their top accounts’ monthly bill has more than doubled. They Slack the data team. The data team is heading into the weekend. They open a Jira ticket. The ticket goes behind forty-seven others. By the time someone runs the SQL, exports a spreadsheet, and emails it back, it’s Tuesday afternoon, and the analyst’s customer call is in twenty minutes.

This loop, the one between noticing an anomaly and understanding it, is where most of the real cost of fragmented billing data lives. It isn’t the storage. It isn’t even the compute. It’s the days of human back-and-forth between someone who has a question and someone who has SQL.

The Customer Billing Accelerator is a production-grade reference implementation we built, in partnership with Databricks, to collapse that loop. It’s an agentic AI system that runs natively on the Databricks Lakehouse, continuously monitoring billing data, detecting anomalies on its own, letting business users investigate in natural language, and writing back to operational systems through governed, auditable workflows. Unity Catalog enforces who can do what at every step. There are no external dependencies, no shadow IT, and no SQL bottleneck.

This post walks through what the system does, who it’s for, how the pieces fit together, and what makes the underlying Databricks platform load-bearing rather than incidental. It starts at the business level and gradually descends into the architecture, by the end, you’ll have a clear picture of how an agentic AI stack is actually built and deployed in production on Databricks.

What’s Broken in Enterprise Billing Analytics

Three things, mostly.

First, billing data is everywhere. ERP holds order-to-cash. A separate billing engine computes charges. Operational systems carry the usage events that drive those charges. A reporting warehouse contains the curated facts. Finance can’t get a single view across them without a ticket, and that latency is exactly where revenue leakage hides.

Second, investigation is manual. When an analyst flags a customer whose bill spiked, what actually happens? Someone files a ticket. Someone else writes a query. A spreadsheet comes back. Someone manually cross-references against plan pricing, then maybe pulls in ERP to check credit and dispute history. That’s a multi-day cycle for a single data point, repeated across hundreds of anomalies a quarter.

Third, nobody monitors the platform itself. Finance teams need to know whether the pipelines are running, whether anomaly detection succeeded last night, whether compute costs are trending toward a quarterly overrun. Without that telemetry, you’re flying blind on the system that’s supposed to give you visibility into everything else.

These aren’t novel observations. What’s novel is that the platform underneath has caught up: Unity Catalog, Agent Bricks, Genie, and Foundation Models on Databricks now make it possible to collapse all three problems into one governed, agentic stack, without writing a custom orchestration layer.

What the Accelerator Does

The system delivers four capabilities, each mapped deliberately to a different audience.

1. Continuous anomaly detection

A scheduled PySpark job computes per-customer baselines from billing history and flags statistical outliers across four anomaly types: total-charge spikes (z-score above 2.0 relative to the customer’s own mean), roaming spikes, international charges, and data-overage spikes (each flagged when charges exceed 3× the customer’s rolling average). The thresholds are adaptive per customer, there is no global “$200 = anomaly” rule, because what’s normal for one account is a red flag for another.

Anomalies land in a governed Delta table (billing_anomalies) and become immediately queryable through a Unity Catalog function (lookup_billing_anomalies) that the agent can call as a tool. A separate alerter task runs on a daily workflow, identifying unacknowledged anomalies and routing them to operational owners.

For a CFO, this surfaces aggregate exposure, what’s our total anomaly dollar amount this quarter, and is it trending up? For Finance Ops, it surfaces operational triage, which anomalies haven’t been acknowledged, and who owns them?

2. Natural-language investigation

When someone wants to dig deeper, they ask the agent in plain English. The agent doesn’t retrieve documents and improvise an answer, it executes structured SQL against governed Gold tables through a fixed set of Unity Catalog functions. It can traverse the full billing chain: customer → plan → charge breakdown → ERP revenue recognition. Seconds, not days.

The agent has fourteen UC function tools at its disposal, plus a vector-search tool for FAQ retrieval and a Genie-delegation tool for ad-hoc analytics. All of them are governed at the catalog level, the LLM never sees raw SQL, and the agent service principal never has direct read access to PII tables.

3. Self-service conversational BI via Genie

For business users who want to explore on their own, Genie offers a conversational SQL interface. Finance asks “what’s the average monthly charge by plan tier?”, Genie translates that to SQL, runs it on Serverless, returns the result. Users refine: “now break that down by region, last six months only.” Genie maintains conversational context across the thread.

A critical design decision: the Genie Space is scoped to a PII-safe view, not the underlying tables. Customer names, emails, and phone numbers are excluded by design. Genie literally cannot generate a query that surfaces PII because the view it sees doesn’t contain those columns.

4. Platform self-monitoring

The system ingests Databricks system tables, DBU billing, job-run timelines, query history, into a governed telemetry layer (Bronze → Silver → Gold). The agent can then answer questions like “is the platform healthy?”, “what’s our seven-day DBU trend?”, or “which jobs are failing most often?” This is the layer that the Head of Data Engineering and the FinOps team actually care about: the system that watches the system.

For us, this is also the layer that makes the accelerator a credible production artifact rather than a demo. Without observability, there’s no operational basis for trust.

Walking Through One Real Flow

Here’s what an end-to-end anomaly investigation looks like once everything is wired up.

02:00 AM. The DLT streaming pipeline is processing billing events as they land. A separate PySpark job has already run and detected that customer 4401’s October roaming charges are 4.2× their rolling average, flagged as roaming_spike and written to billing_anomalies. An alert task has notified the Finance Ops queue.

08:30 AM. A finance ops analyst opens the billing-intelligence app, authenticates via SSO, and selects the Finance Ops persona. Behind the scenes, the app pulls their identity via SCIM, builds a signed RequestContext, and propagates it to the agent in custom_inputs.

08:31 AM. The analyst asks: “Why did customer 4401’s bill spike in October?” The agent, running as a service principal on a Model Serving endpoint, calls lookup_billing_anomalies('4401'), then lookup_billing_items('4401'), then lookup_customer_erp_profile('4401'). The ReAct loop runs three tool calls and synthesizes a coherent answer: international roaming charges in Brazil for two weeks, no prior pattern, ERP shows the customer is in good standing.

08:33 AM. The analyst asks: “Create a dispute for customer 4401, reason: unrecognized roaming charges, customer reported they were in domestic territory.” The agent stages the write, generates an 8-character confirmation token, and asks: “Write staged (token: a3f9c2d1). Reply CONFIRM or CANCEL.”

08:34 AM. The analyst replies “CONFIRM.” The agent verifies the token, verifies the analyst’s identity context is still valid, and executes a single transaction in Lakebase that writes both the dispute record and an immutable audit row. The audit row records the human who initiated it, the service principal that executed it, the session ID, and the persona.

The whole loop, detection, investigation, dispute creation, took four minutes. The same flow at the start of this post took five business days.

Under the Hood: How It Actually Works

We promised L200 depth, so here’s where things get specific. The accelerator’s architecture has six components worth understanding in detail.

1. Medallion + DLT streaming as the data backbone

All data lands in a Bronze → Silver → Gold pipeline governed by Unity Catalog. Bronze is raw ingestion. Silver is cleansed, joined, and quality-checked. Gold is business-ready: anomaly scores, revenue attribution, telemetry KPIs.

The streaming side is a Delta Live Tables pipeline (06_dlt_streaming_pipeline) that does a stream-static join, a streaming read of billing_items joined to static customer and plan dimensions, to produce billing_events_streaming (Silver) and billing_monthly_running (Gold). DLT manages checkpointing, schema evolution, and data-quality expectations (@dlt.expect_or_drop on customer_id and device_id). Change Data Feed is enabled on billing_items, billing_disputes, and billing_write_audit, audit consumers can incremental-read without full-table scans.

This isn’t novel architecture. It is, however, a precondition for everything else: agentic AI on top of ungoverned data is a liability, not an asset.

2. LangGraph ReAct agent on MLflow ChatAgent

The agent itself is a LangGraph StateGraph compiled into an MLflow ChatAgent, deployed to a Databricks Model Serving endpoint. It runs as a service principal, no SparkSession, no dbutils. The graph is a two-node ReAct loop:

agent → should_continue → tools → agent
└→ END

Compiled with recursion_limit=30 (about 15 tool round trips), this is enough headroom for the kind of multi-hop investigations the system needs without unbounded loops. Persona-specific subgraphs are built lazily on first use and cached.

Implementing the MLflow ChatAgent protocol matters more than it might sound. It’s the contract Databricks Model Serving expects for agent endpoints, and it gets you one-line deployment via agents.deploy(), plus built-in request/response inference tables, A/B traffic split, and the AI Playground review app, all without writing custom infrastructure.

3. Unity Catalog functions as the tool interface

This is the design choice that makes the system genuinely production-safe.

The agent’s fourteen tools are not Python functions defined inside agent.py. They are SQL functions defined in Unity Catalog, lookup_customer, lookup_billing_anomalies, lookup_customer_erp_profile, and so on, loaded dynamically via UCFunctionToolkit(function_names=[...]). The function signature and COMMENT string serve as the contract between the data layer and the LLM. The LLM sees the schema; it never sees the underlying SQL.

A representative function definition:

CREATE OR REPLACE FUNCTION main.billing.lookup_billing_anomalies(
  input_customer STRING COMMENT ‘Customer ID. Empty string for recent
                                  anomalies across all customers.’
)
RETURNS TABLE (customer_id BIGINT, event_month STRING, plan_name STRING,
               total_charges DOUBLE, anomaly_type STRING,
               anomaly_detail STRING, pipeline_run_at TIMESTAMP)
COMMENT ‘Returns billing anomalies for a customer (charge spikes, …)’
RETURN (
  SELECT … FROM main.billing.billing_anomalies
  WHERE (input_customer = ”
         OR customer_id = TRY_CAST(input_customer AS DECIMAL))
  ORDER BY total_charges DESC
  LIMIT 50
);

Three concrete benefits fall out of this pattern:

Bounded result sets. Every UC function has either a LIMIT, a primary-key filter that returns 0–1 rows, or a vector-search num_results cap. The LLM cannot accidentally trigger a full-table scan.

PII isolation by privilege. The “safe” lookup_customer returns only customer_id, device_id, and plan, no names, emails, or phone numbers. The PII-revealing function (lookup_customer_pii) lives in a separate _internal schema where the agent service principal has REVOKE USE SCHEMA. UC’s definer-rights model lets the safe function read the underlying customers table even though the SP can’t SELECT from it directly.

Schema-evolution safety. Adding or removing a column flows from the catalog through UCFunctionToolkit to the agent on the next cold start. Tool definitions live in one place, not three.

4. Genie for conversational analytics, architecturally separate

The agent is not Genie, and Genie is not the agent. They serve different needs.

The agent does structured, deterministic lookups against governed UC functions. Genie does arbitrary, exploratory SQL against tables registered to its space. The agent calls Genie via a custom @tool (ask_billing_analytics) when a user’s question is open-ended enough that no fixed UC function fits, “what’s the distribution of monthly charges across our top 100 customers?” That tool delegates to _ws_client.genie.start_conversation_and_wait(), parses the result, and returns it to the agent’s reasoning loop.

The Genie Space is registered against twenty tables, all of which are PII-safe views or already-projected analytics tables. Six PII guardrail instructions are injected into the Space’s configuration. There is one operational reality worth knowing: Genie has a 45–90 second cold start after inactivity. We document that honestly rather than pretending otherwise.

5. Identity propagation, Pattern C (Hybrid)

This is one of the harder problems in agentic AI on a shared platform: who is acting when the agent calls a tool? The naive answer, “the service principal”, leaves you with no way to authorize per-user behavior or audit who did what.

The accelerator implements what we call Pattern C (Hybrid): user identity is propagated for authorization and audit, but SQL execution stays under the SP. The flow:

The Databricks App receives x-forwarded-access-token from the platform proxy.

The App calls /api/2.0/preview/scim/v2/Me to resolve email and groups.

The App builds a RequestContext (email, groups, persona, session, request, expiry) and signs it with HMAC-SHA256 using a key from Databricks Secret Scope.

The App sends custom_inputs.request_context to Model Serving alongside the chat history.

The agent validates the signature (timing-safe), checks the 15-minute TTL, and stores the context in a contextvars.ContextVar for the request.

Tool authorization, persona-group binding, and write authorization all consult that context.

Audit records both the human (initiating_user) and the SP (executing_principal).

The choice of contextvars over threading.local matters because Model Serving reuses threads from a pool and streaming responses outlive the originating call. copy_context() snapshots the identity at predict_stream() and runs each generator step inside snapshot.run() so the streaming generator sees its original identity even if the thread later handles a different request.

6. Token-gated write-back

Writes pose a specific Model Serving problem: there’s no SparkSession. So the agent uses a token-gated mediation pattern. When a user requests a write, the agent stages it, generates an 8-character token, stores the operation in a thread-safe in-memory dict with a 10-minute TTL, and asks the user to reply CONFIRM or CANCEL. On CONFIRM, the agent enforces two hard guards inside a single lock, the token must exist and the RequestContext must be present and valid, before invoking _execute_write.

When Lakebase (managed PostgreSQL inside Databricks) is provisioned, the dispute record and the audit row commit atomically inside a single PostgreSQL transaction. When Lakebase isn’t available, the agent falls back to the Statement Execution API: an audit INSERT (PENDING) first, then the business write, then a final audit INSERT (SUCCESS|FAILED). The two-INSERT pattern means an audit record exists even if the process crashes between steps. UPDATE-based audit trails don’t give you that.

Why This Stack: What Makes Databricks Load-Bearing

A handful of platform capabilities here are doing real work, not glue work, not branding work, but actual structural load. Worth being explicit about which ones.

Unity Catalog functions as a governance contract. This is the lynchpin. The function schema is the boundary between governed data and the LLM. Tools are discoverable via information_schema, lifecycle-managed by the catalog, and inherit UC’s privilege model. The agent doesn’t implement governance, it participates in governance defined elsewhere.

Definer-rights privilege escalation. The pattern of giving the agent SP zero direct table access and routing reads through definer-rights UC functions is what lets us be aggressive about least-privilege without sacrificing capability. The agent can query customer data through a function but can’t SELECT from the table; the Genie SQL warehouse inherits the same restriction.

MLflow ChatAgent on Model Serving. One-line deployment, request/response inference tables, the AI Playground review app, and built-in tracing. We didn’t write any of that infrastructure, and rebuilding it is a quiet, significant cost most teams underestimate when they reach for an open-source-only stack.

Lakebase for transactional writes. The Statement Execution API works as a fallback, but Lakebase is the right tool for high-frequency operational writes: sub-100ms point writes, ACID transactions across business + audit rows, no SQL warehouse dependency. Synced tables bridge Lakebase back to Delta so Genie and downstream analytics still see the operational data.

Foundation Model API + model-agnosticism. The default LLM is Claude 3.7 Sonnet via Databricks Model Serving, but the LangChain/LangGraph tool-calling abstraction means swapping in DBRX, Llama, or another available model is a config change, not a rewrite. That portability matters when LLM economics shift quarterly.

The system is, deliberately, not portable to a non-Databricks environment. That’s the point. Every load-bearing capability above is something you’d otherwise have to build, integrate, and operate yourself, and the operational cost of that, over a multi-year deployment, dwarfs any portability benefit.

What This Has Already Delivered

We deployed an earlier version of this architecture at PepsiCo during their migration of hundreds of SQL warehouses to Databricks Serverless. The accelerator rode alongside the migration, we didn’t wait for it to finish.

A few outcomes worth knowing:

Up to 90% faster billing-anomaly resolution. What used to be a multi-day investigation cycle compressed into a single conversational session, driven by the combination of pre-computed anomaly tables, agent-mediated investigation, and write-back disputes.

Up to 70% reduction in ad-hoc finance reporting requests. Genie absorbs the steady-state queries that previously generated tickets to the data team. Finance teams went from depending on a queue to operating self-service.

85% serverless migration success rate. The accelerator’s telemetry monitoring gave the migration team before-and-after performance confidence, which is what made the higher success rate achievable. You can’t migrate confidently without observability.

Genie went from executive demo to daily production use. This is harder to quantify, but it’s the qualitative result that mattered most. Self-service conversational BI stopped being a curiosity and became the default investigation surface for finance.

These are up-to figures derived from PepsiCo deployment data, your mileage will depend on data maturity, persona structure, and how aggressively you adopt the agentic write-back layer. But the directional case is clear, and the implementation pattern is repeatable.

What It Takes to Deploy

Four prerequisites:

Unity Catalog. Billing data and telemetry registered in UC. Non-negotiable; it’s how all of the above works.

Lakehouse Federation or sync. External ERP/finance systems either federated through Lakebase or synced to UC tables.

Serverless compute. Billing workloads are bursty. You need auto-scaling, not fixed clusters.

Agent Bricks + Genie + Foundation Model API. Premium tier or above.

If a customer’s billing data is already in UC, core deployment is 2–4 weeks. Adding ERP federation and the full observability pipeline brings it to 4–7 weeks total. PepsiCo-scale deployment with concurrent migration co-delivery was three to four months end-to-end.

The accelerator is parameterized by industry, telco, SaaS, utility, CPG, and by persona (Customer Care, Finance Ops, Executive, Technical). The underlying architecture is industry-agnostic. Swapping domains changes vocabulary and charge categories, not the pipeline.

Closing

If your finance team is still filing tickets to investigate billing anomalies, you don’t have a billing problem. You have an interface problem: the gap between people who know the questions and people who know the SQL.

The Customer Billing Accelerator closes that gap with an agentic AI layer that’s fully governed, fully auditable, and operationally honest about what it can and can’t do. It’s not magic. It’s good architectural choices stacked on a platform that finally has the right primitives, Unity Catalog functions, Agent Bricks, Genie, Lakebase, Model Serving, to make agentic AI a production discipline rather than a demo genre.

We’re happy to walk through the architecture in more depth, share the codebase, or run a live demo. Get in touch if you want to dig in.

The Customer Billing Accelerator is a joint offering between Entrada and Databricks. Entrada is a Databricks consulting partner specializing in Data & AI practice leadership and accelerator-driven delivery.

Entrada

Building an AI Billing Agent on Databricks: Anomaly Detection, Genie Analytics, and Governed Write-Back at Scale

What’s Broken in Enterprise Billing Analytics