How Large Language Models combined with data engineering and machine learning can automatically discover, validate, and organize KPIs into dynamic, explainable systems
"Most dashboards today measure everything — except what really matters."
Organizations track hundreds of metrics, yet struggle to answer: "Which KPIs truly move our business outcomes, and how are they connected?"
The answer lies in transforming static dashboards into LLM-driven KPI systems that can discover, validate, and continuously refine metrics based on real data.
In this post, we'll show how Large Language Models (LLMs) — combined with data engineering and machine learning — can automatically:
An LLM can:
This human-like reasoning ability — when grounded in data — allows AI systems to act like digital management consultants, building metric systems that mirror how executives think.
flowchart TD
A[Business Intent (text)] --> B[LLM Goal Interpreter
maps intent → KPI concepts]
B --> C[Schema Analyzer
LLM reads tables & columns]
C --> D[KPI Hypothesis Generator
LLM suggests candidate KPIs + SQL]
D --> E[Quantitative Validator
tests predictiveness & causality]
E --> F[KPI Tree Builder
builds weighted DAG]
F --> G[Registry & Governance
versioned definitions]
G --> H[Continuous Monitor
drift, decay, re-learning]
We start with a user prompt in plain English:
Goal: Improve customer satisfaction in our e-commerce business.
Data: We have order tables, shipment logs, and support tickets.
Using a small prompt template, the LLM translates this to structured KPI intent.
from openai import OpenAI
client = OpenAI()
intent_prompt = """
You are an analytics strategist. The business goal is: "Improve customer satisfaction".
Given available data tables: orders, shipments, feedback, support.
List 5 candidate KPIs that could measure or influence this goal.
For each, explain: purpose, formula (pseudo-SQL), and data dependencies.
Return JSON.
"""
response = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":intent_prompt}]
)
print(response.choices[0].message.content)
Example LLM output:
[
{"kpi":"On_Time_Delivery_Rate",
"purpose":"Measures delivery reliability",
"formula":"AVG(CASE WHEN delivered_date <= promised_date THEN 1 ELSE 0 END)",
"tables":["shipments","orders"]},
{"kpi":"Support_Tickets_per_Order",
"purpose":"Captures friction in post-purchase experience",
"formula":"COUNT(ticket_id)/COUNT(order_id)",
"tables":["support","orders"]},
{"kpi":"Stockout_Rate","purpose":"Supply reliability",
"formula":"AVG(stockout_flag)"},
{"kpi":"Average_Cycle_Time",
"purpose":"Operational speed",
"formula":"AVG(delivered_date - order_date)"},
{"kpi":"CSAT",
"purpose":"Outcome KPI","formula":"AVG(rating)"}
]
👉 The LLM has reasoned from both business intent and schema context, something deterministic code alone cannot do.
In a real enterprise, column names are messy: del_date
, ord_prom_dt
, cust_satis_score
.
An LLM can read metadata or sample data and infer meaning.
schema_prompt = """
You are a data engineer. Given these column names:
['ord_dt','del_dt','prom_days','stk_flag','csat_score','sup_tkts']
Map each to a semantic tag (e.g., order_date, delivered_date, promised_days, stockout_flag, csat, support_tickets)
Return a JSON map.
"""
schema_map = client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":schema_prompt}]
)
print(schema_map.choices[0].message.content)
The LLM returns:
{"ord_dt":"order_date","del_dt":"delivered_date","prom_days":"promised_days","stk_flag":"stockout_flag","csat_score":"csat","sup_tkts":"support_tickets"}
This automated semantic labeling becomes the foundation for dynamic KPI discovery.
With the goal and schema understood, the LLM suggests not only which KPIs to track but how to compute them.
kpi_gen_prompt = """
Given the goal "Improve customer satisfaction"
and these mapped columns: order_date, delivered_date, promised_days, stockout_flag, support_tickets, csat.
Suggest 5 KPI formulas (in SQL) that can be computed to evaluate or drive this goal.
Return JSON list of {kpi,sql,lower_is_better}.
"""
print(client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":kpi_gen_prompt}]
).choices[0].message.content)
Typical LLM-generated output:
[
{"kpi":"Order_Cycle_Time_Days","sql":"AVG(julianday(delivered_date)-julianday(order_date))","lower_is_better":true},
{"kpi":"On_Time_Delivery_Rate","sql":"AVG(CASE WHEN (julianday(delivered_date)-julianday(order_date)) <= promised_days THEN 1 ELSE 0 END)","lower_is_better":false},
{"kpi":"Stockout_Rate","sql":"AVG(stockout_flag)","lower_is_better":true},
{"kpi":"Support_Tickets_per_Order","sql":"AVG(support_tickets)","lower_is_better":true},
{"kpi":"CSAT","sql":"AVG(csat)","lower_is_better":false}
]
These now feed into the quantitative validation layer.
Here we ensure the proposed KPIs actually track the target outcome.
import pandas as pd
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
# assume df = dataset with the above fields
y = (df["csat"] >= 4).astype(int)
features = {
"On_Time_Delivery_Rate":["on_time"],
"Stockout_Rate":["stockout_flag"],
"Support_Tickets_per_Order":["support_tickets"],
"Order_Cycle_Time_Days":["cycle_time_days"]
}
def validate_kpi(k):
X = df[features[k]]
model = RandomForestClassifier(n_estimators=200, random_state=42)
auc = cross_val_score(model, X, y, cv=5, scoring="roc_auc").mean()
return auc
validated = {k:validate_kpi(k) for k in features}
print(validated)
We keep KPIs with AUC ≥ 0.6 and no strong multicollinearity. This ensures the language-suggested metrics are grounded in evidence.
Once validated, the LLM helps craft the narrative explaining why these KPIs matter — turning dry numbers into human-readable insight.
insight_prompt = f"""
We found these KPI correlations with customer satisfaction:
{validated}
Explain in 3 sentences what they mean for an operations manager.
"""
print(client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":insight_prompt}]
).choices[0].message.content)
Example response:
"Timely delivery and fewer stockouts have the highest positive impact on customer satisfaction. Support interactions show a strong negative correlation, suggesting friction in post-delivery experience. Focusing on logistics reliability and proactive support will likely yield the greatest NPS improvement."
This interpretive layer is where LLMs excel — contextualizing quantitative outputs into managerial action.
Now we connect the validated KPIs to the business goal, weighting each by its influence score.
import networkx as nx, numpy as np
weights = {k:(v-0.5)*2 for k,v in validated.items()} # simple transform
G = nx.DiGraph()
G.add_node("Customer_Satisfaction", kind="goal")
for k,w in weights.items():
G.add_node(k, kind="driver")
G.add_edge(k, "Customer_Satisfaction", weight=round(w,2))
nx.nx_pydot.write_dot(G, "kpi_tree.dot")
Visualized, it forms:
Customer_Satisfaction
├── On_Time_Delivery_Rate (↑ strong)
├── Stockout_Rate (↓ medium)
├── Support_Tickets_per_Order (↓ strong)
└── Order_Cycle_Time_Days (↓ weak)
Every edge weight is earned through statistical validation, not intuition.
Once live, this system can re-run periodically:
Example periodic summary prompt:
summary_prompt = """
Compare last quarter vs previous quarter KPI correlations with CSAT:
On_Time_Delivery 0.78→0.62
Support_Tickets -0.72→-0.45
Summarize insights and possible causes.
"""
print(client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":summary_prompt}]
).choices[0].message.content)
LLMs thus enable self-commentary on metrics, bridging analytics and decision-making.
Each KPI's metadata can be automatically written by the LLM:
registry_prompt = """
Draft a registry entry for the KPI "On_Time_Delivery_Rate"
including definition, formula, owner, refresh cycle, and interpretation.
"""
print(client.chat.completions.create(
model="gpt-4o-mini",
messages=[{"role":"user","content":registry_prompt}]
).choices[0].message.content)
Result:
On_Time_Delivery_Rate
Definition: Percentage of orders delivered within promised time.
Owner: Supply Chain Analytics
Formula: AVG(CASE WHEN delivered_date <= promised_date THEN 1 ELSE 0 END)
Refresh: Daily
Interpretation: Indicates reliability of fulfilment; directly influences customer satisfaction and NPS.
Such entries form the basis of a governed AI-generated metric catalog, ensuring consistency and auditability.
Step | What the LLM does | What classic ML does |
---|---|---|
Intent Understanding | Parses goal text | — |
Schema Reasoning | Maps column names to business meaning | — |
KPI Generation | Creates candidate formulas | — |
Validation | — | Tests correlation, causality, drift |
Explanation | Generates human-readable insights | — |
Tree Building | Structures relationships semantically | Computes edge weights |
Continuous Learning | Comments on trend shifts | Re-trains metrics periodically |
Together they create a closed loop: Language understanding → Data validation → Narrative insight → Governance.
Traditional KPI frameworks are static and human-authored. An LLM-driven system can:
At Finarb Analytics, we are applying this framework across healthcare, BFSI, retail, and manufacturing — using enterprise-grade data governance, privacy-compliant LLM integration, and cloud-native deployment. The result is not just faster insight, but intelligent decision systems that think like your best analysts, at scale.
LLMs don't replace analysts — they amplify them. By blending semantic understanding (language) with statistical validation (data), we can finally build KPI systems that learn, explain, and evolve with the organization.
In short: KPIs no longer have to be defined by humans. They can now be discovered, tested, and narrated by AI — grounded in your own data.