At a glance
Pipeline Lag (since last success) is the elapsed time since your data pipeline last completed a successful run. It answers the question every data team dreads being asked: “how stale is the data?”. If a pipeline that should refresh every 15 minutes last succeeded 90 minutes ago, every dashboard, Delta table, and downstream consumer fed by that pipeline is showing data that is up to 90 minutes old. The card turns red when the lag exceeds the pipeline’s own expected interval, so a fast pipeline and a slow daily pipeline are each judged against their own cadence.
| What it tracks | Seconds elapsed since the most recent successful pipeline run, for the selected period. The card surfaces the worst (most-lagging) pipeline in scope. |
| Data source | Databricks pipeline run history: Delta Live Tables pipeline updates (/pipelines and pipeline update events) and/or scheduled Jobs run history (/jobs/runs/list), comparing now to the last COMPLETED/SUCCESS finish time. |
| Time window | RT (real-time, continuously evaluated) |
| Alert trigger | > expected_interval. When lag exceeds the pipeline’s configured/expected refresh interval, the on-call data engineer is notified. A 15-minute pipeline alerts at 15 minutes of lag; a daily pipeline alerts at roughly 24 hours. |
| Roles | owner, engineering, operations |
| Card class | Hero and Sensitivity card: it drives the Pipelines health signal and the expected-interval threshold is configurable in the Sensitivity tab. |
Calculation
For each pipeline in scope, Vortex IQ records the finish time of the most recent run whose result state was successful (COMPLETED for a Delta Live Tables update, or SUCCESS for a scheduled Job run). The lag is current_time - last_successful_finish_time, expressed in seconds and rendered in human-friendly units on the card (minutes/hours).
Crucially, only successful runs reset the clock. A pipeline that is currently retrying, failing repeatedly, or stuck mid-update does not reset the lag, so the number keeps climbing even while the pipeline looks “busy”. This is deliberate: a pipeline that has been failing for an hour has produced no fresh data for an hour, regardless of how many attempts it made.
The alert compares the live lag against the pipeline’s expected interval, derived from its schedule (a cron trigger, a continuous-pipeline cadence, or the Sensitivity-tab override). Because the threshold is relative to each pipeline’s own cadence, the same card sensibly governs both a 5-minute streaming pipeline and a nightly batch pipeline without false alarms on the slow one.
Worked example
A homewares retailer runs a Delta Live Tables pipeline,orders_to_lakehouse, every 15 minutes. It ingests order events from the storefront, applies dedup and currency normalisation, and writes a silver.orders Delta table that feeds the finance dashboard and the inventory-replenishment job. Snapshot taken on 19 May 26 at 11:42 BST.
| Reading | Value |
|---|---|
| Expected interval | 15 minutes |
| Last successful run | 10:08 BST |
| Pipeline Lag | 94 minutes (alert: lag above expected interval) |
| Last 4 run states | FAILED, FAILED, FAILED, COMPLETED (10:08) |
- The data is stale by 94 minutes and getting worse. Everything downstream of
silver.orders, the finance dashboard and the replenishment job, is operating on a snapshot from 10:08. The longer the lag, the larger the gap the eventual successful run must backfill. - The failures explain the lag. Drilling into the pipeline update events shows the dedup step failing on a schema-evolution error: an upstream change added a
gift_messagecolumn the pipeline’s expectations did not allow. Each 15-minute attempt fails the same way, so lag grows by ~15 minutes per cycle. - There is a real downstream risk. The replenishment job reads
silver.ordersto decide reorder quantities. Running on 94-minute-old order data, it under-counts recent sales of a flash-sale item and risks ordering too little stock. This is where a pipeline-lag problem becomes a business problem.
- Lag, not failure count, is the metric that matters to the business. A pipeline can fail ten times and still be fine if a recovery run lands inside the expected interval. What hurts is uninterrupted staleness, which is exactly what this card measures.
- Always pair lag with the run-state history. A growing lag with recent failures is a broken pipeline (fix the code). A growing lag with no failures is a stuck or unscheduled pipeline (check the trigger / cluster availability).
- The threshold must match the cadence. If you change a pipeline’s schedule, update its expected interval in the Sensitivity tab, otherwise a now-hourly pipeline will keep alerting against an old 15-minute expectation.
Sibling cards
| Card | Why pair it with Pipeline Lag | What the combination tells you |
|---|---|---|
| DLT Pipeline Status Distribution | The live status of every pipeline. | A FAILED pipeline in the donut explains a growing lag; an IDLE one suggests a missing trigger. |
| Failed Jobs (24h) | The job-level failures that cause lag. | Repeated job failures map directly to a pipeline whose lag clock is not resetting. |
| Job Success Rate (24h) | The success-ratio peer. | A falling success rate is the leading indicator that lag is about to grow. |
| Failed Job Burst (>5 failures in 1h) | The cascade alert. | A burst of failures across dependent jobs often shows up as lag on the pipeline they feed. |
| Top 10 Failing Workflows (7d) | The chronic-offender view. | A pipeline that lags often will appear here; chronic lag deserves a structural fix. |
| Long-Running Jobs (>1h) | The stuck-run view. | Lag with no failures and a job still running equals a stuck/over-running pipeline, not a crash. |
| Pipeline Lag vs Ecom Order Flow | The revenue cross-channel view. | Tells you whether the pipeline stalled while live orders were still flowing, the worst case. |
| Databricks Health Score | The composite. | Sustained pipeline lag pulls the overall health score down. |
Reconciling against the source
Where to look in Databricks:Delta Live Tables in the workspace: open the pipeline, the “Update history” panel shows the timestamp and result of every update, so you can read the last successful finish directly. Workflows → Jobs → Runs for scheduled-job pipelines: the run list shows the lastTo reproduce lag for a scheduled-job pipeline you can compute it from run history: takeSUCCESSfinish time used to compute lag.system.lakeflow/ pipeline event log (or theevent_logfor the DLT pipeline) gives the machine-readable update events to reproduce the calculation.
current_timestamp() minus the latest end_time where result_state = 'SUCCESS'. For a DLT pipeline, read the most recent update with state COMPLETED from the pipeline event log.
Why our number may legitimately differ from the Databricks UI:
| Reason | Direction | Why |
|---|---|---|
| ”Success” definition | Vortex IQ may read higher lag | We reset the clock only on a fully successful run; a partially-successful or warning-state update that the UI shows as “completed with issues” may not reset our clock. |
| Continuous pipelines | Different basis | For continuous DLT pipelines there is no discrete “run”; lag is derived from the last successful flow checkpoint, which can differ from the UI’s notion of “last updated”. |
| API/event-log latency | Brief lag | Pipeline event logs can take a few seconds to record a completion, so a just-finished run may not yet reset the clock. |
| Time zone | Display only | Vortex IQ renders the last-success timestamp in your reporting time zone; the workspace UI uses workspace time. |
| Expected-interval source | Threshold differs | The alert uses the Sensitivity-tab expected interval if set, which may not match the pipeline’s literal schedule. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
shopify.total_revenue / bigcommerce.total_revenue | If a revenue pipeline lags, lakehouse revenue figures fall behind the storefront’s own order count. | A gap between storefront orders and lakehouse-reported revenue is the downstream symptom of pipeline lag. |
google_analytics | Independent measurement source. | GA4 shows live traffic while the lakehouse table is stale equals confirmation the lag is on the ingestion side. |