At a glance
The age, in hours, of the most recent Redis backup that was successfully shipped offsite. This is not “did Redis save to local disk”, it is “do we have a durable copy somewhere we could restore from if this node burned down right now”. For a DBA, this is the single most important durability number on the board: every hour this value grows is an hour of writes you cannot get back if the instance is lost. Redis is often run as a cache and treated as disposable, but the moment it holds sessions, rate-limit counters, queues, or any source-of-truth data, a stale backup is a silent data-loss incident waiting to happen.
| What it tracks | Hours elapsed since the last RDB snapshot or AOF copy was successfully shipped offsite (object storage, snapshot vault, or managed-service backup). It tracks the offsite copy, not the local dump.rdb on the data node. |
| Data source | For self-managed Redis: the timestamp of the last backup artefact landed in the offsite target (for example S3 / GCS object LastModified), reconciled against rdb_last_save_time and aof_last_bgrewrite_status from INFO persistence. For ElastiCache / MemoryDB: CloudWatch backup events and the SnapshotComplete event timestamp. |
| Time window | RT (real-time, re-evaluated on every Nerve Centre poll, typically every 60 seconds). |
| Alert trigger | > 72h. If the newest offsite backup is older than 72 hours, the card turns red and pages the on-call DBA. |
| Units | Hours (integer, rounded down). A reading of 0 means a backup completed within the last hour. |
| What counts as “successful” | The backup process exited cleanly AND the artefact is present and non-zero-byte at the offsite destination. A bgsave that finished but never uploaded does NOT reset this clock. |
| What does NOT count | (1) A local SAVE / BGSAVE that wrote dump.rdb to the node’s own disk but was never copied offsite; (2) a failed or partial upload; (3) an AOF file that exists locally but is not part of the offsite backup set; (4) a snapshot still in progress. |
| Roles | owner, dba, platform, sre |
Calculation
The card resolves the timestamp of the newest valid offsite backup artefact, then subtracts it from “now”:last_offsite_backup_timestamp_utc is resolved depends on deployment:
- Self-managed Redis with a shipping job. The engine reads the
LastModifiedtimestamp of the newest object in the configured backup bucket / prefix. It cross-checksrdb_last_save_time(Unix epoch of the last successful local save, fromINFO persistence) so that a local save with no upload is visibly distinguished from a healthy offsite copy. If the local save is fresh but the offsite object is stale, the card uses the offsite timestamp (the durable one) and flags the gap. - AOF-based durability. Where Append Only File is the durability mechanism, the engine treats the last successful AOF rewrite (
aof_last_bgrewrite_status = okplus the rewrite completion time) plus the offsite copy of the AOF as the backup point. A pure local AOF withappendfsync everysecis durable on the node but is not an offsite backup; only the shipped copy resets this card. - ElastiCache / MemoryDB. The engine reads the most recent automatic or manual snapshot’s completion time from the CloudWatch
SnapshotCompletebackup event (or the snapshot list via the managed-service API). Self-managed local-disk saves are irrelevant here because AWS manages the snapshot lifecycle.
Worked example
A platform team runs a self-managed Redis 7.2 primary on a VM, holding user sessions and a job queue for an order-processing service. Their cron-driven shipping job is meant to runBGSAVE, then upload the resulting dump.rdb to an S3 bucket every 6 hours. Snapshot taken on 14 Apr 26 at 09:00 UTC.
| Signal | Value | Source |
|---|---|---|
rdb_last_save_time | 14 Apr 26 08:02 UTC | INFO persistence |
| Newest object in S3 backup prefix | 12 Apr 26 02:10 UTC | S3 LastModified |
aof_last_bgrewrite_status | ok | INFO persistence |
| Card headline | 55 hours ago | offsite timestamp |
rdb_last_save_time is under an hour old, so Redis is clearly saving. But the card reads 55 hours, derived from the offsite copy, not the local save. The story it tells:
- Local saving is healthy, offsite shipping is broken. Redis has been writing
dump.rdbto its own disk every 6 hours as designed, but the upload step has not landed a new object since 12 Apr. Something downstream ofBGSAVEfailed: an expired IAM credential, a full local disk, a changed bucket policy, or a cron job that silently errored. - The durability window is 55 hours wide. If the VM is lost right now (host failure, region issue, accidental termination), the team can only restore to 12 Apr 02:10. Every session, every queued job, and every write since then is gone. For a session store that is mass logout; for the order queue that is lost or duplicated work.
- The 72h alert has not fired yet, but it is 17 hours away. This is the value of a Hero card: the team can see the slow bleed and fix the shipping job before the alert ever pages them at 3am.
0. Three takeaways:
- A fresh
rdb_last_save_timeis reassuring but not sufficient. Local saves protect against a Redis process restart; only offsite copies protect against losing the whole node. This card deliberately measures the offsite copy because that is the one that survives a disaster. - The alert threshold should match your RPO. 72 hours is a safe default for cache-like workloads. If Redis holds source-of-truth data, lower the alert in the Sensitivity tab to match your recovery-point objective: a 6-hour backup cadence usually wants a 12h to 24h alert so a single missed run is visible before the gap compounds.
- Test the restore, not just the backup. A backup that exists but cannot be restored (corrupt RDB, wrong Redis version, missing AOF segment) is worse than no backup because it creates false confidence. Pair this card with a periodic restore drill into a throwaway instance.
Sibling cards DBAs should reference together
| Card | Why pair it with Last Successful Backup | What the combination tells you |
|---|---|---|
| Last RDB Save (minutes ago) | The local-save companion. This card is offsite; that one is on-node. | Fresh local save plus stale offsite backup equals a broken shipping job, exactly the worked example above. |
| Last AOF Rewrite Status | AOF is the other durability mechanism. | An err rewrite status means your AOF-based recovery point is unreliable, narrowing your durability options to RDB only. |
| Redis Health Score | The composite that folds backup age into overall health. | A stale backup drags the health score even when memory and latency look perfect. |
| Connected Replicas | Replicas are availability, backups are durability. Different risks. | Zero replicas plus stale backup equals no failover AND no recovery point: the worst durability posture. |
| Memory Used vs Maxmemory % | A near-full instance makes BGSAVE riskier (fork copy-on-write needs headroom). | High memory plus failing saves often equals fork failing for lack of RAM, a common root cause of a stale backup. |
| Replica Lag (seconds) | If you back up from a replica, lag defines how stale that backup’s data is. | High replica lag means a replica-sourced backup is already behind the primary before it even ships. |
Reconciling against the source
Where to look in Redis’s own tooling:Why our number may legitimately differ from what you see:INFO persistenceon the data node. Readrdb_last_save_time(Unix epoch of the last local save),rdb_last_bgsave_status,aof_last_bgrewrite_status, andaof_last_write_status. These tell you whether Redis itself is saving cleanly; they do NOT tell you whether the copy reached offsite.redis-cli LASTSAVEreturns the Unix timestamp of the last successful local save, the quickest one-liner check. Your offsite store’s object listing. For S3:aws s3 ls s3://your-backup-bucket/prefix/ --recursiveand read the newestLastModified. This is the number this card actually reports. ElastiCache / MemoryDB console → Backups tab, oraws elasticache describe-snapshots, for the managed snapshot completion time and the CloudWatchSnapshotCompleteevent.
| Reason | Direction | Why |
|---|---|---|
| Local save vs offsite copy | Vortex IQ shows older | LASTSAVE / rdb_last_save_time reflect the on-node save; this card reflects the offsite artefact, which is the durable one. A gap is a finding, not a bug. |
| Time zone | Timestamps shift | Redis epoch values and CloudWatch are UTC; Vortex IQ renders the age in your profile time zone for chart axes but computes the gap in UTC. |
| In-flight upload | Vortex IQ shows older briefly | An upload that is mid-flight has not landed; the clock resets only when the object is complete. |
| Backup from a replica | Data is staler than it looks | If you snapshot a replica, the data point is the replica’s state, which may lag the primary. Pair with Replica Lag. |
| Managed snapshot retention | Object disappears | If a managed service rotates out the snapshot you measured against, the next newest one defines the age. |
| Card | Expected relationship | What causes divergence |
|---|---|---|
redis.last-rdb-save-minutes-ago | Local save should be much fresher than offsite backup. | If both are stale, Redis itself stopped saving (fork failure, disk full). If only offsite is stale, the shipping job is broken. |
CloudWatch SnapshotComplete events | For ElastiCache, 1:1 with this card’s reset. | A gap means the automatic backup window failed or was disabled. |
Known limitations / FAQs
Redis says it saved 10 minutes ago, but this card reads 50 hours. Which is right? Both are right; they measure different things.LASTSAVE / rdb_last_save_time report the last local save to the node’s own disk. This card reports the last copy that reached offsite storage. A local save protects you from a Redis process crash; only an offsite copy protects you from losing the whole node or region. A 50-hour reading with a fresh local save almost always means your upload / shipping step is broken while Redis itself is healthy. Fix the shipping job.
We run AOF with appendfsync everysec, isn’t that already durable?
AOF makes the node durable against a process crash because writes are flushed to the local append-only file roughly every second. It does NOT make you durable against losing the node, the disk, or the availability zone. This card measures the offsite copy precisely because AOF alone does not survive a host failure. Keep AOF for fast local recovery and still ship a periodic copy offsite.
Why 72 hours as the default alert? That seems generous.
72 hours is a conservative default chosen so that a single missed daily backup, or a weekend with a stuck job, surfaces before it becomes a multi-day gap. It is intentionally generous so it does not page teams for transient blips. If Redis holds source-of-truth data with a tighter recovery-point objective, lower the alert in the Sensitivity tab. A common pattern is to set the alert to roughly 2x your backup cadence.
Does a backup that exists but is corrupt reset the clock?
The card cannot validate the internal integrity of a remote artefact; it trusts that a complete, non-zero-byte object at the offsite destination is a backup. This is a known limitation. The mitigation is a periodic restore drill: load the newest backup into a throwaway instance and confirm it starts and serves keys. A backup you have never restored is a hypothesis, not a guarantee.
We back up from a read replica to avoid load on the primary. Does that affect this card?
It affects the data point, not the card logic. The card still measures the age of the offsite artefact, but that artefact reflects the replica’s state at save time, which may lag the primary by the current replication delay. Pair this card with Replica Lag (seconds): a replica-sourced backup is effectively “backup age plus replica lag” behind the primary’s true state.
Our ElastiCache cluster shows automatic backups in the console but this card reads stale. Why?
Two common causes. First, automatic snapshots may be disabled or have a zero-day retention on this node group (check the backup retention setting). Second, the snapshot window may overlap a high-write period and AWS skipped or delayed it. Confirm via aws elasticache describe-snapshots and the CloudWatch SnapshotComplete event timestamp; if the newest snapshot genuinely is old, the card is correct and you have a real durability gap.
Is this card relevant if we use Redis purely as a disposable cache?
Less so, but do not assume “purely a cache” is permanent. Many teams start with a cache and quietly add session storage, rate-limit counters, or a queue without revisiting durability. If Redis genuinely holds only recomputable cache data, you can raise the alert threshold or disable it for that instance in the Sensitivity tab. Revisit that decision whenever the workload changes.