Incident Case Studies

Real operational narratives from signal conflict to safe action.

These are not marketing stories. Each case documents environment, contradiction state, confidence movement, safety gate behavior, replay chain, and final operational outcome.

Green health checks but degraded Kafka throughput

1. Environment: Kubernetes + Kafka consumers in local-connected workspace mode.

2. Operational problem: Consumer lag rose while API probes remained healthy.

3. What operators saw: Green endpoint checks, but rising lag and reduced consumer throughput.

4. Traditional tooling showed: Probe dashboard stayed healthy; lag charts were isolated and not action-ranked.

5. CTK detected: Cross-signal contradiction between probe health and runtime behavior collapse.

6. Contradiction state: Active contradiction linked to queue pressure and runtime instability.

7. Confidence evolution: 0.84 -> 0.52 until fresh logs and throughput recovery evidence arrived.

8. Safety gate behavior: Restart blocked; safer scale-consumer action allowed under higher trust path.

9. Replay chain: Lag trend + throughput drop + healthy probe -> contradiction node -> confidence penalty -> split action branch.

10. Final operational outcome: Team scaled consumers first, restored throughput, then replayed chain before any disruptive restart.

Configured but inactive consumer

1. Environment: Docker runtime with mapped service profile and local evidence ingestion.

2. Operational problem: Consumer service existed in config but process was not actively consuming.

3. What operators saw: Service looked configured; incident pressure still rose with no clear single fault.

4. Traditional tooling showed: Configuration looked complete, hiding runtime inactivity behind static setup assumptions.

5. CTK detected: Configuration-runtime mismatch contradiction with low evidence freshness.

6. Contradiction state: Intent says active, runtime evidence says inactive.

7. Confidence evolution: 0.66 -> 0.44 while no active consumption evidence was observed.

8. Safety gate behavior: Scale-down and restart recommendations held in review/blocked state.

9. Replay chain: Config map evidence + inactive runtime status + lag pressure -> contradiction -> trust downgrade -> action gated.

10. Final operational outcome: Team reactivated consumer, validated fresh evidence, then safely resumed queued actions.

Topology drift after deploy

1. Environment: Multi-service cloud runtime (AWS + Kubernetes connectors) in one workspace scope.

2. Operational problem: Post-deploy failures increased despite stable baseline health metrics.

3. What operators saw: Intermittent timeout spikes with no obvious top-level alarm in static maps.

4. Traditional tooling showed: Dependency drift was not tied to runtime impact in one explainable chain.

5. CTK detected: Topology delta on critical edge correlated with incident onset and runtime degradation.

6. Contradiction state: Expected dependency route and observed runtime route diverged.

7. Confidence evolution: 0.78 -> 0.53 immediately after edge removal; partial recovery after rollback evidence.

8. Safety gate behavior: Broad rollback recommendation gated until relation confidence and runtime evidence aligned.

9. Replay chain: Deploy event -> edge removal delta -> timeout pressure -> contradiction -> confidence collapse -> gated rollback path.

10. Final operational outcome: Team fixed dependency route, verified confidence recovery, then replayed final adjudication for postmortem.

Stale evidence blocked risky restart

1. Environment: Local-connected provider mix with active incidents and partially stale diagnostics.

2. Operational problem: On-call wanted immediate restart during pressure, but critical evidence streams were stale.

3. What operators saw: Incident urgency high; latest logs absent for key service mapping.

4. Traditional tooling showed: Urgency overrode data quality checks, increasing risk of panic actions.

5. CTK detected: Freshness violation + active contradiction in same decision scope.

6. Contradiction state: Stale diagnostics conflicted with confidence required for disruptive action.

7. Confidence evolution: 0.69 -> 0.42 as freshness penalties compounded unresolved contradiction.

8. Safety gate behavior: Restart blocked until fresh evidence and contradiction resolution conditions were met.

9. Replay chain: Freshness decay + contradiction pressure -> confidence breach -> policy block with required evidence refresh steps.

10. Final operational outcome: After evidence refresh, CTK lifted block and team executed safer remediation order.

Continue learning

Open Replay Demo Why CTK Decided This Download Desktop