Skialith gives every agent a complete, ordered history of its state transitions — enabling crash recovery, replay, drift detection, and forensic audit that no framework can replicate with a library update.
Per-event write latency · p50
Measured locally against NATS + MySQL.
Reproduce with cargo run --bin benchmark.
The problem
A multi-step agent that crashes mid-run loses all progress and replays every LLM call from the beginning. Existing frameworks offer no crash recovery, no audit trail, and no visibility into what the agent actually did — making debugging, compliance, and reliable orchestration structurally impossible.
Synchronous DB writes add ~1ms per event. Across hundreds of steps, this compounds.
Without a durable log, a process crash anywhere in the pipeline means restarting from zero.
At-least-once delivery without idempotency creates duplicate rows and corrupted state.
How it works
Every save_event call is acknowledged by NATS JetStream before returning to your agent.
A background writer batches those events into your database — keeping the hot path fast and the data durable.
Your agent publishes an event. Skialith serialises it and sends it to NATS JetStream.
JetStream confirms the write in ~133us. Your agent unblocks — no DB wait.
A background task collects events and flushes in efficient batches with automatic retry.
Agent
| save_event / checkpoint
v
Skialith sidecar
|-- NATS JetStream <-- PubAck ~133us returned to caller
| |
| +-- Background batch writer
| +-- MySQL / TiDB (async, retried, idempotent)
|
+-- trace_ingest consumer --> agent_traces table Performance
Run the benchmarks yourself against a local NATS + MySQL stack.
| Scenario | p50 | p95 | p99 |
|---|---|---|---|
| save_event (NATS PubAck) | 133 us | 265 us | 386 us |
| Baseline MySQL INSERT | 986 us | 1.5 ms | 2.6 ms |
cargo run --bin benchmark Ecosystem
KV-cache optimizations like TurboQuant compress volatile memory to make inference cheaper. Skialith provides what compressed volatile memory structurally cannot — durable, crash-recoverable state across the full agent lifecycle.
Techniques like TurboQuant shrink attention caches 6×, reducing the per-token cost of long-context inference. Volatile by design — lives only for the duration of a single forward pass.
Persists agent checkpoints, tool call results, and step events across process boundaries. Survives crashes, restarts, and preemptions — providing the long-horizon memory that volatile KV caches cannot.
Integrations
Agents are plain async functions. SDKs are thin HTTP clients — no Rust required.
from skialith import SkialithAgent
async with SkialithAgent(agent_id="my-agent") as agent:
state = await agent.resume()
await agent.checkpoint(
step=state.step_index,
data={"messages": messages}
)
await agent.save_event("step-1", {
"kind": "thought", "text": "..."
}) import { SkialithAgent } from "@skialith/agent-core";
const agent = new SkialithAgent({ agentId: "my-agent" });
const state = await agent.resume();
await agent.checkpoint(state.stepIndex, { messages });
await agent.saveEvent("step-1", {
kind: "thought", text: "..."
}); from skialith.langchain import SkialithCheckpointer
checkpointer = SkialithCheckpointer()
app = graph.compile(checkpointer=checkpointer)
# No other changes needed
result = await app.ainvoke(
{"messages": [...]},
config={"configurable": {"thread_id": "agent-1"}}
) Reference Implementation · RFC
Skialith is both a working implementation and a living specification — the Lithic State Specification — for how AI agents should persist, recover, and resume state across distributed infrastructure. The goal is a community-ratified standard that any runtime, framework, or cloud provider can implement.
Every agent action emits a structured, ordered event. The log is the source of truth — the database is a derived read model.
Checkpoints are content-addressed by (agent_id, step_index). Retries are safe by construction — no deduplication logic required in application code.
Any agent process can call resume() and reconstruct its exact prior state from the log, regardless of how it terminated.
The specification defines semantics, not wire protocol. The reference implementation uses NATS JetStream; the spec is open to Kafka, Redpanda, or custom transports.
Every state transition is structured, queryable, and exportable. Operators can introspect, diff, and audit the full lifecycle of any agent run without modifying application code.
The write path from agent call to durability acknowledgement must complete in sub-millisecond time under normal operating conditions, regardless of database backpressure.
Contribute to the specification → github.com/leanerrk-star/skialith
Managed Control Plane · Roadmap
The open-source core provides the durable event log. The managed control plane turns that log into actionable visibility — features that cannot be retrofitted by any framework that does not own the state layer.
Compare expected vs. actual agent state at any step. Detect regressions before they reach production.
Rewind any agent run to an exact step and replay forward. Reproduce bugs deterministically without re-running LLM calls.
Trace how state flows across agent boundaries in multi-agent pipelines. Identify the exact handoff that caused a downstream failure.
Replicate the event log across regions for geo-redundant recovery. Agents survive regional outages without losing a single step.
Quantify the cost of failed runs — token spend, wall-clock time, and downstream impact — to make the case for durability budgets.
Deploy the control plane into your own cloud account. Agent state never leaves your infrastructure — required for regulated industries.
Collaborative Technical Validation
We are working with a small number of engineering teams to stress-test the Lithic State Specification against real distributed agent workloads. This is a technical collaboration, not a sales process.
Participants get early access to the reference implementation, direct input on the RFC, and co-authorship credit on the specification where contributions are substantial.