Open source · BUSL 1.1 · Free to self-host

The durability layer
production AI agents are missing.

Skialith gives every agent a complete, ordered history of its state transitions — enabling crash recovery, replay, drift detection, and forensic audit that no framework can replicate with a library update.

View on GitHub Join technical validation →

Per-event write latency · p50

skialith (NATS PubAck) 133 us

Database-first (MySQL INSERT) 986 us

Measured locally against NATS + MySQL. Reproduce with cargo run --bin benchmark.

The problem

95% of AI agents fail in production.

A multi-step agent that crashes mid-run loses all progress and replays every LLM call from the beginning. Existing frameworks offer no crash recovery, no audit trail, and no visibility into what the agent actually did — making debugging, compliance, and reliable orchestration structurally impossible.

⚡

Hot path is slow

Synchronous DB writes add ~1ms per event. Across hundreds of steps, this compounds.

💥

Crashes lose work

Without a durable log, a process crash anywhere in the pipeline means restarting from zero.

🔄

Retries cause duplicates

At-least-once delivery without idempotency creates duplicate rows and corrupted state.

How it works

Write-ahead log first. Database second.

Every save_event call is acknowledged by NATS JetStream before returning to your agent. A background writer batches those events into your database — keeping the hot path fast and the data durable.

Agent calls save_event

Your agent publishes an event. Skialith serialises it and sends it to NATS JetStream.

NATS PubAck returned

JetStream confirms the write in ~133us. Your agent unblocks — no DB wait.

Background batch to DB

A background task collects events and flushes in efficient batches with automatic retry.

Agent
  |  save_event / checkpoint
  v
Skialith sidecar
  |-- NATS JetStream  <-- PubAck ~133us returned to caller
  |       |
  |       +-- Background batch writer
  |                 +-- MySQL / TiDB  (async, retried, idempotent)
  |
  +-- trace_ingest consumer  -->  agent_traces table

Performance

Numbers you can reproduce.

Run the benchmarks yourself against a local NATS + MySQL stack.

Scenario	p50	p95	p99
save_event (NATS PubAck)	133 us	265 us	386 us
Baseline MySQL INSERT	986 us	1.5 ms	2.6 ms

cargo run --bin benchmark

Ecosystem

The missing persistence layer.

KV-cache optimizations like TurboQuant compress volatile memory to make inference cheaper. Skialith provides what compressed volatile memory structurally cannot — durable, crash-recoverable state across the full agent lifecycle.

Compute efficiency

KV-Cache Compression

Techniques like TurboQuant shrink attention caches 6×, reducing the per-token cost of long-context inference. Volatile by design — lives only for the duration of a single forward pass.

Durable persistence

Skialith State Plane

Persists agent checkpoints, tool call results, and step events across process boundaries. Survives crashes, restarts, and preemptions — providing the long-horizon memory that volatile KV caches cannot.

As inference costs fall, agents run longer and attempt more complex tasks — raising the cost of a failed run and the value of a durable state layer.

Integrations

Drop in. No rewrites.

Agents are plain async functions. SDKs are thin HTTP clients — no Rust required.

Python

from skialith import SkialithAgent

async with SkialithAgent(agent_id="my-agent") as agent:
    state = await agent.resume()
    await agent.checkpoint(
        step=state.step_index,
        data={"messages": messages}
    )
    await agent.save_event("step-1", {
        "kind": "thought", "text": "..."
    })

TypeScript

import { SkialithAgent } from "@skialith/agent-core";

const agent = new SkialithAgent({ agentId: "my-agent" });
const state = await agent.resume();

await agent.checkpoint(state.stepIndex, { messages });
await agent.saveEvent("step-1", {
  kind: "thought", text: "..."
});

LangGraph

from skialith.langchain import SkialithCheckpointer

checkpointer = SkialithCheckpointer()
app = graph.compile(checkpointer=checkpointer)

# No other changes needed
result = await app.ainvoke(
    {"messages": [...]},
    config={"configurable": {"thread_id": "agent-1"}}
)

Reference Implementation · RFC

Defining the standard for
distributed agentic state.

Skialith is both a working implementation and a living specification — the Lithic State Specification — for how AI agents should persist, recover, and resume state across distributed infrastructure. The goal is a community-ratified standard that any runtime, framework, or cloud provider can implement.

LSS-01

Durable Event Log

Every agent action emits a structured, ordered event. The log is the source of truth — the database is a derived read model.

LSS-02

Idempotent Checkpoints

Checkpoints are content-addressed by (agent_id, step_index). Retries are safe by construction — no deduplication logic required in application code.

LSS-03

Resumable Execution

Any agent process can call resume() and reconstruct its exact prior state from the log, regardless of how it terminated.

LSS-04

Transport Agnostic

The specification defines semantics, not wire protocol. The reference implementation uses NATS JetStream; the spec is open to Kafka, Redpanda, or custom transports.

LSS-05

Observable State

Every state transition is structured, queryable, and exportable. Operators can introspect, diff, and audit the full lifecycle of any agent run without modifying application code.

LSS-06

Bounded Hot Path

The write path from agent call to durability acknowledgement must complete in sub-millisecond time under normal operating conditions, regardless of database backpressure.

Contribute to the specification → github.com/leanerrk-star/skialith

Managed Control Plane · Roadmap

From durable log to full agent observability.

The open-source core provides the durable event log. The managed control plane turns that log into actionable visibility — features that cannot be retrofitted by any framework that does not own the state layer.

📈

State Drift Visualization

Compare expected vs. actual agent state at any step. Detect regressions before they reach production.

▶

Agent Replay & Time-Travel Debug

Rewind any agent run to an exact step and replay forward. Reproduce bugs deterministically without re-running LLM calls.

🔗

Cross-Agent State Lineage

Trace how state flows across agent boundaries in multi-agent pipelines. Identify the exact handoff that caused a downstream failure.

🌎

Multi-Region Durability

Replicate the event log across regions for geo-redundant recovery. Agents survive regional outages without losing a single step.

🛡

Resume Insurance Analytics

Quantify the cost of failed runs — token spend, wall-clock time, and downstream impact — to make the case for durability budgets.

🏢

Sovereign VPC Deployment

Deploy the control plane into your own cloud account. Agent state never leaves your infrastructure — required for regulated industries.

Collaborative Technical Validation

Help shape the specification.

We are working with a small number of engineering teams to stress-test the Lithic State Specification against real distributed agent workloads. This is a technical collaboration, not a sales process.

Participants get early access to the reference implementation, direct input on the RFC, and co-authorship credit on the specification where contributions are substantial.

The durability layer production AI agents are missing.