Architecture
Layer Overview
Section titled “Layer Overview”graph TB
App[Your Application<br/>Rust, Python, or future bindings]
Bindings[Language Bindings<br/>thin layer — workflow definition only]
Rust[Rust Bindings<br/>native]
Python[Python Bindings<br/>PyO3]
Node[Node.js<br/>planned]
Runtime[Sayiir Runtime Rust]
CheckpointRunner[CheckpointingRunner<br/>single process]
PooledWorker[PooledWorker<br/>distributed, multi-machine]
Persistence[Persistence Layer]
InMem[InMemory Backend]
Postgres[Postgres Backend]
Custom[Custom Backend]
Enterprise[Enterprise gRPC Server]
App --> Bindings
Bindings --> Rust
Bindings --> Python
Bindings --> Node
Rust --> Runtime
Python --> Runtime
Node --> Runtime
Runtime --> CheckpointRunner
Runtime --> PooledWorker
CheckpointRunner --> Persistence
PooledWorker --> Persistence
Persistence --> InMem
Persistence --> Postgres
Persistence --> Custom
Persistence --> Enterprise
Core Concepts
Section titled “Core Concepts”Workflow Definition vs Workflow Instance
Section titled “Workflow Definition vs Workflow Instance”A workflow definition is the blueprint — the chain of tasks, forks, delays, signals, and retries you build with Flow or WorkflowBuilder. It has a definition hash (auto-computed from its structure) that uniquely identifies the shape of the workflow.
A workflow instance is a single execution of that definition. When you call engine.run(workflow, "order-123", ...), "order-123" is the instance ID — a user-provided string that uniquely identifies this run. The same definition can have thousands of concurrent instances, each with its own instance ID and independent state.
Definition: fetch_user → send_email (definition_hash = "a1b2c3")Instance 1: instance_id = "order-123" (InProgress, at send_email)Instance 2: instance_id = "order-456" (Completed)Instance 3: instance_id = "order-789" (Failed)The instance_id is the primary key for all persistence — snapshots, signals, task claims. You choose it, so it can be meaningful to your domain (order IDs, user IDs, idempotency keys).
What is a Run?
Section titled “What is a Run?”A run is a single invocation of engine.run() or engine.resume(). It executes tasks sequentially from the current position until the workflow completes, fails, pauses, parks at a delay, or parks at a signal.
In single-process mode (CheckpointingRunner / DurableEngine), a run executes all tasks in one process. After each task, the snapshot is checkpointed. If the process crashes, call resume() to continue from the last checkpoint.
In distributed mode (PooledWorker), a run is split across multiple workers. Each worker polls for available tasks, claims one, executes it, checkpoints the result, and releases the claim. The next available worker picks up the following task. No single process owns the full execution — workers collaborate through the persistence layer.
Both modes use the same workflow definition, the same persistence traits, and the same snapshot format. The only difference is who drives the execution loop.
The Checkpoint-Resume Lifecycle
Section titled “The Checkpoint-Resume Lifecycle”graph LR
Start([Start]) --> Execute[Execute Task]
Execute --> Checkpoint[Checkpoint State]
Checkpoint --> More{More tasks?}
More -->|Yes| Execute
More -->|No| Complete([Complete])
Crash([Process Crash]) -.-> Load[Load Last Checkpoint]
Load --> Resume[Resume from Last Task]
Resume --> Execute
When a workflow runs, Sayiir executes tasks sequentially. After each task completes, the workflow state is checkpointed to the persistence backend. If the process crashes, the workflow can be resumed from the last checkpoint — no replay, no re-execution of completed tasks.
This checkpoint-resume model is what eliminates the need for deterministic code. Your tasks can have side effects, call external APIs, generate random values, or read the system time — because Sayiir never re-executes completed work.
Hexagonal Design
Section titled “Hexagonal Design”Sayiir’s internals follow hexagonal (ports & adapters) architecture. The core domain (sayiir-core) has zero infrastructure dependencies — pure business logic. All dependencies flow inward:
core ← persistence ← runtime ← language bindingsEvery integration point is a trait-based port with swappable adapters:
Codec— rkyv, JSON, or your own serializerPersistentBackend— InMemory, PostgreSQL, or your own storageCoreTask— closures, registry lookups, or your own execution modelWorkflowRunner— single-process, distributed, or your own topology
This isn’t accidental. It means you can swap any layer without touching the others. Test with InMemory, deploy with PostgreSQL. Prototype with JSON, optimize with rkyv, or any custom Codec of your choice (protobuf, avro ..). Run single-process locally, distribute across machines in production. Same workflow code, different adapters.
Pluggable Codecs
Section titled “Pluggable Codecs”Serialization is pluggable — bring your own format or use the built-in options:
// Zero-copy for maximum performance (default)let codec = RkyvCodec::new();
// Human-readable for debugginglet codec = JsonCodec::new();
// Custom format (implement Codec trait)let codec = MyCustomCodec::new();- rkyv (default) — Zero-copy deserialization for maximum performance
- JSON — Human-readable, enable with
--features json - Custom — Implement the
Codectrait for any format (Protobuf, MessagePack, etc.)
Pluggable Storage Backends
Section titled “Pluggable Storage Backends”// In-memory (testing)let backend = InMemoryBackend::new();
// PostgreSQL (production)let backend = PostgresBackend::new(pool);
// Custom (bring your own)impl PersistentBackend for MyBackend { ... }- InMemory — For testing
- PostgreSQL — For production
- Custom — Implement the
PersistentBackendtrait for anything else (Redis, DynamoDB, SQLite, Cloudflare Durable Objects)
Why Rust?
Section titled “Why Rust?”The core runtime is written in Rust for safety, performance, and correctness — the properties that matter most for infrastructure that runs your critical business processes. But Sayiir is not a Rust-only tool. Python bindings are available today, with TypeScript and Go planned — so you get Rust’s reliability without leaving your ecosystem. The binding is a thin layer: you write task functions in your language, Rust handles all orchestration, checkpointing, and execution.
Performance
Section titled “Performance”Designed to scale to hundreds of thousands of concurrent activities:
- Zero-copy deserialization with rkyv codec
- Minimal coordination — workers claim tasks independently
- Per-task checkpointing — fine-grained durability
- No global locks — optimistic concurrency
Distributed Retry Resilience
Section titled “Distributed Retry Resilience”When a task fails in distributed mode, Sayiir uses soft worker affinity to prefer retrying on a different worker. This improves resilience against worker-local failures — corrupted caches, unhealthy dependencies, resource exhaustion, or environment-specific bugs.
How it works
Section titled “How it works”- A worker executes a task and it fails (error, timeout, or panic)
- The worker records the retry in the snapshot, tagging itself as the
last_failed_worker - The task claim is released, making the task available for any worker to pick up
- When workers poll for available tasks, the backend sorts results so that tasks which did not fail on the requesting worker come first
This is a soft bias, not a hard exclusion. If the failed worker is the only one available, it will still pick up the task — no work is left stranded. But when multiple workers are polling, tasks naturally migrate away from the worker that failed them.
Why soft affinity over hard exclusion
Section titled “Why soft affinity over hard exclusion”- No starvation — A single-worker deployment still retries normally
- Self-healing — Transient worker issues resolve without manual intervention; the task moves to a healthy worker while the original recovers
- No configuration — The bias is automatic; no retry routing rules to maintain
- Distributed fault isolation — If Worker A has a bad network path to an external service, retries on Worker B bypass the issue entirely without any operator awareness