Storage
LiteJoin uses SQLite as its primary data store, with optional tiered storage for long-term retention.
SQLite Sharding
Data is distributed across multiple SQLite databases using FNV hashing on the message key:
storage:
shard_count: 8 # Number of SQLite shards
data_dir: "./data" # Directory for shard files
reader_pool_size: 4 # Reader connections per shard
Each shard is a separate .db file in WAL mode with a single writer connection and a configurable pool of reader connections.
| Setting | Default | Description |
|---|
shard_count | 8 | Number of SQLite database shards. More shards = better write parallelism. |
data_dir | ./data | Directory for all data files. |
reader_pool_size | 4 | Read connections per shard for concurrent queries. |
Retention
Data older than the retention duration is periodically deleted:
retention:
duration: 24h # Keep data for 24 hours
clean_interval: 1m # Check every minute
When tiered storage is disabled, deleted data is permanently lost. Set retention based on your downstream query needs.
Tiered Storage (Optional)
When enabled, LiteJoin compacts expired data into Parquet files before deleting from SQLite. These files are queryable via an embedded DuckDB instance and can optionally be uploaded to cloud storage.
How It Works
SQLite (hot) → Compactor → Parquet (warm) → Uploader → Cloud Storage (cold)
- Retention fires — rows older than the TTL are eligible for compaction.
- Compactor reads rows from SQLite, writes them to Parquet files with Snappy compression.
- Rows are deleted from SQLite, reclaiming space.
- DuckDB queries Parquet files for historical data.
- Uploader (optional) copies Parquet files to S3/GCS/Azure Blob Storage.
Configuration
storage:
archive:
enabled: true
compaction_interval: 1m
target_file_size: 128MB
compression: snappy # snappy | zstd | none
duckdb_memory_limit: 256MB
local_retention: 168h # Keep local Parquet for 7 days
cloud:
enabled: false
provider: s3 # s3 | gcs | azure
bucket: my-litejoin-archive
prefix: litejoin/
region: us-east-1
upload_concurrency: 4
Archive Config Reference
| Field | Type | Default | Description |
|---|
enabled | bool | false | Enable tiered storage. |
compaction_interval | duration | 1m | How often compaction runs. |
target_file_size | string | 128MB | Target Parquet file size. |
compression | string | snappy | Parquet compression codec. |
duckdb_memory_limit | string | 256MB | Max memory for DuckDB queries. |
duckdb_threads | int | 0 | DuckDB threads. 0 = match GOMAXPROCS. |
local_retention | duration | 168h | How long to keep local Parquet files. |
Cloud Config Reference
| Field | Type | Default | Description |
|---|
cloud.enabled | bool | false | Enable cloud upload. |
cloud.provider | string | — | s3, gcs, or azure. |
cloud.bucket | string | — | Bucket name. |
cloud.prefix | string | — | Key prefix within bucket. |
cloud.region | string | — | Cloud region. |
cloud.upload_concurrency | int | 4 | Parallel upload workers. |
cloud.upload_timeout | duration | 5m | Per-file upload timeout. |
Data Lifecycle Example
Given retention: 1h and archive.local_retention: 168h:
| Time | Tier | State |
|---|
| t=0 | SQLite | Written, available for real-time joins |
| t=1h | Parquet (local) | Compacted from SQLite, queryable via DuckDB |
| t=1h+30s | Parquet + Cloud | Uploaded to S3 (if enabled) |
| t=7d | Cloud only | Local Parquet evicted |
| t=∞ | Cloud | Retained indefinitely |
Querying Historical Data
Historical data is queryable via the Snapshot API. When a from parameter extends beyond the retention window, the snapshot handler automatically queries Parquet files via DuckDB.
The hot path (real-time joins) has zero overhead from tiered storage. DuckDB is only used for historical queries.