Architecture
WARNING
Treat this page as a narrative companion for developers who enjoy reading about low level engineering, not as operational documentation you would rely on for debugging or performance tuning. If something here contradicts the code, the code wins.
System Overview
┌───────────────────────────────────────────────────┐
│ Client │
├───────────────────────────────────────────────────┤
│ HTTP (port 3000) │
├───────────────────────────────────────────────────┤
│ Axum Router (src/routes.rs) │
│ /collections /search /suggest /items │
├───────────────────────────────────────────────────┤
│ Store Engine (src/store.rs) │
│ Inverted Index · Tokenization · ID Strategy │
├───────────────────────────────────────────────────┤
│ fjall LSM-tree Database │
│ Keyspaces: _collections, _index_queue, │
│ {col}.inverted, {col}.docs │
└───────────────────────────────────────────────────┘The server has three layers:
- HTTP Layer — Axum router exposing REST endpoints.
- Store Engine — Core logic: tokenization, inverted index management, search/insert.
- Persistence Layer — fjall LSM-tree database for on-disk storage.
HTTP Layer (src/routes.rs)
An Axum Router maps endpoints to handler functions that delegate to the Store. All state is shared via Arc<Store>.
| Method | Path | Handler |
|---|---|---|
GET | /status | Health check |
GET | /collections | List collections |
POST | /collections | Create collection |
GET | /collections/{name} | Collection metadata |
DELETE | /collections/{name} | Delete collection |
POST | /collections/{name}/items | Upsert document |
DELETE | /collections/{name}/items/{id} | Delete document |
GET | /collections/{name}/search?q=... | Search documents |
GET | /collections/{name}/suggest?q=... | Autocomplete |
POST | /backup/export | Export database snapshot to a file in the dumps folder |
POST | /backup/import | Import a snapshot from the dumps folder |
Store Engine (src/store.rs)
The Store struct is the heart of Aperio. It holds:
db: fjall::Database— the underlying database handle.config: StoreConfig— tunable parameters (shard sizes, token length, compression, index interval).collections: RwLock<HashMap<String, CollectionMeta>>— in-memory registry of known collections, their ID type and searchable fields.lock: Mutex<()>— serializes write operations (upsert/delete) for index consistency.next_seq: AtomicU64— monotonic sequence counter for the indexing queue.background_active: AtomicBool— whether the background indexer is running.
Tokenization
Document content is tokenized using charabia:
content → tokenize() → filter(is_word) → lemma() → filter(min_token_length)Tokens are deduplicated into a HashSet<String> before indexing.
Inverted Index
Each collection has an inverted index stored in a dedicated fjall keyspace ({name}.inverted). For every unique token (word), posting lists map to document IDs.
Word markers — an empty key (word → empty bytes) signals that a word exists in the index, enabling fast prefix scans for autocomplete.
Two ID Strategies
Collections are created with an id_type that determines the posting list format:
id_type | Storage format | Data structure |
|---|---|---|
string | rkyv-archived shards | PostingShard { first, last, ids: Vec<String> } |
number | Serialized bitmap shards | RoaringTreemap per shard |
String IDs
Posting lists are split into shards of configurable max_shard_size (default 1000). Each shard stores sorted Vec<String> archived via rkyv. A binary search across shards locates the correct shard for insertion.
A Vec<u64> would be faster for posting-list operations, but u64 can't represent arbitrary string IDs like UUIDs, so Vec<String> is used as the general-purpose format.
Number IDs
Posting lists use RoaringTreemap bitmaps, sharded at max_roaring_shard_size (default 100,000). Bitmaps offer compact storage and fast bitwise intersection for multi-term queries.
Search Execution
- Tokenize the query string.
- List shard indices for each token in parallel (via
std::thread::scope). - Sort tokens by shard count (rarest-first optimization).
- Load posting lists: for string IDs, merge shards in a sorted iterative merge; for number IDs, union shard bitmaps per word, then compute the intersection.
- Apply sort and pagination: sort by ID ascending or descending, apply optional
aftercursor, cap attake.
Search: String IDs
For string-ID collections, each shard is an rkyv-archived PostingShard. The engine loads all shards for the rarest word, then iterates through its sorted IDs, checking membership in other words' shards via binary search.
Search: Number IDs
For number-ID collections, each shard is a RoaringTreemap. Per word, all shards are merged with bitwise OR. Words are then intersected with bitwise AND. The resulting bitmap is iterated in ascending or descending order.
Background Indexing (spawn_background)
When the background indexer is active, upsert() writes to a FIFO queue (_index_queue keyspace) instead of directly updating the index. A tokio::spawn task polls the queue at index_interval (default 900ms) and calls process_pending_queue() to drain entries through upsert_internal().
This batches write operations and reduces lock contention. When the background indexer is not active (e.g., in tests), upsert() calls upsert_internal() synchronously.
Persistence Layer (fjall)
fjall is an embedded LSM-tree storage engine (a RocksDB/Sled alternative). Aperio uses these fjall keyspaces:
| Keyspace | Purpose | |---|---|---| | _collections | Collection name → CollectionMeta (ID type + searchable fields) | | _index_queue | Pending index operations (background indexing) | | {name}.inverted | Inverted index per collection (word → posting lists) | | {name}.docs | Full JSON documents per collection (id → JSON bytes) |
Configurable fjall options exposed via StoreConfig:
write_buffer_size— memtable size.compression—"none"or"lz4"for data block compression.block_cache_size— global block cache for the database.
Configuration (src/config.rs)
Aperio reads an optional TOML config file (CONFIG_FILE env var). Parsing is silently lenient and errors fall back to defaults with a warning. The AppConfig struct maps one-to-one with StoreConfig fields plus server-level options (block_cache_size, maintenance_threads, log_level, dumps_folder).
The dumps_folder config option sets the directory for backup snapshots. It defaults to None (unset) — if missing, POST /backup/export and POST /backup/import return 400 Bad Request. This prevents accidental file writes when the operator hasn't explicitly configured a dump location.
Error Handling (src/error.rs)
All operations return Result<T, AppError>, an enum that maps to appropriate HTTP status codes:
| Error variant | HTTP status |
|---|---|
NotFound | 404 |
BadRequest | 400 |
Internal | 500 |
Axum's IntoResponse impl renders errors as JSON: {"error": "message"}.
Data Flow: Document Insertion
Client → POST /collections/{name}/items
→ routes::upsert_item()
→ store.upsert(name, doc)
→ [background active?]
→ Yes: write to _index_queue → return
→ No: lock() → upsert_internal()
→ extract `id` from JSON doc
→ extract searchable field values from JSON doc
→ tokenize combined searchable content (charabia)
→ load old JSON from {name}.docs
→ compute old tokens from old searchable fields
→ remove stale posting list entries
→ add/update posting list entries
→ store full JSON doc in {name}.docs
→ unlock()Data Flow: Search
Client → GET /collections/{name}/search?q=...
→ routes::search()
→ store.search(name, query, sort, take, after)
→ validate collection exists
→ tokenize query
→ parallel: list shard indices per word
→ sort by rarest word first
→ parallel: load posting lists
→ [string IDs]: sorted merge + membership check
→ [number IDs]: bitmap union + intersection
→ apply after-cursor, sort, limit
→ look up full JSON docs from {name}.docs
→ return Vec<serde_json::Value>