Architecture

System Overview

┌───────────────────────────────────────────────────┐
│                    Client                         │
├───────────────────────────────────────────────────┤
│                   HTTP (port 3000)                │
├───────────────────────────────────────────────────┤
│          Axum Router (src/routes.rs)              │
│     /collections  /search  /suggest  /items       │
├───────────────────────────────────────────────────┤
│            Store Engine (src/store/)              │
│  Inverted Index  ·  FST Vocabulary  ·  ID Strategy│
├───────────────────────────────────────────────────┤
│    Auth (src/auth.rs)  ·  Backup (src/backup.rs)  │
├───────────────────────────────────────────────────┤
│               LMDB Database (heed)                │
│        Databases: meta, queue, docs, inverted     │
│      (collection-prefixed keys, e.g. "col\0key")  │
└───────────────────────────────────────────────────┘

The server has four layers:

HTTP Layer — Axum router exposing REST endpoints.
Auth & Backup — API key authentication (src/auth.rs) and snapshot export/import (src/backup.rs).
Store Engine — Core logic: tokenization, inverted index, FST vocabulary index, search/insert (src/store/ sub-modules).
Persistence Layer — LMDB memory-mapped database for on-disk storage.

HTTP Layer (`src/routes.rs`)

An Axum Router maps endpoints to handler functions that delegate to the Store. All state is shared via Arc<AppState> (store + optional dumps_folder). Every endpoint (except /status) requires an API key via the Authorization header, enforced by src/auth.rs. Two tiers of access: the main key has full access; the search key is restricted to GET …/search.

Method	Path	Handler
`GET`	`/status`	Health check
`GET`	`/collections`	List collections
`POST`	`/collections`	Create collection
`GET`	`/collections/{collection}`	Collection metadata
`DELETE`	`/collections/{collection}`	Delete collection
`POST`	`/collections/{collection}/items`	Upsert document
`DELETE`	`/collections/{collection}/items/{id}`	Delete document
`GET`	`/collections/{collection}/search?q=...`	Search documents
`GET`	`/collections/{collection}/suggest?q=...`	Suggest indexed terms matching a prefix
`POST`	`/backup/export`	Export database snapshot to a file in the dumps folder
`POST`	`/backup/import`	Import a snapshot from the dumps folder
`GET`	`/queue`	Pending index queue depth

Store Engine (`src/store/`)

The Store struct (in src/store/mod.rs) is the heart of Aperio. It holds:

env: heed::Env — the underlying LMDB environment handle.
db_meta: DbBytes — LMDB database handle for collection metadata (meta).
db_queue: DbBytes — LMDB database handle for the index queue (queue).
db_docs: DbBytes — LMDB database handle for stored JSON documents (docs).
db_inverted: DbBytes — LMDB database handle for the inverted index (inverted).
config: StoreConfig — tunable parameters (shard sizes, token length, index interval).
collections: RwLock<HashMap<String, CollectionMeta>> — in-memory registry of known collections, their ID type and searchable fields.
lock: Mutex<()> — serializes write operations (delete/collection mutation) for index consistency.
next_seq: AtomicU64 — monotonic sequence counter for the indexing queue.

The store logic is split across sub-modules:

src/store/config.rs — StoreConfig, FSTConfig, IdType, CollectionMeta, PostingShard, QueuedIndex.
src/store/fst.rs — per-collection FST vocabulary index (push/pop, consolidation, prefix/fuzzy search).
src/store/posting_list.rs — shard-based posting list operations for both ID strategies.
src/store/search.rs — search execution (intersection, cursor pagination) for string and number IDs.

Tokenization

Document content is tokenized using charabia:

content → tokenize() → filter(is_word) → lemma() → filter(min_token_length)

Tokens are deduplicated into a HashSet<String> before indexing.

Inverted Index

All collections share a single LMDB database (inverted) for the inverted index. Keys are prefixed with the collection name and a null byte (e.g. "mycol\0hello\0000042" for shard 42 of word "hello" in collection "mycol").

Two ID Strategies

Collections are created with an id_type that determines the posting list format:

`id_type`	Storage format	Data structure
`string`	rkyv-archived shards	`PostingShard { ids: Vec<String> }`
`number`	Serialized bitmap shards	`RoaringTreemap` per shard

String IDs

Posting lists are split into shards of configurable max_string_shard_size (default 1000). Each shard stores sorted Vec<String> archived via rkyv. A binary search across shards locates the correct shard for insertion.

A Vec<u64> would be faster for posting-list operations, but u64 can't represent arbitrary string IDs like UUIDs, so Vec<String> is used as the general-purpose format.

Number IDs

Posting lists use RoaringTreemap bitmaps, sharded at max_roaring_shard_size (default 100,000). Bitmaps offer compact storage and fast bitwise intersection for multi-term queries.

FST Vocabulary Index (`src/store/fst.rs`)

Each collection has an on-disk Finite State Transducer (FST) that stores the set of all indexed terms. The FST is built with the fst crate and enables term suggestion (GET /collections/{name}/suggest?q=...).

The FST uses the same pending push/pop + consolidation pattern Sonic's FST uses (since FSTs are immutable once built):

When process_pending_queue() indexes a batch of documents, new terms are pushed to an in-memory pending set and removed terms are added to a pop set.
A background task periodically consolidates: it reads the old FST from disk, merges it with pending changes (sorted merge of old FST stream + pending pushes, minus popped words), writes a new FST to a .tmp file, then atomically renames it to the final path.
On restart, the FST is memory-mapped from its .fst file on disk.

Term suggestion uses fst::automaton::StartsWith (prefix match) for autocomplete-style results. The FST is separate from LMDB and lives at {DATA_DIR}/fst/{collection_name}.fst.

Search Execution

Tokenize the query string.
List shard indices for each token (rarest-first optimization).
Load posting lists: for string IDs, merge shards in a sorted iterative merge; for number IDs, union shard bitmaps per word, then compute the intersection.
Apply sort and pagination: sort by ID ascending or descending, apply optional after cursor, cap at take.

Search: String IDs

For string-ID collections, each shard is an rkyv-archived PostingShard. The engine loads all shards for the rarest word, then iterates through its sorted IDs, checking membership in other words' shards via binary search.

Search: Number IDs

For number-ID collections, each shard is a RoaringTreemap. Per word, all shards are merged with bitwise OR. Words are then intersected with bitwise AND. The resulting bitmap is iterated in ascending or descending order.

Background Indexing (`spawn_background`)

upsert() always writes to a FIFO queue (queue database) and returns immediately. A tokio::spawn task polls the queue at index_interval (default 900ms) and dispatches process_pending_queue() on Tokio's blocking thread pool via spawn_blocking. Within each batch, items are processed sequentially: tokenization via charabia, stale token removal, posting list updates, and doc storage — all within a single LMDB write transaction.

This batches write operations and reduces lock contention. In tests, flush() can be called to drain the queue synchronously.

Persistence Layer (LMDB)

LMDB is an embedded memory-mapped database (a key-value store). Aperio uses these LMDB databases via heed:

| Database | Purpose | |---|---|---| | meta | Collection name → CollectionMeta (ID type + searchable fields) | | queue | Pending index operations (background indexing), keyed by sequence number | | docs | Full JSON documents, keyed by {collection}\0{doc_id} | | inverted | Inverted index, keyed by {collection}\0{word}\0{shard_index} |

Configurable queue options exposed via StoreConfig:

index_interval — interval between background index queue flushes.
max_queue_batch_size — items processed per background tick.

Configuration (`src/config.rs`)

Aperio reads an optional TOML config file (CONFIG_FILE env var). Parsing is strict: on any read or parse error the process panics with a clear message. The AppConfig struct mirrors StoreConfig fields (with index_interval_ms converted to a Duration) plus server-level options (log_level, main_api_key, search_api_key, dumps_folder).

The dumps_folder config option sets the directory for backup snapshots. It defaults to None (unset) — if missing, POST /backup/export and POST /backup/import return 400 Bad Request. This prevents accidental file writes when the operator hasn't explicitly configured a dump location.

Error Handling (`src/error.rs`)

All operations return Result<T, AppError>, an enum that maps to appropriate HTTP status codes:

Error variant	HTTP status
`NotFound`	404
`BadRequest`	400
`Internal`	500

Axum's IntoResponse impl renders errors as JSON: {"error": "message"}.

Data Flow: Document Insertion

Client → POST /collections/{collection}/items
  → routes::upsert_item()
    → store.upsert(collection, doc)
      → validate collection exists
      → extract `id` from JSON doc
      → allocate sequence number
      → serialize doc to JSON bytes
      → write QueuedIndex entry to `queue` database
      → return immediately

  (background ticker)
    → store.process_pending_queue()
      → read up to max_queue_batch_size entries from `queue`
      → for each entry:
        → deserialize queued document
        → extract searchable content from JSON
        → tokenize (charabia)
        → load old doc from `docs` database
        → compute old tokens for diff
        → remove stale posting list entries from `inverted`
        → store new doc in `docs` database
        → add new posting list entries to `inverted`
        → delete queue entry
      → commit single LMDB write transaction
      → push new terms to FST pending set (per collection)
      → pop removed terms from FST pending set (per collection)

  (separate consolidation ticker)
    → fst_pool.consolidate_dirty()
      → for each dirty collection FST:
        → merge old FST + pending pushes − pending pops
        → write to .fst.tmp, atomically rename to .fst

Data Flow: Search

Client → GET /collections/{collection}/search?q=...
  → routes::search()
    → store.search(collection, query, sort, take, after)
      → validate collection exists
      → tokenize query
      → list shard indices per word, sort by rarest first
      → load posting lists (sequential)
      → [string IDs]: sorted merge + membership check
      → [number IDs]: bitmap union + intersection
      → apply after-cursor, sort, limit
      → look up full JSON docs from `docs` database
      → return Vec<serde_json::Value>

Data Flow: Suggest

Client → GET /collections/{collection}/suggest?q=app
  → routes::suggest()
    → store.suggest(collection, prefix, take)
      → validate collection exists
      → fst_pool.suggest_prefix(collection, prefix, take)
        → Str::new(prefix).starts_with() automaton
        → stream results from FST, cap at take
      → return Vec<String>

Treat this page as a narrative companion for developers who enjoy reading about low level engineering, not as operational documentation you would rely on for debugging or performance tuning. If something here contradicts the code, the code wins.

Architecture ​

System Overview ​

HTTP Layer (src/routes.rs) ​

Store Engine (src/store/) ​

Tokenization ​

Inverted Index ​

Two ID Strategies ​

String IDs ​

Number IDs ​

FST Vocabulary Index (src/store/fst.rs) ​

Search Execution ​

Search: String IDs ​

Search: Number IDs ​

Background Indexing (spawn_background) ​

Persistence Layer (LMDB) ​

Configuration (src/config.rs) ​

Error Handling (src/error.rs) ​

Data Flow: Document Insertion ​

Data Flow: Search ​

Data Flow: Suggest ​