Database Schema¶

Primary store: PostgreSQL with pgvector extension for similarity search.

Tables (Conceptual)¶

docs_files¶

Tracks file-level metadata for idempotency. | Column | Type | Notes | |--------|------|-------| | doc_id | text | Provider-scoped unique ID | | filename | text | Display/file name | | provider_type | text | e.g. onedrive | | provider_name | text | Instance label | | etag | text | Change token | | last_modified | timestamptz | Source timestamp | | relative_path | text | For local provider | | updated_utc | timestamptz | Tracking |

Unique key often (doc_id, provider_type, provider_name).

docs_chunks¶

Stores chunk embeddings. | Column | Type | Notes | |--------|------|-------| | doc_id | text | FK to docs_files.doc_id (logical) | | chunk_num | int | Sequential chunk index | | filename | text | Original filename | | provider_type | text | Provider type | | provider_name | text | Provider name | | embedding | vector(1536) | OpenAI embedding | | text | text | Chunk content | | metadata | jsonb | Additional attributes | | updated_utc | timestamptz | Upsert time |

Index: IVFFlat (or HNSW future) on embedding with cosine distance.

providers¶

(If present) Tracks enabled providers and last sync time.

ai_provider_settings¶

Stores AI provider configurations (chat models and embedding models). Each provider gets its own row.

Column	Type	Notes
provider_id	text	Primary key - unique model identifier
provider_type	text	'global', 'chat', or 'embedding'
settings	jsonb	Provider-specific configuration
test_status	text	'Untested', 'Passed', or 'Failed'
last_tested_at	timestamptz	Timestamp of last connectivity test
last_test_message	text	Result message from last test
updated_at	timestamptz	Last configuration update

Row types: - provider_type = 'global': Global AI settings (tier assignments, defaults) - provider_type = 'chat': Individual chat model configurations - provider_type = 'embedding': Individual embedding model configurations

The API preloads only assigned providers (those referenced in tier assignments or active embedding).

Sample Vector Index Creation¶

CREATE INDEX IF NOT EXISTS docs_chunks_embedding_idx ON docs_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists=100);

Tune lists based on dataset size.

Similarity Query (Example)¶

SELECT doc_id, filename, provider_type, provider_name, text, embedding <=> $1 AS distance
FROM docs_chunks
WHERE provider_type = COALESCE($2, provider_type)
ORDER BY embedding <=> $1
LIMIT $3;

Maintenance¶

Task	Command
Analyze	`ANALYZE docs_chunks;`
Vacuum	`VACUUM (VERBOSE, ANALYZE) docs_chunks;`
Reindex vector	Recreate index after large data shifts

Size Estimates¶

Factor	Impact
Chunk size smaller	More rows → more embeddings cost
Higher overlap	More redundancy
Larger models	More vector bytes

Next¶

Vector search logic: Search & RAG
Pipeline: Pipeline