Database Schema¶
Primary store: PostgreSQL with pgvector extension for similarity search.
Tables (Conceptual)¶
docs_files¶
Tracks file-level metadata for idempotency.
| Column | Type | Notes |
|--------|------|-------|
| doc_id | text | Provider-scoped unique ID |
| filename | text | Display/file name |
| provider_type | text | e.g. onedrive |
| provider_name | text | Instance label |
| etag | text | Change token |
| last_modified | timestamptz | Source timestamp |
| relative_path | text | For local provider |
| updated_utc | timestamptz | Tracking |
Unique key often (doc_id, provider_type, provider_name).
docs_chunks¶
Stores chunk embeddings. | Column | Type | Notes | |--------|------|-------| | doc_id | text | FK to docs_files.doc_id (logical) | | chunk_num | int | Sequential chunk index | | filename | text | Original filename | | provider_type | text | Provider type | | provider_name | text | Provider name | | embedding | vector(1536) | OpenAI embedding | | text | text | Chunk content | | metadata | jsonb | Additional attributes | | updated_utc | timestamptz | Upsert time |
Index: IVFFlat (or HNSW future) on embedding with cosine distance.
providers¶
(If present) Tracks enabled providers and last sync time.
ai_provider_settings¶
Stores AI provider configurations (chat models and embedding models). Each provider gets its own row.
| Column | Type | Notes |
|---|---|---|
| provider_id | text | Primary key - unique model identifier |
| provider_type | text | 'global', 'chat', or 'embedding' |
| settings | jsonb | Provider-specific configuration |
| test_status | text | 'Untested', 'Passed', or 'Failed' |
| last_tested_at | timestamptz | Timestamp of last connectivity test |
| last_test_message | text | Result message from last test |
| updated_at | timestamptz | Last configuration update |
Row types:
- provider_type = 'global': Global AI settings (tier assignments, defaults)
- provider_type = 'chat': Individual chat model configurations
- provider_type = 'embedding': Individual embedding model configurations
The API preloads only assigned providers (those referenced in tier assignments or active embedding).
Sample Vector Index Creation¶
CREATE INDEX IF NOT EXISTS docs_chunks_embedding_idx ON docs_chunks USING ivfflat (embedding vector_cosine_ops) WITH (lists=100);
Tune lists based on dataset size.
Similarity Query (Example)¶
SELECT doc_id, filename, provider_type, provider_name, text, embedding <=> $1 AS distance
FROM docs_chunks
WHERE provider_type = COALESCE($2, provider_type)
ORDER BY embedding <=> $1
LIMIT $3;
Maintenance¶
| Task | Command |
|---|---|
| Analyze | ANALYZE docs_chunks; |
| Vacuum | VACUUM (VERBOSE, ANALYZE) docs_chunks; |
| Reindex vector | Recreate index after large data shifts |
Size Estimates¶
| Factor | Impact |
|---|---|
| Chunk size smaller | More rows → more embeddings cost |
| Higher overlap | More redundancy |
| Larger models | More vector bytes |
Next¶
- Vector search logic: Search & RAG
- Pipeline: Pipeline