Architecture¶

DocDuck comprises two primary runtime components plus shared libraries:

+----------------+        +-----------------+        +--------------------+
|  Providers     |  --->  | Indexer Service |  --->  | PostgreSQL + pgvector|
+----------------+        +-----------------+        +--------------------+
                                                   ^          |
                                                   |          v
                                                +------------------+
                                                |   Query API      |
                                                +------------------+

Components¶

Component	Responsibility
Indexer	Periodic ingestion of documents, embeddings generation, DB upsert
Query API	Semantic search & answer/chat generation
Providers Shared	Common configuration / provider abstractions
PostgreSQL + pgvector	Durable storage + similarity search

Indexer Internals¶

MultiProviderIndexerService orchestrates the full run
IDocumentProvider implementations supply documents
TextExtractionService selects extractor by extension
TextChunker splits text
ModelAgnosticAiService handles embedding generation
VectorRepository handles persistence & idempotency

Query API Internals¶

Minimal ASP.NET Core host (src/Api/Program.cs)
VectorSearchService performs vector similarity queries
ModelAgnosticAiService provides multi-tier chat completion and embeddings
ChatService manages conversation state/streaming updates with intelligent refinement

Data Flow (Detailed)¶

Provider enumerates docs → doc metadata
Repository checks if doc already indexed (ETag)
Download + extract text
Chunk text (size, overlap)
Embed each chunk (batched)
Upsert chunks + update file tracking
Cleanup orphaned records

Query: 1. Embed question 2. Vector similarity search (top-K) 3. Compose context 4. Generate answer (OpenAI) 5. Return answer + source citations

Extensibility Points¶

Area	Mechanism
New Provider	Implement `IDocumentProvider`
New Extractor	Implement `ITextExtractor` & DI registration
Embedding Model	Add client service + adjust vector dimension & schema
Search Strategy	Modify `VectorSearchService` (reranking / filters)

Non-Goals (Current Version)¶

Multi-tenant database isolation
Complex ACL enforcement
Built-in auth for public query endpoints (planned)

Diagrams¶

Future enhancement: richer UML sequence diagrams.

Next¶

Pipeline deep dive: Pipeline
Provider details: Provider Framework