Architecture¶
DocDuck comprises two primary runtime components plus shared libraries:
+----------------+ +-----------------+ +--------------------+
| Providers | ---> | Indexer Service | ---> | PostgreSQL + pgvector|
+----------------+ +-----------------+ +--------------------+
^ |
| v
+------------------+
| Query API |
+------------------+
Components¶
| Component | Responsibility |
|---|---|
| Indexer | Periodic ingestion of documents, embeddings generation, DB upsert |
| Query API | Semantic search & answer/chat generation |
| Providers Shared | Common configuration / provider abstractions |
| PostgreSQL + pgvector | Durable storage + similarity search |
Indexer Internals¶
MultiProviderIndexerServiceorchestrates the full runIDocumentProviderimplementations supply documentsTextExtractionServiceselects extractor by extensionTextChunkersplits textModelAgnosticAiServicehandles embedding generationVectorRepositoryhandles persistence & idempotency
Query API Internals¶
- Minimal ASP.NET Core host (
src/Api/Program.cs) VectorSearchServiceperforms vector similarity queriesModelAgnosticAiServiceprovides multi-tier chat completion and embeddingsChatServicemanages conversation state/streaming updates with intelligent refinement
Data Flow (Detailed)¶
- Provider enumerates docs → doc metadata
- Repository checks if doc already indexed (ETag)
- Download + extract text
- Chunk text (size, overlap)
- Embed each chunk (batched)
- Upsert chunks + update file tracking
- Cleanup orphaned records
Query: 1. Embed question 2. Vector similarity search (top-K) 3. Compose context 4. Generate answer (OpenAI) 5. Return answer + source citations
Extensibility Points¶
| Area | Mechanism |
|---|---|
| New Provider | Implement IDocumentProvider |
| New Extractor | Implement ITextExtractor & DI registration |
| Embedding Model | Add client service + adjust vector dimension & schema |
| Search Strategy | Modify VectorSearchService (reranking / filters) |
Non-Goals (Current Version)¶
- Multi-tenant database isolation
- Complex ACL enforcement
- Built-in auth for public query endpoints (planned)
Diagrams¶
Future enhancement: richer UML sequence diagrams.
Next¶
- Pipeline deep dive: Pipeline
- Provider details: Provider Framework