AI Agent Context
This document is a concise reference for LLM-based automation or code-assist tools interacting with DocDuck.
Core Components
| Concern |
Location |
Key Types |
| Index Orchestration |
src/Indexer/MultiProviderIndexerService.cs |
MultiProviderIndexerService |
| Provider Abstraction |
src/Indexer/Providers/IDocumentProvider.cs |
IDocumentProvider, ProviderDocument |
| Text Extraction |
src/Indexer/Services/TextExtraction |
TextExtractionService, ITextExtractor |
| Embeddings & AI |
src/Providers.Shared/Ai/ModelAgnosticAiService.cs |
ModelAgnosticAiService |
| Storage |
src/Indexer/Services/VectorRepository.cs |
VectorRepository |
| Query API |
src/Api/Program.cs |
Minimal API endpoints |
| Search Logic |
src/Api/Services/VectorSearchService.cs |
VectorSearchService |
| Chat Orchestration |
src/Api/Services/ChatService.cs |
ChatService |
Endpoint Contract Summary
| Endpoint |
Method |
Purpose |
/health |
GET |
Health & counts |
/providers |
GET |
Active providers list |
/query |
POST |
Q&A over indexed chunks |
/docsearch |
POST |
Document-level search |
/chat |
POST |
Conversational interface (SSE optional) |
Query Request Fields
question (string) – required for /query
topK (int?) – optional override
providerType / providerName – optional filters (AND semantics)
Chat Request Fields
message (required)
history (role/content pairs)
streamSteps (bool) to enable SSE streaming
Indexing Flow Hooks
- Provider enumeration (
ListDocumentsAsync)
- Skip check (
IsDocumentIndexedAsync via VectorRepository)
- Extraction chooses extractor by file extension
- Chunking yields
(chunk_num, char_start, char_end)
- Batch embedding returned as vector(float[1536])
- Upsert with metadata JSON (includes provider, etag)
Configuration Hotspots
- Environment variable ingestion in
src/Api/Program.cs and src/Indexer/Program.cs
- Provider enabling:
PROVIDER_<TYPE>_ENABLED
- Chunk tuning:
CHUNK_SIZE, CHUNK_OVERLAP
- Force reindex:
FORCE_FULL_REINDEX
Adding a Provider (Minimal Outline)
- Implement
IDocumentProvider
- Provide constructor DI registration
- Provide env parsing for enable flag & options
- Ensure
ProviderType stable string
- Supply ETag logic for incremental indexing
Safety / Idempotency Assumptions
- ETag stable across identical content versions
- Chunk numbering deterministic with same text & chunk settings
- Upserts overwrite embeddings of changed chunks only
Potential Extension Points
| Area |
Strategy |
| Alternative Embeddings |
Add second client & model field in metadata |
| Reranking |
Post-process search results before answer synthesis |
| Access Control |
Add provider-level ACL filter in search query |
| Caching |
Cache question embedding & top-K hits by hash |
Common Pitfalls
- Missing DB or OpenAI key aborts startup (fail-fast)
- Unsupported file extension triggers
NotSupportedException in extraction
- Large PDF may require memory/duration safeguards
Non-Goals (Current)
- Fine-grained auth per query
- Multi-tenant isolation across databases
- Real-time file change push (polling-only indexing)
Reference Next