Pipeline¶

Detailed look at the indexing pipeline driven by MultiProviderIndexerService.

Steps¶

Step	Purpose	Key Methods
Enumerate Providers	Gather enabled providers	`ProviderCatalog.GetProvidersAsync`
Register Provider	Track presence & sync time	`VectorRepository.RegisterProviderAsync`
List Documents	Get candidate docs	`IDocumentProvider.ListDocumentsAsync`
Skip Unchanged	Avoid reprocessing	`VectorRepository.IsDocumentIndexedAsync`
Download	Stream file contents	`IDocumentProvider.DownloadDocumentAsync`
Extract	Produce plain text	`TextExtractionService.ExtractTextAsync`
Chunk	Segment into overlapping units	`TextChunker.Chunk`
Embed	Generate vector	`OpenAiEmbeddingsClient.EmbedBatchedAsync`
Upsert	Persist chunk & metadata	`VectorRepository.InsertChunksAsync`
Track File	Store ETag & timestamps	`VectorRepository.UpdateFileTrackingAsync`
Cleanup Orphans	Remove missing docs	`VectorRepository.CleanupOrphanedDocumentsAsync`

Each chunk stores JSON metadata (doc id, provider, etag, path, chunk position). Useful for future filtering.