Indexer Operation¶
The indexer aggregates documents from all enabled providers and maintains an up-to-date vector store.
Flow¶
List docs → Filter unchanged (ETag) → Download → Extract → Chunk → Embed → Upsert → Cleanup orphaned
Exit Codes¶
| Code | Meaning |
|---|---|
| 0 | Success (≥1 file processed) |
| 1 | Error or nothing processed |
| 130 | Cancelled (SIGTERM/SIGINT) |
Key Behaviors¶
- Idempotent: Skips unchanged via ETag
- Force Reindex: Set
FORCE_FULL_REINDEX=true - Orphan Cleanup: Removes DB entries for missing docs when enabled
- Batch Embeddings: Controlled by
EMBED_BATCH_SIZE
Scheduling¶
Kubernetes CronJob example:
apiVersion: batch/v1
kind: CronJob
metadata:
name: docduck-indexer
spec:
schedule: "0 */6 * * *"
jobTemplate:
spec:
template:
spec:
containers:
- name: indexer
image: your-registry/docduck-indexer:latest
env: # supply provider & DB vars
restartPolicy: OnFailure
Operational Tips¶
| Scenario | Recommendation |
|---|---|
| High churn folder | Run indexer more frequently |
| Large initial ingest | Temporarily raise batch size & CPU limits |
| Memory pressure | Reduce EMBED_BATCH_SIZE |
Logs¶
Structured info-level logs show provider, filename, chunk counts, durations.
Failure Handling¶
Exceptions per file are logged; pipeline continues with next file.
Next¶
- Pipeline internals: Pipeline
- Tuning: Performance & Scaling