Skip to content

Embeddings & AI Layer

Responsible for generating embeddings and synthesizing answers via model-agnostic architecture.

Model-Agnostic Architecture

The system uses a fully flexible, JSON-based configuration that supports any AI provider (OpenAI, Anthropic, Azure OpenAI, local models, etc.):

  • No hardcoded parameters: All model settings (temperature, max_tokens, etc.) stored as JSON
  • Configurable endpoints: Full URL, headers, request/response templates per model
  • Template system: Request bodies use placeholders like {MODEL_ID}, {MESSAGES}, {INPUT}
  • Dynamic parameter merging: DefaultParams JSON merged into requests at runtime

Configuration Structure

Each model (chat or embedding) has: - Url: Full API endpoint URL - Headers: Dictionary (e.g., {"Authorization": "Bearer sk-..."}) - RequestTemplate: JSON template with placeholders - ResponseMapping: JSONPath expressions to extract response fields - DefaultParams: Model-specific parameters (e.g., {"temperature": 0.7})

Embeddings

  • Configurable embedding model via admin UI or seeding from environment
  • Default: text-embedding-3-small (1536 dims) when OPENAI_API_KEY is set
  • Supports any embedding API via flexible configuration
  • Batched embedding generation with configurable batch size
  • Request template example: json { "model": "{MODEL_ID}", "input": "{INPUT}", "encoding_format": "float" }

Answer Generation

  • ModelAgnosticAiService provides multi-tier chat completion (Micro/Mini/Full)
  • Strategy selection: Eco (cost), Standard (balanced), Turbo (quality)
  • Automatic fallback between tiers when model unavailable
  • Supports any chat completion API via flexible configuration
  • Request template example: json { "model": "{MODEL_ID}", "messages": {MESSAGES} }
  • Uses max_completion_tokens for newer OpenAI models (GPT-4o, GPT-5)

Chat

  • ChatService sequences: embed last user message → retrieve context → build incremental answer
  • Streaming: server-sent events emitting step updates & final answer

Extending Models

Goal Strategy
New embedding model Add config in admin UI with URL, headers, template, dimensions; reindex
Multi-model Configure multiple models in registry; system selects by tier/strategy
Local model proxy Set URL to local endpoint (e.g., http://localhost:8080/v1/embeddings)
Different AI provider Configure custom URL, headers, request/response templates
Custom parameters Add to DefaultParams JSON (e.g., {"temperature": 0.5, "top_p": 0.9})

Configuration Examples

Adding Anthropic Claude

{
  "url": "https://api.anthropic.com/v1/messages",
  "headers": {
    "x-api-key": "sk-ant-...",
    "anthropic-version": "2023-06-01"
  },
  "requestTemplate": {
    "model": "{MODEL_ID}",
    "messages": "{MESSAGES}",
    "max_tokens": 4096
  },
  "responseMapping": {
    "content": "$.content[0].text",
    "role": "$.role"
  }
}

Adding Local Ollama Model

{
  "url": "http://localhost:11434/api/embeddings",
  "headers": {},
  "requestTemplate": {
    "model": "{MODEL_ID}",
    "prompt": "{INPUT}"
  },
  "responseMapping": {
    "embedding": "$.embedding"
  }
}

Prompt Strategy (Simplified)

  • System-style instruction (implicit)
  • Context concatenation (ordered by similarity)
  • User question appended
  • Model asked to answer citing sources implicitly (source chunk mapping done externally)

Considerations

Concern Mitigation
Context overflow Limit chunk sizes / reduce top-K
Hallucination Provide direct chunk content; consider answer validation
Cost Batch embeddings; right-size chunk length; use Eco strategy
API compatibility Use RequestTemplate and ResponseMapping to adapt any API

Future Enhancements

  • Per-provider model routing
  • Reranking stage (e.g. cross encoder)
  • Source citation markers referencing chunk IDs
  • Auto-detection of response structure

Next