Ollama in OpenRAG
Ollama provides local LLM inference for generating natural language responses in OpenRAG.Configuration
Image:ollama/ollama:latest
Port: 11434 (internal, not exposed to host)
Volume: ollama_data:/root/.ollama
Compute Mode: CPU (GPU disabled for compatibility)
Environment:
Model: llama3.1:8b
Size: 4.9 GB Context Window: 128K tokens Parameters: 8 billion Capabilities:- Text generation
- Question answering
- Summarization
- Multi-turn conversation
Model Download
Model is automatically pulled on first use:API Usage
Generate Completion:OpenRAG Integration
OpenRAG uses Ollama via the LLM Service:Performance Characteristics
First Query (cold start):- Model loading: 10-20 seconds
- Generation: 30-40 seconds
- Total: 50-75 seconds
- Generation: 5-15 seconds (model already in memory)
- CPU cores and speed
- Available RAM (model requires ~5GB)
- Prompt length
- Response length (max_tokens)
- Temperature (higher = slower but more creative)
Model Management
List Downloaded Models:Alternative Models
OpenRAG supports multiple models via configuration: Small/Fast (recommended for development):phi3:mini(2.3 GB) - Fast, decent qualitymistral:7b(4.1 GB) - Good balance
llama3.1:8b(4.9 GB) - Current default, excellent qualitymixtral:8x7b(26 GB) - High quality, slower
llama3.1:70b(40 GB) - Excellent qualitymixtral:8x22b(94 GB) - Best quality available
Configuration Parameters
Temperature (0.0 - 1.0):- 0.0-0.3: Deterministic, factual (recommended for RAG)
- 0.4-0.7: Balanced
- 0.8-1.0: Creative, varied
- 1024: Short responses
- 2048: Medium responses (default previously)
- 4096: Long, detailed responses (current default)
- 8192+: Very long responses
- Controls diversity via nucleus sampling
- Lower = more focused
- Default: 0.9
- Number of top tokens to consider
- Lower = more focused
- Default: 40
System Prompts
OpenRAG uses specialized system prompts:Health Check
Resource Usage
Memory:- Base: ~500 MB
- With llama3.1:8b loaded: ~5.5 GB
- Peak during generation: ~6 GB
- Idle:
<1% - During generation: 80-100% (all cores)
- Per model: 2-50 GB depending on size
- Cache: Additional 1-2 GB
Monitoring
Check Model Status:Troubleshooting
Model Not Loading:- Reduce max_tokens
- Use smaller model (phi3:mini)
- Increase CPU allocation
- Lower temperature
- Reduce model size
- Increase Docker memory limit
- Use quantized models (Q4, Q5)
- Increase client timeout (currently 120s)
- Check API timeout in
backend/api/main.py(currently 300s) - Verify Ollama is running:
docker ps | grep ollama
Streaming Responses
For real-time streaming (future enhancement):GPU Acceleration (Optional)
To enable GPU support (if NVIDIA GPU available):API Reference
Full Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md Key endpoints:POST /api/generate: Generate completionPOST /api/chat: Chat completionGET /api/tags: List modelsPOST /api/pull: Download modelDELETE /api/delete: Remove modelGET /api/ps: Show loaded models