Skip to main content

Ollama in OpenRAG

Ollama provides local LLM inference for generating natural language responses in OpenRAG.

Configuration

Image: ollama/ollama:latest Port: 11434 (internal, not exposed to host) Volume: ollama_data:/root/.ollama Compute Mode: CPU (GPU disabled for compatibility) Environment:
OLLAMA_HOST: http://ollama:11434
LLM_MODEL: llama3.1:8b
LLM_TEMPERATURE: 0.3
LLM_MAX_TOKENS: 4096

Model: llama3.1:8b

Size: 4.9 GB Context Window: 128K tokens Parameters: 8 billion Capabilities:
  • Text generation
  • Question answering
  • Summarization
  • Multi-turn conversation
Languages: Multilingual (optimized for English, supports French)

Model Download

Model is automatically pulled on first use:
# View download progress
sudo docker logs openrag-ollama -f
Output:
pulling manifest
pulling model weights... (4.9 GB)
downloading: 45% [===========>       ]
...
pulling complete
Manual Download (optional):
sudo docker exec openrag-ollama ollama pull llama3.1:8b

API Usage

Generate Completion:
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "What is RAG?",
    "stream": false
  }'
Response:
{
  "model": "llama3.1:8b",
  "created_at": "2026-02-18T10:30:00.123456Z",
  "response": "RAG stands for Retrieval-Augmented Generation...",
  "done": true,
  "context": [...],
  "total_duration": 5234567890,
  "load_duration": 1234567,
  "prompt_eval_duration": 234567890,
  "eval_duration": 4999999999
}
Chat Completion (with context):
curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful technical assistant."},
      {"role": "user", "content": "How does vector search work?"}
    ]
  }'

OpenRAG Integration

OpenRAG uses Ollama via the LLM Service:
# File: backend/services/orchestrator/services/llm_service.py

async def _generate_with_ollama(self, system_prompt: str, user_prompt: str) -> str:
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": f"{system_prompt}\n\n{user_prompt}",
                "stream": False,
                "options": {
                    "temperature": self.temperature,
                    "num_predict": self.max_tokens
                }
            }
        )
        result = response.json()
        return result["response"].strip()

Performance Characteristics

First Query (cold start):
  • Model loading: 10-20 seconds
  • Generation: 30-40 seconds
  • Total: 50-75 seconds
Subsequent Queries (warm):
  • Generation: 5-15 seconds (model already in memory)
Factors Affecting Speed:
  • CPU cores and speed
  • Available RAM (model requires ~5GB)
  • Prompt length
  • Response length (max_tokens)
  • Temperature (higher = slower but more creative)

Model Management

List Downloaded Models:
curl http://localhost:11434/api/tags | jq
Response:
{
  "models": [
    {
      "name": "llama3.1:8b",
      "modified_at": "2026-02-18T08:15:00Z",
      "size": 4946823168,
      "digest": "sha256:...",
      "details": {
        "format": "gguf",
        "family": "llama",
       "parameter_size": "8B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}
Delete Model:
sudo docker exec openrag-ollama ollama rm llama3.1:8b
Pull Different Model:
# Smaller model for faster responses
sudo docker exec openrag-ollama ollama pull phi3:mini

# Larger model for better quality
sudo docker exec openrag-ollama ollama pull llama3.1:70b

Alternative Models

OpenRAG supports multiple models via configuration: Small/Fast (recommended for development):
  • phi3:mini (2.3 GB) - Fast, decent quality
  • mistral:7b (4.1 GB) - Good balance
Medium (recommended for production):
  • llama3.1:8b (4.9 GB) - Current default, excellent quality
  • mixtral:8x7b (26 GB) - High quality, slower
Large (best quality, requires powerful hardware):
  • llama3.1:70b (40 GB) - Excellent quality
  • mixtral:8x22b (94 GB) - Best quality available

Configuration Parameters

Temperature (0.0 - 1.0):
  • 0.0-0.3: Deterministic, factual (recommended for RAG)
  • 0.4-0.7: Balanced
  • 0.8-1.0: Creative, varied
Max Tokens:
  • 1024: Short responses
  • 2048: Medium responses (default previously)
  • 4096: Long, detailed responses (current default)
  • 8192+: Very long responses
Top P (0.0 - 1.0):
  • Controls diversity via nucleus sampling
  • Lower = more focused
  • Default: 0.9
Top K:
  • Number of top tokens to consider
  • Lower = more focused
  • Default: 40

System Prompts

OpenRAG uses specialized system prompts:
system_prompt = """Vous êtes un assistant technique expert spécialisé dans la téléphonie d'entreprise, les solutions Cisco et la plateforme WTE (Webex Teams Edition) d'Orange.

Règles strictes :
1. Répondez UNIQUEMENT en vous basant sur les informations fournies dans le contexte
2. Fournissez des réponses détaillées, précises et complètes avec tous les détails techniques disponibles
3. Ne mentionnez JAMAIS les numéros de documents, les sources ou que vous vous basez sur des documents
4. Répondez comme si vous connaissiez ces informations de manière naturelle
5. Utilisez un format structuré (listes, étapes, sections) pour une meilleure lisibilité
6. Si l'information n'est pas disponible dans le contexte, indiquez simplement que vous n'avez pas cette information
7. Soyez technique mais compréhensible
8. Répondez en français de manière professionnelle et directe"""

Health Check

curl http://localhost:11434/api/tags
If Ollama is running, returns list of models. If not, connection refused.

Resource Usage

Memory:
  • Base: ~500 MB
  • With llama3.1:8b loaded: ~5.5 GB
  • Peak during generation: ~6 GB
CPU:
  • Idle: <1%
  • During generation: 80-100% (all cores)
Disk:
  • Per model: 2-50 GB depending on size
  • Cache: Additional 1-2 GB

Monitoring

Check Model Status:
curl http://localhost:11434/api/ps | jq
View Logs:
sudo docker logs openrag-ollama --tail=100 -f
Resource Usage:
docker stats openrag-ollama

Troubleshooting

Model Not Loading:
# Check disk space
df -h

# Re-pull model
sudo docker exec openrag-ollama ollama pull llama3.1:8b
Slow Response:
  • Reduce max_tokens
  • Use smaller model (phi3:mini)
  • Increase CPU allocation
  • Lower temperature
Out of Memory:
  • Reduce model size
  • Increase Docker memory limit
  • Use quantized models (Q4, Q5)
Connection Timeout:
  • Increase client timeout (currently 120s)
  • Check API timeout in backend/api/main.py (currently 300s)
  • Verify Ollama is running: docker ps | grep ollama

Streaming Responses

For real-time streaming (future enhancement):
curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Explain RAG",
    "stream": true
  }'
Returns JSON objects as response is generated.

GPU Acceleration (Optional)

To enable GPU support (if NVIDIA GPU available):
# docker-compose.yml
ollama:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]
Performance Improvement: 5-10x faster generation

API Reference

Full Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md Key endpoints:
  • POST /api/generate: Generate completion
  • POST /api/chat: Chat completion
  • GET /api/tags: List models
  • POST /api/pull: Download model
  • DELETE /api/delete: Remove model
  • GET /api/ps: Show loaded models