Ollama - Documentation

Ollama in OpenRAG

Ollama provides local LLM inference for generating natural language responses in OpenRAG.

Configuration

Image: ollama/ollama:latest Port: 11434 (internal, not exposed to host) Volume: ollama_data:/root/.ollama Compute Mode: CPU (GPU disabled for compatibility) Environment:

OLLAMA_HOST: http://ollama:11434
LLM_MODEL: llama3.1:8b
LLM_TEMPERATURE: 0.3
LLM_MAX_TOKENS: 4096

Model: llama3.1:8b

Size: 4.9 GB Context Window: 128K tokens Parameters: 8 billion Capabilities:

Text generation
Question answering
Summarization
Multi-turn conversation

Languages: Multilingual (optimized for English, supports French)

Model Download

Model is automatically pulled on first use:

# View download progress
sudo docker logs openrag-ollama -f

Output:

pulling manifest
pulling model weights... (4.9 GB)
downloading: 45% [===========>       ]
...
pulling complete

Manual Download (optional):

sudo docker exec openrag-ollama ollama pull llama3.1:8b

API Usage

Generate Completion:

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "What is RAG?",
    "stream": false
  }'

Response:

{
  "model": "llama3.1:8b",
  "created_at": "2026-02-18T10:30:00.123456Z",
  "response": "RAG stands for Retrieval-Augmented Generation...",
  "done": true,
  "context": [...],
  "total_duration": 5234567890,
  "load_duration": 1234567,
  "prompt_eval_duration": 234567890,
  "eval_duration": 4999999999
}

Chat Completion (with context):

curl http://localhost:11434/api/chat \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "messages": [
      {"role": "system", "content": "You are a helpful technical assistant."},
      {"role": "user", "content": "How does vector search work?"}
    ]
  }'

OpenRAG Integration

OpenRAG uses Ollama via the LLM Service:

# File: backend/services/orchestrator/services/llm_service.py

async def _generate_with_ollama(self, system_prompt: str, user_prompt: str) -> str:
    async with httpx.AsyncClient(timeout=120.0) as client:
        response = await client.post(
            f"{self.base_url}/api/generate",
            json={
                "model": self.model,
                "prompt": f"{system_prompt}\n\n{user_prompt}",
                "stream": False,
                "options": {
                    "temperature": self.temperature,
                    "num_predict": self.max_tokens
                }
            }
        )
        result = response.json()
        return result["response"].strip()

Performance Characteristics

First Query (cold start):

Model loading: 10-20 seconds
Generation: 30-40 seconds
Total: 50-75 seconds

Subsequent Queries (warm):

Generation: 5-15 seconds (model already in memory)

Factors Affecting Speed:

CPU cores and speed
Available RAM (model requires ~5GB)
Prompt length
Response length (max_tokens)
Temperature (higher = slower but more creative)

Model Management

List Downloaded Models:

curl http://localhost:11434/api/tags | jq

Response:

{
  "models": [
    {
      "name": "llama3.1:8b",
      "modified_at": "2026-02-18T08:15:00Z",
      "size": 4946823168,
      "digest": "sha256:...",
      "details": {
        "format": "gguf",
        "family": "llama",
       "parameter_size": "8B",
        "quantization_level": "Q4_0"
      }
    }
  ]
}

Delete Model:

sudo docker exec openrag-ollama ollama rm llama3.1:8b

Pull Different Model:

# Smaller model for faster responses
sudo docker exec openrag-ollama ollama pull phi3:mini

# Larger model for better quality
sudo docker exec openrag-ollama ollama pull llama3.1:70b

Alternative Models

OpenRAG supports multiple models via configuration: Small/Fast (recommended for development):

phi3:mini (2.3 GB) - Fast, decent quality
mistral:7b (4.1 GB) - Good balance

Medium (recommended for production):

llama3.1:8b (4.9 GB) - Current default, excellent quality
mixtral:8x7b (26 GB) - High quality, slower

Large (best quality, requires powerful hardware):

llama3.1:70b (40 GB) - Excellent quality
mixtral:8x22b (94 GB) - Best quality available

Configuration Parameters

Temperature (0.0 - 1.0):

0.0-0.3: Deterministic, factual (recommended for RAG)
0.4-0.7: Balanced
0.8-1.0: Creative, varied

Max Tokens:

1024: Short responses
2048: Medium responses (default previously)
4096: Long, detailed responses (current default)
8192+: Very long responses

Top P (0.0 - 1.0):

Controls diversity via nucleus sampling
Lower = more focused
Default: 0.9

Top K:

Number of top tokens to consider
Lower = more focused
Default: 40

System Prompts

OpenRAG uses specialized system prompts:

system_prompt = """Vous êtes un assistant technique expert spécialisé dans la téléphonie d'entreprise, les solutions Cisco et la plateforme WTE (Webex Teams Edition) d'Orange.

Règles strictes :
Répondez UNIQUEMENT en vous basant sur les informations fournies dans le contexte
Fournissez des réponses détaillées, précises et complètes avec tous les détails techniques disponibles
Ne mentionnez JAMAIS les numéros de documents, les sources ou que vous vous basez sur des documents
Répondez comme si vous connaissiez ces informations de manière naturelle
Utilisez un format structuré (listes, étapes, sections) pour une meilleure lisibilité
Si l'information n'est pas disponible dans le contexte, indiquez simplement que vous n'avez pas cette information
Soyez technique mais compréhensible
Répondez en français de manière professionnelle et directe"""

Health Check

curl http://localhost:11434/api/tags

If Ollama is running, returns list of models. If not, connection refused.

Resource Usage

Memory:

Base: ~500 MB
With llama3.1:8b loaded: ~5.5 GB
Peak during generation: ~6 GB

CPU:

Idle: <1%
During generation: 80-100% (all cores)

Disk:

Per model: 2-50 GB depending on size
Cache: Additional 1-2 GB

Monitoring

Check Model Status:

curl http://localhost:11434/api/ps | jq

View Logs:

sudo docker logs openrag-ollama --tail=100 -f

Resource Usage:

docker stats openrag-ollama

Troubleshooting

Model Not Loading:

# Check disk space
df -h

# Re-pull model
sudo docker exec openrag-ollama ollama pull llama3.1:8b

Slow Response:

Reduce max_tokens
Use smaller model (phi3:mini)
Increase CPU allocation
Lower temperature

Out of Memory:

Reduce model size
Increase Docker memory limit
Use quantized models (Q4, Q5)

Connection Timeout:

Increase client timeout (currently 120s)
Check API timeout in backend/api/main.py (currently 300s)
Verify Ollama is running: docker ps | grep ollama

Streaming Responses

For real-time streaming (future enhancement):

curl http://localhost:11434/api/generate \
  -H "Content-Type: application/json" \
  -d '{
    "model": "llama3.1:8b",
    "prompt": "Explain RAG",
    "stream": true
  }'

Returns JSON objects as response is generated.

GPU Acceleration (Optional)

To enable GPU support (if NVIDIA GPU available):

# docker-compose.yml
ollama:
  deploy:
    resources:
      reservations:
        devices:
          - driver: nvidia
            count: 1
            capabilities: [gpu]

Performance Improvement: 5-10x faster generation

API Reference

Full Ollama API documentation: https://github.com/ollama/ollama/blob/main/docs/api.md Key endpoints:

POST /api/generate: Generate completion
POST /api/chat: Chat completion
GET /api/tags: List models
POST /api/pull: Download model
DELETE /api/delete: Remove model
GET /api/ps: Show loaded models

​Ollama in OpenRAG

​Configuration

​Model: llama3.1:8b

​Model Download

​API Usage

​OpenRAG Integration

​Performance Characteristics

​Model Management

​Alternative Models

​Configuration Parameters

​System Prompts

​Health Check

​Resource Usage

​Monitoring

​Troubleshooting

​Streaming Responses

​GPU Acceleration (Optional)

​API Reference