Skip to main content

OpenRAG Architecture

OpenRAG implements a modular and scalable microservices architecture based on the RAG (Retrieval-Augmented Generation) pattern.

Overview

Main Components

1. API Gateway (Port 8000)

REST entry point for all user interactions. Responsibilities:
  • Authentication and authorization (coming soon)
  • Request validation
  • Routing to orchestrator
  • Rate limiting
  • API documentation (Swagger)
Technologies:
  • FastAPI
  • Uvicorn (ASGI server)
  • Pydantic (validation)

2. Orchestrator (Port 8001)

System core that coordinates the complete RAG workflow. Responsibilities:
  • Document ingestion pipeline coordination
  • Query workflow management
  • Inter-service communication
  • Asynchronous job management
  • Process monitoring
Document ingestion workflow: RAG query workflow:

3. Embedding Service (Port 8002)

Specialized service for vector embeddings generation. Responsibilities:
  • Text embeddings generation
  • Batch processing support
  • Performance optimization (GPU if available)
Supported models:
  • sentence-transformers/all-MiniLM-L6-v2 (default, 384 dimensions)
  • sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2 (multilingual)
  • sentence-transformers/all-mpnet-base-v2 (better quality)
  • Custom sentence-transformers compatible models
Configuration:
EMBEDDING_MODEL=sentence-transformers/all-MiniLM-L6-v2
EMBEDDING_DEVICE=cpu  # or cuda
EMBEDDING_BATCH_SIZE=32

4. MinIO (Ports 9000, 9001)

S3-compatible object storage for original documents. Responsibilities:
  • Persistent storage of uploaded files
  • Document versioning
  • Bucket management
Storage structure:
documents/
├── {document_id_1}/
│   └── filename.pdf
├── {document_id_2}/
│   └── report.docx
└── ...
Access:
  • S3 API: http://localhost:9000
  • Web Console: http://localhost:9001

5. Qdrant (Ports 6333, 6334)

Vector database for semantic search. Responsibilities:
  • Embedding vector indexing
  • Similarity search (HNSW algorithm)
  • Metadata filtering
  • Clustering and optimization
Collections:
  • documents_embeddings: Default collection
  • Custom collections per use case
Configuration:
vectors:
  size: 384  # according to embedding model
  distance: Cosine  # or Dot, Euclidean
Payload structure:
{
  "document_id": "uuid",
  "chunk_index": 0,
  "content": "chunk text",
  "metadata": {
    "source_file": "document.pdf",
    "page": 1
  }
}

6. PostgreSQL (Port 5432)

Relational database for metadata. Main schema:
CREATE TABLE documents (
    id UUID PRIMARY KEY,
    filename VARCHAR(255),
    file_type VARCHAR(50),
    file_size BIGINT,
    minio_object_key VARCHAR(500),
    status VARCHAR(50),
    upload_date TIMESTAMP,
    processed_date TIMESTAMP,
    metadata JSONB,
    created_at TIMESTAMP,
    updated_at TIMESTAMP
);

7. Ollama (Port 11434)

Local LLM server for response generation. Responsibilities:
  • Language model execution
  • Contextualized response generation
  • Model cache management
Recommended models:

Llama 3.1 8B

Best quality/performance ratio

Phi-3 Mini

Lightweight and fast model

Gemma 7B

Excellent for analytical tasks

Mistral 7B

Very good in French
Alternative with Cloud API:
LLM_PROVIDER=openai
LLM_MODEL=gpt-4-turbo
OPENAI_API_KEY=sk-...

8. Redis (Port 6379)

Distributed cache and message queue. Uses:
  • Recent embeddings cache
  • Asynchronous task queue (with Celery)
  • Session management
  • Rate limiting

Data Flow

Document Ingestion

  1. Upload: Document sent via API
  2. Storage: Saved in MinIO
  3. Extraction: Text extraction according to format
  4. Chunking: Split into ~512 token pieces
  5. Embedding: Vector generation for each chunk
  6. Indexing: Vector storage in Qdrant
  7. Metadata: Recording in PostgreSQL

Query Processing

  1. Query: User’s question
  2. Embedding: Query vectorization
  3. Search: Top-K search in Qdrant (cosine similarity)
  4. Retrieval: Getting chunk contents
  5. Context: Context assembly for LLM
  6. Generation: LLM generates response
  7. Logging: Query and response recording

Scalability

Horizontal Scaling

Each service can be scaled independently:
docker-compose.yml:
  orchestrator:
    deploy:
      replicas: 3

Optimizations

  • Use GPU for embeddings (10-50x faster)
  • Increase batch size
  • Cache frequent embeddings in Redis
  • Enable HNSW index optimization
  • Sharding for large collections (>10M vectors)
  • Quantization to reduce memory footprint
  • Use multiple GPUs
  • Enable query parallelism
  • Optimize model parameters
  • Index frequently queried columns
  • Partitioning for large tables
  • Connection pooling (PgBouncer)

Security

Basic security configuration - Strengthen for production!

To implement for production

  • Authentication: JWT tokens, OAuth2
  • Authorization: RBAC (Role-Based Access Control)
  • HTTPS/TLS: Communication encryption
  • Secrets Management: Vault, AWS Secrets Manager
  • Network Policies: Service isolation
  • Input Validation: Injection protection
  • Rate Limiting: DDoS protection
  • Audit Logging: Operation traceability

Monitoring

Important Metrics

  • API: Requests/sec, latency, error rate
  • Orchestrator: Jobs in progress, average duration
  • Qdrant: Search time, collection size
  • PostgreSQL: Connections, slow queries
  • Ollama: Tokens/sec, GPU/CPU usage

Monitoring Stack (optional)

docker-compose --profile monitoring up -d
  • Prometheus: Metrics collection
  • Grafana: Visualization (pre-configured dashboards)
  • Loki: Log aggregation

Next Steps