OpenRAG Architecture
OpenRAG implements a modular and scalable microservices architecture based on the RAG (Retrieval-Augmented Generation) pattern.Overview
Main Components
1. API Gateway (Port 8000)
REST entry point for all user interactions. Responsibilities:- Authentication and authorization (coming soon)
- Request validation
- Routing to orchestrator
- Rate limiting
- API documentation (Swagger)
- FastAPI
- Uvicorn (ASGI server)
- Pydantic (validation)
2. Orchestrator (Port 8001)
System core that coordinates the complete RAG workflow. Responsibilities:- Document ingestion pipeline coordination
- Query workflow management
- Inter-service communication
- Asynchronous job management
- Process monitoring
3. Embedding Service (Port 8002)
Specialized service for vector embeddings generation. Responsibilities:- Text embeddings generation
- Batch processing support
- Performance optimization (GPU if available)
sentence-transformers/all-MiniLM-L6-v2(default, 384 dimensions)sentence-transformers/paraphrase-multilingual-MiniLM-L12-v2(multilingual)sentence-transformers/all-mpnet-base-v2(better quality)- Custom sentence-transformers compatible models
4. MinIO (Ports 9000, 9001)
S3-compatible object storage for original documents. Responsibilities:- Persistent storage of uploaded files
- Document versioning
- Bucket management
- S3 API:
http://localhost:9000 - Web Console:
http://localhost:9001
5. Qdrant (Ports 6333, 6334)
Vector database for semantic search. Responsibilities:- Embedding vector indexing
- Similarity search (HNSW algorithm)
- Metadata filtering
- Clustering and optimization
documents_embeddings: Default collection- Custom collections per use case
6. PostgreSQL (Port 5432)
Relational database for metadata. Main schema:7. Ollama (Port 11434)
Local LLM server for response generation. Responsibilities:- Language model execution
- Contextualized response generation
- Model cache management
Llama 3.1 8B
Best quality/performance ratio
Phi-3 Mini
Lightweight and fast model
Gemma 7B
Excellent for analytical tasks
Mistral 7B
Very good in French
8. Redis (Port 6379)
Distributed cache and message queue. Uses:- Recent embeddings cache
- Asynchronous task queue (with Celery)
- Session management
- Rate limiting
Data Flow
Document Ingestion
- Upload: Document sent via API
- Storage: Saved in MinIO
- Extraction: Text extraction according to format
- Chunking: Split into ~512 token pieces
- Embedding: Vector generation for each chunk
- Indexing: Vector storage in Qdrant
- Metadata: Recording in PostgreSQL
Query Processing
- Query: User’s question
- Embedding: Query vectorization
- Search: Top-K search in Qdrant (cosine similarity)
- Retrieval: Getting chunk contents
- Context: Context assembly for LLM
- Generation: LLM generates response
- Logging: Query and response recording
Scalability
Horizontal Scaling
Each service can be scaled independently:Optimizations
Embedding Service
Embedding Service
- Use GPU for embeddings (10-50x faster)
- Increase batch size
- Cache frequent embeddings in Redis
Qdrant
Qdrant
- Enable HNSW index optimization
- Sharding for large collections (>10M vectors)
- Quantization to reduce memory footprint
Ollama
Ollama
- Use multiple GPUs
- Enable query parallelism
- Optimize model parameters
PostgreSQL
PostgreSQL
- Index frequently queried columns
- Partitioning for large tables
- Connection pooling (PgBouncer)
Security
To implement for production
- ✅ Authentication: JWT tokens, OAuth2
- ✅ Authorization: RBAC (Role-Based Access Control)
- ✅ HTTPS/TLS: Communication encryption
- ✅ Secrets Management: Vault, AWS Secrets Manager
- ✅ Network Policies: Service isolation
- ✅ Input Validation: Injection protection
- ✅ Rate Limiting: DDoS protection
- ✅ Audit Logging: Operation traceability
Monitoring
Important Metrics
- API: Requests/sec, latency, error rate
- Orchestrator: Jobs in progress, average duration
- Qdrant: Search time, collection size
- PostgreSQL: Connections, slow queries
- Ollama: Tokens/sec, GPU/CPU usage
Monitoring Stack (optional)
- Prometheus: Metrics collection
- Grafana: Visualization (pre-configured dashboards)
- Loki: Log aggregation