Embedding & Chunking Service
The Embedding service is responsible for transforming document text into vector representations that can be searched semantically. It handles two critical tasks: chunking (splitting documents into manageable pieces) and embedding (converting text to vectors).
What is Chunking?
Chunking is the process of splitting large documents into smaller, overlapping segments of text. This is essential for RAG systems because:
LLM Context Limits : Language models have token limits (~8K-32K tokens)
Semantic Precision : Smaller chunks provide more precise context
Better Retrieval : Focused chunks improve search relevance
Chunking Strategy in OpenRAG
# Current configuration (configurable)
CHUNK_SIZE = 1000 # characters per chunk
CHUNK_OVERLAP = 200 # overlap between consecutive chunks
MIN_CHUNK_SIZE = 100 # minimum chunk size
Example:
Original Document (5000 characters)
↓
Chunk 1: characters 0-1000 (1000 chars)
Chunk 2: characters 800-1800 (1000 chars, 200 overlap)
Chunk 3: characters 1600-2600 (1000 chars, 200 overlap)
Chunk 4: characters 2400-3400 (1000 chars, 200 overlap)
Chunk 5: characters 3200-4200 (1000 chars, 200 overlap)
Chunk 6: characters 4000-5000 (1000 chars, 200 overlap)
Why Overlap?
Ensures important information isn’t split mid-sentence
Provides context continuity between chunks
Improves retrieval accuracy by avoiding boundary effects
Chunking Method
OpenRAG uses recursive character splitting with sentence awareness:
Sentence Detection : Uses spaCy to detect sentence boundaries
Smart Splitting : Tries to break at sentence boundaries, not mid-sentence
Metadata Preservation : Each chunk retains document metadata (filename, page, position)
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size = 1000 ,
chunk_overlap = 200 ,
length_function = len ,
separators = [ " \n\n " , " \n " , ". " , " " , "" ]
)
What is Embedding?
Embedding is the process of converting text into a numerical vector representation. Similar texts produce similar vectors, enabling semantic search.
Characteristics:
Dimensions : 384 (each text becomes a 384-dimensional vector)
Model Size : 80 MB
Speed : ~1000 sentences/second on CPU
Quality : Good balance between speed and accuracy
Language : Optimized for English, works reasonably on other languages
Alternatives (configurable in settings):
Model Dimensions Size Speed Quality all-MiniLM-L6-v2 384 80 MB Fast Good all-mpnet-base-v2 768 420 MB Medium Excellent multilingual-e5-base 768 1.1 GB Medium Multilingual instructor-large 768 1.3 GB Slow Excellent
Embedding Process
Example Vector Output:
text = "How to configure an automated attendant?"
# After embedding:
vector = [
0.0234 , - 0.1234 , 0.5678 , ... , 0.0912 # 384 dimensions
]
# Vector properties:
len (vector) # 384
type (vector) # numpy.ndarray
vector.shape # (384,)
Service Architecture
Docker Service Configuration
# docker-compose.yml
embedding :
build : ./embedding-service
container_name : openrag-embedding
ports :
- "8002:8002"
environment :
- MODEL_NAME=sentence-transformers/all-MiniLM-L6-v2
- BATCH_SIZE=32
- MAX_LENGTH=512
volumes :
- ./models:/models # Cache for model weights
networks :
- openrag-network
API Endpoints
The embedding service exposes a FastAPI server:
Port : 8002 (internal only)
Endpoints:
POST /embed
curl -X POST http://localhost:8002/embed \
-H "Content-Type: application/json" \
-d '{
"texts": [
"First chunk of text",
"Second chunk of text"
]
}'
Response:
{
"embeddings" : [
[ 0.0234 , -0.1234 , ... ], // 384 dimensions
[ 0.0567 , -0.0891 , ... ] // 384 dimensions
],
"model" : "all-MiniLM-L6-v2" ,
"dimensions" : 384
}
POST /chunk
curl -X POST http://localhost:8002/chunk \
-H "Content-Type: application/json" \
-d '{
"text": "Long document text here...",
"chunk_size": 1000,
"chunk_overlap": 200
}'
Response:
{
"chunks" : [
{
"text" : "First chunk..." ,
"start_pos" : 0 ,
"end_pos" : 1000
},
{
"text" : "Second chunk..." ,
"start_pos" : 800 ,
"end_pos" : 1800
}
],
"total_chunks" : 25
}
GET /health
curl http://localhost:8002/health
Test Setup : 31 PDF documents, total 456 pages, ~2.3 MB text content
Metric Value Average chunks per document 30-50 chunks Chunking speed ~5000 characters/second Time per document 2-5 seconds
Hardware : CPU-only mode (no GPU)
Metric Value Embeddings/second ~1000 chunks/second Time per document (avg 40 chunks) ~0.04 seconds Batch processing (32 chunks) ~0.03 seconds Total time for 31 PDFs (928 chunks) ~1 second
With GPU (NVIDIA RTX 3060):
Metric Value Embeddings/second ~5000 chunks/second Time per document <0.01 seconds
Configuration Options
Environment Variables
# In docker-compose.yml or .env
# Model selection
EMBEDDING_MODEL = sentence-transformers/all-MiniLM-L6-v2
# Chunking parameters
CHUNK_SIZE = 1000
CHUNK_OVERLAP = 200
MIN_CHUNK_SIZE = 100
# Performance tuning
EMBEDDING_BATCH_SIZE = 32
MAX_SEQUENCE_LENGTH = 512
DEVICE = cpu # or cuda for GPU
Changing the Embedding Model
To use a different model:
Update EMBEDDING_MODEL in Docker Compose
Restart the embedding service:
sudo docker-compose restart embedding
Re-index existing documents (they must be embedded with the same model):
# Delete old collection
curl -X DELETE http://localhost:6333/collections/default
# Re-upload documents
# They will be re-chunked and re-embedded
Changing models requires re-indexing all documents . Vectors from different models are not compatible.
Advanced: Custom Chunking Strategies
For specialized use cases, you can implement custom chunking:
Semantic Chunking
Split based on semantic similarity rather than fixed size:
from langchain.text_splitter import SemanticChunker
semantic_chunker = SemanticChunker(
embeddings = embedding_model,
breakpoint_threshold_type = "percentile"
)
Hierarchical Chunking
Create parent-child chunk relationships:
# Parent chunks (large context)
parent_size = 2000
# Child chunks (precise retrieval)
child_size = 400
Monitoring
View embedding service logs:
sudo docker-compose logs -f embedding
Expected output:
INFO: Model loaded: all-MiniLM-L6-v2 (384 dimensions)
INFO: Embedding service ready on port 8002
INFO: Processed batch of 32 chunks in 0.03s
Troubleshooting
Slow Embedding
Symptom : Embedding takes >10 seconds per document
Solutions :
Increase batch size: EMBEDDING_BATCH_SIZE=64
Use smaller model: all-MiniLM-L6-v2 instead of all-mpnet-base-v2
Add GPU support for 5x speedup
Out of Memory
Symptom : Service crashes with CUDA out of memory or Killed
Solutions :
Reduce batch size: EMBEDDING_BATCH_SIZE=16
Reduce max sequence length: MAX_SEQUENCE_LENGTH=256
Use smaller model
Increase Docker RAM allocation
Next Steps
Qdrant Vector DB Learn how embeddings are stored and searched
Ollama LLM Understand LLM response generation