OpenRAG Experiment — WTE Corpus (February 18, 2026)

Context

Test of the OpenRAG system with a limited corpus of 33 WTE (Workplace Together Essentials) documents in French, primarily Cisco technical guides and contractual documentation.

Objective

Evaluate the RAG system’s ability to answer specific questions about Cisco DECT devices mentioned in the documentation.

Tested Configurations

Version 1: Initial Configuration (Failed)

Embedding Model: sentence-transformers/all-MiniLM-L6-v2
Dimensions: 384D
Optimization: English only
Chunking: 512 characters, overlap 50
Score Threshold: 0.25
Number of documents: 33 (WTE guides, contracts, tutorials)

Identified problem:

Query: “What are the DECT models on WTE?”
Target documents: Cisco IP DECT 6823.pdf, Guide Cisco IP DECT 6825.pdf
Result: DECT documents absent from top 10
Cause: Embedding model optimized for English only, poor semantic understanding of French content

Version 2: Improved Chunking (Partial failure)

Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (unchanged)
Dimensions: 384D
Chunking: 2000 characters, overlap 200 ✅ IMPROVEMENT
Score Threshold: 0.25
Result: Still no DECT documents in results

Conclusion: The problem is not the chunking strategy, but the embedding model.

Version 3: Multilingual Model (Limited improvement)

Embedding Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2 ✅
Dimensions: 768D
Optimization: 50+ languages, better semantic understanding
Chunking: 2000 characters, overlap 200
Score Threshold: 0.20 (lowered for better recall)
Max Results: 20 (increased from 10)

Migration performed:

Deleted Qdrant collection (384D)
Updated .env and docker-compose.yml
Full reprocessing of 33 documents → 237 vectors at 768D
Restarted all services

Test Results

Test 1: DECT Query (v3 — multilingual model 768D)

Query: "What are the DECT models on WTE?" Top 20 results (by relevance score):

Position	Document	Score	Relevance
1	contrats-next-obs_ds_4765.pdf	0.619	❌ False positive
2	contrats-next-obs_ds_4765.pdf	0.609	❌ False positive
3	WTE - Tuto Collecte donnees - Orange Install.pdf	0.524	⚠️ Mentions generic “DECT”
4	contrats-next-obs_ann_4762.pdf	0.515	❌ False positive
5	WTE - Tuto Mon parcours en vie de solution_Vdiff.pdf	0.472	❌ Not relevant
6-11	contrats-next-obs (various)	0.468–0.420	❌ False positives
12	Guide Cisco IP DECT 6825.pdf	0.414	✅ TARGET
13	Cisco IP DECT 6823.pdf	0.414	✅ TARGET
14-20	WTE - Tuto (various)	0.412–0.379	❌ Not relevant

Analysis:

✅ DECT documents found but ranked at positions 12–13
❌ Outside the default top 10 limit
❌ Score 0.414 lower than contractual documents (0.619)
⚠️ The “Orange Install” document (0.524) mentions “IP or DECT phones” in the WTE context, explaining its higher score

Content of DECT chunks:

Cisco IP DECT 6823.pdf: "sur Enregistrer -> «Sauvegarder» pour enregistrer le numéro (Facultatif) Mettez en surbrillance un champ pour ajouter ou remplacer d'autres informations..."

Guide Cisco IP DECT 6825.pdf: "sieurs utilisateurs ou bornes via User Hub, effectuez toutes les actions en une seule opération. Ensuite, attendez environ 90 secondes..."

Identified problem:

The technical content of the chunks does not explicitly mention “model 6823” or “model 6825”
This information is in the filename, not in the indexed content
The RAG system cannot link the question “what models” to the filenames

Test 2: Generic DECT Query

Query: "cisco DECT configuration" Similar results — contractual documents ranked first, DECT guides at positions 10–15.

WTE Corpus Statistics

Total documents: 33 documents
Status: 33 uploaded, 0 failed
Total indexed vectors: 237 (768D)
Chunks per document: 1–49 chunks (average ~7 chunks)
Chunk size: 2000 characters, overlap 200
Languages: Primarily French
Types: PDF technical guides, contracts, tutorials

DECT documents:

Cisco IP DECT 6823.pdf: 5 chunks
Guide Cisco IP DECT 6825.pdf: 5 chunks
Total: 10 DECT chunks / 237 total chunks (4.2%)

Dominant documents (by chunk count):

contrats-next-obs_ds_4765.pdf: 49 chunks (20.7%)
contrats-next-obs_ann_4762.pdf: 29 chunks (12.2%)
WTE - Formation WTE Hub: 14 chunks (5.9%)

Identified Problems

1. Insufficient Corpus

❌ Only 33 documents, 237 vectors
❌ 10 DECT chunks (4.2%) buried among 227 non-DECT chunks
❌ Contractual documents (70+ chunks) dominate the results

2. Content Quality

❌ DECT chunks are highly technical (configuration, procedures)
❌ No description of the device models themselves
❌ “Model 6823/6825” information only exists in the filenames, not in indexed content
❌ Content cannot answer “what are the models?“

3. Architectural Limitations

❌ Filename not included in the vector search context
❌ Qdrant payload: metadata.source_file exists but is not exploited for ranking
⚠️ Score threshold 0.20 too restrictive for a small corpus
⚠️ Max results 10 insufficient to surface documents at positions 12–13

4. Model Performance

⚠️ Multilingual model better than English-only, but insufficient for this task
⚠️ Scores: 0.414 for DECT documents vs 0.619 for contracts
❌ The model does not understand that “6825” in a filename means “model 6825”

Attempted Solutions

✅ Successful

Migration to multilingual model (768D) — DECT documents now found
Increased chunk size (512 → 2000) — better context
Lowered threshold (0.25 → 0.20) — more results returned
Increased max_results (10 → 20) — DECT documents now visible

❌ Insufficient

Multilingual model does not solve the limited corpus problem
Improved chunking does not change the fact that the information is not in chunks
Lower threshold does not change the relative ranking

🔄 Not Tested (Future Directions)

Filename enrichment: include the filename in indexed chunk content
Hybrid search (keyword + semantic) to match “6823”, “6825”
Reranking with filename matching
Weighted metadata (boost if filename matches query terms)
Entity extraction (identify “6823” as a product reference)

Conclusions

Key Finding

Validated hypothesis: ✅

“With an insufficient corpus (33 documents, 237 vectors), a RAG system cannot provide relevant results, even with a high-quality multilingual embedding model.”

Evidence:

DECT documents exist in the database (10 indexed chunks)
The multilingual 768D model retrieves them (positions 12–13, score 0.414)
But they are buried by the more voluminous contractual documents
DECT chunk content does not contain the searched information (“model 6823/6825”)
Cannot correctly answer “What are the DECT models available?”

Test limitations:

📉 Corpus too small (33 docs) for statistically significant results
📉 Imbalance: 70 contract chunks vs 10 DECT chunks
📉 Document quality: technical procedure guides vs product descriptions
📉 No document describing the “list of available DECT models”

Recommendations

To properly test OpenRAG:

✅ Minimum corpus: 500–1000 documents
✅ Well-documented domain: science, medicine, engineering
✅ Public dataset: Wikipedia, arXiv, PubMed
✅ Structured content: descriptions, lists, tables
✅ Varied queries: factual, comparative, exploratory

Suggested datasets:

Wikipedia (sciences): 100K+ articles
arXiv (computer science / physics): 2M+ papers
PubMed (medicine): 35M+ abstracts
Gutenberg (literature): 70K+ books

Logs and Commands Executed

Migration 384D → 768D

# 1. Update configuration
.env:
  QDRANT_VECTOR_SIZE=384 → 768
  EMBEDDING_MODEL=all-MiniLM-L6-v2 → paraphrase-multilingual-mpnet-base-v2
  CHUNK_SIZE=512 → 2000
  CHUNK_OVERLAP=50 → 200

# 2. Delete old collection
curl -X DELETE http://localhost:6333/collections/documents_embeddings
# {"result":true,"status":"ok","time":0.001186893}

# 3. Reset database
docker-compose exec postgres psql -U openrag_user -d openrag_db
DELETE FROM document_chunks; -- 935 rows deleted
UPDATE documents SET status = 'uploaded' WHERE status = 'processed'; -- 33 rows updated

# 4. Restart services
docker-compose down
docker-compose up -d

# 5. Verify configuration
docker-compose exec orchestrator printenv | grep -E "QDRANT|EMBEDDING|CHUNK"
# QDRANT_VECTOR_SIZE=768
# EMBEDDING_MODEL=paraphrase-multilingual-mpnet-base-v2
# CHUNK_SIZE=2000
# CHUNK_OVERLAP=200

# 6. Reprocess documents
docker cp scripts/reprocess_documents.py openrag-orchestrator:/app/
docker-compose exec orchestrator python reprocess_documents.py
# [33/33] Processing complete! ✨
# Total: 237 vectors indexed

# 7. Verify Qdrant
curl -s http://localhost:6333/collections/documents_embeddings | python3 -m json.tool
# "points_count": 237
# "vectors": {"size": 768}

Search Tests

# Test 1: Without LLM, top 10
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the DECT models on WTE?", "use_llm": false, "max_results": 10}'
# Result: 10 contractual documents, no DECT

# Test 2: Without LLM, top 20
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the DECT models on WTE?", "use_llm": false, "max_results": 20}'
# Result: DECT documents at positions 12-13

# Test 3: Verify DECT indexation
curl -s -X POST http://localhost:6333/collections/documents_embeddings/points/scroll \
  -H "Content-Type: application/json" \
  -d '{"limit": 300, "with_payload": true, "with_vector": false}' | \
  python3 -c "import sys, json; d=json.load(sys.stdin); \
  dect=[p for p in d['result']['points'] if 'DECT' in p['payload']['metadata'].get('source_file','')]; \
  print(f'DECT chunks: {len(dect)}')"
# DECT chunks: 10 ✅

Processing Times

Reprocessing 33 documents: ~600 seconds (10 minutes)
Throughput: ~3–4 documents/minute
768D embedding: ~100–150ms per chunk
Qdrant storage: ~30–50ms per vector
Total vectors: 237 in ~10 minutes

Technical Metadata

Infrastructure:

Docker Compose: 10 services
Qdrant: 1.13.0, collection “documents_embeddings”
PostgreSQL: documents, document_chunks tables
MinIO: 33 documents stored
Ollama: llama3.1:8b (60–90s per response)

Models:

Embedding: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
Dimensions: 768
Distance: Cosine similarity
LLM: llama3.1:8b

Date: February 18, 2026

​OpenRAG Experiment — WTE Corpus (February 18, 2026)

​Context

​Objective

​Tested Configurations

​Version 1: Initial Configuration (Failed)

​Version 2: Improved Chunking (Partial failure)

​Version 3: Multilingual Model (Limited improvement)

​Test Results

​Test 1: DECT Query (v3 — multilingual model 768D)

​Test 2: Generic DECT Query

​WTE Corpus Statistics

​Identified Problems

​1. Insufficient Corpus

​2. Content Quality

​3. Architectural Limitations

​4. Model Performance

​Attempted Solutions

​✅ Successful

​❌ Insufficient

​🔄 Not Tested (Future Directions)

​Conclusions

​Key Finding

​Recommendations

​Logs and Commands Executed

​Migration 384D → 768D

​Search Tests

​Processing Times

​Technical Metadata