OpenRAG Experiment — WTE Corpus (February 18, 2026)
Context
Test of the OpenRAG system with a limited corpus of 33 WTE (Workplace Together Essentials) documents in French, primarily Cisco technical guides and contractual documentation.Objective
Evaluate the RAG system’s ability to answer specific questions about Cisco DECT devices mentioned in the documentation.Tested Configurations
Version 1: Initial Configuration (Failed)
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2 - Dimensions: 384D
- Optimization: English only
- Chunking: 512 characters, overlap 50
- Score Threshold: 0.25
- Number of documents: 33 (WTE guides, contracts, tutorials)
- Query: “What are the DECT models on WTE?”
- Target documents:
Cisco IP DECT 6823.pdf,Guide Cisco IP DECT 6825.pdf - Result: DECT documents absent from top 10
- Cause: Embedding model optimized for English only, poor semantic understanding of French content
Version 2: Improved Chunking (Partial failure)
- Embedding Model:
sentence-transformers/all-MiniLM-L6-v2(unchanged) - Dimensions: 384D
- Chunking: 2000 characters, overlap 200 ✅ IMPROVEMENT
- Score Threshold: 0.25
- Result: Still no DECT documents in results
Version 3: Multilingual Model (Limited improvement)
- Embedding Model:
sentence-transformers/paraphrase-multilingual-mpnet-base-v2✅ - Dimensions: 768D
- Optimization: 50+ languages, better semantic understanding
- Chunking: 2000 characters, overlap 200
- Score Threshold: 0.20 (lowered for better recall)
- Max Results: 20 (increased from 10)
- Deleted Qdrant collection (384D)
- Updated
.envanddocker-compose.yml - Full reprocessing of 33 documents → 237 vectors at 768D
- Restarted all services
Test Results
Test 1: DECT Query (v3 — multilingual model 768D)
Query:"What are the DECT models on WTE?"
Top 20 results (by relevance score):
| Position | Document | Score | Relevance |
|---|---|---|---|
| 1 | contrats-next-obs_ds_4765.pdf | 0.619 | ❌ False positive |
| 2 | contrats-next-obs_ds_4765.pdf | 0.609 | ❌ False positive |
| 3 | WTE - Tuto Collecte donnees - Orange Install.pdf | 0.524 | ⚠️ Mentions generic “DECT” |
| 4 | contrats-next-obs_ann_4762.pdf | 0.515 | ❌ False positive |
| 5 | WTE - Tuto Mon parcours en vie de solution_Vdiff.pdf | 0.472 | ❌ Not relevant |
| 6-11 | contrats-next-obs (various) | 0.468–0.420 | ❌ False positives |
| 12 | Guide Cisco IP DECT 6825.pdf | 0.414 | ✅ TARGET |
| 13 | Cisco IP DECT 6823.pdf | 0.414 | ✅ TARGET |
| 14-20 | WTE - Tuto (various) | 0.412–0.379 | ❌ Not relevant |
- ✅ DECT documents found but ranked at positions 12–13
- ❌ Outside the default top 10 limit
- ❌ Score 0.414 lower than contractual documents (0.619)
- ⚠️ The “Orange Install” document (0.524) mentions “IP or DECT phones” in the WTE context, explaining its higher score
- The technical content of the chunks does not explicitly mention “model 6823” or “model 6825”
- This information is in the filename, not in the indexed content
- The RAG system cannot link the question “what models” to the filenames
Test 2: Generic DECT Query
Query:"cisco DECT configuration"
Similar results — contractual documents ranked first, DECT guides at positions 10–15.
WTE Corpus Statistics
- Total documents: 33 documents
- Status: 33 uploaded, 0 failed
- Total indexed vectors: 237 (768D)
- Chunks per document: 1–49 chunks (average ~7 chunks)
- Chunk size: 2000 characters, overlap 200
- Languages: Primarily French
- Types: PDF technical guides, contracts, tutorials
Cisco IP DECT 6823.pdf: 5 chunksGuide Cisco IP DECT 6825.pdf: 5 chunks- Total: 10 DECT chunks / 237 total chunks (4.2%)
- contrats-next-obs_ds_4765.pdf: 49 chunks (20.7%)
- contrats-next-obs_ann_4762.pdf: 29 chunks (12.2%)
- WTE - Formation WTE Hub: 14 chunks (5.9%)
Identified Problems
1. Insufficient Corpus
- ❌ Only 33 documents, 237 vectors
- ❌ 10 DECT chunks (4.2%) buried among 227 non-DECT chunks
- ❌ Contractual documents (70+ chunks) dominate the results
2. Content Quality
- ❌ DECT chunks are highly technical (configuration, procedures)
- ❌ No description of the device models themselves
- ❌ “Model 6823/6825” information only exists in the filenames, not in indexed content
- ❌ Content cannot answer “what are the models?“
3. Architectural Limitations
- ❌ Filename not included in the vector search context
- ❌ Qdrant payload:
metadata.source_fileexists but is not exploited for ranking - ⚠️ Score threshold 0.20 too restrictive for a small corpus
- ⚠️ Max results 10 insufficient to surface documents at positions 12–13
4. Model Performance
- ⚠️ Multilingual model better than English-only, but insufficient for this task
- ⚠️ Scores: 0.414 for DECT documents vs 0.619 for contracts
- ❌ The model does not understand that “6825” in a filename means “model 6825”
Attempted Solutions
✅ Successful
- Migration to multilingual model (768D) — DECT documents now found
- Increased chunk size (512 → 2000) — better context
- Lowered threshold (0.25 → 0.20) — more results returned
- Increased max_results (10 → 20) — DECT documents now visible
❌ Insufficient
- Multilingual model does not solve the limited corpus problem
- Improved chunking does not change the fact that the information is not in chunks
- Lower threshold does not change the relative ranking
🔄 Not Tested (Future Directions)
- Filename enrichment: include the filename in indexed chunk content
- Hybrid search (keyword + semantic) to match “6823”, “6825”
- Reranking with filename matching
- Weighted metadata (boost if filename matches query terms)
- Entity extraction (identify “6823” as a product reference)
Conclusions
Key Finding
Validated hypothesis: ✅“With an insufficient corpus (33 documents, 237 vectors), a RAG system cannot provide relevant results, even with a high-quality multilingual embedding model.”Evidence:
- DECT documents exist in the database (10 indexed chunks)
- The multilingual 768D model retrieves them (positions 12–13, score 0.414)
- But they are buried by the more voluminous contractual documents
- DECT chunk content does not contain the searched information (“model 6823/6825”)
- Cannot correctly answer “What are the DECT models available?”
- 📉 Corpus too small (33 docs) for statistically significant results
- 📉 Imbalance: 70 contract chunks vs 10 DECT chunks
- 📉 Document quality: technical procedure guides vs product descriptions
- 📉 No document describing the “list of available DECT models”
Recommendations
To properly test OpenRAG:- ✅ Minimum corpus: 500–1000 documents
- ✅ Well-documented domain: science, medicine, engineering
- ✅ Public dataset: Wikipedia, arXiv, PubMed
- ✅ Structured content: descriptions, lists, tables
- ✅ Varied queries: factual, comparative, exploratory
- Wikipedia (sciences): 100K+ articles
- arXiv (computer science / physics): 2M+ papers
- PubMed (medicine): 35M+ abstracts
- Gutenberg (literature): 70K+ books
Logs and Commands Executed
Migration 384D → 768D
Search Tests
Processing Times
- Reprocessing 33 documents: ~600 seconds (10 minutes)
- Throughput: ~3–4 documents/minute
- 768D embedding: ~100–150ms per chunk
- Qdrant storage: ~30–50ms per vector
- Total vectors: 237 in ~10 minutes
Technical Metadata
Infrastructure:- Docker Compose: 10 services
- Qdrant: 1.13.0, collection “documents_embeddings”
- PostgreSQL: documents, document_chunks tables
- MinIO: 33 documents stored
- Ollama: llama3.1:8b (60–90s per response)
- Embedding: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
- Dimensions: 768
- Distance: Cosine similarity
- LLM: llama3.1:8b

