Skip to main content

OpenRAG Experiment — WTE Corpus (February 18, 2026)

Context

Test of the OpenRAG system with a limited corpus of 33 WTE (Workplace Together Essentials) documents in French, primarily Cisco technical guides and contractual documentation.

Objective

Evaluate the RAG system’s ability to answer specific questions about Cisco DECT devices mentioned in the documentation.

Tested Configurations

Version 1: Initial Configuration (Failed)

  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2
  • Dimensions: 384D
  • Optimization: English only
  • Chunking: 512 characters, overlap 50
  • Score Threshold: 0.25
  • Number of documents: 33 (WTE guides, contracts, tutorials)
Identified problem:
  • Query: “What are the DECT models on WTE?”
  • Target documents: Cisco IP DECT 6823.pdf, Guide Cisco IP DECT 6825.pdf
  • Result: DECT documents absent from top 10
  • Cause: Embedding model optimized for English only, poor semantic understanding of French content

Version 2: Improved Chunking (Partial failure)

  • Embedding Model: sentence-transformers/all-MiniLM-L6-v2 (unchanged)
  • Dimensions: 384D
  • Chunking: 2000 characters, overlap 200 ✅ IMPROVEMENT
  • Score Threshold: 0.25
  • Result: Still no DECT documents in results
Conclusion: The problem is not the chunking strategy, but the embedding model.

Version 3: Multilingual Model (Limited improvement)

  • Embedding Model: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  • Dimensions: 768D
  • Optimization: 50+ languages, better semantic understanding
  • Chunking: 2000 characters, overlap 200
  • Score Threshold: 0.20 (lowered for better recall)
  • Max Results: 20 (increased from 10)
Migration performed:
  1. Deleted Qdrant collection (384D)
  2. Updated .env and docker-compose.yml
  3. Full reprocessing of 33 documents → 237 vectors at 768D
  4. Restarted all services

Test Results

Test 1: DECT Query (v3 — multilingual model 768D)

Query: "What are the DECT models on WTE?" Top 20 results (by relevance score):
PositionDocumentScoreRelevance
1contrats-next-obs_ds_4765.pdf0.619❌ False positive
2contrats-next-obs_ds_4765.pdf0.609❌ False positive
3WTE - Tuto Collecte donnees - Orange Install.pdf0.524⚠️ Mentions generic “DECT”
4contrats-next-obs_ann_4762.pdf0.515❌ False positive
5WTE - Tuto Mon parcours en vie de solution_Vdiff.pdf0.472❌ Not relevant
6-11contrats-next-obs (various)0.468–0.420❌ False positives
12Guide Cisco IP DECT 6825.pdf0.414TARGET
13Cisco IP DECT 6823.pdf0.414TARGET
14-20WTE - Tuto (various)0.412–0.379❌ Not relevant
Analysis:
  • ✅ DECT documents found but ranked at positions 12–13
  • ❌ Outside the default top 10 limit
  • ❌ Score 0.414 lower than contractual documents (0.619)
  • ⚠️ The “Orange Install” document (0.524) mentions “IP or DECT phones” in the WTE context, explaining its higher score
Content of DECT chunks:
Cisco IP DECT 6823.pdf: "sur Enregistrer -> «Sauvegarder» pour enregistrer le numéro (Facultatif) Mettez en surbrillance un champ pour ajouter ou remplacer d'autres informations..."

Guide Cisco IP DECT 6825.pdf: "sieurs utilisateurs ou bornes via User Hub, effectuez toutes les actions en une seule opération. Ensuite, attendez environ 90 secondes..."
Identified problem:
  • The technical content of the chunks does not explicitly mention “model 6823” or “model 6825”
  • This information is in the filename, not in the indexed content
  • The RAG system cannot link the question “what models” to the filenames

Test 2: Generic DECT Query

Query: "cisco DECT configuration" Similar results — contractual documents ranked first, DECT guides at positions 10–15.

WTE Corpus Statistics

  • Total documents: 33 documents
  • Status: 33 uploaded, 0 failed
  • Total indexed vectors: 237 (768D)
  • Chunks per document: 1–49 chunks (average ~7 chunks)
  • Chunk size: 2000 characters, overlap 200
  • Languages: Primarily French
  • Types: PDF technical guides, contracts, tutorials
DECT documents:
  • Cisco IP DECT 6823.pdf: 5 chunks
  • Guide Cisco IP DECT 6825.pdf: 5 chunks
  • Total: 10 DECT chunks / 237 total chunks (4.2%)
Dominant documents (by chunk count):
  1. contrats-next-obs_ds_4765.pdf: 49 chunks (20.7%)
  2. contrats-next-obs_ann_4762.pdf: 29 chunks (12.2%)
  3. WTE - Formation WTE Hub: 14 chunks (5.9%)

Identified Problems

1. Insufficient Corpus

  • ❌ Only 33 documents, 237 vectors
  • ❌ 10 DECT chunks (4.2%) buried among 227 non-DECT chunks
  • ❌ Contractual documents (70+ chunks) dominate the results

2. Content Quality

  • ❌ DECT chunks are highly technical (configuration, procedures)
  • ❌ No description of the device models themselves
  • ❌ “Model 6823/6825” information only exists in the filenames, not in indexed content
  • ❌ Content cannot answer “what are the models?“

3. Architectural Limitations

  • ❌ Filename not included in the vector search context
  • ❌ Qdrant payload: metadata.source_file exists but is not exploited for ranking
  • ⚠️ Score threshold 0.20 too restrictive for a small corpus
  • ⚠️ Max results 10 insufficient to surface documents at positions 12–13

4. Model Performance

  • ⚠️ Multilingual model better than English-only, but insufficient for this task
  • ⚠️ Scores: 0.414 for DECT documents vs 0.619 for contracts
  • ❌ The model does not understand that “6825” in a filename means “model 6825”

Attempted Solutions

✅ Successful

  1. Migration to multilingual model (768D) — DECT documents now found
  2. Increased chunk size (512 → 2000) — better context
  3. Lowered threshold (0.25 → 0.20) — more results returned
  4. Increased max_results (10 → 20) — DECT documents now visible

❌ Insufficient

  1. Multilingual model does not solve the limited corpus problem
  2. Improved chunking does not change the fact that the information is not in chunks
  3. Lower threshold does not change the relative ranking

🔄 Not Tested (Future Directions)

  1. Filename enrichment: include the filename in indexed chunk content
  2. Hybrid search (keyword + semantic) to match “6823”, “6825”
  3. Reranking with filename matching
  4. Weighted metadata (boost if filename matches query terms)
  5. Entity extraction (identify “6823” as a product reference)

Conclusions

Key Finding

Validated hypothesis: ✅
“With an insufficient corpus (33 documents, 237 vectors), a RAG system cannot provide relevant results, even with a high-quality multilingual embedding model.”
Evidence:
  1. DECT documents exist in the database (10 indexed chunks)
  2. The multilingual 768D model retrieves them (positions 12–13, score 0.414)
  3. But they are buried by the more voluminous contractual documents
  4. DECT chunk content does not contain the searched information (“model 6823/6825”)
  5. Cannot correctly answer “What are the DECT models available?”
Test limitations:
  • 📉 Corpus too small (33 docs) for statistically significant results
  • 📉 Imbalance: 70 contract chunks vs 10 DECT chunks
  • 📉 Document quality: technical procedure guides vs product descriptions
  • 📉 No document describing the “list of available DECT models”

Recommendations

To properly test OpenRAG:
  1. Minimum corpus: 500–1000 documents
  2. Well-documented domain: science, medicine, engineering
  3. Public dataset: Wikipedia, arXiv, PubMed
  4. Structured content: descriptions, lists, tables
  5. Varied queries: factual, comparative, exploratory
Suggested datasets:
  • Wikipedia (sciences): 100K+ articles
  • arXiv (computer science / physics): 2M+ papers
  • PubMed (medicine): 35M+ abstracts
  • Gutenberg (literature): 70K+ books

Logs and Commands Executed

Migration 384D → 768D

# 1. Update configuration
.env:
  QDRANT_VECTOR_SIZE=384 768
  EMBEDDING_MODEL=all-MiniLM-L6-v2 paraphrase-multilingual-mpnet-base-v2
  CHUNK_SIZE=512 2000
  CHUNK_OVERLAP=50 200

# 2. Delete old collection
curl -X DELETE http://localhost:6333/collections/documents_embeddings
# {"result":true,"status":"ok","time":0.001186893}

# 3. Reset database
docker-compose exec postgres psql -U openrag_user -d openrag_db
DELETE FROM document_chunks; -- 935 rows deleted
UPDATE documents SET status = 'uploaded' WHERE status = 'processed'; -- 33 rows updated

# 4. Restart services
docker-compose down
docker-compose up -d

# 5. Verify configuration
docker-compose exec orchestrator printenv | grep -E "QDRANT|EMBEDDING|CHUNK"
# QDRANT_VECTOR_SIZE=768
# EMBEDDING_MODEL=paraphrase-multilingual-mpnet-base-v2
# CHUNK_SIZE=2000
# CHUNK_OVERLAP=200

# 6. Reprocess documents
docker cp scripts/reprocess_documents.py openrag-orchestrator:/app/
docker-compose exec orchestrator python reprocess_documents.py
# [33/33] Processing complete! ✨
# Total: 237 vectors indexed

# 7. Verify Qdrant
curl -s http://localhost:6333/collections/documents_embeddings | python3 -m json.tool
# "points_count": 237
# "vectors": {"size": 768}

Search Tests

# Test 1: Without LLM, top 10
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the DECT models on WTE?", "use_llm": false, "max_results": 10}'
# Result: 10 contractual documents, no DECT

# Test 2: Without LLM, top 20
curl -X POST http://localhost:8000/query \
  -H "Content-Type: application/json" \
  -d '{"query": "What are the DECT models on WTE?", "use_llm": false, "max_results": 20}'
# Result: DECT documents at positions 12-13

# Test 3: Verify DECT indexation
curl -s -X POST http://localhost:6333/collections/documents_embeddings/points/scroll \
  -H "Content-Type: application/json" \
  -d '{"limit": 300, "with_payload": true, "with_vector": false}' | \
  python3 -c "import sys, json; d=json.load(sys.stdin); \
  dect=[p for p in d['result']['points'] if 'DECT' in p['payload']['metadata'].get('source_file','')]; \
  print(f'DECT chunks: {len(dect)}')"
# DECT chunks: 10 ✅

Processing Times

  • Reprocessing 33 documents: ~600 seconds (10 minutes)
  • Throughput: ~3–4 documents/minute
  • 768D embedding: ~100–150ms per chunk
  • Qdrant storage: ~30–50ms per vector
  • Total vectors: 237 in ~10 minutes

Technical Metadata

Infrastructure:
  • Docker Compose: 10 services
  • Qdrant: 1.13.0, collection “documents_embeddings”
  • PostgreSQL: documents, document_chunks tables
  • MinIO: 33 documents stored
  • Ollama: llama3.1:8b (60–90s per response)
Models:
  • Embedding: sentence-transformers/paraphrase-multilingual-mpnet-base-v2
  • Dimensions: 768
  • Distance: Cosine similarity
  • LLM: llama3.1:8b
Date: February 18, 2026