Extract Text from Report
Low-level pipeline that turns PDF, HL7, RTF, and DOCX pathology reports into plain text for downstream NLP
https://github.com/sbalci/gleason_extraction
https://github.com/sbalci/report-management-system
https://github.com/sbalci/pathology-reports-text-analysis
Objectives
- One extractor that handles PDF (scanned + digital), HL7 v2, RTF, DOCX, and plain text.
- Deterministic output: same input → same text, byte-for-byte.
- OCR fallback for scanned PDFs, but flagged so downstream code knows confidence is lower.
- Structured JSON sidecar with page numbers, headings, and source format.
Status
| Component | Status |
|---|---|
| PDF (digital) | Stable |
| PDF (scanned, OCR via Tesseract) | Working; Turkish + English language packs |
| HL7 v2 | Stable |
| RTF / DOCX | Stable |
| JSON sidecar schema | v0.3 — see repo README |
Input assumptions
- All PHI removal is done before files reach this pipeline (upstream of the repo).
- Extractor assumes files are already in a reviewed, de-identified staging folder.
- Language is detected automatically; mixed Turkish/English documents are handled.
Tools and repositories
extract-report-text— main extractor (Python).markitdown— used for Office and HTML inputs.pathology-reports-text-analysis— primary consumer; read this to understand what the downstream expects.
How to contribute
- Clone
extract-report-textand runpytestagainst the bundled fixtures. - Add a failing fixture for any real-world report format that breaks the extractor.
- Fix the extractor, then re-run tests.
Known pitfalls
- Some scanner output uses embedded fonts with custom glyph mappings; OCR-fallback kicks in automatically, but turnaround is slower.
- HL7 messages sometimes contain CDATA-wrapped narrative; the parser unwraps it, but check the JSON sidecar
source: hl7.cdataflag before analyzing.