Serdar Balcı • Research
  • Home
  • About
  • Main site
  1. Text Analysis
  2. Extract Text from Report
  • Home
  • Report Analysis
    • Kris reports
    • LabStats
  • Text Analysis
    • Extract Text from Report
    • Skills & Report Quality
  • Quality Research
    • Scanning Time
    • Intradepartmental Consultation
    • Consultations from Outside Labs
  • Morphology
    • Pancreas Morphology
    • HER2 Gastro
    • Omentum
  • patoloji AI
    • Pink Kidney
    • Liver Tru-Cut Primary
    • Aiforia Breast
    • Paige Prostate
    • QuPath Repositories
    • hepatocyteapp
  • Pathology Apps
    • PathoLens
    • PathoGross
    • Video to WSI
    • DIY WSI
  • Bibliometrics
    • WHO Cites Who
  • Ecosystem
    • ecosystem
  • Patoloji Notları
    • Patoloji Atlası
    • Patoloji Notları
    • ParaPathology
  • Web Pages
    • Web Pages
  • jamovi
    • jamovi
    • ClinicoPathDescriptives
    • jsurvival
    • meddecide
    • jjstatsplot
    • OncoPath
  • Patoloji ve Bilişim
    • Patoloji ve Bilişim
  • Patoloji Bilgi Yönetim Sistemi
    • LIS
  • List of Projects
    • List of Projects

On this page

  • Objectives
  • Status
  • Input assumptions
  • Tools and repositories
  • How to contribute
  • Known pitfalls
  1. Text Analysis
  2. Extract Text from Report

Extract Text from Report

Low-level pipeline that turns PDF, HL7, RTF, and DOCX pathology reports into plain text for downstream NLP

← Home

https://github.com/sbalci/gleason_extraction

https://github.com/sbalci/report-management-system

https://github.com/sbalci/pathology-reports-text-analysis

Objectives

  • One extractor that handles PDF (scanned + digital), HL7 v2, RTF, DOCX, and plain text.
  • Deterministic output: same input → same text, byte-for-byte.
  • OCR fallback for scanned PDFs, but flagged so downstream code knows confidence is lower.
  • Structured JSON sidecar with page numbers, headings, and source format.

Status

Component Status
PDF (digital) Stable
PDF (scanned, OCR via Tesseract) Working; Turkish + English language packs
HL7 v2 Stable
RTF / DOCX Stable
JSON sidecar schema v0.3 — see repo README

Input assumptions

  • All PHI removal is done before files reach this pipeline (upstream of the repo).
  • Extractor assumes files are already in a reviewed, de-identified staging folder.
  • Language is detected automatically; mixed Turkish/English documents are handled.

Tools and repositories

  • extract-report-text — main extractor (Python).
  • markitdown — used for Office and HTML inputs.
  • pathology-reports-text-analysis — primary consumer; read this to understand what the downstream expects.

How to contribute

  1. Clone extract-report-text and run pytest against the bundled fixtures.
  2. Add a failing fixture for any real-world report format that breaks the extractor.
  3. Fix the extractor, then re-run tests.

Known pitfalls

  • Some scanner output uses embedded fonts with custom glyph mappings; OCR-fallback kicks in automatically, but turnaround is slower.
  • HL7 messages sometimes contain CDATA-wrapped narrative; the parser unwraps it, but check the JSON sidecar source: hl7.cdata flag before analyzing.

© 2024-2026 Serdar Balcı

 

Contact