structurecc

Document Structure Extraction for Claude Code

Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.

Installation

``bash npx structurecc`

This installs the plugin to ~/.claude/plugins/structurecc/.

`Usage`

`$3`

`bash /structure document.pdf /structure lab_image.png /structure report.docx`

`$3`

`bash /structure:batch ./documents/ /structure:batch ./patient_files/ --output ./extracted/`

`Supported Formats`

| Format | Extension | Notes | |--------|-----------|-------| | PDF |.pdf| Multi-page supported, chunked for large documents | | Word |.docx, .doc| Text and embedded images extracted | | Images |.png, .jpg, .jpeg, .tiff, .bmp | Single-page extraction |

`Output`

For each document, structurecc generates:

`document_extracted/ ├── chunks/ # Individual chunk extractions (for debugging) ├── structure.json # Complete structured extraction └── STRUCTURE.md # Human-readable markdown summary`

`$3`

`json { "source": "/path/to/document.pdf", "extracted": "2026-01-30T14:30:22Z", "pages": [ { "page": 1, "elements": [ { "id": "element_1", "type": "table", "title": "Table 1. Lab Results", "data": { "headers": ["Test", "Result", "Units", "Reference"], "rows": [ ["Glucose", "126", "mg/dL", "70-100"] ] }, "confidence": 0.98 } ] } ], "summary": { "total_pages": 5, "tables": 3, "figures": 4, "equations": 1, "average_confidence": 0.94 } }`

`Architecture`

structurecc uses a chunk-based parallel processing approach:

1. Document Analysis - Determine page count and split into chunks (5 pages each) 2. Parallel Extraction - Launch one Task agent per chunk for parallel processing 3. Chunk Merge - Combine chunk results with page offset correction 4. Output Generation - Create JSON and Markdown outputs

`Document (20 pages) │ ├── Chunk 1 (Pages 1-5) → Agent 1 ├── Chunk 2 (Pages 6-10) → Agent 2 ├── Chunk 3 (Pages 11-15)→ Agent 3 └── Chunk 4 (Pages 16-20)→ Agent 4 │ ▼ Merged Output``

This approach:
- Maximizes throughput via parallel processing
- Preserves context within chunks (figures and captions stay together)
- Uses Claude's native vision (no external APIs)
- Each agent has 200K context for thorough extraction

Element Types

$3

Extracted with:
- Headers and all rows
- Cell values with exact formatting
- Flags (H, L, *, †)
- Footnotes
- Merged cell information

$3

Supports various figure types:
- Charts/Graphs: Line, bar, scatter, pie with data series and axes
- Scientific Images: Western blots, gels, micrographs
- Diagrams: Flowcharts, illustrations, photographs

Each figure includes:
- Title and caption
- Data points (when visible)
- Axis labels and ranges
- Annotations and legends

$3

Extracted as:
- LaTeX representation
- Plain text fallback
- Variable definitions

$3

Captured with:
- Full content
- Type (header, paragraph, caption, footnote)
- Formatting information

Confidence Scores

Every element includes a confidence score (0.0-1.0):

| Score | Meaning |
|-------|---------|
| 0.95-1.00 | Crystal clear extraction |
| 0.85-0.94 | Clear with minor uncertainty |
| 0.70-0.84 | Readable but some ambiguity |
| < 0.70 | Needs manual verification |

Low confidence items are flagged in the output for review.

Use Cases

- Medical Lab Results: Extract patient data from PDF reports
- Research Papers: Structure tables and figures from publications
- Scientific Images: Transcribe gel/blot data for documentation
- Patient Records: Batch process document folders
- Data Digitization: Convert scanned documents to structured data

Requirements

- Claude Code CLI
- No external dependencies (uses Claude's native capabilities)

How It Works

structurecc leverages Claude's multimodal capabilities:

1. Claude Vision: Reads PDFs and images natively without OCR
2. Parallel Agents: Task tool spawns chunk agents for parallel processing
3. Structured Output: JSON schema ensures consistent, parseable output
4. Markdown Summary: Human-readable format for quick review

No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.

Limitations

- Very large documents (100+ pages) may require multiple runs
- Handwritten content has lower accuracy than printed text
- Low-resolution images may have reduced confidence scores
- Complex nested tables may require manual verification

License

MIT

structurecc

Document Structure Extraction for Claude Code

Extract structured data from PDFs, Word documents, and images using Claude's native vision capabilities and parallel Task agents.

Installation

``bash npx structurecc`

This installs the plugin to ~/.claude/plugins/structurecc/.

`Usage`

`$3`

`bash /structure document.pdf /structure lab_image.png /structure report.docx`

`$3`

`bash /structure:batch ./documents/ /structure:batch ./patient_files/ --output ./extracted/`

`Supported Formats`

`Output`

For each document, structurecc generates:

`$3`

`Architecture`

structurecc uses a chunk-based parallel processing approach:

Element Types

$3

Extracted with:
- Headers and all rows
- Cell values with exact formatting
- Flags (H, L, *, †)
- Footnotes
- Merged cell information

$3

Each figure includes:
- Title and caption
- Data points (when visible)
- Axis labels and ranges
- Annotations and legends

$3

Extracted as:
- LaTeX representation
- Plain text fallback
- Variable definitions

$3

Captured with:
- Full content
- Type (header, paragraph, caption, footnote)
- Formatting information

Confidence Scores

Every element includes a confidence score (0.0-1.0):

Low confidence items are flagged in the output for review.

Use Cases

Requirements

- Claude Code CLI
- No external dependencies (uses Claude's native capabilities)

How It Works

structurecc leverages Claude's multimodal capabilities:

No web searches, no external APIs, no Python dependencies. Just Claude + document = structured data.

Limitations

License

MIT