LR XMLParser — Modular, robust and safe XML parser

![npm](https://www.npmjs.com/package/@luciformresearch/xmlparser)
![npm downloads](https://www.npmjs.com/package/@luciformresearch/xmlparser)
![types](./dist/types/index.d.ts)

![TypeScript](https://www.typescriptlang.org/)
![Benchmarks](./docs/BENCHMARKS.md)
![Status](#key-use-cases)

High‑performance XML parser designed for modern AI pipelines. LR XMLParser is optimized for LLM‑generated XML (permissive mode with error recovery) while remaining strict, traceable, and secure for production workloads.

Project by LuciformResearch (Lucie Defraiteur).

— Français: see README.fr.md

Key Features

- Namespaces: xmlns/xmlns:prefix mapping with ns‑aware queries (findByNS, findAllByNS).
- Streaming (SAX): lightweight event API via LuciformSAX for large inputs.
- Robust recovery: permissive modes with diagnostics; maxRecoveries cap + recoveryReport { attempts, capped, codes?, notes? }.
- Precise diagnostics: structured codes, messages, suggestions, and locations.
- Secure defaults: limits for depth, text/PI/comment length; entity expansion guard.
- Dual build: ESM/CJS with exports map; types included.
- Text coalescing: coalesceTextNodes (default true) merges adjacent text nodes to reduce fragmentation.

Getting started (npm)

- Install:
- npm install @luciformresearch/xmlparser
- pnpm add @luciformresearch/xmlparser

- Examples (ESM and CommonJS):

``ts // ESM import { LuciformXMLParser } from '@luciformresearch/xmlparser'; const result = new LuciformXMLParser(xml, { mode: 'luciform-permissive' }).parse();`

`js // CommonJS const { LuciformXMLParser } = require('@luciformresearch/xmlparser'); const result = new LuciformXMLParser(xml, { mode: 'luciform-permissive' }).parse();`

- Streaming (SAX) quickstart:

`ts import { LuciformSAX } from '@luciformresearch/xmlparser/sax'; new LuciformSAX(xml, { onStartElement: (name, attrs) => {}, onEndElement: (name) => {}, onText: (t) => {}, }).run();`

- Subpath exports (optional): @luciformresearch/xmlparser/document, .../scanner, .../diagnostics, .../types, .../migration.

`License`

MIT with reinforced attribution. See LICENSE for terms, attribution obligations, and allowed uses.

`Overview`

LR XMLParser follows a modular architecture (scanner → parser → models → diagnostics) focused on clarity, testability, and performance.

`What's New`

- 0.2.3 - All diagnostics/messages in English,coalesceTextNodesoption (default true), benchmarks now include memory metrics. - 0.2.2 -recoveryReport enriched with codes and notes, SAX unit tests, benchmarks scaffold (npm run bench). - 0.2.1 - Recovery cap behavior: stop scanning whenmaxRecoveries exceeded; summary diagnostics added; partial document returned.

`LLM Structured Responses`

- Safe permissive parse recipe:

`ts import { LuciformXMLParser } from '@luciformresearch/xmlparser';

const parser = new LuciformXMLParser(xml, { mode: 'luciform-permissive', maxRecoveries: 20, maxDepth: 100, maxTextLength: 200_000, maxPILength: 2_000, maxCommentLength: 20_000, coalesceTextNodes: true, }); const res = parser.parse(); if (!res.success) { // In permissive mode, you may still have a usable partial document console.warn('Diagnostics:', res.diagnostics); } const value = res.document?.findElement('answer')?.getTextContent();`

- Namespace-aware extraction (LLM tags with prefixes):

`ts // 42 const item = res.document?.findByNS('urn:slots', 'item')?.getTextContent();`

`Production Security Posture`

- Limits: enforce maxDepth, maxTextLength, maxPILength, maxCommentLength, attribute count/value length. - Recovery guard: cap automatic fixes withmaxRecoveries; adds summary diagnostics and stops scanning beyond cap. - Namespaces: reserved prefix checks (xmlns, xmlURI), unbound prefix diagnostics; default ns does not apply to attributes. - DOCTYPE: extracts root/public/system; no external fetching; DTD processing disabled by default. - Entities: expansion guard (configurable limit); no network I/O. - Diagnostics: structured codes/messages/suggestions with locations for auditing.

`$3`

- Structured LLM responses ("luciform‑permissive" mode to tolerate and recover from common LLM formatting issues). - General XML parsing with precise diagnostics (line/column) and configurable limits. - Integration in AI pipelines (LR HMM) and larger systems (LR Hub).

Example within a hierarchical memory engine:

`ts const parser = new LuciformXMLParser(xml, { mode: 'luciform-permissive', maxTextLength: 100_000, }); const result = parser.parse(); if (result.success) { const summary = result.document?.findElement('summary')?.getText(); }`

`Code structure`

`lr_xmlparser/ ├── index.ts # Main parser (public API) ├── scanner.ts # Stateful tokenizer ├── document.ts # XML models (Document/Element/Node) ├── diagnostics.ts # Diagnostics (codes, messages, suggestions) ├── migration.ts # Compatibility layer (legacy → new) ├── types.ts # Shared types and interfaces └── test-integration.ts`

`Why LR XMLParser`

- Performance: fast on practical workloads (see test-integration.ts). - Maintainability: focused modules with clear separation of concerns. - Testability: isolated components, validated integration, easier debugging. - Reusability: standalone scanner, extensible diagnostics, independent models. - LLM‑oriented: permissive mode, error recovery, CDATA handling, format tolerance.

`Edge cases covered`

- Attributes and self-closing tags () - Unclosed comments/CDATA: permissive mode recovers and logs diagnostics - Mismatched tags: errors with precise codes and locations - Limits:maxDepth, maxTextLength, maxPILength- Processing instructions and DOCTYPE handling - BOM + whitespace tolerance - Namespaces:xmlns/xmlns:prefix mapping, unbound prefix diagnostics

`Express API`

`ts export class LuciformXMLParser { constructor(content: string, options?: ParserOptions); parse(): ParseResult; }`

Options include security and performance limits (depth, text length, entity expansion), plus mode: strict | permissive | luciform-permissive.

Additional option: -coalesceTextNodes?: boolean (default true): merges adjacent text nodes under the same parent to reduce node fragmentation without changing text content.

Namespace-aware queries:`ts // Given const item = result.document?.findByNS('urn:foo', 'item'); const items = result.document?.findAllByNS('urn:foo', 'item');`

SAX/streaming (large inputs):`ts import { LuciformSAX } from '@luciformresearch/xmlparser/sax';

new LuciformSAX(xml, { onStartElement: (name, attrs) => { / ... / }, onEndElement: (name) => {}, onText: (text) => {}, }).run();`

SAX API handlers: -onStartElement(name, attrs)-onEndElement(name)-onText(text)-onComment(text, closed)-onCDATA(text, closed)-onPI(content, closed)-onDoctype(content)

`Namespaces`

- Default namespace applies to elements, not attributes. - Prefixed names (e.g.,foo:bar) require a bound xmlns:fooin scope. - Reserved:xmlns prefix/name; xml must map to http://www.w3.org/XML/1998/namespace. - UsefindByNS(nsUri, local)/findAllByNS for ns-aware traversal.

`$3`

| Case | Element resolution | Attribute resolution | Example | | --- | --- | --- | --- | | Default namespace declared (xmlns="urn:d") | Applies to element names | Does not apply to attributes | → item resolves to urn:d:item; ahas no namespace | | Prefixed element (foo:bar) | Requires xmlns:foo="…" in scope | n/a | → element resolves to urn:f:bar| | Prefixed attribute (foo:a) | n/a | Requires xmlns:foo="…" in scope | → attribute a in urn:f| | Unbound prefix | DiagnosticUNDEFINED_PREFIX | Diagnostic UNDEFINED_PREFIX | without xmlns:a| | Reserved names | Diagnostic (XMLNS_PREFIX_RESERVED, XML_PREFIX_URI) | Diagnostic | xmlns:test, xml bound to wrong URI |

`Error handling`

- Inspect result.diagnosticsfor structured issues (code, message, suggestion, location). -result.success is false when errors are present; permissive mode may still return a usable document. - Typical codes:UNCLOSED_TAG, MISMATCHED_TAG, INVALID_COMMENT, INVALID_CDATA, MAX_DEPTH_EXCEEDED, MAX_TEXT_LENGTH_EXCEEDED. - Recovery cap: setmaxRecoveries to cap automatic fixes in permissive modes. When the cap is exceeded, the parser stops further scanning, adds RECOVERY_ATTEMPTED and PARTIAL_PARSE info diagnostics, and returns a partial document. See result.recoveryReport for { attempts, capped, codes?, notes? }.

`Testing and validation`

`bash npx tsx test-integration.ts`

Validated internally on:

- Valid simple XML - Malformed XML (permissive mode) - Complex XML with CDATA and comments - Performance and limits - Compatibility wrapper available

`Benchmarks`

- Quick start: - Build:npm run build- Run:npm run bench- Outputs throughput and average latency for several corpora, plus memory deltas when GC is available. - Writes JSON reports toReports/Benchmarks/ for later comparison. See docs/BENCHMARKS.md for details.

Tip: run once with node --expose-gc to enable memory delta instrumentation.

`Links and integrations`

- GitLab (source): https://gitlab.com/luciformresearch/lr_xmlparser - GitHub mirror: https://github.com/LuciformResearch/LR_XMLParser - Used by: - LR HMM (L1/L2 memory compression, "xmlEngine") - GitLab: https://gitlab.com/luciformresearch/lr_hmm - GitHub: https://github.com/LuciformResearch/LR_HMM - LR Hub (origin/base): https://gitlab.com/luciformresearch/lr_chat

`Integration Examples`

- Strict parsing (fail-fast):

`ts const strict = new LuciformXMLParser(xml, { mode: 'strict' }).parse(); if (!strict.success) throw new Error('Invalid XML');`

- Permissive with diagnostics filtering:

`ts const res = new LuciformXMLParser(xml, { mode: 'luciform-permissive', maxRecoveries: 10 }).parse(); const fatal = res.diagnostics.filter(d => d.level === 'error'); const nonFatal = res.diagnostics.filter(d => d.level !== 'error');`

- Subpath import for models:

`ts import { XMLDocument, XMLElement } from '@luciformresearch/xmlparser/document';``

Contributing

PRs welcome.

- Fork → feature branch → MR/PR
- Keep modules focused; avoid unnecessary deps
- Add tests for affected modules

Support

- Issues: open on GitLab
- Questions: GitLab discussions or direct contact
- Contact: luciedefraiteur@luciformresearch.com

—

LR XMLParser — Modular, robust and safe XML parser

![npm](https://www.npmjs.com/package/@luciformresearch/xmlparser)
![npm downloads](https://www.npmjs.com/package/@luciformresearch/xmlparser)
![types](./dist/types/index.d.ts)

![TypeScript](https://www.typescriptlang.org/)
![Benchmarks](./docs/BENCHMARKS.md)
![Status](#key-use-cases)

Project by LuciformResearch (Lucie Defraiteur).

— Français: see README.fr.md

Key Features

Getting started (npm)

- Install:
- npm install @luciformresearch/xmlparser
- pnpm add @luciformresearch/xmlparser

- Examples (ESM and CommonJS):

``ts // ESM import { LuciformXMLParser } from '@luciformresearch/xmlparser'; const result = new LuciformXMLParser(xml, { mode: 'luciform-permissive' }).parse();`

`js // CommonJS const { LuciformXMLParser } = require('@luciformresearch/xmlparser'); const result = new LuciformXMLParser(xml, { mode: 'luciform-permissive' }).parse();`

- Streaming (SAX) quickstart:

`ts import { LuciformSAX } from '@luciformresearch/xmlparser/sax'; new LuciformSAX(xml, { onStartElement: (name, attrs) => {}, onEndElement: (name) => {}, onText: (t) => {}, }).run();`

- Subpath exports (optional): @luciformresearch/xmlparser/document, .../scanner, .../diagnostics, .../types, .../migration.

`License`

MIT with reinforced attribution. See LICENSE for terms, attribution obligations, and allowed uses.

`Overview`

LR XMLParser follows a modular architecture (scanner → parser → models → diagnostics) focused on clarity, testability, and performance.

`What's New`

`LLM Structured Responses`

- Safe permissive parse recipe:

`ts import { LuciformXMLParser } from '@luciformresearch/xmlparser';

- Namespace-aware extraction (LLM tags with prefixes):

`ts // 42 const item = res.document?.findByNS('urn:slots', 'item')?.getTextContent();`

`Production Security Posture`

`$3`

Example within a hierarchical memory engine:

`Code structure`

`Why LR XMLParser`

`Edge cases covered`

`Express API`

`ts export class LuciformXMLParser { constructor(content: string, options?: ParserOptions); parse(): ParseResult; }`

Options include security and performance limits (depth, text length, entity expansion), plus mode: strict | permissive | luciform-permissive.

Additional option: -coalesceTextNodes?: boolean (default true): merges adjacent text nodes under the same parent to reduce node fragmentation without changing text content.

Namespace-aware queries:`ts // Given const item = result.document?.findByNS('urn:foo', 'item'); const items = result.document?.findAllByNS('urn:foo', 'item');`

SAX/streaming (large inputs):`ts import { LuciformSAX } from '@luciformresearch/xmlparser/sax';

new LuciformSAX(xml, { onStartElement: (name, attrs) => { / ... / }, onEndElement: (name) => {}, onText: (text) => {}, }).run();`

SAX API handlers: -onStartElement(name, attrs)-onEndElement(name)-onText(text)-onComment(text, closed)-onCDATA(text, closed)-onPI(content, closed)-onDoctype(content)

`Namespaces`

`$3`

`Error handling`

`Testing and validation`

`bash npx tsx test-integration.ts`

Validated internally on:

- Valid simple XML - Malformed XML (permissive mode) - Complex XML with CDATA and comments - Performance and limits - Compatibility wrapper available

`Benchmarks`

Tip: run once with node --expose-gc to enable memory delta instrumentation.

`Links and integrations`

`Integration Examples`

- Strict parsing (fail-fast):

`ts const strict = new LuciformXMLParser(xml, { mode: 'strict' }).parse(); if (!strict.success) throw new Error('Invalid XML');`

- Permissive with diagnostics filtering:

- Subpath import for models:

`ts import { XMLDocument, XMLElement } from '@luciformresearch/xmlparser/document';``

Contributing

PRs welcome.

- Fork → feature branch → MR/PR
- Keep modules focused; avoid unnecessary deps
- Add tests for affected modules

Support

- Issues: open on GitLab
- Questions: GitLab discussions or direct contact
- Contact: luciedefraiteur@luciformresearch.com

—