Document processing tools for majk chat - PDF, Excel, Word, PowerPoint parsing and analysis
npm install @majkapp/majk-chat-document-toolsComprehensive document processing package for majk chat that adds support for parsing and analyzing PDF, Excel, Word, PowerPoint, and CSV files.
- Universal Document Analyzer: Automatically detects file types and routes to appropriate parsers
- PDF Parser: Extract text content and metadata from PDF files
- Excel Parser: Parse XLSX/XLS files with sheet analysis and data extraction
- Word Parser: Extract text from DOCX files with multiple output formats
- PowerPoint Parser: Extract slide content, notes, and presentation structure
- CSV Parser: Intelligent CSV parsing with type detection and column analysis
| Format | Extensions | Features |
|--------|------------|----------|
| PDF | .pdf | Text extraction, metadata, page-specific parsing |
| Excel | .xlsx, .xls | Sheet parsing, data analysis, multiple output formats |
| Word | .docx | Text/HTML/Markdown output, style extraction |
| PowerPoint | .pptx | Slide content, speaker notes, presentation metadata |
| CSV | .csv, .tsv | Auto-delimiter detection, type inference, column analysis |
``bash`
npm install @majkapp/majk-chat-document-tools
`typescript
import { DocumentAnalyzerTool } from '@majkapp/majk-chat-document-tools';
const analyzer = new DocumentAnalyzerTool();
// Automatically detect and parse any supported document
const result = await analyzer.execute({
file_path: './document.pdf',
analysis_type: 'auto',
include_metadata: true
}, context);
`
`typescript
import {
PdfParserTool,
ExcelParserTool,
WordParserTool,
PowerPointParserV2Tool,
CsvParserTool
} from '@majkapp/majk-chat-document-tools';
// PDF parsing
const pdfParser = new PdfParserTool();
const pdfResult = await pdfParser.execute({
file_path: './report.pdf',
page_range: { start: 1, end: 5 },
extract_metadata: true
}, context);
// Excel parsing
const excelParser = new ExcelParserTool();
const excelResult = await excelParser.execute({
file_path: './data.xlsx',
sheet_name: 'Sales Data',
output_format: 'json',
max_rows: 1000
}, context);
`
`typescript
import { MajkChatBuilder } from '@majkapp/majk-chat-core';
import { registerDocumentTools } from '@majkapp/majk-chat-document-tools';
const builder = new MajkChatBuilder()
.withProvider('anthropic')
.withModel('claude-3-5-sonnet-20241022');
// Register all document tools
registerDocumentTools(builder.getToolRegistry());
const chat = builder.build();
`
Automatically detects file type and applies the appropriate parser.
Parameters:
- file_path (required): Path to document fileanalysis_type
- : auto | text_only | structured | metadatamax_text_length
- : Maximum text extraction length (default: 50000)include_metadata
- : Extract document metadata (default: true)output_format
- : json | summary | detailed
Parameters:
- file_path (required): Path to PDF filepage_range
- : { start?: number, end?: number }extract_metadata
- : Extract PDF metadata (default: true)max_text_length
- : Text length limit (default: 50000)
Parameters:
- file_path (required): Path to Excel filesheet_name
- : Specific sheet to parserange
- : Excel range (e.g., "A1:D10")header_row
- : Header row number (default: 1)max_rows
- : Maximum rows to parse (default: 1000)output_format
- : json | csv | table
Parameters:
- file_path (required): Path to Word fileoutput_format
- : plain | html | markdowninclude_images
- : Process image references (default: false)max_text_length
- : Text length limit (default: 50000)extract_styles
- : Extract style information (default: false)
Parameters:
- file_path (required): Path to PowerPoint fileinclude_slide_notes
- : Extract speaker notes (default: true)slide_numbers
- : Array of specific slides to extractmax_text_length
- : Text length limit (default: 50000)extract_slide_titles
- : Extract slide titles (default: true)include_shapes
- : Include shape details (default: false)
Parameters:
- file_path (required): Path to CSV filedelimiter
- : Column delimiter (auto-detected if not provided)has_headers
- : Whether first row contains headers (auto-detected)encoding
- : File encoding (utf8 | ascii | latin1)max_rows
- : Maximum rows to parse (default: 5000)output_format
- : json | table | summary
All parsers are designed to work seamlessly with majk-chat's context management system:
- Smart Truncation: Automatically truncates large documents while preserving structure
- Incremental Reading: Supports offset/limit reading for large files via read_tool_result
- Memory Efficient: Processes documents in chunks to avoid memory issues
- Token Optimization: Formats output to minimize token usage while preserving information
All tools provide comprehensive error handling:
- File Not Found: Clear error messages with resolved paths
- Permission Denied: Specific permission error reporting
- Invalid Format: Format validation with supported format guidance
- Parsing Errors: Detailed parsing error information with context
- pdf-parse: PDF text extractionxlsx
- : Excel/XLSX parsingmammoth
- : Word document processingnode-pptx-parser
- : PowerPoint parsingcsv-parser`: CSV parsing and analysis
-
MIT