Chunkr is an open source document intelligence API developed by Lumina AI for parsing complex documents into structured data suitable for RAG and LLM applications. It supports inputs like PDFs, DOCX, PPT, Excel, PNG, and JPEG, processing them through layout analysis, OCR, and intelligent chunking to produce outputs in HTML, Markdown, JSON, or plain text. The tool uses vision language models for accurate segmentation and includes features such as bounding boxes for precise element mapping and citation tracking for traceability.
Key functionalities include adaptive chunking strategies that range from fixed size splits to semantic grouping based on content meaning. Users configure pipelines via YAML files, selecting models like GPT 4o or local options such as Ollama. The API operates on a task based system, where uploads return task IDs for polling results, ensuring asynchronous handling of large files. Integration occurs through a Python SDK that supports both synchronous and asynchronous calls, with environment variables for API keys and endpoints.
In comparisons, Chunkr outperforms basic LlamaIndex parsers in handling visual elements and multi page structures, though it requires more configuration. Against Unstructured.io, it provides superior modularity for custom VLM processing but may demand additional setup for non developers. Azure Document Intelligence serves as a cloud alternative with similar OCR capabilities, yet Chunkr offers open source flexibility and lower entry barriers via its free tier.
User feedback highlights high accuracy in text extraction for printed materials, with processing times around two minutes for 50 page documents. Limitations appear in handwriting recognition, where outputs default to images, and complex tables with shading may not fully parse as structured data. The enterprise edition supports on premises deployment for regulated sectors, maintaining auditability through preserved metadata.
For implementation, begin with the Docker quickstart to test local processing, then migrate to cloud API for production scale. Focus on tuning chunk overlap to 20 percent for better context retention in retrieval tasks, ensuring outputs align with downstream embedding models.
Continue
An open-source AI code assistant that enhances software development by integrating into IDEs
Leap
Add AI to your app in minutes with best-in-class APIs and SDKs
Duda
An AI-driven website-building platform for professional web designers, agencies, and businesses
GPTGame
An AI platform that allows users to generate and play games instantly
AskCodi
An online tool for developers helping them avoid redundant tasks
HoneyHive
Evaluates and observes AI agents to ensure reliable production deployment