RepoPilot

lumina-ai-inc/chunkr

Vision infrastructure to turn complex documents into RAG/LLM-ready data

Healthy

Healthy across the board

ConcernsDependency

copyleft license (AGPL-3.0) — review compatibility

HealthyFork & modify

Has a license, tests, and CI — clean foundation to fork and modify.

HealthyLearn from

Documented and popular — useful reference codebase to read through.

HealthyDeploy as-is

No critical CVEs, sane security posture — runnable as-is.

  • AGPL-3.0 is copyleft — check downstream compatibility
  • Scorecard: default branch unprotected (0/10)
  • Last commit 5w ago
  • 9 active contributors
  • Distributed ownership (top contributor 36% of recent commits)
  • AGPL-3.0 licensed
  • CI configured
  • Tests present

What would improve this?

  • Use as dependency ConcernsMixed if: relicense under MIT/Apache-2.0 (rare for established libs)

Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard

Informational only. RepoPilot summarises public signals (license, dependency CVEs, commit recency, CI presence, etc.) at the time of analysis. Signals can be incomplete or stale. Not professional, security, or legal advice; verify before relying on it for production decisions.

Embed the "Healthy" badge

Paste into your README — live-updates from the latest cached analysis.

Variant:
RepoPilot: Healthy
[![RepoPilot: Healthy](https://repopilot.app/api/badge/lumina-ai-inc/chunkr)](https://repopilot.app/r/lumina-ai-inc/chunkr)

Paste at the top of your README.md — renders inline like a shields.io badge.

Preview social card

This card auto-renders when someone shares https://repopilot.app/r/lumina-ai-inc/chunkr on X, Slack, or LinkedIn.

Ask AI about lumina-ai-inc/chunkr

Grounded in the actual source code. Pick a starter question or write your own.

Or write your own question →

Onboarding doc

Onboarding: lumina-ai-inc/chunkr

Generated by RepoPilot · 2026-06-24 · Source

🎯Verdict

GO — Healthy across the board

  • Last commit 5w ago
  • 9 active contributors
  • Distributed ownership (top contributor 36% of recent commits)
  • AGPL-3.0 licensed
  • CI configured
  • Tests present
  • ⚠ AGPL-3.0 is copyleft — check downstream compatibility
  • ⚠ Scorecard: default branch unprotected (0/10)

<sub>Computed from maintenance signals — commit recency, contributor breadth, bus factor, license, CI, tests, cross-checked against OpenSSF Scorecard</sub>

TL;DR

Chunkr is a production-ready open-source document intelligence API that converts complex documents (PDFs, PPTs, Word docs, images) into RAG/LLM-ready chunks through layout analysis, OCR with bounding boxes, and semantic chunking. It extracts structured HTML/Markdown output with vision-language model processing to prepare unstructured documents for retrieval-augmented generation and LLM consumption. Monorepo with apps/web/ containing the TypeScript/React frontend (Vite build, MUI components, Keycloak auth, Stripe billing), while the core Rust backend (likely in a separate workspace root) handles vision/OCR processing. Frontend uses Redux for state, React Query for server sync, and Tailwind/PostCSS for styling. Development infrastructure includes Docker Compose orchestration (.dev/otel-collector/), OpenTelemetry observability, and npm/pnpm workspace management.

👥Who it's for

ML/data engineers building RAG pipelines who need to ingest and parse complex multi-page documents at scale; companies integrating document processing into LLM applications; developers wanting an open-source alternative to proprietary document intelligence APIs like Docling or Unstructured.

🌱Maturity & risk

Actively maintained with version 2.2.0, comprehensive Docker setup, GitHub Actions CI (codespell, Rust linting, TypeScript tests), and structured releases via release-please. The monorepo spans 1M+ LOC across Rust (core vision engine), TypeScript (web UI), and Python (model integration), suggesting production-grade infrastructure. Verdict: actively developed and production-ready for self-hosted deployments.

The codebase is dual-licensed (AGPL open-source vs proprietary Cloud API) which may create confusion around contribution scope; heavy Rust dependency (519K LOC) means build complexity and compiler brittleness. The single web app under apps/web/ with substantial MUI/Radix/Tremor dependencies (50+ packages) creates JS fatigue risk. No visibility into test coverage or issue resolution velocity from the provided metadata.

Active areas of work

Release automation is active (.release-please-config.json present), TypeScript and Rust linting are enforced via GitHub Actions, and content migration is a tracked concern (CONTENT-MIGRATION-GUIDE.md). The changelog exists but recent commit data unavailable; however, the explicit versioning and CI setup indicate regular deployments.

🚀Get running

Check README for instructions.

Daily commands:

pnpm dev                    # Dev server on default Vite port (usually 5173)
pnpm build                  # TypeScript + Vite production build
pnpm build:deploy           # Build with .env interpolation for deployment
pnpm preview:deploy         # Preview production build locally
pnpm lint                   # ESLint checks
pnpm copy-pdf-worker        # Copy PDF.js worker for browser PDF handling

For the backend: inferred Docker Compose setup via .dev/otel-collector/compose.yaml.

🗺️Map of the codebase

  • apps/web/package.json — Defines all frontend dependencies and build scripts; essential for understanding the React/Vite setup and Material-UI ecosystem this UI relies on.
  • apps/web/src/auth/Auth.tsx — Core authentication logic that gates access to the document processing application; required to understand user session management.
  • apps/web/index.html — Vite entry point and HTML root for the SPA; critical for understanding the application bootstrap and main component mount.
  • .env.example — Template for required environment variables; must be consulted to configure API endpoints, keys, and deployment settings.
  • apps/web/src/components — Central directory containing all reusable UI components (CodeBlock, ApiDialog, Dropdown, etc.); foundational to the component architecture.
  • .release-please-config.json — Configuration for automated versioning and release generation; critical for understanding the repo's release workflow and version management.
  • apps/web/vite.config.ts — Vite bundler and build configuration; essential for understanding build optimization, asset handling, and dev server setup.

🛠️How to make changes

Add a New UI Component

  1. Create a new directory under apps/web/src/components/{ComponentName}/ with .tsx and .css files. (apps/web/src/components/BetterButton/BetterButton.tsx)
  2. Follow the Material-UI / Emotion styled pattern and export the component as a named export. (apps/web/src/components/CodeBlock/CodeBlock.tsx)
  3. Import and use the component in parent pages or other components. (apps/web/src/components/Header/Header.tsx)
  4. Run linting to ensure code quality: pnpm lint (apps/web/eslint.config.js)

Add API Integration for Document Processing

  1. Define your API endpoint and authentication token in .env.example and add documentation. (.env.example)
  2. Create a fetch wrapper service in apps/web/src/ (pattern not visible but typical for React apps). (apps/web/src/auth/Auth.tsx)
  3. Use the API client in UI components (e.g., in CodeBlock example scripts or ApiKeyDialog). (apps/web/src/components/CodeBlock/exampleScripts.ts)
  4. Ensure authentication is attached to requests; see AuthGuard for session validation. (apps/web/src/auth/AuthGuard.tsx)

Add a New Feature Card / Animation

  1. Create a new Lottie JSON animation file in apps/web/src/assets/animations/. (apps/web/src/assets/animations/chunking.json)
  2. Create a corresponding feature card image (WebP) in apps/web/src/assets/cards/. (apps/web/src/assets/cards/chunking.webp)
  3. Import the animation and card image in a landing page component (e.g., Header or hero section). (apps/web/src/assets/animations)
  4. Render the card with Lottie player component (likely from @lottiefiles/react-lottie-player or similar). (apps/web/package.json)

🔧Why these technologies

  • React + Vite — Fast development cycles and modern ESM bundling for a responsive, SPA-based document management UI.
  • Material-UI (@mui/material) — Professional, accessible component library with built-in theming and chart support (@mui/x-charts) for visualizing document processing metrics.
  • Emotion (@emotion/react, @emotion/styled) — CSS-in-JS for scoped, performant styling alongside Material-UI theming and custom CSS modules.
  • Keycloak (@keycloakify/login-ui) — Enterprise-grade identity and access management with federated login support for secure document API authentication.
  • Contentful Rich Text — Parse and render richly formatted content (likely for blog/documentation pages showcasing document processing features).
  • TypeScript + ESLint — Type safety and consistent code quality across the frontend codebase.

⚖️Trade-offs already made

  • Open-source AGPL version vs. proprietary Cloud API

    • Why: Balances community adoption with commercial offering; allows developers to self-host while the company monetizes managed services.
    • Consequence: Open-source users must handle their own model deployment (community/open models); Cloud API customers get higher accuracy, speed, and enterprise SLAs.
  • Vision-Language Model (VLM) processing in backend, not browser

    • Why: VLM inference is compute-intensive and requires GPU acceleration; centralizing in backend improves latency and reduces client burden.
    • Consequence: All document uploads go to API backend, introducing network latency and server cost; enables better resource pooling but requires trust in API provider.
  • Monorepo structure with separate apps/web frontend

    • Why: Allows coordinated versioning and shared infrastructure for future services (backend, CLI, etc.).
    • Consequence: Slightly more complex setup and CI/CD; benefits larger organizations but may feel over-engineered for pure frontend projects.

🚫Non-goals (don't propose these)

  • Real-time collaborative document editing (this is an upload-and-process service, not a collaborative editor).
  • Client-side document processing (all heavy compute happens on backend API).
  • Database persistence (frontend is stateless; backend handles storage).
  • Mobile app (appears to be web-only at this time based on responsive web assets).

🪤Traps & gotchas

Dual-license ambiguity: Open-source AGPL vs. proprietary Cloud API — verify which model you're contributing to. Environment variables are mandatory: .env.example lists Keycloak, Stripe, and backend URLs; missing these breaks the app silently. PDF worker setup: pnpm copy-pdf-worker must run before browser-based PDF rendering works (non-obvious side effect). Monorepo pnpm workspaces: yarn or npm may not resolve dependencies correctly; use pnpm. Backend not in this repo: The Rust vision/OCR engine is elsewhere or private; the TypeScript app is a frontend-only scaffold. Docker Compose orchestration required for local testing: standalone pnpm dev will fail without a running API backend.

🏗️Architecture

💡Concepts to learn

  • Layout Analysis (Document Understanding) — Core capability that distinguishes Chunkr; understanding 2D spatial structure of documents (headers, tables, multi-column text) is essential for accurate chunking and downstream LLM processing.
  • Optical Character Recognition (OCR) with Bounding Boxes — Enables text extraction from images and scanned documents while preserving spatial coordinates for reconstruction of document structure in structured output formats.
  • Semantic Chunking — Breaks documents into chunks based on meaning/topic rather than fixed size; critical for RAG quality since semantically coherent chunks improve retrieval and LLM context.
  • Vision-Language Models (VLM) — Chunkr uses VLMs (likely CLIP or similar) to understand document images and extract semantic meaning; understanding their role helps debug layout/OCR failures.
  • Retrieval-Augmented Generation (RAG) — The end-use case for Chunkr's output; RAG pipelines rely on high-quality document chunks to improve LLM accuracy, so Chunkr's parsing quality directly impacts downstream AI application performance.
  • Monorepo with pnpm Workspaces — This repo uses pnpm monorepo structure (not npm/yarn); understanding workspace resolution and hoisting is essential to avoid dependency conflicts across apps/web/ and backend.
  • OpenTelemetry Observability — Production infrastructure relies on OTEL tracing and metrics (see .dev/otel-collector/); understanding distributed tracing patterns helps debug multi-service document processing pipelines.
  • Unstructured-IO/unstructured — Direct competitor for document parsing/chunking; similar problem space but Python-first with different model backend choices.
  • corpusops/lovely-pdf — Open-source PDF extraction and layout analysis; complementary low-level building block that Chunkr likely uses or competes with.
  • langchain-ai/langchain — Canonical LLM orchestration framework; Chunkr's document chunks are designed to feed into LangChain RAG pipelines, making it an ecosystem dependency.
  • getzep/zep — Companion project for semantic memory and RAG; typical downstream consumer of Chunkr's structured document output.
  • lumina-ai-inc/chunkr-cloud — Likely the proprietary Cloud API mentioned in the README; related closed-source companion for fully managed document processing.

🪄PR ideas

To work on one of these in Claude Code or Cursor, paste: Implement the "<title>" PR idea from CLAUDE.md, working through the checklist as the task list.

Add TypeScript tests for PDF worker copy utility (copyPdfWorker.js)

The build script references a 'copy-pdf-worker' task in package.json, but there's no corresponding test coverage visible in the file structure. This is a critical build utility that could silently fail. Adding Jest tests would ensure the PDF worker is correctly copied during build/deployment, preventing runtime failures in production.

  • [ ] Create tests/copyPdfWorker.test.ts to verify copyPdfWorker.js correctly copies PDF worker files to build output
  • [ ] Test both success and failure paths (missing source files, permission errors)
  • [ ] Integrate test into the 'build' script or create a pre-build validation step
  • [ ] Document the PDF worker copy process in apps/web/README.md

Add missing GitHub Actions workflow for Web app TypeScript linting and type checking

The repo has rust-lint.yml and typescript-tests.yml workflows, but no dedicated workflow for type checking the Web app (apps/web). The build script includes 'tsc -b' for TypeScript compilation, but this isn't validated in CI. This creates risk of type errors reaching main branch.

  • [ ] Create .github/workflows/web-typescript-check.yml to run 'tsc -b' in apps/web directory
  • [ ] Add eslint validation (apps/web has eslint.config.js but no CI enforcement visible)
  • [ ] Configure to run on PRs targeting main branch and commits to main
  • [ ] Add status check requirement to branch protection rules in repo settings

Add integration tests for Keycloak authentication flow in Web app

The dependencies show @keycloakify/login-ui and keycloak-js are integrated, but there's no visible test coverage in the file structure for auth flows. This is critical for security and user access. Adding tests ensures login/logout/token refresh work across browser environments.

  • [ ] Create tests/integration/auth.test.tsx for Keycloak login flow using Vitest or React Testing Library
  • [ ] Test token refresh logic, logout cleanup, and redirect behavior
  • [ ] Mock Keycloak JS library and verify correct initialization with environment variables
  • [ ] Document auth setup in apps/web/README.md with expected .env variables for Keycloak
  • [ ] Add test execution to typescript-tests.yml workflow

🌿Good first issues

  • Add unit tests to apps/web/src/ — currently no test files visible in the directory structure; start by adding Jest/Vitest setup and tests for utility functions or Redux slices.
  • Improve .env.example documentation — add comments explaining each variable's purpose, valid values, and whether it's required for dev vs. production (e.g., clarify Keycloak, Stripe, OpenTelemetry settings).
  • Create a quick-start guide for the TypeScript app in apps/web/README.md — document the exact steps to get from clone to running pnpm dev, including which backend services are needed and how to mock them for frontend-only work.

Top contributors

Click to expand

📝Recent commits

Click to expand
  • 1bde59b — Update README.md (ishaan99k)
  • f30540f — Update README.md (#578) (ishaan99k)
  • 6b1f62c — Update README.md (m-chadda)
  • c3fc804 — chore: release main (#572) (github-actions[bot])
  • ff4b906 — fix: replaced vgt with yolo model as it's more practical for consumer hardware (akhileshsharma99)
  • ff73099 — chore: updated build context for cpu (akhileshsharma99)
  • 9b110cb — chore: release main (#569) (github-actions[bot])
  • 4986633 — Update README.md (m-chadda)
  • e2b9b8a — chore: removed unused config for release please (akhileshsharma99)
  • 7a444ea — feat: created a stable and simple version by removing all extra/unused components (#570) (akhileshsharma99)

🔒Security observations

  • High · Hardcoded Default Credentials in .env.example — .env.example. The .env.example file contains hardcoded credentials for MinIO (AWS__ACCESS_KEY=minioadmin, AWS__SECRET_KEY=minioadmin) and PostgreSQL (postgres:postgres). While this is an example file, it demonstrates a pattern that could be replicated in production and indicates default credentials are used in development environments. Fix: 1) Never commit actual .env files with real credentials to version control. 2) Use secret management tools (AWS Secrets Manager, HashiCorp Vault, etc.). 3) Enforce strong credentials in production. 4) Add .env to .gitignore if not already present. 5) Document that example credentials must be changed before deployment.
  • High · Unencrypted HTTP Communication in Default Configuration — .env.example. The .env.example file configures services to communicate over HTTP (e.g., VITE_KEYCLOAK_URL=http://localhost:8080, AWS__ENDPOINT=http://minio:9000, WORKER__GENERAL_OCR_URL=http://ocr:8000). While acceptable for local development, this suggests the default setup may not enforce HTTPS in production, exposing credentials and sensitive data in transit. Fix: 1) Enforce HTTPS/TLS for all production environments. 2) Update documentation to require HTTPS configuration. 3) Use environment-specific .env files with validated HTTPS URLs. 4) Implement certificate validation and pinning for critical connections.
  • High · Missing Authentication/Authorization on Keycloak and Service Endpoints — .env.example, infrastructure configuration. The configuration exposes Keycloak and internal worker service URLs without documented authentication requirements. If these services are accessible without proper network segmentation or authentication, they could be exploited for unauthorized access or information disclosure. Fix: 1) Implement network policies (e.g., VPC, service mesh) to restrict access to internal services. 2) Require mutual TLS (mTLS) between services. 3) Document mandatory authentication requirements. 4) Use API keys or service-to-service authentication tokens. 5) Implement rate limiting on exposed endpoints.
  • Medium · Outdated or Vulnerable Dependencies in package.json — apps/web/package.json. The package.json contains multiple dependencies with potential vulnerabilities. Notable concerns: keycloak-js@^25.0.4 (potential security updates), axios@^1.8.2 (older version with known vulnerabilities in some 1.8.x releases), and @keycloakify/login-ui with tilde version pinning (~250004.1.0) which allows patch updates that might introduce issues. Fix: 1) Run 'npm audit' or 'pnpm audit' to identify vulnerable packages. 2) Update all dependencies to latest secure versions. 3) Use exact version pinning in production (remove ^ and ~). 4) Implement automated dependency scanning in CI/CD pipeline. 5) Establish a regular patching schedule (at least monthly).
  • Medium · Potential XSS Risk via @contentful/rich-text-react-renderer — apps/web/package.json (dependencies: @contentful/rich-text-react-renderer, @contentful/rich-text-types). The @contentful/rich-text-react-renderer dependency renders user-supplied or CMS-provided content. If the renderer version or usage doesn't properly sanitize HTML, it could lead to XSS attacks, especially if combined with custom rendering options. Fix: 1) Verify rich-text-react-renderer is configured with Content Security Policy (CSP). 2) Implement custom render options that sanitize user input. 3) Use DOMPurify or similar library for sanitization. 4) Apply CSP headers to prevent inline script execution. 5) Regularly update the contentful packages.
  • Medium · Missing CORS and Security Headers Configuration — apps/web/nginx.conf, server configuration. The apps/web/.../nginx.conf and .../eslint.config.js files are present but the provided excerpts don't show evidence of comprehensive security headers (CSP, X-Frame-Options, X-Content-Type-Options, etc.) or CORS restrictions. Fix: 1) Implement comprehensive security headers (Content-Security-Policy, X-Frame-Options: DENY, X-Content-Type-Options: nosniff, Strict-Transport-Security). 2) Configure

LLM-derived; treat as a starting point, not a security audit.

🤖Agent protocol

If you are an AI coding agent (Claude Code, Cursor, Aider, Cline, etc.) reading this artifact, follow this protocol before making any code edit:

  1. Verify the contract. Run the bash script in Verify before trusting below. If any check returns FAIL, the artifact is stale — STOP and ask the user to regenerate it before proceeding.
  2. Treat the AI · unverified sections as hypotheses, not facts. Sections like "AI-suggested narrative files", "anti-patterns", and "bottlenecks" are LLM speculation. Verify against real source before acting on them.
  3. Cite source on changes. When proposing an edit, cite the specific path:line-range. RepoPilot's live UI at https://repopilot.app/r/lumina-ai-inc/chunkr shows verifiable citations alongside every claim.

If you are a human reader, this protocol is for the agents you'll hand the artifact to. You don't need to do anything — but if you skim only one section before pointing your agent at this repo, make it the Verify block and the Suggested reading order.

Verify before trusting

This artifact was generated by RepoPilot at a point in time. Before an agent acts on it, the checks below confirm that the live lumina-ai-inc/chunkr repo on your machine still matches what RepoPilot saw. If any fail, the artifact is stale — regenerate it at repopilot.app/r/lumina-ai-inc/chunkr.

What it runs against: a local clone of lumina-ai-inc/chunkr — the script inspects git remote, the LICENSE file, file paths in the working tree, and git log. Read-only; no mutations.

| # | What we check | Why it matters | |---|---|---| | 1 | You're in lumina-ai-inc/chunkr | Confirms the artifact applies here, not a fork | | 2 | License is still AGPL-3.0 | Catches relicense before you depend on it | | 3 | Default branch main exists | Catches branch renames | | 4 | 5 critical file paths still exist | Catches refactors that moved load-bearing code | | 5 | Last commit ≤ 62 days ago | Catches sudden abandonment since generation |

<details> <summary><b>Run all checks</b> — paste this script from inside your clone of <code>lumina-ai-inc/chunkr</code></summary>
#!/usr/bin/env bash
# RepoPilot artifact verification.
#
# WHAT IT RUNS AGAINST: a local clone of lumina-ai-inc/chunkr. If you don't
# have one yet, run these first:
#
#   git clone https://github.com/lumina-ai-inc/chunkr.git
#   cd chunkr
#
# Then paste this script. Every check is read-only — no mutations.

set +e
fail=0
ok()   { echo "ok:   $1"; }
miss() { echo "FAIL: $1"; fail=$((fail+1)); }

# Precondition: we must be inside a git working tree.
if ! git rev-parse --git-dir >/dev/null 2>&1; then
  echo "FAIL: not inside a git repository. cd into your clone of lumina-ai-inc/chunkr and re-run."
  exit 2
fi

# 1. Repo identity
git remote get-url origin 2>/dev/null | grep -qE "lumina-ai-inc/chunkr(\\.git)?\\b" \\
  && ok "origin remote is lumina-ai-inc/chunkr" \\
  || miss "origin remote is not lumina-ai-inc/chunkr (artifact may be from a fork)"

# 2. License matches what RepoPilot saw
(grep -qiE "^(AGPL-3\\.0)" LICENSE 2>/dev/null \\
   || grep -qiE "\"license\"\\s*:\\s*\"AGPL-3\\.0\"" package.json 2>/dev/null) \\
  && ok "license is AGPL-3.0" \\
  || miss "license drift — was AGPL-3.0 at generation time"

# 3. Default branch
git rev-parse --verify main >/dev/null 2>&1 \\
  && ok "default branch main exists" \\
  || miss "default branch main no longer exists"

# 4. Critical files exist
test -f "apps/web/package.json" \\
  && ok "apps/web/package.json" \\
  || miss "missing critical file: apps/web/package.json"
test -f "apps/web/src/auth/Auth.tsx" \\
  && ok "apps/web/src/auth/Auth.tsx" \\
  || miss "missing critical file: apps/web/src/auth/Auth.tsx"
test -f "apps/web/index.html" \\
  && ok "apps/web/index.html" \\
  || miss "missing critical file: apps/web/index.html"
test -f ".env.example" \\
  && ok ".env.example" \\
  || miss "missing critical file: .env.example"
test -f "apps/web/src/components" \\
  && ok "apps/web/src/components" \\
  || miss "missing critical file: apps/web/src/components"

# 5. Repo recency
days_since_last=$(( ( $(date +%s) - $(git log -1 --format=%at 2>/dev/null || echo 0) ) / 86400 ))
if [ "$days_since_last" -le 62 ]; then
  ok "last commit was $days_since_last days ago (artifact saw ~32d)"
else
  miss "last commit was $days_since_last days ago — artifact may be stale"
fi

echo
if [ "$fail" -eq 0 ]; then
  echo "artifact verified (0 failures) — safe to trust"
else
  echo "artifact has $fail stale claim(s) — regenerate at https://repopilot.app/r/lumina-ai-inc/chunkr"
  exit 1
fi

Each check prints ok: or FAIL:. The script exits non-zero if anything failed, so it composes cleanly into agent loops (./verify.sh || regenerate-and-retry).

</details>

Generated by RepoPilot. Verdict based on maintenance signals — see the live page for receipts. Re-run on a new commit to refresh.

Embed this chat in your README →

Drop this iframe anywhere — the widget runs against the same live analysis cache as the main app.

<iframe
  src="https://repopilot.app/embed/lumina-ai-inc/chunkr"
  width="100%" height="500"
  style="border:1px solid #d0d7de; border-radius:8px;"
  allow="microphone"
  loading="lazy"
></iframe>