---
configs:
- config_name: default
  data_files:
  - split: train
    path: data/*.jsonl.gz
tags:
- pdf
- ocr
- chandra
- chandra-ocr-2
- markdown
- html
- hf-jobs
- uv-script
---

# PDF OCR with Chandra OCR 2

This output bundle stores OCR results for PDFs referenced by a supplied URL list using [datalab-to/chandra-ocr-2](https://huggingface.co/datalab-to/chandra-ocr-2).

## Summary

- Output bucket: `hf://buckets/sroecker/pdf-chandra-ocr`
- Source PDF URLs in input list: 111
- Processed inputs recorded in `state/processed_inputs.txt`: 111
- Successes: 111
- Partial successes: 0
- Errors: 0
- Next shard index: 12
- Updated at: 2026-04-15T18:56:12.558178+00:00

## Files

- `data/part-*.jsonl.gz`: OCR result shards, one JSON object per PDF
- `state/processed_inputs.txt`: completed PDF URLs used for resume
- `state/summary.json`: aggregate counters and bookkeeping

Each record includes:

- `num_pages`: total number of pages in the source PDF
- `num_pages_processed`: number of pages actually sent to OCR
- `pdf_exceeds_page_limit`: whether the PDF had more pages than the configured OCR cap
- `max_pages_per_paper`: configured OCR page cap for the run

## Load the results

```python
from datasets import load_dataset

dataset = load_dataset("<dataset-id>", data_files="data/*.jsonl.gz", split="train")
print(dataset[0]["source_id"])
print(dataset[0]["pdf_url"])
print(dataset[0]["markdown"][:1000])
```

## Job config

- Prompt type: `ocr_layout`
- Page batch size: 28
- Max output tokens: 12384
- Max model length: 18000
- GPU memory utilization: 0.85
- Minimum download request interval: 0.0 seconds
- Max pages per paper sent to OCR: 30
- Bucket backend: hf-cli
- Paginate output: False
- Include headers/footers: False

## Reproduction

```bash
hf jobs uv run --flavor l4x1 --image vllm/vllm-openai:v0.17.0 \
  -s HF_TOKEN --timeout 2d \
  ./chandra2-arxiv-ocr.py --output-dataset hf://buckets/sroecker/pdf-chandra-ocr \
  --output-bucket hf://buckets/sroecker/pdf-chandra-ocr \
  --pdf-urls-url https://.../pdf_urls.txt
```