PARSeq — Scene Text Recognition (GGUF)

GGUF conversions of PARSeq (ECCV 2022) for use with CrispEmbed.

PARSeq is a scene text recognition model that reads text from natural images (signs, labels, documents). It recognizes 94 printable ASCII characters (digits, letters, punctuation).

Architecture

Encoder: 12-layer pre-LN ViT (patch 4×8, input 32×128 RGB, 128 tokens, GELU FFN)
Decoder: 1-layer two-stream Transformer (XLNet-style position queries + context self-attention, then cross-attention to encoder memory)
Head: Linear → 95 classes (94 printable ASCII chars + EOS)
Inference: Autoregressive greedy decode (max 25 characters)

Variants

File	Variant	Params	Size	Notes
`parseq-f32.gguf`	Base	24M	91 MB	Full precision
`parseq-q8_0.gguf`	Base	24M	24 MB	Best quantized
`parseq-q4_k.gguf`	Base	24M	13 MB	Smallest base
`parseq-tiny-f16.gguf`	Tiny	6M	12 MB	Half precision
`parseq-tiny-q8_0.gguf`	Tiny	6M	6 MB	Smallest overall

All quantization levels produce identical output on test images.

Usage

# CLI
crispembed -m parseq-q8_0.gguf --ocr image.png

# Auto-download
crispembed -m parseq --auto-download --ocr image.png

from crispembed import CrispMathOcr
ocr = CrispMathOcr("parseq-q8_0.gguf")
text = ocr.recognize("sign.png")

Benchmark (94-char, PARSeq-base)

Dataset	Accuracy
IIIT5k	99.1%
SVT	97.9%
IC13-1015	98.1%
IC15-2077	89.2%
SVTP	96.9%
CUTE80	98.6%

Source

Paper: Scene Text Recognition with Permuted Autoregressive Sequence Models (ECCV 2022)
Code: baudm/parseq (Apache-2.0)
Converted with models/convert-parseq-to-gguf.py from CrispEmbed

Downloads last month: -

GGUF

Model size

23.8M params

Architecture

parseq

Hardware compatibility

8-bit

16-bit

32-bit

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for cstr/parseq-GGUF

Scene Text Recognition with Permuted Autoregressive Sequence Models

Paper • 2207.06966 • Published Jul 14, 2022 • 1