PARSeq โ€” Scene Text Recognition (GGUF)

GGUF conversions of PARSeq (ECCV 2022) for use with CrispEmbed.

PARSeq is a scene text recognition model that reads text from natural images (signs, labels, documents). It recognizes 94 printable ASCII characters (digits, letters, punctuation).

Architecture

  • Encoder: 12-layer pre-LN ViT (patch 4ร—8, input 32ร—128 RGB, 128 tokens, GELU FFN)
  • Decoder: 1-layer two-stream Transformer (XLNet-style position queries + context self-attention, then cross-attention to encoder memory)
  • Head: Linear โ†’ 95 classes (94 printable ASCII chars + EOS)
  • Inference: Autoregressive greedy decode (max 25 characters)

Variants

File Variant Params Size Notes
parseq-f32.gguf Base 24M 91 MB Full precision
parseq-q8_0.gguf Base 24M 24 MB Best quantized
parseq-q4_k.gguf Base 24M 13 MB Smallest base
parseq-tiny-f16.gguf Tiny 6M 12 MB Half precision
parseq-tiny-q8_0.gguf Tiny 6M 6 MB Smallest overall

All quantization levels produce identical output on test images.

Usage

# CLI
crispembed -m parseq-q8_0.gguf --ocr image.png

# Auto-download
crispembed -m parseq --auto-download --ocr image.png
from crispembed import CrispMathOcr
ocr = CrispMathOcr("parseq-q8_0.gguf")
text = ocr.recognize("sign.png")

Benchmark (94-char, PARSeq-base)

Dataset Accuracy
IIIT5k 99.1%
SVT 97.9%
IC13-1015 98.1%
IC15-2077 89.2%
SVTP 96.9%
CUTE80 98.6%

Source

Downloads last month
-
GGUF
Model size
23.8M params
Architecture
parseq
Hardware compatibility
Log In to add your hardware

8-bit

16-bit

32-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Paper for cstr/parseq-GGUF