Best open-source model for parsing messy PDFs on 16GB RAM (CPU only)

Probably the repo_id format is wrong… Also, since many GGUF models aren’t gated, it’s faster to search for them directly.


A) What’s available right now

  • Mistral 7B: public.

Use mistralai/Mistral-7B-v0.3 (base) or mistralai/Mistral-7B-Instruct-v0.3 (chat). (Hugging Face)

  • “LLaMA 3B”: this is Llama-3.2-3B from Meta. It is gated.

Use meta-llama/Llama-3.2-3B or meta-llama/Llama-3.2-3B-Instruct. (Hugging Face)

Why you saw 404s:

  • Wrong or incomplete repo ID (IDs are owner/repo, case-sensitive). (Hugging Face Forums)

  • Repo is private or gated and you’re not approved yet. HF returns 404 in that case. (Hugging Face)

B) How to request access for gated models (e.g., Llama 3.2)

  1. Open the model page while logged in and click Request/Accept access. (Hugging Face)

  2. Gated access is controlled on the Hub; owners can require acceptance and review. (Hugging Face)

C) Minimal commands that work

Authenticate and verify:


# login

hf auth login # or: huggingface-cli login

# verify

hf auth whoami # or: huggingface-cli whoami

(Hugging Face)

Download models:


# Public Mistral 7B Instruct

hf download mistralai/Mistral-7B-Instruct-v0.3 --local-dir ./mistral-7b-instruct

# Public Mistral 7B base

hf download mistralai/Mistral-7B-v0.3 --local-dir ./mistral-7b-base

# Gated Llama 3.2 3B (accept license on the repo page first)

hf download meta-llama/Llama-3.2-3B-Instruct --local-dir ./llama-3.2-3b-instruct

(Hugging Face)

Python alternative:


from huggingface_hub import snapshot_download

snapshot_download("mistralai/Mistral-7B-Instruct-v0.3", local_dir="./mistral-7b-instruct")

(Hugging Face)

D) CPU-only? Use GGUF + llama.cpp

This avoids big PyTorch installs and runs fully on CPU.

Install llama.cpp and run a GGUF directly from HF:


# build

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp

cmake -B build && cmake --build build --config Release

# run a GGUF in one line (example: Llama-3.2-3B Instruct, quantized)

./build/bin/llama-cli -hf bartowski/Llama-3.2-3B-Instruct-GGUF:Q4_K_M -p "Hello"

You can replace the -hf target with any GGUF repo:file. (GitHub)

E) Good, small, public CPU models (ready today)

  • Phi-3 Mini 4K Instruct (3.8B). Also has an official GGUF repo. (Hugging Face)

  • Qwen2.5-3B-Instruct (3B). Strong small model. (Hugging Face)

  • SmolLM3-3B. Fully open 3B; ONNX and GGUF variants exist from the community. (Hugging Face)

  • TinyLlama-1.1B-Chat GGUF, for very low-resource CPU. (Hugging Face)

  • Zephyr-7B-beta GGUF, if you want a 7B chat model and can tolerate slower CPU speed. (Hugging Face)

F) Quick 404 checklist

  • Always use the exact owner/repo ID, correct casing. (Hugging Face Forums)

  • Open the repo page in your browser to confirm visibility or gating. (Hugging Face)

  • Stay authenticated before downloading (hf auth whoami). (Hugging Face)

  • If it’s truly private, owners must grant you access; otherwise you’ll keep seeing 404. (Hugging Face Forums)

G) Note on Python 3.13

3.13 support is improving, but some stacks still lag. PyTorch now lists 3.13 support on its install page, yet issues around dependencies have existed; if you hit install errors, use Python 3.12. (PyTorch)


Short “do this now” script (Ubuntu, CPU)


# 1) Auth

pipx install huggingface_hub || pip install -U huggingface_hub

hf auth login && hf auth whoami

# 2) Download public Mistral 7B Instruct

hf download mistralai/Mistral-7B-Instruct-v0.3 --local-dir ./mistral-7b-instruct

# 3) If you want Llama-3.2-3B, accept access in browser, then:

hf download meta-llama/Llama-3.2-3B-Instruct --local-dir ./llama-3.2-3b-instruct

# 4) CPU inference via llama.cpp + GGUF

git clone https://github.com/ggml-org/llama.cpp && cd llama.cpp

cmake -B build && cmake --build build --config Release

./build/bin/llama-cli -hf microsoft/Phi-3-mini-4k-instruct-gguf:Q4_K_M -p "Hello"

(Hugging Face)


Curated references

  • Model cards: Mistral-7B v0.3, Mistral-7B-Instruct-v0.3, Llama-3.2-3B. Useful to verify availability and licenses. (Hugging Face)

  • HF docs: hf download, downloads API, gated-model rules. Practical CLI and why 404 appears for private/gated. (Hugging Face)

  • llama.cpp: repo and GGUF-on-HF usage page. CPU-first, simple runs with -hf. (GitHub)

  • Alternative small models: Phi-3 Mini 4K, Qwen2.5-3B, SmolLM3-3B, TinyLlama, Zephyr-7B. Good CPU options. (Hugging Face)

2 Likes

Hi everyone

I wanted to share an update regarding my local testing of several models.

? Models tested locally:

1?? unsloth/gemma-3-4b-it-GGUF (gemma3_4b_it)2?? ibm-granite/granite-3.3-2b-instruct-GGUF3?? Qwen3-4B-Instruct-2507-GGUF4?? mistralai/Mistral-7B-Instruct-v0.3

All of these models were downloaded locally and tested using the transformers library in Python.

System info:

CPU:linux

RAM: 32 GB

Torch dtype: float32 (CPU)

Observations:

Gemma-3-4B is very fast and responsive on my system.

Granite 3.3-2B and Qwen3-4B show moderate speed.

Mistral-7B-Instruct is significantly slower on CPU, as expected for a 7B model.

Code snippet used for testing (example for Mistral-7B):

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

model_path = “/home/ahmad/Documents/financeAgent/prune_models/mistral-7b-instruct”
tokenizer = AutoTokenizer.from_pretrained(model_path)
model = AutoModelForCausalLM.from_pretrained(model_path, torch_dtype=torch.float32, device_map=“auto”)

def chat(prompt):
inputs = tokenizer(prompt, return_tensors=“pt”).to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
reply = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(reply)

chat(“Hello! How are you?”)

1 Like

It’s fortunate that Gemma 3 4B runs fast on the CPU. Gemma 3 simply offers high performance, including multilingual capabilities. If text-only support is sufficient, there’s also the 1B model.

1 Like

Hi everyone,

I’m experimenting with pruning and quantizing LLMs (like unsloth/gemma-2-2b-it) and running into several issues. I’m hoping to get advice or best practices. Here’s what I observed:


:one: Model size increases after pruning

  • After structured pruning (~30%), the model size doubled instead of decreasing.

  • I suspect this is due to PyTorch mask tensors added during pruning.


:two: Accuracy and inference time unchanged

  • After pruning, accuracy and response time remain almost identical.

  • Only a portion of weights were pruned; CPU inference doesn’t get faster automatically.


:three: 4-bit quantization on CPU

  • Attempting 4-bit quantization fails on CPU.

  • bitsandbytes library is GPU-optimized, so CPU-only systems aren’t supported.


:four: INT8 quantization issues

  • INT8 quantization sometimes crashes when saving with save_pretrained().

  • Seems Transformers’ serialization does not fully support int8 tensors on CPU.


:five: Package / environment issues

  • Missing bitsandbytes → 4-bit quantization fails.

  • Missing sentencepiece → tokenizer fails.

  • Missing langchain.text_splitter → ingestion fails.


:six: Saving pruned + quantized model

  • Pruned + quantized model sometimes fails to save or size doubles.

:seven: GPU vs CPU differences

  • On CPU, cannot benefit from 4-bit quantization or speed-up.

  • GPU-only optimized kernels are needed for memory and inference improvements.


Questions:

  1. Is there a recommended way to prune and quantize models on CPU without increasing size?

  2. How do people typically handle saving pruned + quantized models?

  3. Any tips to get speed/memory benefits on CPU?

  4. Are there alternative approaches for CPU-only systems to reduce memory while maintaining accuracy?

Thanks in advance for any guidance!

1 Like

Hi everyone,

I’m working with Gemma2 2B locally on CPU and facing several issues. I would really appreciate your guidance.

  1. Pruning:

    • Applied structured pruning using PyTorch (ln_structured), but after pruning, the model size doubled instead of reducing.

    • Accuracy and inference time remain almost the same. Is this expected? How can I effectively reduce model size?

  2. Quantization:

    • Tried dynamic INT8 quantization on CPU, but saving the model gives:

      AttributeError: 'torch.dtype' object has no attribute 'data_ptr'
      
      
    • Is INT8 quantization supported on CPU for Gemma2 2B? How can I do it properly?

  3. Fine-tuning:

    • Using Hugging Face Trainer on CPU causes the process to get killed due to memory limits.

    • Also faced prepare_model_for_int8_training not found in PEFT.

    • How can I fine-tune Gemma2 2B locally on CPU with a small dataset efficiently?

  4. Tokenizer:

    • GemmaTokenizer requires SentencePiece. Installed it, but unsure if additional dependencies are needed.
  5. General advice:

    • Is running Gemma2 2B for pruning, quantization, or fine-tuning on CPU feasible?

    • Tips for reducing memory usage and speeding up training/inference?

:folded_hands: I would really appreciate it if anyone could reply with guidance or best practices for CPU setups.

Thank you!

1 Like

Hi community,
I am pruning the Gemma 2B model using PyTorch’s nn.utils.prune. After pruning ~20–30% of the weights, I notice some performance degradation.

  • What are the best practices to minimize accuracy loss during pruning?

  • Is structured pruning recommended over unstructured pruning for LLMs like Gemma?

  • After pruning, is LoRA fine-tuning sufficient to recover performance?


Title: Pruning + Fine-Tuning Workflow for Gemma 2B
Body:
Hello,
I want to prune Gemma 2B and then fine-tune it on a small dataset.

  • Should I prune first and then fine-tune with full parameters, or is LoRA/PEFT tuning enough?

  • Any advice on learning rate, batch size, or number of epochs for fine-tuning after pruning?

  • Has anyone successfully pruned + fine-tuned Gemma 2B? Any tips?


:two: Quantization Questions

Title: CPU Inference for Gemma 2B After Pruning + Fine-Tuning
Body:
Hi,
I plan to perform pruning + LoRA fine-tuning + quantization for CPU inference on Gemma 2B.

  • Which quantization method works best for CPU: 8-bit or 4-bit?

  • Can I combine pruning + LoRA fine-tuning + 4-bit quantization without losing significant accuracy?

  • Does BitsAndBytes fully support CPU-only quantization for a 2B parameter model?


:three: Hardware / Workflow Questions

Title: Minimum GPU Requirements for Gemma 2B Workflow
Body:
Hello Hugging Face community,
I am planning a workflow for Gemma 2B:

  1. Download the model

  2. Prune ~20–30% weights

  3. Fine-tune with LoRA

  4. Quantize for CPU inference

  • What is the minimum GPU VRAM required for this workflow?

  • Can pruning be done entirely on CPU, or is GPU strongly recommended?

  • Are there any example scripts for pruning + LoRA fine-tuning + quantization for 2B+ LLMs?


:light_bulb: Tips before posting:

  • Include your environment (PyTorch version, GPU/CPU, RAM)

  • Include code snippet if possible, e.g., how you’re pruning or fine-tuning

1 Like