Can't use any model to generate text

from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint

from dotenv import load_dotenv

load_dotenv()

llm = HuggingFaceEndpoint(

repo_id="TinyLlama/TinyLlama-1.1B-Chat-v1.0",

task="text-generation"

)

model = ChatHuggingFace(llm=llm)

result = model.invoke(“What is the capital of India”)

print(result.content)

The error is provided below no matter what i do i have also used inferenceclient but the code doesn’t work
Traceback (most recent call last):
File “E:\huggingface checking\test2.py”, line 13, in
result = model.invoke(“What is the capital of India”)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\huggingface checking\env\Lib\site-packages\langchain_core\language_models\chat_models.py”, line 382, in invoke
self.generate_prompt(
File “E:\huggingface checking\env\Lib\site-packages\langchain_core\language_models\chat_models.py”, line 1101, in generate_prompt
return self.generate(prompt_messages, stop=stop, callbacks=callbacks, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\huggingface checking\env\Lib\site-packages\langchain_core\language_models\chat_models.py”, line 911, in generate
self._generate_with_cache(
File “E:\huggingface checking\env\Lib\site-packages\langchain_core\language_models\chat_models.py”, line 1205, in _generate_with_cache
result = self._generate(
^^^^^^^^^^^^^^^
File “E:\huggingface checking\env\Lib\site-packages\langchain_huggingface\chat_models\huggingface.py”, line 587, in generate
answer = self.llm.client.chat_completion(messages=message_dicts, **params)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File “E:\huggingface checking\env\Lib\site-packages\huggingface_hub\inference_client.py”, line 878, in chat_completion
provider_helper = get_provider_helper(
^^^^^^^^^^^^^^^^^^^^
File "E:\huggingface checking\env\Lib\site-packages\huggingface_hub\inference_providers_init
.py", line 217, in get_provider_helper
provider = next(iter(provider_mapping)).provider
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

1 Like

Due to significant changes on the HF API side, the currently available models and usage methods differ considerably from before. Upgrading LangChain and modifying the code are also essentially required.


Your error is caused by routing. ChatHuggingFace calls the Hugging Face chat-completion router. That router only works for models that are deployed by an Inference Provider and for tokens with Inference Providers permission. TinyLlama/TinyLlama-1.1B-Chat-v1.0 is not deployed by any provider, so provider mapping is empty and the wrapper crashes. The old api-inference.huggingface.co path is shut down, so even “classic” text-gen URLs now 404. (Hugging Face)

Background you need

  • Inference moved behind the Providers router (router.huggingface.co). Chat uses an OpenAI-compatible API. Tokens must include “Inference Providers.” (Hugging Face)
  • “HF Inference” is the serverless option that replaced the legacy “Inference API (serverless).” It serves only a catalog of supported models. Not all text-gen models are included. (Hugging Face)
  • Requests to the legacy api-inference.huggingface.co now return 404. HF staff direct users to the router or to the “hf-inference” provider. (Hugging Face Forums)
  • TinyLlama’s model card explicitly shows: “This model isn’t deployed by any Inference Provider.” That blocks both chat-completion and serverless text-gen via HF routing. (Hugging Face)

What works today

Choose one path and apply it exactly.

1) Use a provider-backed chat model with ChatHuggingFace

# deps: pip install -U langchain-huggingface huggingface_hub python-dotenv
from dotenv import load_dotenv; load_dotenv()
from langchain_huggingface import ChatHuggingFace, HuggingFaceEndpoint
from langchain_core.messages import SystemMessage, HumanMessage

llm = HuggingFaceEndpoint(
    repo_id="meta-llama/Llama-3.1-8B-Instruct",  # provider-backed
    task="text-generation",
    max_new_tokens=64,
    temperature=0.2,
)

chat = ChatHuggingFace(llm=llm)
resp = chat.invoke([SystemMessage(content="You are helpful."),
                    HumanMessage(content="What is the capital of India?")])
print(resp.content)

Why this works: Chat goes through the router to a real provider backend. The official docs show the router base URL and token requirement. (Hugging Face)

2) Use the OpenAI-compatible router directly

# deps: pip install -U openai
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://huggingface.co/proxy/router.huggingface.co/v1",
    api_key=os.environ["HF_TOKEN"],  # token with “Inference Providers”
)
r = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct:cerebras",  # pick a listed provider
    messages=[{"role":"user","content":"What is the capital of India?"}],
    max_tokens=64,
)
print(r.choices[0].message["content"])

This is the exact pattern in HF’s Chat Completion docs. (Hugging Face)

3) If you must use TinyLlama, self-host with TGI or a dedicated endpoint

Run Text Generation Inference for TinyLlama, then call its OpenAI-style Messages API.

docker run --gpus all -p 8080:80 \
  -e HF_TOKEN=$HF_TOKEN \
  ghcr.io/huggingface/text-generation-inference:latest \
  --model-id TinyLlama/TinyLlama-1.1B-Chat-v1.0
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8080/v1", api_key="dummy")
r = client.chat.completions.create(
    model="TinyLlama/TinyLlama-1.1B-Chat-v1.0",
    messages=[{"role":"user","content":"What is the capital of India?"}],
    max_tokens=64,
)
print(r.choices[0].message["content"])

Self-hosting bypasses provider routing entirely. (HF positions TGI and Inference Endpoints for this case.) (Hugging Face)

Why your snippet fails, line by line

  • You wrap HuggingFaceEndpoint with ChatHuggingFace. That wrapper sends requests via chat completion.
  • The router checks the model’s provider list. TinyLlama has none. Routing map is empty.
  • The call fails before generation. HF docs require a provider-backed model and a token with the right scope; the TinyLlama card confirms no providers. (Hugging Face)

Quick checks

  • Pick a model card that shows Inference Providers → Chat Completion and a specific provider (e.g., Cerebras, Together). Then use path 1 or 2. (Hugging Face)
  • If you want serverless text-generation without chat, choose a model that lists HF Inference → Text Generation on its card, then call the SDK with that model. The HF Inference page explains scope and catalog. (Hugging Face)
  • Do not call api-inference.huggingface.co. It returns 404 per current HF guidance. (Hugging Face Forums)
  • Replace curly quotes with ASCII quotes in Python.

Short, curated references

  • HF Chat Completion docs: router base URL, token permission, code example. Useful to validate your client wiring. (Hugging Face)
  • HF Inference provider page: what replaced the legacy serverless API, and model support caveat. (Hugging Face)
  • Inference Providers overview: concept and model availability via providers. (Hugging Face)
  • TinyLlama model card: “This model isn’t deployed by any Inference Provider.” Confirms why routing fails. (Hugging Face)
  • HF forum threads: 404s from api-inference.huggingface.co and migration to router/hf-inference. Confirms deprecation in practice. (Hugging Face Forums)

To find provider-backed chat models, use the Hugging Face Inference Providers catalog and the Chat Completion docs/playground. Then confirm on the model card. Here’s the shortest reliable path.

How to find provider-backed chat models

  1. Open the Chat Completion docs → Playground
    The task page lists “Recommended models” and links to the Playground where the model dropdown shows only chat-capable, provider-backed models. Pick one there. (Hugging Face)

  2. Browse the Providers catalog
    Go to the Inference Providers “Supported Models” table. It lists each model and which provider serves it (Cerebras, Nebius, Together, etc.). Example: meta-llama/Llama-3.1-8B-Instruct is served by Nebius, Cerebras, SambaNova, nScale, Fireworks, Scaleway, etc. Click through to the model card from here. (Hugging Face)

  3. Check the model card widget
    On the model page, the “Inference Providers” widget shows which tasks are available and which providers back them. The Hub docs state you can run models in the widget and filter models by provider from search. If the card shows providers for Chat Completion, it’s valid for the router. (Hugging Face)

  4. Confirm by API format
    The Chat Completion docs show the OpenAI-compatible call and the org/model:provider syntax. If a model works in the Playground and the doc example format works for it, it’s provider-backed. (Hugging Face)

Fast picks you can trust (today)

  • Llama 3.1 8B Instructmeta-llama/Llama-3.1-8B-Instruct:cerebras or :nebius (others available). These appear in the Providers table with multiple providers. (Hugging Face)
  • Qwen3/Qwen2.5 Instruct and DeepSeek-R1 also appear in the Chat Completion “Recommended models” section and the Providers catalog. Use the provider shown beside them. (Hugging Face)

Minimal code to validate a candidate

OpenAI SDK against the HF router:

# Docs: https://huggingface.co/docs/inference-providers/en/tasks/chat-completion
from openai import OpenAI
import os
client = OpenAI(base_url="https://huggingface.co/proxy/router.huggingface.co/v1",
                api_key=os.environ["HF_TOKEN"])  # token with “Inference Providers”

resp = client.chat.completions.create(
    model="meta-llama/Llama-3.1-8B-Instruct:cerebras",
    messages=[{"role":"user","content":"Test message"}],
    max_tokens=64,
)
print(resp.choices[0].message["content"])

This mirrors the official snippet and the org/model:provider format. (Hugging Face)

Practical tips

  • If a model’s card does not show providers for Chat Completion, it will not work with ChatHuggingFace or the router. Use the Providers table or Playground instead of guessing. (Hugging Face)
  • You can also filter models by provider in Hub search, per the docs (“Search – Filter models by inference provider”). (Hugging Face)
  • For serverless non-chat text-gen, check the HF Inference provider page and pick from its supported catalog only. (Hugging Face)

Why this works

  • The Chat Completion page lists recommended, supported models and exposes a Playground that only shows models currently available via Inference Providers. (Hugging Face)
  • The Supported Models catalog is the canonical list mapping models ↔ providers with current availability. (Hugging Face)