Why does using `TextIteratorStreamer` result in so many empty outputs?

The simplest case is that you forgot to call apply_chat_template?


You’re printing many blanks because most early “tokens” decode to nothing after your filters. You’re feeding raw chat markup and tags. The streamer removes the prompt and special tokens during decode, so several iterations yield "". Your loop still prints the markers for each "", so you see ###$$$ many times.

Why this happens

  1. Chat markup and special tokens get stripped
    Your text contains ChatML-style control tokens and tags:
<|im_start|>user
hello
<|im_end|>
<|im_start|>assistant
<think>

</think>

With skip_special_tokens=True, the tokenizer drops control tokens on decode. skip_prompt=True makes the streamer ignore the prompt part. Early decode steps often become empty strings. You still print them. That renders as repeated ###$$$. This is expected when feeding chat models raw markup while also skipping specials. (Hugging Face)

  1. The streamer buffers until it has “displayable words”
    TextIteratorStreamer accumulates token pieces and only emits text when decoding forms complete, displayable spans. This can delay or suppress output for subword fragments. Combined with your skip filters, several iterations produce "". (Hugging Face)

  2. Subword and whitespace tokens don’t always produce visible characters
    BPE/SentencePiece commonly produce leading spaces or fragments. Until a boundary is closed, decode can be empty or only whitespace. Your print makes empties visible as ###$$$. The streamer’s documented behavior is to emit when words materialize, not at every raw token. (Hugging Face)

  3. Multiple EOS and chat tags at the boundary
    Modern chat models often use more than one stop token (for example, <|end_of_text|> and <|im_end|> or <|eot_id|>). If you don’t stop at all relevant EOS tokens, the model may output extra headers or newlines that get stripped to "". Transformers supports a list for eos_token_id. Use it so generation ends cleanly at the first relevant stop. (Hugging Face)

  4. You hand-wrote the conversation instead of using the chat template
    Most chat models expect a template. Hand-rolled markup can make the model emit scaffolding tokens first, which your decode then strips. apply_chat_template(..., add_generation_prompt=True) produces the exact format expected by that model’s tokenizer and cleanly marks where assistant output should begin. That reduces spurious blanks. (Hugging Face)

What to change

A. Use chat templates instead of manual <|im_start|> strings

# Docs:
# - https://huggingface.co/docs/transformers/en/chat_templating
# - https://huggingface.co/docs/transformers/en/main_classes/text_generation

messages = [{"role": "user", "content": "hello"}]

input_ids = tokenizer.apply_chat_template(
    messages,
    add_generation_prompt=True,     # assistant starts here
    return_tensors="pt"
).to("cuda")

from transformers import TextIteratorStreamer
streamer = TextIteratorStreamer(
    tokenizer,
    skip_prompt=True,
    skip_special_tokens=True,
    clean_up_tokenization_spaces=False
)

# Provide multiple EOS to end cleanly on any of them.
stop_tokens = ["<|im_end|>", "<|eot_id|>", "<|end_of_text|>"]
eos_list = []
for tok in stop_tokens:
    tok_id = tokenizer.convert_tokens_to_ids(tok)
    if tok_id is not None and tok_id != tokenizer.pad_token_id:
        eos_list.append(tok_id)

gen_kwargs = dict(
    input_ids=input_ids,
    max_new_tokens=256,
    do_sample=True, temperature=0.7, top_p=0.8, top_k=20,
    eos_token_id=eos_list or tokenizer.eos_token_id,
    streamer=streamer,
)

from threading import Thread
Thread(target=model.generate, kwargs=gen_kwargs).start()

for chunk in streamer:
    if not chunk:                   # ignore empty deltas
        continue
    print("###" + chunk + "$$$", end="")  # no extra newline

Rationale: the template matches training-time formatting and sets the assistant start boundary. That avoids leading control tokens and reduces empty emissions. EOS as a list handles multi-stop models. (Hugging Face)

B. Don’t print empty or whitespace-only chunks

Minimal and effective:

for chunk in streamer:
    if not chunk or chunk.isspace():
        continue
    print("###" + chunk + "$$$", end="")

This keeps your markers readable and removes ###$$$ lines created by "" and pure "\n" deltas. Behavior aligns with the streamer’s “emit words” design. (Hugging Face)

C. Debug once with specials enabled

To confirm the root cause in your environment, run a single test with:

streamer = TextIteratorStreamer(tokenizer, skip_prompt=True, skip_special_tokens=False)
for chunk in streamer:
    print(repr(chunk), end="")  # show \n and specials explicitly

You’ll see leading control tokens and newlines that were previously erased by skip_special_tokens=True. This verifies why "" showed up. (Hugging Face)

D. If your model emits tags like <think>...</think>

Add stop strings or stop IDs for those tags. Many users combine eos_token_id=[...] with substring stoppers to prevent internal tags reaching the UI:

# Example: custom stopper for substrings
from transformers import StoppingCriteria, StoppingCriteriaList
class StopOnStrings(StoppingCriteria):
    def __init__(self, stop_strings): self.stop_strings = stop_strings
    def __call__(self, input_ids, scores, **kwargs):
        text = tokenizer.decode(input_ids[0], skip_special_tokens=True)
        return any(s in text for s in self.stop_strings)

stops = StoppingCriteriaList([StopOnStrings(["<think>", "</think>"])])
model.generate(..., stopping_criteria=stops, streamer=streamer)

Reason: some chat variants print scaffolding before content. Stopping criteria ensure your loop never receives them. (Patterns vary by model; templates help most.) (Hugging Face)

Mental model

  • Streamer receives token IDs.
  • It decodes incrementally and emits only when there is displayable text.
  • You asked it to skip prompt and specials, which turn many early iterations into "".
  • Your print shows each "" as an empty payload between markers.
  • Fix by providing proper chat formatting, giving all stop IDs, and skipping "" in your UI loop. (Hugging Face)

Quick checklist

  • Use apply_chat_template(..., add_generation_prompt=True). (Hugging Face)
  • Keep skip_special_tokens=True, but ignore "" or whitespace chunks. (Hugging Face)
  • Provide all relevant EOS IDs. Many chat models need more than one. (Hugging Face)
  • Optional: clean_up_tokenization_spaces=False for exact spacing during streaming. (Hugging Face)
  • If needed, add substring stoppers for tags like <think>. Use minimally.

Curated references and similar cases

Official docs

  • Transformers streaming utilities. Describes word-boundary emission and streamer parameters like skip_prompt and skip_special_tokens. Useful to understand why empties happen. (Hugging Face)
  • Generation API. Confirms eos_token_id accepts a list for multiple stops. Key for chat models. (Hugging Face)
  • Chat templates guide. Shows apply_chat_template(..., add_generation_prompt=True) and why templates prevent formatting mismatches. (Hugging Face)
  • HF blog on chat templates. Explains training-time formats and why hand-rolled prompts degrade behavior. (Hugging Face)
  • Tokenizer docs on skip_special_tokens. Confirms that specials are omitted from decode. (Hugging Face)

Issues and threads with analogous symptoms

  • Beginners thread: delays and chunking behavior when streaming. Reinforces word-level emission and the need to run generate in a background thread with a streamer. (Hugging Face Forums)
  • Discussion on multi-EOS not stopping cleanly for some chat models. Motivation for passing all stop IDs. (GitHub)
  • FastAPI discussion showing iterator streaming patterns and why to skip empty chunks in the loop. Useful for server implementations. (GitHub)

Model-specific chat template notes

  • Qwen chat docs show apply_chat_template usage and streaming patterns consistent with the above. Good cross-check if you test Qwen-style templates. (Qwen)
1 Like