Dataset format standards for chat-based, fine-tuned Llama models

I want to use a Llama-based model for text generation/chat-bot. I have my own data, and was curious how to format it as to get the best results out of my fine-tuning. Currently, I use "[PCP] " and "[SR] " to separate who is talking. Here’s a snippet of my code to do this, followed by an example of two conversations I might have in my dataset.

    conversation = ""
    for _, row in group.iterrows():
        if row.get("PCP_MESSAGE", "").strip():
            clean_pcp_message = textFromHtml(row["PCP_MESSAGE"])
            conversation += "[PCP] " + clean_pcp_message + " "
        if row.get("SR_MESSAGE", "").strip():
            clean_sr_message = textFromHtml(row["SR_MESSAGE"])
            conversation += "[SR] " + clean_sr_message + " "
    
    print(conversation)
    
    return conversation

[PCP] Some text some text some text [SR] Dear Lorem, ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco laboris nisi ut aliquip ex ea commodo consequat. Duis aute irure dolor in reprehenderit in voluptate velit esse cillum dolore eu fugiat nulla pariatur. Excepteur sint occaecat cupidatat non proident, sunt in culpa qui officia deserunt mollit anim id est

laborum. [PCP] Hello Lorem,

ipsum dolor sit amet, consectetur adipiscing elit, sed do eiusmod tempor incididunt ut labore et dolore magna aliqua. Ut enim ad minim veniam, quis nostrud exercitation ullamco. [SR] Hi __,

&

Good morning,

Thanks for getting back to GI.

Given the clinical symptom along with patient’s age and documented BRBPR, GI will evaluate patient in GI clinic to consider scheduling diagnostic EGD/colonoscopy evaluation.
&
Best [PCP] Great!

Thank you.

and (no data is real)…

[PCP] 45 yo male already on Pantoprazole , has recurrence , worsening GERD with esophagitis.

addendum: update:01/01/1970. patient tested negative for H.pylori in the stool but still having a lot of abdominal gas and eructation [SR] Dr Smith

what are the pts GERD and “esophagitis” symptoms.

MMK

This is the only place I’ve seen something that has a little documentation on how to format data depending on what model and what task you are using. A question was asked here but never answered

It says use this for chat-based

[INST]<<SYS>>
You are a friendly chatbot that gives helpful answers
<</SYS>>

Hello[/INST]Hello, how are you?</s><s>[INST]Good, please tell me what 1+1 is.[/INST]1+1=2. Please let me know if you need anything else!</s>

But I am curious if there is any other documentation out there that details how and why data should be formatted a certain way for a certain type of model and task.

maybe ChatML, as proposed in How to Fine-Tune LLMs in 2024 with Hugging Face?

I FOUND IT

Utilities for Tokenizers (huggingface.co)

Here is an example from here Templates for Chat Models (huggingface.co)

from transformers import AutoModelForCausalLM, AutoTokenizer

checkpoint = "HuggingFaceH4/zephyr-7b-beta"
tokenizer = AutoTokenizer.from_pretrained(checkpoint)
model = AutoModelForCausalLM.from_pretrained(checkpoint)  # You may want to use bfloat16 and/or move to GPU here

messages = [
    {
        "role": "system",
        "content": "You are a friendly chatbot who always responds in the style of a pirate",
    },
    {"role": "user", "content": "How many helicopters can a human eat in one sitting?"},
 ]
tokenized_chat = tokenizer.apply_chat_template(messages, tokenize=True, add_generation_prompt=True, return_tensors="pt")
print(tokenizer.decode(tokenized_chat[0]))

Include add_generation_prompt if you are using inference. false if training/fine-tuning.

This will automatically format your messages into that datasets most conducive format :)))))))))))))

1 Like

ccruttjr Hi. What do You think of such case.

I’m trying to do LoRA fine-tuning on 332KB of jsonl conversational data (including system instruction). I want my model to learn an individual style of conversation and predict delay with which to respond. During inference it is supposed to return text and delay value. For that I introduced another key delay. Also I have category key and chat_id(which is irrelevant actually). So my structure of data doesn’t fully match the one in documentation.

Do You think the difference in the dataset structure will impede fine-tuning success?

{"category": "acquaintances", "chat_id": "24129172583342694.html", "conversation": [{"role": "system", "content": "You act as target user etc...."}, {"role": "target", "content": "Hi. blebleblebleblebleble"}, {"role": "other", "content": "oh really? blebleble."}, {"role": "target", "content": "blebleblebleblebleble", "delay": 159}]}

It says “expected” dataset format…

1 Like

I think the data should be usable without issues if you manually normalize it before fine-tuning. You could do it on the fly right before fine-tuning, or you could prepare a pre-normalized dataset beforehand…


It will not block fine-tuning, but you must do two things:

  1. Map your custom JSON into the format TRL actually reads (text or messages).
  2. Turn delay into tokens in the assistant output if you want the model to learn it.

The “expected format” in the docs is about that one training field, not your whole JSON structure.

I’ll walk through why, and what I’d do with your exact schema.


1. What SFTTrainer really expects

TRL’s SFTTrainer supports “standard” and “conversational” language-modeling datasets:(Hugging Face)

  • Standard: each row has a text column with the full sequence.

  • Conversational: each row has a messages column that is a list of {role, content} messages, e.g.:

    {
      "messages": [
        {"role": "system", "content": "You are helpful"},
        {"role": "user", "content": "What's the capital of France?"},
        {"role": "assistant", "content": "It is Paris."}
      ]
    }
    

The docs also say:

  • Columns “vary depending on the task”; only text or messages is special.(Hugging Face)
  • Conversational datasets must be converted via a chat template into a standard text sequence before training.(Hugging Face)

So SFTTrainer does not care about your whole JSON schema. It cares that:

  • There is a text or messages field with the right structure.
  • Everything else is either ignored or used by your own preprocessing.

Your current row:

{
  "category": "acquaintances",
  "chat_id": "24129172583342694.html",
  "conversation": [
    {"role": "system", "content": "You act as target user etc...."},
    {"role": "target", "content": "Hi. blebleblebleblebleble"},
    {"role": "other",  "content": "oh really? blebleble."},
    {"role": "target", "content": "blebleblebleblebleble", "delay": 159}
  ]
}

is fine as raw data. You just need to transform it into the expected training field.


2. Will the “different structure” impede success?

Short version: no, if you preprocess correctly.

From the TRL dataset-format guide:

A language modeling dataset consists of a column "text" (or "messages" for conversational datasets) containing a full sequence of text.(Hugging Face)

It also shows that conversational datasets can have other columns; only messages is required.(Hugging Face)

On top of that:

  • TRL’s own examples and blogs say that if your dataset uses a different structure, you simply preprocess into the messages shape and then apply the chat template.(Medium)
  • There is even a GitHub issue about ShareGPT-style datasets where messages live under conversations and roles are human/gpt; the proposed solution is to map that into the standard messages / user / assistant format.(GitHub)

Your case is the same:

  • conversation → messages
  • target / other → assistant / user
  • Keep system as system
  • category, chat_id, and raw delay are just extra columns, which TRL is happy to ignore unless you use them.

So the schema difference itself does not impede fine-tuning. The only real danger is if you pass the raw JSON straight to SFTTrainer without mapping to a proper text/messages field.


3. Critical point: how to make the model learn delay

Right now, delay lives as a separate field on the last message:

{"role": "target", "content": "ble...", "delay": 159}

By default:

  • SFTTrainer only trains on text built from text or messages.(Hugging Face)
  • Extra columns like label, score, or here delay are ignored unless you explicitly convert them into text. This is exactly what shows up in GitHub issues where people expect SFTTrainer to use a label column for classification; it doesn’t.(Hugging Face)

So if you leave delay as a separate numeric field:

  • The model never sees it.
  • It cannot learn to predict it.

To train the model to output both text and delay, you must encode delay into the assistant output text, for example:

Option A: JSON output

Turn the assistant’s final message into something like:

{
  "reply": "blebleblebleblebleble",
  "delay": 159
}

and make that the content of the assistant message.

This follows the pattern used in JSON-generation SFT tutorials: the model is trained to output a fixed JSON schema.(Hugging Face)

Option B: Tag header

Use a simple tagged convention:

<delay>159</delay>
blebleblebleblebleble

or

DELAY_MS=159
blebleblebleblebleble

Then, during inference:

  1. Ask the model to answer in that format.
  2. Parse the delay with a regex or JSON parser.
  3. Use it to schedule your visible reply.

Either way, delay becomes part of the token sequence, so the LM can learn it.


4. How to map your schema into TRL’s conversational format

Based on TRL docs and examples:(Hugging Face)

  1. Normalize roles:

    • system → system
    • target → assistant
    • other → user
  2. Rename conversation to messages:

    • SFTTrainer and many examples expect the field to be called messages.(Hugging Face)
  3. Inject delay into assistant content:

    • For the last target message, if it has "delay": 159, rewrite:

      {"role": "assistant", "content": "{”reply”: ”ble...”, ”delay”: 159}"}
      

      or use the tagged form.

  4. Use a chat template:

    • Call tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=False) to convert each messages list into a single training string in the exact format your base model expects.(Hugging Face)

    • You can either:

      • Let SFTTrainer handle this automatically for a messages column, or
      • Use a formatting_func that does it explicitly.

TRL’s own tutorials and blogs recommend this “messages → chat_template → text” path for multi-turn chat data.(Google AI for Developers)

Once you do that:

  • SFTTrainer sees a perfectly standard conversational dataset.
  • Your extra fields are gone from the training text, except delay, which is now encoded as tokens.
  • The non-standard original schema no longer matters.

5. About the small data size

332 KB of JSONL is on the order of tens of thousands of tokens, depending on how verbose the content is.

In typical LoRA fine-tuning practice:(AI Engineering Academy.)

  • That is enough to:

    • Impose a noticeable conversational style.
    • Teach simple regularities about when to choose longer/shorter delays.
  • It is not enough to:

    • Learn a precise, robust regression from arbitrary context to a very accurate delay value.

So:

  • The dataset structure, once mapped, is not your bottleneck.

  • The limiting factors are:

    • How you encode delay.
    • Data size and variety.
    • How tightly you regularize LoRA (ranks, LR, epochs) to avoid overfitting.

6. Direct answer to your question

Do You think the difference in the dataset structure will impede fine-tuning success?

No, not by itself.

If you:

  • Convert conversation → messages with standard roles (system / user / assistant).(Hugging Face)
  • Use a chat template to turn messages into a single sequence of tokens.(Hugging Face)
  • Encode delay inside the assistant’s text (JSON or tag), instead of keeping it as a separate field.

then your non-standard JSON schema will not impede fine-tuning. It becomes just an internal data format that you normalize before training.

If you skip that and pass the raw schema directly, then yes, things will break or delay will be silently ignored.