Model Card for Qwen3-0.6B-MNLP_IF_v2_text_mcqa_rl

This model is a fine-tuned version of andresnowak/Qwen3-0.6B-MNLP_mcqa_model_text_2. It has been trained using TRL.

Quick start

from transformers import pipeline

question = "If you had a time machine, but could only go to the past or the future once and never return, which would you choose and why?"
generator = pipeline("text-generation", model="andresnowak/Qwen3-0.6B-MNLP_IF_v2_text_mcqa_rl", device="cuda")
output = generator([{"role": "user", "content": question}], max_new_tokens=128, return_full_text=False)[0]
print(output["generated_text"])

Training procedure

This model was trained with GRPO, a method introduced in DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models. And the methodology was Starting from the andresnowak/Qwen3-0.6B-MNLP_mcqa_model_text, that was trained to output [Letter]. [Answer] the model was then trained again for 2 epochs on the same dataset but now with RLVR, if we find [Letter]. [Text of theanswer] in the output we give $2.0$, if we find only [Letter]. we give $1.0$ and if not we give $-1.0$ (it was a very simple verifiable reward).

And the arguments used where:

defaults:
  - override hydra/job_logging: disabled

environment:
  seed: 42

model:
  # name: andresnowak/Qwen3-0.6B-instruction-finetuned
  # name: Qwen/Qwen3-0.6B-Base
  name: andresnowak/Qwen3-0.6B-instruction-finetuned_v2
  hub_model_id: andresnowak/Qwen3-0.6B-MNLP_IF_v2_mcqa_rl

dataset_train:
  - name: andresnowak/MNLP_MCQA_dataset
    config: train
    subset_name: math_qa
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ScienceQA
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: mmlu_auxiliary_train_stem_10_choices 
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ai2_arc_challenge
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ai2_arc_easy
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: medmcqa
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: openbookqa
    config: train
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: sciq
    config: train

dataset_validation:
  - name: andresnowak/MNLP_MCQA_dataset
    config: validation
    subset_name: math_qa
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ScienceQA
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: mmlu
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ai2_arc_challenge
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: ai2_arc_easy
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: medmcqa
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: openbookqa
    config: validation
  - name: andresnowak/MNLP_MCQA_dataset
    subset_name: sciq
    config: validation

dataset_mmlu:
  - name: cais/mmlu
    config: validation
    subjects: ["abstract_algebra", "anatomy", "astronomy", "college_biology", "college_chemistry", "college_computer_science", "college_mathematics", "college_physics", "computer_security", "conceptual_physics", "electrical_engineering", "elementary_mathematics", "high_school_biology",  "high_school_chemistry", "high_school_computer_science", "high_school_mathematics", "high_school_physics", "high_school_statistics", "machine_learning"]


training:
  output_dir: ./output
  logging_dir: ./logs
  resume_dir: None
  report_to: wandb
  learning_rate: 1e-5
  per_device_train_batch_size: 8
  per_device_eval_batch_size: 8
  gradient_accumulation_steps: 8 # to get effective 64
  num_train_epochs: 1
  weight_decay: 0.00
  warmup_ratio: 0.05
  max_grad_norm: 0.5
  num_generations: 4
  completion_length: 512
  beta: 0.1

wandb:
  project: MNLP-qwen-instruction-finetuning
  name: qwen-instruction-finetuning-v2-MCQA-RL

Evaluation

First evaluation: (type 0)

The following are multiple choice questions (with answers) about knowledge and skills in advanced master-level STEM courses.

---
*[Insert Question Here]*
---
*[Insert Choices Here, e.g.:*
*A. Option 1*
*B. Option 2*
*C. Option 3*
*D. Option 4]*
---
Answer:

And the teseting was done on [Letter]. [Text answer]

Benchmark	Accuracy (Acc)	Normalized Accuracy (Acc Norm)
ARC Challenge	64.58%	64.99%
ARC Easy	82.59%	82.32%
GPQA	25.45%	24.55%
Math QA	33.16%	32.89%
MCQA Evals	42.21%	42.73%
MMLU	47.89%	47.89%
MMLU Pro	15.71%	15.63%
MuSR	48.68%	47.62%
NLP4Education	48.08%	45.11%
Overall	45.37%	44.86%

Framework versions

TRL: 0.18.1
Transformers: 4.52.4
Pytorch: 2.7.0
Datasets: 3.6.0
Tokenizers: 0.21.0

Citations

Cite GRPO as:

@article{zhihong2024deepseekmath,
    title        = {{DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models}},
    author       = {Zhihong Shao and Peiyi Wang and Qihao Zhu and Runxin Xu and Junxiao Song and Mingchuan Zhang and Y. K. Li and Y. Wu and Daya Guo},
    year         = 2024,
    eprint       = {arXiv:2402.03300},
}

Cite TRL as:

@misc{vonwerra2022trl,
    title        = {{TRL: Transformer Reinforcement Learning}},
    author       = {Leandro von Werra and Younes Belkada and Lewis Tunstall and Edward Beeching and Tristan Thrush and Nathan Lambert and Shengyi Huang and Kashif Rasul and Quentin Gallou{\'e}dec},
    year         = 2020,
    journal      = {GitHub repository},
    publisher    = {GitHub},
    howpublished = {\url{https://github.com/huggingface/trl}}
}