Update README.md

by zhuzeyuan - opened Jun 18, 2025

base: refs/heads/main

←

from: refs/pr/2

Discussion Files changed

-1

zhuzeyuan

Jun 18, 2025

I'm afraid the Hellaswag accuracy for DCLM-1.4B is largely incorrect (also in their arxiv paper). Below are my evaluations of mllmTeam/PhoneLM-1.5B and TRI-ML/DCLM-1B using lm_eval_harness:

For mllmTeam/PhoneLM-1.5B:

{'arc_challenge': {'alias': 'arc_challenge', 'acc,none': 0.3779863481228669, 'acc_stderr,none': 0.0141696645203031, 'acc_norm,none': 0.39761092150170646, 'acc_norm_stderr,none': 0.01430175222327954}, 'arc_easy': {'alias': 'arc_easy', 'acc,none': 0.7243265993265994, 'acc_stderr,none': 0.009169229476542569, 'acc_norm,none': 0.6994949494949495, 'acc_norm_stderr,none': 0.009407763090599316}, 'boolq': {'alias': 'boolq', 'acc,none': 0.6599388379204894, 'acc_stderr,none': 0.008285579731379784}, 'hellaswag': {'alias': 'hellaswag', 'acc,none': 0.5044811790479984, 'acc_stderr,none': 0.004989581008163221, 'acc_norm,none': 0.6687910774746066, 'acc_norm_stderr,none': 0.004696861625496948}, 'lambada_openai': {'alias': 'lambada_openai', 'perplexity,none': 4.6818882905778265, 'perplexity_stderr,none': 0.10240384705826118, 'acc,none': 0.6551523384436251, 'acc_stderr,none': 0.006622117207603226}, 'piqa': {'alias': 'piqa', 'acc,none': 0.7540805223068553, 'acc_stderr,none': 0.010047331865625213, 'acc_norm,none': 0.7693144722524483, 'acc_norm_stderr,none': 0.009828959550983089}, 'social_iqa': {'alias': 'social_iqa', 'acc,none': 0.43244626407369496, 'acc_stderr,none': 0.011210331273967561}, 'wikitext': {'alias': 'wikitext', 'word_perplexity,none': 13.414900237284861, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6250416226082716, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.700476670732793, 'bits_per_byte_stderr,none': 'N/A'}, 'winogrande': {'alias': 'winogrande', 'acc,none': 0.6345698500394633, 'acc_stderr,none': 0.013533965097638798}}

For TRI-ML/DCLM-1B:

{'arc_challenge': {'alias': 'arc_challenge', 'acc,none': 0.4129692832764505, 'acc_stderr,none': 0.014388344935398326, 'acc_norm,none': 0.43430034129692835, 'acc_norm_stderr,none': 0.01448470304885736}, 'arc_easy': {'alias': 'arc_easy', 'acc,none': 0.7491582491582491, 'acc_stderr,none': 0.00889518301048739, 'acc_norm,none': 0.7138047138047138, 'acc_norm_stderr,none': 0.009274470774627726}, 'boolq': {'alias': 'boolq', 'acc,none': 0.709480122324159, 'acc_stderr,none': 0.007940549952156444}, 'hellaswag': {'alias': 'hellaswag', 'acc,none': 0.5361481776538538, 'acc_stderr,none': 0.004976724124850563, 'acc_norm,none': 0.7165903206532563, 'acc_norm_stderr,none': 0.0044973255339596455}, 'lambada_openai': {'alias': 'lambada_openai', 'perplexity,none': 4.016529108680619, 'perplexity_stderr,none': 0.08517294414865222, 'acc,none': 0.6941587424801087, 'acc_stderr,none': 0.0064193271158925974}, 'piqa': {'alias': 'piqa', 'acc,none': 0.7720348204570185, 'acc_stderr,none': 0.00978809383232491, 'acc_norm,none': 0.7747551686615887, 'acc_norm_stderr,none': 0.009746643471032147}, 'social_iqa': {'alias': 'social_iqa', 'acc,none': 0.44114636642784033, 'acc_stderr,none': 0.011235418947344599}, 'wikitext': {'alias': 'wikitext', 'word_perplexity,none': 16.647271516522334, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6919879195293779, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.7587192679275578, 'bits_per_byte_stderr,none': 'N/A'}, 'winogrande': {'alias': 'winogrande', 'acc,none': 0.6669297553275454, 'acc_stderr,none': 0.013246194028070653}}

My believe is the the authors might have wrongly used the 'acc,none' for DCLM-1B but 'acc_norm,none' for other models? After all, DCLM-1B is trained for 4T tokens on high-quality data so such a high accuracy should be expected.

Update README.md29b46d86

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Ready to merge

This branch is ready to get merged automatically.

· Sign up or log in to comment