I'm afraid the Hellaswag accuracy for DCLM-1.4B is largely incorrect (also in their arxiv paper). Below are my evaluations of mllmTeam/PhoneLM-1.5B and TRI-ML/DCLM-1B using lm_eval_harness:

For mllmTeam/PhoneLM-1.5B:

{'arc_challenge': {'alias': 'arc_challenge', 'acc,none': 0.3779863481228669, 'acc_stderr,none': 0.0141696645203031, 'acc_norm,none': 0.39761092150170646, 'acc_norm_stderr,none': 0.01430175222327954}, 'arc_easy': {'alias': 'arc_easy', 'acc,none': 0.7243265993265994, 'acc_stderr,none': 0.009169229476542569, 'acc_norm,none': 0.6994949494949495, 'acc_norm_stderr,none': 0.009407763090599316}, 'boolq': {'alias': 'boolq', 'acc,none': 0.6599388379204894, 'acc_stderr,none': 0.008285579731379784}, 'hellaswag': {'alias': 'hellaswag', 'acc,none': 0.5044811790479984, 'acc_stderr,none': 0.004989581008163221, 'acc_norm,none': 0.6687910774746066, 'acc_norm_stderr,none': 0.004696861625496948}, 'lambada_openai': {'alias': 'lambada_openai', 'perplexity,none': 4.6818882905778265, 'perplexity_stderr,none': 0.10240384705826118, 'acc,none': 0.6551523384436251, 'acc_stderr,none': 0.006622117207603226}, 'piqa': {'alias': 'piqa', 'acc,none': 0.7540805223068553, 'acc_stderr,none': 0.010047331865625213, 'acc_norm,none': 0.7693144722524483, 'acc_norm_stderr,none': 0.009828959550983089}, 'social_iqa': {'alias': 'social_iqa', 'acc,none': 0.43244626407369496, 'acc_stderr,none': 0.011210331273967561}, 'wikitext': {'alias': 'wikitext', 'word_perplexity,none': 13.414900237284861, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6250416226082716, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.700476670732793, 'bits_per_byte_stderr,none': 'N/A'}, 'winogrande': {'alias': 'winogrande', 'acc,none': 0.6345698500394633, 'acc_stderr,none': 0.013533965097638798}}

For TRI-ML/DCLM-1B:

{'arc_challenge': {'alias': 'arc_challenge', 'acc,none': 0.4129692832764505, 'acc_stderr,none': 0.014388344935398326, 'acc_norm,none': 0.43430034129692835, 'acc_norm_stderr,none': 0.01448470304885736}, 'arc_easy': {'alias': 'arc_easy', 'acc,none': 0.7491582491582491, 'acc_stderr,none': 0.00889518301048739, 'acc_norm,none': 0.7138047138047138, 'acc_norm_stderr,none': 0.009274470774627726}, 'boolq': {'alias': 'boolq', 'acc,none': 0.709480122324159, 'acc_stderr,none': 0.007940549952156444}, 'hellaswag': {'alias': 'hellaswag', 'acc,none': 0.5361481776538538, 'acc_stderr,none': 0.004976724124850563, 'acc_norm,none': 0.7165903206532563, 'acc_norm_stderr,none': 0.0044973255339596455}, 'lambada_openai': {'alias': 'lambada_openai', 'perplexity,none': 4.016529108680619, 'perplexity_stderr,none': 0.08517294414865222, 'acc,none': 0.6941587424801087, 'acc_stderr,none': 0.0064193271158925974}, 'piqa': {'alias': 'piqa', 'acc,none': 0.7720348204570185, 'acc_stderr,none': 0.00978809383232491, 'acc_norm,none': 0.7747551686615887, 'acc_norm_stderr,none': 0.009746643471032147}, 'social_iqa': {'alias': 'social_iqa', 'acc,none': 0.44114636642784033, 'acc_stderr,none': 0.011235418947344599}, 'wikitext': {'alias': 'wikitext', 'word_perplexity,none': 16.647271516522334, 'word_perplexity_stderr,none': 'N/A', 'byte_perplexity,none': 1.6919879195293779, 'byte_perplexity_stderr,none': 'N/A', 'bits_per_byte,none': 0.7587192679275578, 'bits_per_byte_stderr,none': 'N/A'}, 'winogrande': {'alias': 'winogrande', 'acc,none': 0.6669297553275454, 'acc_stderr,none': 0.013246194028070653}}

My believe is the the authors might have wrongly used the 'acc,none' for DCLM-1B but 'acc_norm,none' for other models? After all, DCLM-1B is trained for 4T tokens on high-quality data so such a high accuracy should be expected.

Ready to merge
This branch is ready to get merged automatically.

Sign up or log in to comment