EU AI Act Compliance Leaderboard
Technical Interpretation of the EU AI Act
We have interpreted the high-level regulatory requirements of the EU AI Act as concrete technical requirements. We further group requirements within six EU AI Act principles and label them as GPAI, GPAI+SR (Systemic Risk), and HR (High-Risk).
Explore the Interpretation
Open-Source Benchmarking Suite
The framework includes the ability to evaluate the technical requirements on a benchmarking suite containing 27 SOTA LLM benchmarks. The benchmark suite and technical interpretations are both open to community contributions.
EU AI Act Principle: Technical Robustness and Safety
T | Model | Report | MMLU: Robustness | BoolQ Contrast Set | IMDB Contrast Set | Monotonicity Checks | Self-Check Consistency | Goal Hijacking and Prompt Leakage | Rule Following | model_name_for_query |
|---|---|---|---|---|---|---|---|---|---|---|
π’ | 0.72 | 0.57 | 0.84 | 0.67 | 0.85 | 0.54 | 0.58 | mistralai/Mixtral-8x7B-Instruct-v0.1 |
EU AI Act Principle: Privacy & Data Governance
Training Data Suitability
We evaluate the adequacy of the dataset [1], aiming to assess the potential of an LLM trained on this data to exhibit toxic or discriminatory behavior.
No Copyright Infringement
We check if the model can be made to directly regurgitate content that is subject to the copyright of a third person.
User Privacy Protection
We focus on cases of user privacy violation by the LLM itself, evaluating the modelβs ability to recover personal identifiable information that may have been included in the training data.
T | Model | Report | Toxicity of the Dataset | Bias of the Dataset | Copyrighted Material Memorization | PII Extraction by Association | model_name_for_query |
|---|---|---|---|---|---|---|---|
π’ | N/A | N/A | 0.99 | 1 | mistralai/Mixtral-8x7B-Instruct-v0.1 |
EU AI Act Principle: Transparency
Capabilities, Performance, and Limitations
To provide an overarching view, we assess the capabilities and limitations of the AI system by evaluating its performance on a wide range of tasks. We evaluate the model on widespread research benchmarks covering general knowledge [1], reasoning [2,3], truthfulness [4], and coding ability [5].
Interpretability
The large body of machine learning interpretability research is often not easily applicable to large language models. While more work in this direction is needed, we use the existing easily-applicable methods to evaluate the modelβs ability to reason about its own correctness [1], and the degree to which the probabilities it outputs can be interpreted [3,4].
Disclosure of AI
We require the language model to consistently deny that it is a human.
Traceability
We require the presence of language model watermarking [1,2], and evaluate its viability, combining several important requirements that such schemes must satisfy to be practical.
T | Model | Report | General Knowledge: MMLU | Reasoning: AI2 Reasoning Challenge | Common Sense Reasoning: HellaSwag | Truthfulness: TruthfulQA MC2 | Coding: HumanEval | Logit Calibration: BIG-Bench | Self-Assessment: TriviaQA | Denying Human Presence | Watermark Reliability & Robustness | model_name_for_query |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
π’ | 0.75 | 0.65 | 0.84 | 0.55 | 0.32 | 0.89 | 0.43 | 0.36 | N/A | mistralai/Mixtral-8x7B-Instruct-v0.1 |
EU AI Act Principle: Diversity, Non-discrimination & Fairness
Representation β Absence of Bias
We evaluate the tendency of the LLM to produce biased outputs, on three popular bias benchmarks [1,2,3].
Fairness β Absence of Discrimination
We evaluate the modelβs tendency to behave in a discriminatory way by comparing its behavior on different protected groups, using prominent fairness benchmarks [1,2].
T | Model | Report | Representation Bias: RedditBias | Prejudiced Answers: BBQ | Biased Completions: BOLD | Income Fairness: DecodingTrust | Recommendation Consistency: FaiRLLM | model_name_for_query |
|---|---|---|---|---|---|---|---|---|
π’ | 0.62 | 0.93 | 0.68 | 0.82 | 0.23 | mistralai/Mixtral-8x7B-Instruct-v0.1 |
EU AI Act Principle: Social & Environmental Well-being
Harmful Content and Toxicity
We evaluate the modelsβ tendency to produce harmful or toxic content, leveraging two recent evaluation tools, RealToxicityPrompts and AdvBench [1,2].
T | Model | Report | Toxic Completions of Benign Text: RealToxicityPrompts | Following Harmful Instructions: AdvBench | model_name_for_query |
|---|---|---|---|---|---|
π’ | 0.92 | 0.99 | mistralai/Mixtral-8x7B-Instruct-v0.1 |
model | revision | status |
|---|---|---|
main | FINISHED |
π Request an evaluation here
What does N/A score mean?
An N/A score means that it was not possible to evaluate the benchmark for a given model.
This can happen for multiple reasons, such as:
- The benchmark requires access to model logits, but the model API doesn't provide them (or only provides them for specific strings),
- The model API refuses to provide any answer,
- We do not have access to the training data.