Evaluating GenAI Evaluations
Epistemics, Stakeholders, Technical Infrastructure and Policy
Background
Generative AI (GenAI) has become widely adopted, with its attributed capabilities are advancing at a sensational pace. At the same time, debates about short- and long-term risks have raised questions on how these risks can be identified through benchmarks and mitigated by technical means. However, current approaches to evaluation are criticised for exhibiting shaping what counts as capability and risk as well as various conceptual and methodological problems. Current ML evaluation practices routinely skip rigorous conceptual operationalisation, mapping high-level constructs onto narrow metrics without adequate validation. Evaluations therefore do not merely measure results and consequences of AI but actively construct what counts as performance and harm. Resolving this requires a socio-technical perspective on who evaluates how and what for whom, as the needs of affected groups and users remain underserved by existing benchmarks. Finally, while some access mechanisms exist (APIs, open-weight models), the infrastructure for credible evaluation remains unevenly distributed, and independent researchers, NGOs, and regulators often lack cost-effective and reproducible tools for systematic assessment.
Against this GenAI evaluation crisis, this interdisciplinary Weizenbaum short project investigates the methodological and theoretical foundations of GenAI evaluation, which not only shape how it is evaluated, including who evaluates for whom, but also shape narratives around AI capabilities and risks in general. Building on our recent work in AI evaluations and narratives, this project therefore focuses on the epistemics, stakeholders, and technical infrastructure of GenAI evaluations from an interdisciplinary perspective, spanning six research groups of the Weizenbaum Institute.
The project is organized into three work packages
-
WP1 – Epistemics (led by Anne Krüger and Rainer Rehak, with contributions from David Hartmann, LK Seiling, and Angelie Kraft): Develops an epistemic theory of GenAI to critically examine the design choices underlying current evaluation methods and their implications for the validity and interpretability of claims about model performance.
-
WP2 – Stakeholders and Governance (led by David Hartmann, in collaboration with Arman Noroozian and Maria Eriksson from HUMAINT-JRC, Bogdan Lungu (ApTI), Annette Zimmermann (WI Fellow) and Tina Lassiter (WI Fellow)): Maps the stakeholder landscape of GenAI evaluations through qualitative interviews, funding analysis, and document analysis, exploring how affected communities, independent and internal auditors, companies, and regulators are included in evaluation infrastructure and measurement — and what factors and incentives may shape evaluation priorities.
-
WP3 – Technical Infrastructure (led by Jan Batzner, in collaboration with David Hartmann, Angelie Kraft, and the EvalEval Coalition): Furthers a shared schema and crowdsourced database to create a common language for reporting and comparing evaluation results across frameworks.