Introduction
New LLM models are released frequently, and teams working in model risk management (MRM) and AI governance often look for practical insight into how different providers perform on the types of tasks they encounter in day-to-day documentation and validation workflows—not just how they score on broad, general-purpose benchmarks.
In this benchmarking analysis, we focus on a set of governance-oriented tasks that commonly arise in documentation and review processes, including:
- Interpreting quantitative test results from tables and figures
- Drafting documentation from existing source materials
- Assessing evidence against predefined criteria
- Reviewing documentation in light of regulatory or policy questions
These tasks were selected because they tend to surface behaviors that matter in enterprise environments—such as accurately reading numerical results, avoiding unsupported additions, identifying gaps, and producing conclusions that remain defensible under review.
The objective of this leaderboard is simply to provide greater transparency into how models behave across these use cases, so teams can make informed decisions based on their own priorities, risk tolerance, and operational constraints.
Why This Matters
Generic public benchmarks can be useful for understanding general model capabilities, but they often do not reflect the types of evidence-based tasks that appear in AI governance and model risk management workflows. In these settings, the key requirement is not only strong language generation, but the ability to stay grounded in provided inputs and produce outputs that remain defensible under review.
For this reason, we focus on domain-specific tasks such as interpreting validation results, drafting documentation from source materials, assessing evidence against guidelines, and evaluating regulatory question coverage. This approach is intended to provide a more practical view of how different models perform in real governance workflows, and to highlight trade-offs across quality, reliability, cost, and operational efficiency.
Small differences in model behavior can create material risk:
- Misreading quantitative results can distort validation conclusions
- Unsupported statements can introduce audit and compliance exposure
- Missed gaps or contradictions can weaken governance controls
- Overconfident language can create false assurance
- Cost and latency differences compound quickly at scale
What We Tested
We evaluated leading LLM providers across four AI governance–oriented tasks that are commonly encountered in enterprise validation and documentation workflows, spanning both first-line (model development) and second-line (independent validation and oversight) activities.
- Quantitative Results Analysis: Interpreting metric tables and plots to produce accurate, evidence-grounded insights without introducing unsupported assumptions.
- Model Documentation Drafting: Generating documentation from existing source materials while preserving numerical accuracy, maintaining scope discipline, and ensuring internal consistency.
- Risk Assessment Against Guidelines: Assessing whether available evidence aligns with predefined criteria, distinguishing substantiated evidence from narrative statements, and forming well-supported conclusions.
- Regulatory Question Coverage: Evaluating whether documentation addresses specific regulatory or policy questions, identifying potential gaps, and providing a calibrated assessment of coverage.
For each task, models were evaluated across multiple runs (10 traces per task), and the reported metrics reflect average performance across those runs. The overall leaderboard aggregates results across both traces and tasks to provide a balanced, high-level comparative view.
How We Evaluated the Models
Each model was evaluated using a consistent scoring framework designed to capture both output quality and operational characteristics.
- Faithfulness: Measures how consistently a response stays grounded in the provided inputs. Outputs are assessed statement by statement to determine whether claims can be traced back to the source material. Higher scores indicate stronger evidence alignment and fewer unsupported additions or contradictions.
- Answer Relevancy: Measures how directly the response addresses the task at hand. This helps distinguish focused, task-specific outputs from responses that include tangential or generic content that does not materially contribute to the objective.
- Verbosity: Measures the length of the response (word count). This provides insight into how concise or expansive a model tends to be, which can influence review effort and downstream workflow efficiency.
- Cost: Measures the relative resource expense of completing a task, based on token usage and provider pricing. This is particularly relevant when considering scalability across repeated or high-volume workflows.
- Latency: Measures end-to-end response time. In operational settings, latency can affect user experience and overall process throughput, especially when tasks are run at scale.
All models were evaluated under comparable conditions using standardized prompts and datasets to support consistent comparison.
The Leaderboard
The table below provides a summary view of how leading LLM providers perform across the four governance-oriented tasks included in this analysis. Each row represents a model, and the scores reflect performance across key dimensions, including faithfulness, answer relevancy, and operational characteristics such as cost and latency.
| LLM | Faithfulness | Answer Relevancy | Verbosity | Cost | Latency |
|---|---|---|---|---|---|
| gemini/gemini-2.5-flash | 0.9347 | 0.9342 | 705.5357 | 0.0089 | 11988.0714 |
| gemini/gemini-3-pro-preview | 0.8833 | 0.8743 | 436.7963 | 0.0352 | 16974.1481 |
| claude-opus-4-5-20251101 | 0.9232 | 0.9371 | 685.2069 | 0.1091 | 27694.6552 |
| gpt-4.1 | 0.8269 | 0.7687 | 594.3077 | 0.0369 | 24678.2308 |
| gpt-5 | 0.7946 | 0.7587 | 653.5472 | 0.0411 | 51410.6415 |
| gpt-5.1 | 0.7784 | 0.7298 | 966.8261 | 0.0330 | 22145.1957 |
| gemini/gemini-2.0-flash | 0.7544 | 0.6862 | 414.9623 | 0.0041 | 5658.6981 |
| gpt-4o | 0.7729 | 0.7468 | 382.0185 | 0.0681 | 18283.3889 |
| claude-sonnet-4-5-20250929 | 0.7750 | 0.7708 | 849.3878 | 0.1462 | 33511.6327 |
| claude-opus-4-20250514 | 0.6662 | 0.6760 | 493.6346 | 0.6625 | 28997.0962 |
Performance by Task
Model performance can vary depending on the type of task being performed. A model that shows strong results in quantitative interpretation may perform differently when drafting documentation or assessing regulatory coverage.
The task-level leaderboards below offer a more detailed view, allowing teams to compare providers based on the specific workflows.
Quantitative Results Analysis
Ability to accurately interpret tables and plots and produce evidence-grounded summaries without introducing unsupported conclusions.
| LLM | Faithfulness | Answer Relevancy | Verbosity | Cost | Latency |
|---|---|---|---|---|---|
| gemini/gemini-2.5-flash | 0.9816 | 1.0000 | 1100.8889 | 0.0086 | 16205.2222 |
| gpt-4.1 | 0.9987 | 1.0000 | 1118.6667 | 0.0185 | 40470.0000 |
| gemini/gemini-2.0-flash | 0.9270 | 0.9156 | 848.2222 | 0.0007 | 8490.2222 |
| gpt-5.1 | 1.0000 | 1.0000 | 1857.7500 | 0.0278 | 32342.6250 |
| gemini/gemini-3-pro-preview | 0.9925 | 0.9998 | 968.4444 | 0.0317 | 27643.1111 |
| gpt-5 | 0.9995 | 1.0000 | 1179.4444 | 0.0375 | 86486.4444 |
| claude-sonnet-4-5-20250929 | 0.9864 | 1.0000 | 1540.7778 | 0.0438 | 53567.7778 |
| claude-opus-4-5-20251101 | 0.9995 | 1.0000 | 1123.3333 | 0.0592 | 36121.8889 |
| gpt-4o | 0.8544 | 0.8084 | 667.0000 | 0.0172 | 29074.4444 |
| claude-opus-4-20250514 | 0.9027 | 0.9870 | 984.7778 | 0.1611 | 49951.3333 |
Model Documentation Drafting
Ability to generate documentation strictly from source materials while preserving accuracy, scope discipline, and internal consistency.
| LLM | Faithfulness | Answer Relevancy | Verbosity | Cost | Latency |
|---|---|---|---|---|---|
| gpt-5.1 | 0.9407 | 0.9150 | 142.8000 | 0.0393 | 5377.2000 |
| gemini/gemini-2.5-flash | 0.8719 | 0.8673 | 83.0000 | 0.0093 | 5361.7778 |
| gpt-5 | 0.9436 | 0.8289 | 95.9000 | 0.0492 | 27324.8000 |
| gpt-4.1 | 0.9250 | 0.8285 | 118.0000 | 0.0494 | 8348.5000 |
| gemini/gemini-3-pro-preview | 0.8576 | 0.8436 | 120.2000 | 0.0547 | 12125.8000 |
| claude-opus-4-5-20251101 | 0.8705 | 0.9175 | 192.9000 | 0.1789 | 8652.3000 |
| claude-sonnet-4-5-20250929 | 0.7146 | 0.7812 | 303.4000 | 0.1107 | 13305.0000 |
| gemini/gemini-2.0-flash | 0.7120 | 0.5866 | 77.3333 | 0.0035 | 1887.1111 |
| gpt-4o | 0.6552 | 0.7030 | 187.5000 | 0.0788 | 13454.0000 |
| claude-opus-4-20250514 | 0.7242 | 0.7903 | 280.5000 | 0.5476 | 16769.5000 |
Risk Assessment Against Guidelines
Ability to evaluate whether evidence meets predefined criteria and produce defensible, evidence-weighted conclusions.
| LLM | Faithfulness | Answer Relevancy | Verbosity | Cost | Latency |
|---|---|---|---|---|---|
| gemini/gemini-2.5-flash | 0.9492 | 0.9351 | 910 | 0.0087 | 14156.3 |
| gemini/gemini-2.0-flash | 0.9011 | 0.8911 | 543.2 | 0.0011 | 7232.6 |
| gpt-5.1 | 0.9369 | 0.9061 | 1451 | 0.0339 | 27201.6667 |
| gemini/gemini-3-pro-preview | 0.8968 | 0.9078 | 525.2 | 0.0339 | 21700.2 |
| gpt-5 | 0.9349 | 0.9349 | 1053.2222 | 0.0485 | 97935.6667 |
| gpt-4.1 | 0.8726 | 0.8447 | 656.3 | 0.0255 | 36028.4 |
| gpt-4o | 0.8448 | 0.855 | 394.2 | 0.0271 | 21736.8 |
| claude-sonnet-4-5-20250929 | 0.8824 | 0.9102 | 1615.8571 | 0.0671 | 59471.2857 |
| claude-opus-4-5-20251101 | 0.9071 | 0.9 | 783.2 | 0.0842 | 39152.5 |
| claude-opus-4-20250514 | 0.881 | 0.8879 | 601.8 | 0.2172 | 34260.9 |
Regulatory Question Coverage
Ability to assess whether documentation adequately addresses a specific regulatory question, identifying gaps with calibrated judgment.
| LLM | Faithfulness | Answer Relevancy | Verbosity | Cost | Latency |
|---|---|---|---|---|---|
| gemini/gemini-3-pro-preview | 0.849 | 0.828 | 336.68 | 0.0293 | 13182.24 |
| gpt-4o | 0.7619 | 0.6989 | 352.36 | 0.0985 | 14949 |
| gpt-4.1 | 0.6972 | 0.6191 | 569.2609 | 0.0437 | 20663.8696 |
| gemini/gemini-2.0-flash | 0.6489 | 0.5575 | 329.24 | 0.0068 | 5367.56 |
| gpt-5 | 0.6107 | 0.5804 | 543.4 | 0.0366 | 31668.68 |
| claude-sonnet-4-5-20250929 | 0.6859 | 0.6343 | 582.9565 | 0.2258 | 26548.3043 |
| gpt-5.1 | 0.5808 | 0.4993 | 885.3636 | 0.0319 | 24679.8182 |
| claude-opus-4-20250514 | 0.455 | 0.4124 | 347.087 | 1.1023 | 23825.3478 |

