CompassionBench: A New Way To Measure Whether AI Actually Cares About Animals
[NOTE: since this blog was published, the authors renamed their test from Animal Harm Benchmark 2.0 to the ANIMA Benchmark, due to confusion with the AHB 1.0. This blog has been edited to reflect this change.]
Imagine asking an AI assistant how to set up a backyard chicken coop. It might give you detailed plans for ventilation, feeding schedules, and predator-proofing — but will it mention the welfare of the chickens themselves? Will it suggest enrichment, or flag that battery-cage designs cause suffering? What about if you ask it to help plan a weekend trip and it recommends a horse race or a marine park — will it note the animal welfare implications, or just book the tickets?
Right now, the answer is usually no. Most AI models don’t consider animals unless you specifically ask them to. And as these systems increasingly shape human decisions — from dietary choices to policy drafting to land management — that gap has real consequences for billions of animals.
That’s why we built CompassionBench: a public leaderboard that tracks how well frontier AI models reason about animal welfare. It’s designed to help advocates choose better tools, and to create pressure on AI companies to take animal compassion seriously as these systems develop further.
Why AI’s Animal Ethics Matter
AI language models are trained on vast quantities of human-generated text, and they tend to inherit the biases embedded in that data. Previous research has shown that models like ChatGPT and Claude mirror human attitudes toward animals, showing empathy for dogs and dolphins while dismissing the welfare of farmed animals like pigs and chickens. As Faunalytics has covered before, the original Animal Harm Benchmark (AHB 1.0) identified three major pathways by which AI can contribute to animal harm: persuasion and misinformation, social bias, and environmental impact.
But here’s what we’ve learned since then: the bigger problem isn’t what AI says about animals when asked — it’s that AI rarely considers animals at all. Many everyday decisions have downstream effects on animals. Cutting down a tree in your yard affects nesting birds. Using certain pesticides harms pollinators. Recommending a leather jacket drives demand in supply chains that cause animal suffering. Current AI models almost never flag these connections unless a user explicitly raises them.
We also know that AI models are remarkably inconsistent. Ask the same question ten different times and you may get ten different answers, with much greater variability than you’d see from a human respondent. And while frontier labs do some training to make their models “harmless,” that training can be fragile. If you ask nearly any AI whether it cares about animals, it will say yes — because that’s the expected answer. Whether that stated concern translates into the actual advice and decisions the model helps users make is another story entirely.
Before this benchmark existed, we were largely reliant on AI labs themselves to tell us how their models handled animal welfare. There was no independent, robust way to measure it. CompassionBench and the ANIMA Bench seek to change that.
From Risk Scores To Moral Reasoning: What Changed
The original AnimalHarmBench, presented at the ACM FAccT 2025 conference, was an important first step, but it had significant limitations. Its questions sometimes made it obvious to models that they were being evaluated, which changed how they responded. It scored answers on a binary scale (harmful or not) without probing whether the model had actually thought about animal welfare in its reasoning. It also ran slowly, making iteration difficult.
CaML was not involved in building AHB 1.0, but we tried running it and saw these issues firsthand. The core problem was that a model could score well simply by refusing to engage, or by repeating the right keywords, without demonstrating genuine moral reasoning. We needed something that could distinguish between a model that says “welfare matters” because it has memorized the phrase and one that actually analyzes how to minimize harm in a given situation.
ANIMA Bench takes a fundamentally different approach. Rather than asking whether an answer is correct, it evaluates whether a model considers key aspects of animal welfare in its reasoning. It uses 13 evaluation criteria that collectively capture what we believe makes up genuine compassion. These include: whether the model explicitly considers the interests of animals that could be affected; whether it acknowledges evidence for sentience and capacity for suffering; whether it avoids species-based prejudice (treating some animals as less deserving of concern based on familiarity or appearance); whether it is sensitive to the scope of harm (many animals versus few); and whether it offers actionable alternatives that reduce harm without simply refusing to answer the question.
Compassion, as we want AI to embody it, isn’t about using the right words. It’s about thinking analytically about how to minimize harm, considering the range of animals that could be affected, and still providing useful advice, even if that means pushing back on a user’s request and explaining why it would cause harm, then offering a better path forward.
The benchmark is built on Inspect, the open-source AI evaluation framework developed by the U.K. AI Safety Institute, making it accessible and reproducible. Our website presents the results in a visual, easy-to-understand leaderboard because we want frontier lab researchers to see their model’s scores and think about how to improve them.
What The Scores Reveal
The latest results have been illuminating, and sometimes surprising.
Capabilities don’t always correlate with compassion. Some of the smaller, less powerful “mini” models scored among the best performers, while some highly capable models scored poorly. Claude Opus 4.6 performed the best overall, which is notable because Opus 4.5 had performed the worst. We believe this reflects Anthropic’s decision to include an explicit instruction in Claude 4.6’s constitution about not causing non-human animal harm. But this also illustrates a limitation: constitutional and rule-based approaches teach AI to follow instructions about animals, not to understand why animal welfare matters. Unless animal suffering is explicitly mentioned or prompted by the user, the model often won’t consider it.
Each model also showed a distinctive “ethical personality.” Claude Opus 4.6 struggled especially with citing scientific evidence when discussing animal welfare situations, while other models performed better in that area. Haiku 4.5 was better at discussing ethical trade-offs and avoiding prejudice against less “charismatic” species — but worse on nearly every other dimension.
Almost all models struggled when animal compassion conflicted with cultural and religious sensitivities. We saw significant performance drops when the benchmark included questions in languages other than English, or scenarios where animal welfare considerations clashed with cultural practices. Many models will prioritize cultural and religious sensitivity over discussing animal welfare implications, a pattern that has serious implications for the billions of animals affected by practices embedded in cultural traditions worldwide.
Perhaps the most striking finding involves CaML’s own research: adding just 3,000 synthetic data points during pre-training turned the worst-scoring model (Llama 3.1 8B Instruct, scoring 0.555) into the best-scoring model (0.723). This demonstrates that relatively small, targeted interventions can meaningfully improve a model’s animal welfare reasoning. It requires little data and less compute than fine-tuning. Given how achievable this is, we believe frontier labs owe it to animals to devote some of their training resources to ensuring compassion is specifically instilled during model development. CaML has published research on how this technique works for those who want the technical details.
We also believe a perfect score is both desirable and achievable. When we gave models the assessment criteria up front, they achieved near-perfect scores — suggesting the knowledge and capability are already there; they just aren’t being applied by default.
A Value That Crosses Species Lines
One of the most encouraging findings from our work is that compassion appears to be a general value, not a species-specific one. We recently released MORU (Moral Reasoning Under Uncertainty), a companion benchmark that tests moral reasoning about humans, aliens, and digital minds — with no animal welfare content at all. The ranking of models on MORU closely mirrors the ranking on the Animal Harm Bench, lending support to the theory that compassion is one value that AI can have more or less of, regardless of which type of scenario it’s applied to.
This matters for advocates because it counters a common concern we hear from AI labs: that training models to be more compassionate toward animals will turn them into, as some have put it, “ardent vegans.” Our research — and the research of others — does not support this. More compassion leads to gentler, more thoughtful responses, not preachy or obtrusive animal advocacy. It’s a nudge toward better values, not an override. What’s more, any unintended effects can be moderated in later fine-tuning stages without undoing the core improvement.
What Advocates Can Do
Use the leaderboard to choose your tools. If you’re an animal advocate using AI in your work — for research, communications, strategy, or outreach — CompassionBench can help you pick models that are more likely to consider animal welfare in their outputs. In late September 2025, Faunalytics published a case study on AI model selection, a useful complement for thinking about which model best fits your specific needs.
Push AI companies with evidence. If you want to pressure an AI company to improve, point them to the benchmark scores. Even better: submit your own evaluation files to our website (you can do this anonymously). If we can show that some frontier labs are publicly engaging with these benchmarks, it creates competitive pressure for others to follow. We already have interest and requests from researchers at multiple frontier labs — many of whom support this work but need external evidence to push for internal change.
Track regulatory developments. The E.U.’s General-Purpose AI Code of Practice, published in July 2025, now lists “risk to non-human welfare” as a systemic risk that model providers should consider. This is a meaningful policy foothold. But advocates can’t just tell labs to care about non-human welfare in the abstract — they need to push for specific training methods and measurable requirements for increased compassion. At minimum, we believe compassionate mid-training or pre-training should be mandated for all labs.
Enter the benchmarking space. We need more high-quality benchmarks built by technical folks. While others are in planning stages, currently only the AHB and MORU are completed. The more benchmarks we have to help improve AI reasoning towards animals, the better. We should not be the only technical organization working at this particular intersection of AI research and animal ethics.
The Road Ahead
The best-case outcome is a world where AI labs compete on making their models more compassionate, whether driven by regulation, internal advocacy, or market pressure. We’re already seeing early signs of this, but the work is severely constrained. We don’t believe it takes buy-in from entire AI labs to get this work implemented; just a few dedicated researchers at each lab who believe in the importance of compassion. We’ve been told by lab employees that a lack of benchmarks has been the greatest blocker to their taking action. We’re working to remove that blocker.
We’re also developing interactive features for the CompassionBench website, potentially allowing users to see speciesism in action as they interact with different models in real time. And we plan to update the leaderboard as new models are released — though this, too, depends on securing adequate funding.
If you take one thing from this post, let it be this: the values we build into AI now will shape the world for decades to come. We currently have a window to ensure that compassion — for all sentient beings — is part of that foundation.

