AI Model Selection: An Animal Advocacy Case Study
Animal advocates and other social campaigners often seek to reach and educate audiences through publishing articles in a wide variety of outlets. To date, these articles have largely been written entirely by people. With the advent of artificial intelligence (AI) tools such as large language models (LLMs), however, there’s an opportunity to harness this technology to provide research and writing assistance to help make these processes more efficient. This could potentially increase the quantity, quality, and scope of articles generated, and thereby, the effectiveness of educational campaigns.
However, these models are new, and their characteristics vary. The ChatGPT Pro Deep Research function, for example, produces unusually comprehensive research reports, but is also much more expensive than most other LLMs, with a ChatGPT Pro subscription priced at US$200/month in early 2025. Accordingly, we sought to compare key characteristics of the outputs of eight LLMs generated when using an identical prompt (also known as a search query). Of particular interest was determining whether another more affordable LLM could meet the performance of the ChatGPT Pro Deep Research function. We also sought to determine the AI use policies of academic journals and books or their publishers, particularly in regard to the use of LLMs. This is relevant to the many animal advocates seeking to create peer-reviewed publications that will benefit their work.
Our Approach
We chose six journals and publishers that we regularly use for publishing papers and books in the field of animal welfare and veterinary science. The eight LLMs chosen were:
- ChatGPT (Plus/Pro)
- ChatGPT (free)
- ChatNGO
- Perplexity (Pro)
- Grok (free)
- Gemini (Advanced)
- Kompas AI (Pro)
- Elicit (Pro)
These were selected based on an initial review of the LLMs that are highly recommended, have a Deep Research function, or are highlighted for academic or NGO suitability. We didn’t trial an exhaustive list due to limited resources available. Where necessary, we signed up to a free trial or one-month subscription in order to use the LLM. We didn’t pay the higher fees required for ChatGPT (Plus/Pro), but two colleagues with a subscription to ChatGPT Pro submitted our prompt and shared the outputs with us.
As we’re both currently researching the field of vegan pet food, this was the topic used to test-drive the LLMs. The scope of this project was limited to prose rather than, for instance, image generation using AI. The same identical detailed prompt was input into each aforementioned LLM, which was modified from a prompt originally published in The Neuron.
We chose characteristics of interest to us (e.g., length of output, number of references provided, quality of sources used) and assessed them by visually analyzing the LLM interface and its outputs — apart from the accuracy characteristic. Accuracy was assessed by the impressions of the first author (who has expertise in the field of study) when reading the LLM outputs.
Two tables were created to collate the findings. The first summarizes the AI use policies from the chosen academic journals and books, or their publishers, and the second compares key LLM characteristics.
Our Results
AI Use Policies
It’s clear from Table 1 that, amongst the journals and publishers sampled, all permit the use of AI and LLMs by authors, so long as such use is clearly stipulated and incorporated into the methodology or equivalent sections. As the journals and publishers sampled are all major players in the field of animal welfare and veterinary science, we believe they can be taken as representative of journal and publisher AI use policies more broadly.
In addition to the content summarized in Table 1, there was consensus that AI can never be listed as an author as it can’t take responsibility for the content — such responsibility must lie with humans. Generally, it seems that LLMs can be used without acknowledgement only for the purpose of improving grammar, spelling, or style, but the policies of specific journals should be checked and confirmed. Differences also exist between journals and publishers regarding whether AI can be used for image generation. Again, journal and publisher policies should be checked concerning specific cases.
Journals and publishers often seem quite aligned regarding AI use policies. This may be because they follow the AI policies and guidance from senior overarching sector organizations, such as the World Association of Medical Editors, the Committee on Publication Ethics, and the Journal of the American Medical Association Network.
Table 1: AI Use Policies Of Academic Journals And Publishers
Journal/ Book Publisher | AI Use Policy* |
PLOS One | “[A]rticles should report the listed authors’ own work and ideas. Any contributions made by other sources must be clearly and correctly attributed. Contributions by artificial intelligence (AI) tools and technologies to a study or to an article’s contents must be clearly reported in a dedicated section of the Methods, or in the Acknowledgements section for article types lacking a Methods section. This section should include the name(s) of any tools used, a description of how the authors used the tool(s) and evaluated the validity of the tool’s outputs, and a clear statement of which aspects of the study, article contents, data, or supporting files were affected/generated by AI tool usage.” |
MDPI | “AI technology can still be used when writing academic papers. However, this must be appropriately declared when submitting a paper to an MDPI journal. In such cases, authors are required to be fully transparent, within the “Acknowledgments” section, about which tools were used, and to describe in detail how the tools were used, in the “Materials and Methods” section.” |
Frontiers | “[A]uthor is responsible for checking the factual accuracy of any content created by the generative AI technology. This includes, but is not limited to, any quotes, citations or reference. … acknowledged in the acknowledgements section of the manuscript and the methods section if applicable … We encourage authors to upload all input prompts provided to a generative AI technology and outputs received from a generative AI technology in the supplementary files for the manuscript.” |
Nature | “[R]esearchers using LLM tools should document this use in the methods or acknowledgements sections. If a paper does not include these sections, the introduction or another appropriate section can be used to document the use of the LLM.” |
Taylor & Francis | “Authors are accountable for the originality, validity, and integrity of the content of their submissions. In choosing to use Generative AI tools, journal authors are expected to do so responsibly and in accordance with our journal editorial policies on authorship and principles of publishing ethics and book authors in accordance with our book publishing guidelines. This includes reviewing the outputs of any Generative AI tools and confirming content accuracy … Authors must clearly acknowledge within the article or book any use of Generative AI tools through a statement which includes: the full name of the tool used (with version number), how it was used, and the reason for use. For article submissions, this statement must be included in the Methods or Acknowledgments section. Book authors must disclose their intent to employ Generative AI tools at the earliest possible stage to their editorial contacts for approval – either at the proposal phase if known, or if necessary, during the manuscript writing phase. If approved, the book author must then include the statement in the preface or introduction of the book.” |
Springer Nature | “Use of an LLM should be properly documented in the Methods section (and if a Methods section is not available, in a suitable alternative part) of the manuscript. The use of an LLM (or other AI-tool) for ‘AI assisted copy editing’ purposes does not need to be declared.” |
LLM Key Characteristics
Darker green shading in Table 2 represents the most desirable characteristics, while lighter green shading represents a mid-level desirability (and half a green box in the counting that follows). Table 2 highlights that ChatGPT Deep Research scored the highest with 9.5 green boxes. Gemini Advanced had the second most desirable characteristics with 8.5 green boxes. Other important findings included:
- ChatGPT Deep Research exceeded or rivalled most other LLMs trialled in terms of length of output (20 pages), number of references (10), and the exclusively prose format of the output.
- Gemini Advanced surpassed ChatGPT Deep Research in terms of the number of references (17) and tidiness of citations (i.e., Gemini Advanced used numbered citations within the main text, while ChatGPT Deep Research used entire weblinks).
- Elicit Pro had a similar number of references to ChatGPT Deep Research (9) as well as a similar output length (15 pages).
However, the other LLMs were considerably less favorable in these regards. While Elicit Pro is highly academic in nature, its use is restricted by its narrow systematic review approach (i.e., it focuses exclusively on comparing data from academic papers), which won’t suit all search queries. Due to the quality of sources used being of paramount importance in academic work, Gemini Advanced’s overall second highest score was slightly undermined by its poor rating in this characteristic. For our particular prompt, the shorter outputs were lacking in important detail. Hence, outputs of 15+ pages were considered good and 20+ were considered excellent.
While by early 2025 ChatGPT Pro cost US$200/month, 10 Deep Research queries per month were permitted on its Plus plan at a more affordable US$20/month (additional local taxes may apply). One could acquire multiple Plus plans to increase the Deep Research query monthly allowance without it costing US$200. Additionally, this could be paired with another LLM (e.g., Perplexity or Gemini — both roughly US$20/month) without it costing US$200/month and with the added bonus of the different LLMs potentially counteracting each other’s weaknesses.
It’s also worth noting that undesired aspects of the initial outputs (e.g., if users wanted a longer output, a different citation style, or more prose rather than bullet points) could potentially be improved upon by continuing dialogue and further refinements (in the same query) with the LLM — apart from Kompas AI, which didn’t allow ‘chat.’
Another aspect we noticed (not featured in Table 2) was slight-to-medium differences in the outputs when slightly different search prompts were used, or even if the same prompt was used at different times of the same day. For instance, the reference list of one colleague’s preliminary trial at using ChatGPT Pro was considerably less well formatted than that in the official output included in the project. The first preliminary attempt only differed in the wording of the prompt — the emphasis was on dogs thriving on a vegan diet rather than just being healthy. We wouldn’t have previously expected these sorts of differences in reference formatting just from slightly altered wording.
Table 2: Key Characteristics Of Eight Leading Exemplar LLMs
Accuracy Comparison
Our own assessment of accuracy found most LLM outputs to be comparable in terms of delivering high accuracy:
- ChatGPT Plus/Pro Deep Research: High
- ChatGPT (free): High
- ChatNGO: High
- Perplexity Deep Research: High
- Kompas AI Pro: Medium (due to a hallucination)
- Grok Deep Research: High
- Gemini Advanced (Deep Research): Medium-high (less nuance)
- Elicit Pro: High
We only spotted one inaccuracy. This was a hallucination about the duration and nature of a particular study. Specifically, the Kompas AI Pro output referred to a “seven-year study by Dodd,” when in fact Dodd and colleagues wrote a review paper and nowhere does a seven-year study appear in it.
The only other aspect we noted was that all LLMs, barring ChatNGO, recommended consulting vets and regular health monitoring if attempting to transition dogs to vegan diets, which potentially demonstrates an overly cautious approach or mild bias against vegan diets for dogs (though we didn’t stipulate “nutritionally sound vegan diets” in the search prompt, which could also account for this).
Both human assessment and use of AI tools include the possibility of error. However, due to our expertise in the field of study, it’s unlikely any additional significant errors were missed.
Limitations And Cautions
In addition to the inconsistencies in LLM outputs mentioned in the previous section regarding slight wording tweaks to prompts or repeating identical queries at different times of the same day, there are some other issues of which users should be aware. These include risks of inadvertent plagiarism and AI models being prone to bias and hallucinations (presenting false information as fact). It’s important to note that these risks exist, to varying degrees, with human writers and researchers too.
Additionally, some of these risks can be minimized. For instance, a plagiarism checker could be used to avoid unintended plagiarism (which is good practice anyway), and manual database searches should be conducted to check for any missing data in LLM outputs. AI hallucinations may be the most difficult to avoid. The risk can be minimized by ensuring a specialist in the topic at hand is the one manually checking the AI content — an expert will find hallucinations easier to spot. The steps required to minimize the risks involved may, in some cases, limit the efficiency benefits.
There are also concerns about genuine time savings versus academic integrity. For instance, despite the emphasis on humans checking AI outputs and having overall responsibility for content, it’s easy to imagine scenarios where humans don’t live up to this role — or if they do, it may negate a certain portion of the intended time savings (especially if using Elicit’s systematic review function of 200 papers, for instance). How realistic is it that the human researcher is going to go in detail through all the papers the LLMs consult to check that the AI hasn’t hallucinated, misinterpreted, or missed any content? However, as mentioned previously, academic journals and publishers do normally require all authors to accept full responsibility for any manuscript published, and so full accuracy remains the expected standard. Authors and publishers alike know that published errors have the potential to damage their reputations.
Recommendations And Conclusions
This brief investigation found that ChatGPT Plus/Pro led among LLMs in terms of its accuracy and other desirable characteristics. Gemini Advanced came second place overall. However, some rows within Table 2 may be more important than others, depending on the application. For academic research aimed at producing articles publishable within scientific journals, for example, good quality sources become essential, and LLMs reliant on blogs and news stories should normally be ruled out. If using generative AI purely to inform internal organizational or individual advocacy strategy, tidy (journal-ready) citations and well-formatted references will be less important. Hence, a more nuanced choice may be appropriate, depending on the intended AI model use.
For many users, it would be appropriate to employ ChatGPT on its Plus plan (US$20/month in early 2025), which offered up to 10 Deep Research queries per month. Those needing a higher allowance could potentially sign up to the Plus plan with multiple email accounts and/or subscribe to Perplexity Pro or Gemini Advanced. Using multiple LLMs may also serve to counteract the errors arising in any individual LLM or search. Kompas AI was the only LLM specifically NOT recommended based on this mini project due to an error spotted and its inability to ‘chat.’
These recommendations need to be caveated with the awareness that this is an extremely fast-developing field. Our assessment was also based on our limited research on a single topic (the health of dogs fed vegan diets). It’s recommended that researchers and writers conduct their own trials based on their own subject areas, and keep updated with fast-moving developments among AI models. Platforms such as the AI Impact Hub have been developed to assist advocates with this.
Readers could also choose to make use of comparison sites such as Artificial Analysis or Live Bench, which evaluate AI models using standardized tests covering aspects such as intelligence, speed, cost, and specialized skills. However, AI companies do have some ability to design their models to optimize outcomes on these tests. The website OpenRouter also allows testing of multiple models using the same prompt. However, this does require signup and funds for use of premium models.