Bias In, Bias Out: The responsibility of researchers to use AI with caution to avoid promoting discrimination or damaging the public's faith in science

Thoughts and views written in this blog post reflect those of the author only, and not necessarily those of every SNAP member or the SNAP coalition as a whole.

A now-retracted JAK/STAT signaling pathway figure edited to include the word RETRACTED — Figure 2 from the now retracted paper “Cellular functions of spermatogonial stem cells in relation to JAK/STAT signaling pathway” edited to include the word “RETRACTED”. https://www.frontiersin.org/journals/cell-and-developmental-biology/articles/10.3389/fcell.2024.1386861/full

The use of AI in scientific research

In 2024, the Nobel Prize in Chemistry was awarded to the creators of AlphaFold, an AI system which is able to predict the 3D structure of proteins from their amino acid sequences [1]. Clearly, AI tools have the potential to assist research and medicine in many ways. AI has been used to detect clinically-relevant abnormalities in medical imaging [2,3], predict the effect of genetic variants [4], and detect bacterial outbreaks in hospitals [5,6]. These are just a few of the many possible uses of AI in research and medicine. These tools, when used thoughtfully, have the potential to improve public health, make research more efficient, and enable types of research that previously may not have been possible. However, despite AI’s potential to streamline research, there are important ethical considerations and limitations to consider with its use. When used carelessly or unethically, AI has the potential to exacerbate existing health disparities, promote prejudiced beliefs, lead to the misdiagnosis or mistreatment of patients, or damage public trust in science.

AI models may be trained on unrepresentative data

One of the most pressing concerns with the use of AI in research is its tendency to propagate the biases present in training data. There are multiple points in the training process at which bias can be introduced [7]. To begin with, many databases of genetic or healthcare data used for training AI systems have an overrepresentation of healthy, middle-to-high income, and European-ancestry individuals likely due to both access to healthcare and socio-economic resources, as well as structural racism [7–10]. AI models trained on unrepresentative data develop outputs that may not be applicable to all groups, leading to inequitable treatment and research findings. For example, an analysis of three skin-cancer-detecting AI models (all trained primarily on images of pale-skinned individuals) found that these models were significantly worse at detecting cancer from images of dark-skinned individuals [11,12]. If these models were to be used in clinical practice, there is the disastrous potential for delayed cancer diagnosis among dark-skinned individuals. There is a clear parallel in a research setting: the use of AI tools that are trained on sources with insufficient data from certain groups might result in research that is less applicable to those groups.

Biased Input = Biased Output

Training data itself may also reflect the results of systemic discrimination. One study found that an algorithm designed to assess patients’ health risk underestimated disease burden in individuals who identified as Black compared to individuals who identified as white¹ [13,14]. The researchers determined that this was due to the algorithm’s use of healthcare spending as a proxy for disease severity. However, less money is often spent on Black patients with a given disease burden than on white patients, which may reflect barriers to healthcare access or physician bias, such as the tendency to underestimate and undertreat pain in Black patients [13,15]. Therefore, the algorithm is flawed because it doesn’t account for the effects of systemic racism on healthcare spending, and as such, results in insufficient treatment for Black patients [13]. Based on the biased and inaccurate summary of individual disease status output by this algorithm, it is conceivable that researchers could draw the conclusion that Black individuals with high cholesterol or other biomarkers were healthier than white individuals with the same levels.

The larger problem with Large Language Models

Large language models may be especially susceptible to misinformation and bias because they are trained largely on internet data, which is often not verified by subject matter experts [16,17]. One such LLM, ChatGPT, was primarily trained on data from the CommonCrawl database, which contains information from billions of websites [16,18]. The creators of ChatGPT made an effort to pre-filter this database to exclude information that a classifier flagged as low quality [16], but the efficacy of this method in excluding biased or unreliable data remains unclear. Indeed, one study found that four large language models (ChatGPT, Claude, GPT-4, and Bard) parrot unfounded racist pseudoscientific beliefs, such as the false idea that Black individuals have higher pain thresholds than white individuals [19,20]. Additionally, a report from UNESCO found that some LLMs (GPT-2 and Llama-2) make sexist or homophobic statements [21]. Another study found that four LLMs (ChatGPT, Claude, Gemini, and NewMes-15) proposed inferior treatments for mental health conditions such as schizophrenia or depression when they were informed that a patient was African American [20]. This raises concerns that the use of these models in patient care could result in racially-biased treatment plans. Unlike traditional algorithms used in medicine and research with which users typically know what all the inputs and covariates are, AI algorithms often act as a “black box” in which the exact information that is used to determine the output is not clear. Since users may not know exactly what the algorithm is doing, it is difficult to correct for or even recognize the effects of bias. Outside of clinical practice, it is worth considering that these models could offer researchers incorrect and biased information that could misguide study design or interpretation. Additionally, if these models are used for writing purposes, they might present information in biased ways. The inclusion of biased language or misinformation in scientific papers has the potential to cause very real harm, such as being used to justify discriminatory beliefs or policies, regardless of the author’s intentions.

AI use can damage the public’s faith in science

Beyond the potential harms caused by bias or misdiagnosis/treatment of patients, careless use of AI tools has the potential to harm research and public health by damaging the public’s faith in science. For example, a now-retracted paper was published in Frontiers in Cell and Developmental Biology with bizarre, inaccurate, and obviously AI-generated figures [22,23], calling the quality of the paper and the peer review process into question. In this case, it may be fair to suggest that the reviewers and editors are partly at fault for failing to flag this paper. In other cases, inappropriate AI use may be more difficult to detect but equally damaging to the quality of works being published. A recent study found that approximately 1 out of every 458 papers that was published in 2025 contained at least one reference to a paper that doesn’t actually exist, likely due to the use of AI tools [24,25]. Since these cited papers are not real, the information attributed to them may not be correct, and even when it is, these hallucinated citations will cause issues for future researchers looking to understand the source of an idea. They also call into question the integrity of the papers in which they appear, since the authors seemingly weren’t engaging in careful, source-based writing. These publications hinder scientific progress by clogging up journals with possibly incorrect information that could lead future researchers astray. Furthermore, when questionable or inaccurate information like this is published to reputable peer-reviewed journals, it sends the message that scientists don’t care about the quality of work that gets published, and that published studies are inaccurate or irrelevant. This is especially concerning during a time in which distrust of scientists and the spread of pseudoscientific misinformation by public figures is becoming more widespread [26–28]. If anything, now more than ever, scientists need to be extremely careful in the work we do and how we convey it to the public in order to preserve the reputation of science and promote evidence-based policy.

Current policies surrounding the use of AI in research

Unfortunately, the rush to incorporate AI tools into every aspect of grant writing, research design, code development, analysis, and editing appears to be outpacing institutional and governmental policies regulating their use. Laws regulating the use of AI differ by state [29] and policies differ by journal [30]. Many journals agree that the use of generative AI should be disclosed in papers [30], but there are points of disagreement as well. For instance, the publisher Sage Journals permits the use of AI to produce images such as illustrations [31], whereas Nature Journals prohibits the inclusion of AI-generated images with a few limited exceptions [32]. The NIH also has policies regulating AI use, but these policies are pretty limited. The NIH prohibits the entry of confidential patient or subject info into LLMs as it is in violation of pre-existing subject privacy policies and bans the use of AI in peer review [33]. However, when it comes to the use of AI in producing research applications, their policies seem somewhat nebulous, saying only that they will not consider applications that are “substantially developed by AI” [34].

Conclusion: What should researchers do to mitigate issues?

Hopefully, in time, the NIH, journals, and academic institutions will develop more detailed policies outlining acceptable and unacceptable uses of AI. In the meantime, researchers have to hold themselves accountable. In my opinion, the following practices should be considered the bare minimum of ethical AI use:

Any work done by AI that a human author would receive credit for should be clearly cited. For example, any text or figure generated by AI should be labeled as such. Using AI to write without citing it is similar to plagiarism because it involves falsely passing off work as one’s own.
Authors should remember that they are responsible for any errors or biased information due to AI use, and should triple-check any AI-generated content, including code, figures, and text. Researchers should be aware that many AI models can produce output that is racist, sexist, or homophobic [19,21], and should be wary of subtle ways this bias may creep into AI models’ language, recommendations, or interpretation of results.

AI tools have the potential to benefit scientific research and public health, but these tools also have their limitations. Researchers should ensure that they are using these tools conscientiously for the good of all as policy makers work towards developing robust regulation surrounding the use of AI.

Recognition:

Sol Taylor-Brill is a PhD candidate in Molecular, Cellular, Developmental Biology, & Genetics at the University of Minnesota with a focus on statistical and population genetics.

Special thanks to fellow SNAP members who provided feedback on this article: Mikayla Smith-Craven who received her doctoral degree in Pharmaceutical Chemistry from the University of Kansas where she focused on drug development and delivery; Andrew Mattson a physics PhD student developing quantum technologies for dark matter detection, gravitational wave observation, and life science/medical applications. He also serves as President of the Science Policy and Diplomacy Group at Johns Hopkins; and Liam Russell, a Molecular and Cellular Biophysics PhD candidate at the University of Denver, studying epithelial remodeling during embryonic development. He is also a current science communication intern for the Society of Developmental Biology, and a STEM writing consultant in the DU writing center.

Footnotes:

I have capitalized Black but not white in alignment with guidance from the Associated Press.