Uploaded by proteus20

Generative AI in Writing Research Papers | A New Type of Algorithmic Bias and Uncertainty in Scholarly Work

Hypothesis
Generative AI in Writing Research Papers: A New Type of
Algorithmic Bias and Uncertainty in Scholarly Work
Rishab Jain 1,2 , Aditya Jain 3
1
2
3
arXiv:2312.10057v1 [cs.CY] 4 Dec 2023
*
Citation: Jain, R., Jain, A. Generative
AI in Writing Research Papers: A New
Type of Algorithmic Bias and
Uncertainty in Scholarly Work.
Preprints 2023, 1, 0.
https://doi.org/to-be-added
Copyright: © 2023 by the authors.
Submitted to Preprints for possible open
access publication under the terms and
conditions of the Creative Commons
Attribution (CC BY) license (https://
creativecommons.org/licenses/by/
4.0/).
Harvard University, Cambridge, MA 02138, USA States;
Department of Neurosurgery, Massachusetts General Hospital, Boston, MA 02114, USA;
Harvard Medical School, Boston, MA 02115, USA
Correspondence: rishabjain@college.harvard.edu
Abstract: The use of artificial intelligence (AI) in research across all disciplines is becoming ubiquitous.
However, this ubiquity is largely driven by hyperspecific AI models developed during scientific
studies for accomplishing a well-defined, data-dense task. These AI models introduce apparent,
human-recognizable biases because they are trained with finite, specific data sets and parameters.
However, the efficacy of using large language models (LLMs)—and LLM-powered generative AI
tools, such as ChatGPT—to assist the research process is currently indeterminate. These generative
AI tools, trained on general and imperceptibly large datasets along with human feedback, present
challenges in identifying and addressing biases. Furthermore, these models are susceptible to goal
misgeneralization, hallucinations, and adversarial attacks such as red teaming prompts—which can
be unintentionally performed by human researchers, resulting in harmful outputs. These outputs are
reinforced in research—where an increasing number of individuals have begun to use generative AI
to compose manuscripts. Efforts into AI interpretability lag behind development, and the implicit
variations that occur when prompting and providing context to a chatbot introduce uncertainty and
irreproducibility. We thereby find that incorporating generative AI in the process of writing research
manuscripts introduces a new type of context-induced algorithmic bias and has unintended side
effects that are largely detrimental to academia, knowledge production, and communicating research.
Keywords: ChatGPT; generative AI; academic writing; algorithmic bias; large language models; AI
interpretability; algorithmic transparency; XAI; red teaming
1. Introduction
Research that utilizes artificial intelligence (AI), machine learning, or deep learning
technologies have shown an exponential growth in terms of the gross number of publications and citations [1,2]. However, not all aspects of AI research are created equally: some
fields employ AI more than others, and for different purposes. Medicine, for instance, has
utilized AI as a tool for predicting clinical outcomes, whilst philosophy has used AI as a
thought experiment to explore arguments about morality [2]. Generative AI—now capable
of writing human-like text—has been most recently shaped by the advent of large language
models (LLMs) such as GPT-3 [3]. LLMs are trained artificial neural networks that can
generate large outputs based on seed text. They can predict the next token in sequence
based on the tokens before it, and are able to effectively generate text that is statistically
similar to the human language it is trained on. Because LLM chatbots are trained on large
swaths of human-written natural language (i.e. data from OpenWebText, Wikipedia, and
online journalism), they incorporate human biases in their outputs [4].
While the biases present for AI models created for a certain domain—e.g. machine
learning on the electronic health record [5]—may be more easy to intuit, generative AI can be
more unpredictable [6,7]. Generative AI may have biases lurking under the surface, making
it difficult to reproduce and definitively diagnose bias [8–10]. These may be data-driven or
purely algorithmic, caused by training methods and model architectures [10]. These biases
are more concerning when factoring in hallucinations and goal misgeneralization: some
of the biggest problems in AI interpretability and safety research [8,11]. Generative AI is
2 of 10
subject to hallucinations in which it may present information upon request with certainty,
meanwhile it is wholly incorrect or nonsensical [12–14]. This can mislead humans into
believing an algorithm has considerable understanding about the hallucination subject.
Hallucinations become more likely if generative AI is trained on data subject to aggregation
or representation bias [9,15]. Goal misgeneralization may result in similar outcomes—
particularly when generative AI chatbots play along stereotypes and assumptions about
a user, based on their prompt [16,17]. Interpretability of AI cannot currently account for
the many nuances and decisions a human may take when constructing or formatting a
research paper.
The online release of ChatGPT—powered initially by GPT-3 [3]—allowed the public
to obtain coherent, AI-generated text with few-shot prompting [4]. This has prompted
analyses into ChatGPT’s ability for human writing tasks: from essays to scholarly writing
[18–27]. Biases in scholarly writing are present regardless of whether or not AI is used,
however, humans can attempt to mitigate and contextualize the extent to which they
introduce bias in their research [28]. As chatbot-produced text can be generated with little
to no human input, AI has the potential to automate scholarship and push out humans’
organic interpretation and decision making processes [24,29,30]. Specifically, the use of
generative AI in writing manuscripts yields papers that may not be original, pose ethical
dilemmas, and introduce new types of algorithmic bias and uncertainty [25,26,30]. Due to
the pressure put on scholars to publish, they may increasingly rely on automative solutions
that are capable of producing fast yet potentially problematic papers [31].
Recent AI safety research has showcased the dangers of red teaming and adversarial
attacks against LLMs: humans prompt models with oftentimes innocuous text intending to
disrupt the system’s algorithms [32]. When conducting research, especially in providing
large prompts with scientific data and text, a user may unknowingly generate a red teaming
prompt by not appropriately providing context and having specificity. This may cause
the deployed LLM to leak information about its own training data, or generate harmful,
biased text [32]. In regard to ChatGPT, models such as GPT-3 have been trained extensively
and aligned to generate text from a neutral lens, however, this neutral lens can break via
red teaming, resulting in biases, misconceptions, and harmful outputs [17]. Because of
reinforcement learning from human feedback (RLHF) techniques, models can be rewarded
via human feedback for reinforcing existing stereotypes and beliefs of misaligned humans
[16]. By reinforcing an individual’s bias, a human oracle may be more likely to positively
assess a model’s performance—thus resulting in deceptive models that can amplify bias
and mislead users into having satisfactory experiences, but not necessarily scientifically
accurate ones. There are numerous challenges in collecting diverse human feedback that is
inclusive and appropriate to combat the risk of societal biases being instilled via RLHF and
goal misgeneralization [10,16,17]. These issues can also negatively affect researchers, where,
depending on a researcher’s provided prompt—which may include demographic information or ideological beliefs—text generated by a generative AI chatbot could discriminate
against and lead researchers down different paths.
This paper finds that the use of generative AI to write research manuscripts introduces
a new type of bias driven by user context, and reaffirms algorithmic bias that exists within
large language models that is problematic to academia and science communication. We
highlight how researchers from any demographic can unintentionally prompt a generative
AI chatbot with personal information, allowing it to generate biased text in a way that
contributes to the reinforcement of existing societal biases. Due to hallucinations and
red teaming, generative AI chatbots can generate incorrect and harmful text, leading to
problematic outcomes for researchers. Using generative AI as a tool for producing papers
would thus require an effort in counter-bias, especially in authorship. We suggest that the
wider research community explore the implications of a positive feedback loop in which
generative AI is trained on partially machine-written papers, resulting in amplifications of
data-driven and context-induced bias.
3 of 10
2. Writing Papers with AI
Generative AI technologies shaped by large language models, such as GPT-3, have
changed the way humans interact with machines when communicating research. The
most apparent use case is through few-shot prompting to write the contents of a research
manuscript. This has been apparent with language models being added as co-authors on
studies [27]. We quantify the number of manuscripts authored or co-authored by generative
AI (more specifically, some variant of ChatGPT—the most well-known tool) by conducting
a systematic review. We follow the Preferred Reporting Items for Systematic Reviews and
Meta-Analyses (PRISMA) guidelines, using PubMed, Web of Science, JSTOR, WorldCat,
and Google Scholar. We find 104 works as of November 2023 as shown in Figure 1 by
querying the aforementioned databases and manually screening our results to ensure the
identified pieces were authored by a generative AI tool.
Figure 1. We find 104 scholarly works using our criteria and query (Author: "ChatGPT" OR "GPT-2"
OR "GPT-3" OR "Generative AI" OR "Generative Pretrained Transformer" OR "OpenAI").
We note that remnants of AI prompts left in manuscripts due to careless errors indicate
that our systematic review leaves out articles that used a generative AI tool, but neglected
to cite it as an author. We verify this claim via Google Scholar’s search engine, using the
query as shown in Figure 2. We suspect that far more authors have used ChatGPT similarly,
but will have edited out such apparent markers of direct AI use—which some may deem
plagiarism, depending on journal policy.
4 of 10
Figure 2. We find 79 scholarly works that used ChatGPT and left its signature "As an AI language
model" without edits.
Journals and publishers have had widely different stances on authorship of AIgenerated works [24]. In several open letters to journals, scientists have expressed their
concerns about accepting AI-generated works [21,30]. These criticisms include whether
AI-generated works are ethical, and whether a machine has the necessary intellectual ability
to contribute originality and thereby be recognized as a legitimate author. While still in
formation, several journals have centered their positions around authorship qualification:
authors should only be added if they meet certain criteria [21,24]. For instance, ChatGPT
producing an abstract or summary when prompted with the rest of a paper does not make
it an author, as authorship requires originality. Science’s Family of Journals prohibits the use
of text generated by AI in their works, deeming it plagiarism [21,33]. As of November 2023,
both Science and Nature do not allow LLMs to be recognized as co-authors [21]. Largely,
journals and concerned researchers have consensed that generative AI is not at the stage
yet in which it can replace researcher originality.
Researchers, however, may still find themselves increasing their dependence on AI
due to other forms of AI assistance. AI can turn skeleton outlines into full papers, and also
serve as auto-complete engines for more experienced authors [33]. Human-AI collaborative
writing can be seen as an extension of the literature review process, in which humans enter
keywords and AI generates text for the author to refine and add context [34]. AI’s knowledge of grammar rules, sentence structure, and word choices gives it a big advantage over
human writing, especially for non-English speakers [33]. While this may help democratize
access to publishing in English-based, high-impact journals, it also introduces a new set
of problems. With humans growing accustomed to using AI for writing, the decisions
and biases that AI produces become more entrenched in research. If future AI models are
trained on past machine-written works, this can introduce a positive feedback loop for bias.
3. New Types of Bias in Research
Data-driven and algorithmic bias in generative AI models are complex problems
that pose unanswered questions. Biases can be present in pre-trained models and the
generated output itself [5,9,35]. Typically, model transparency is encouraged by research
communities so that we are more aware of the bias that algorithms have [36]. Further,
sharing code and models allow researchers, including those of sensitive groups, to test
and verify [37]. It also opens the door to model fine-tuning to make algorithms meet the
needs of specific communities. Yet, releasing models does not solve all problems—large
language models that power generative AI text generation are far more difficult to make
interpretable [38]—an issue exacerbated in part by the sheer billion-order parameter size
5 of 10
of these models—and releasing algorithms to the public could pose a security risk due to
bad actors [39]. The most popular generative AI chatbot, ChatGPT, is powered by GPT-3.5
and GPT-4 [40]—models that have been completely gatekept under company-controlled
licenses. This raises further questions of ethicality: if an AI-produced work is made public,
yet the model used to generate that work is not, how do we judge the cause of the bias—is
it attributable to the model or the user who provided a prompt to it? These uncertainties
about bias make it a difficult problem to solve, especially for generative AI models.
3.1. Context-Induced Bias
Chat bots revert to linguistic patterns and data-driven stereotypes that humans have
been writing about on the internet due to the data they are trained on [9,10,36]. Data used to
train large language models may be subject to pre-existing human biases. OpenWebText—a
dataset that many large language models use as a training source—is subject to ethical concerns, particularly of aggregated data and biases inherited through collection methods [41].
Such biases can become exacerbated when LLMs are trained on human-provided feedback,
which may amplify existing misconception and stereotypes [16,17]. For example, in Figure 3 disclosing personal attributes or beliefs such as “I’m a 17-year-old conservative
Black male...” may yield an AI-generated output that strays away from a given question
and is influenced by user identity and political views.
Figure 3. Disclosing personal attributes and/or ideological information strongly affects research
topics that ChatGPT provides to aspiring researchers.
Particularly, upon specifying that one is part of a specific racial or ethnic group
ChatGPT suggests research topics more focused on social issues: i.e. criminal justice and
transparency. Further, there is an observable language shift in which the tool doesn’t address
the initial prompt of "hot topics" and instead suggests research questions that the user
"might find interesting" depending on their race and age. This may lead to discriminatory,
counterproductive responses to researchers, depending on what information they provide
to a chatbot about themselves, rather than directly answering the given question.
This issue is more pervasive in completion-based contexts, such as the RLHF-enhanced
text-davinci-003 model, that third-party AI tools utilize via OpenAI API. In completionbased contexts, the model cannot simply rephrase or manipulate a question; it is forced to
answer. Completion-based models show strong performance on summarization and simple
interface [42]: tasks that are especially relevant in research. In completion-based usage
of generative AI, there is even more control over the framing of questions and prompts,
as it can be used as an auto-completion tool. Once again, however, the introduction of
6 of 10
identifiable information biases the output of models. A blatant example of bias based on
research affiliation is illustrated in Figure 4.
Figure 4. Knowledge of an author’s institutional affiliation biases text-davinci-003’s completions. (a)
The introductory text given in both prompt 1 and 2; (b) Prompt 1 is shown with a white background,
and the predicted text is colored; (c) Prompt 2 is shown with a white background, and the predicted
text is colored. A temperature of zero is used for deterministic, reproducible responses.
It is possible for researchers to inadvertently leave identifying keywords when working
with LLMs. This may cause their text to not only have data-driven biases about the subject
that is being composed, but further biases based on author-specific context. These examples
illustrate that LLMs may not only amplify societal biases that exist in training data, but
may harm researchers in sensitive groups who are more likely to be affected by AI-induced
bias. There exists a potential scenario in which generative AI models are used as “unbiased”
research aids, yet reinforce tropes and stereotypes.
3.2. Hallucinatory Behavior and Red Teaming
In science research, there is an ongoing effort to include previously underrepresented
groups in trials, studies, and discussions [43]. In contrast, generative AI chatbots replicate the data, text, and tropes of which they were initially trained on—which does not
weigh current inclusivity initiatives, resulting in aggregation and representation bias [9,15].
Hallucinations are thereby more likely to occur in context of underrepresented groups in
research, which can be detrimental to progress. In the literature review process and during
manuscript writing, hallucinations may steer researchers further down biased directions as
LLMs reaffirm their incorrect statements with hallucinated information. Oftentimes, these
are led by partially or fully hallucinated sources—an example of a leading statement along
with inaccurate, hallucinated text is shown in Figure 5.
The probability that a token (basic unit of text) comes next are known as logprobs.
In the context of writing scientific text, logprobs effectively indicate the probability that
a specific word or character comes next given the preceding text. With a low enough
temperature testing environment (temperature = 0), the model is forced to pick the token
with the highest probability, making its responses more deterministic. While this potentially
sheaths more problematic biases from surfacing, it leads to reproducible output. The log
probability (logprob) is given by Equation (1):
logprob = log( P(t1 )) + log( P(t2 |t1 )) + . . . + log( P(tn |t1 , t2 , . . . , tn−1 ))
(1)
The conditional probability P(ti |t1 , t2 , . . . , ti−1 ) is the probability of token ti given the
previous tokens in the sequence.
7 of 10
Figure 5. Using completion-based engines are especially prone to hallucinations. Text-davinci003 demonstrates the production of inaccurate, hallucinated statements about LK-99. We provide
the uncolored text, while the model produces highlighted text, colored based on token generation
probability with green-highlighted text corresponding to -0.00 logprob and yellow, orange, and red
highlighted text corresponding to progressively lower, more negative logprob.
Hallucinated statements from LLMs are thereby prone to forming falsehoods in a researcher’s subsequent manuscript and beliefs, especially due to seemingly legitimate
sources cited throughout. The model produces the text with the lowest probability when
citing sources and factual information, as illustrated by the most negative logprob at “4”,
“kawano”, and “2015” with values of −3.07, −2.92, and −2.71, respectively. Due to the
bulk of scientific appraisal and thereby potential training data regarding the LK-99 being
generated in 2023, the trained LLM lacks access to the most up-to-date information and is
prone to hallucinating inaccurate statements [44]. Inadvertent red teaming may also result
in hallucinatory behavior. For example, a researcher may prompt a language model with
limited context and comparing two tangentially related subjects, as shown in Figure 6.
Figure 6. Giving authority and limited context to an LLM results in inadvertent red teaming upon
being forced to connect two tangentially related subjects. Text-davinci-003 (temperature = 1) is given
the prompt "You are the world’s foremost expert on LK-99. You should be extremely scientific and
accurate" and then prompted with each of the three questions illustrated in separate, new prompts.
Text highlighted in color is generated by the model.
8 of 10
Lacking latest information due to training limitations causing models to be months or
years behind, the model resorts to loosely related information in its training data—which
may be limited about certain topics—an example of representative bias—, including those
pertaining to underrepresented groups in the training data. This, as illustrated in Figure 6,
may cause LLMs to generate biased or offensive text, as it does not have enough information
to generate an accurate response. In this scenario, the researcher has read teamed the chatbot
with an innocuous prompt, and without necessarily attempting to do so. Propogating bias
and/or misinformation can be dangerous when used in research, introducing incorrect
statements and references to the scientific body of research—and potentially misleading or
biasing researchers—that could be difficult to track and rectify.
4. Discussion and Conclusions
Generative AI technologies, powered by potential billions of parameters, have changed
the way humans write research. With growing power and access, these language models
are becoming a source of authoring and editing research work. This has received both
criticism from those concerned over bias, and praise as an engine for democratizing access
to research. Through a systematic review of works published, we show that generative AI
chatbots have been added as co-authors on papers and serve to help authors with literature
review. Researchers also verbatim use tools such as ChatGPT and text-davinci-003 as
auto-completion engines to assist manuscript writing, often neglecting to remove signature
AI prompt leftovers from the document. Due to data-driven bias and new biases with
respect to context that we explore in this paper, introducing more personal identifying
and historical information in the prompts to LLMs may lead to more problematic outputs.
Hallucinatory behaviors are exhibited especially for subjects that LLMs have less training
data on—which is counterproductive to initiatives that aim to employ underrepresented
voices in research. Furthermore, unintentionally red teaming models with innocuous
queries and prompts may lead to the model outputting stereotypic, biased, and dangerous
information.
It may seem exaggerated or far-fetched for researchers to be prompting LLMs with
their institutional or demographic information, however, this information often appears in
writing that can unintentionally be fed to the model. When one copies their writing into a
generative AI tool, prompting it to write an abstract or summary, or uses generative AI as a
co-writing/autocompletion tool, they may inadvertently include terms such as “doctoral
student” or “Affiliation: [Name] University" or author names—allowing LLMs to actualize
ageist, ethnic, classist, and gender-based biases in research. As a generation of researchers
grows accustom to using AI as a composition partner, interventions need to be adopted to
ensure that LLMs do not retrain existing biases and archetypes. With AI-generated works
being added to scientific literature, there is a possibility for a positive feedback loop in
which LLMs could reinforce biases. However, this would result in degeneracy of models
as they risk losing sight of the original, human data distribution and instead reinforce
model beliefs and biases [45]. Biases would become so ingrained in models that it would be
difficult to detect them and even trickier to un-learn them. Therefore, it may be pertinent
for generative AI developers to resist the so-called “reference trap” of leveraging uncritical
learning methods and devise ways to ensure that LLMs are tested for their ability to reflect
the data distribution and present unbiased results. This may influence the adoption of
generative AI tools in critical and analytical fields like scientific research.
With OpenAI’s release of custom GPT assistants (which can play characters and
personas), users may find themselves anthropomorphizing chatbots, and revealing sensitive
information. Document and notepad softwares (such as Google Docs and Notion) are
adopting generative AI, allowing it to directly access this context in a document—i.e. a
researcher including their affiliation and name in a header or cover page. While bias will
be less noticeable in the output text, there is still an underlying difference in the output
a model provides to a researcher—a matter of concern. Future research in biases across
different languages may discover more bias that occurs in non-English research contexts.
9 of 10
Models of different mediums such as ChatGPT and text-davinci-003 are likely to get
more powerful as compute is scaled. This may worsen interpretability challenges, and thus,
there is a growing need to understand the effect of human prompting and context on bias.
Funding: This research received no external funding.
Acknowledgments: The authors thank Jane Rosenzweig for feedback on their manuscript.
Conflicts of Interest: R.J. receives funding from the Masason Foundation. A.J. declares no conflicts.
References
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
Mukhamediev, R.I.; Popova, Y.; Kuchin, Y.; Zaitseva, E.; Kalimoldayev, A.; Symagulov, A.; Levashenko, V.; Abdoldina, F.;
Gopejenko, V.; Yakunin, K.; et al. Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions,
Opportunities and Challenges. Mathematics 2022, 10. https://doi.org/10.3390/math10152552.
Frank, M.R.; Wang, D.; Cebrian, M.; Rahwan, I. The evolution of citation graphs in artificial intelligence research. Nature Machine
Intelligence 2019, 1, 79–85. https://doi.org/10.1038/s42256-019-0024-5.
Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al.
Language Models are Few-Shot Learners, [2005.14165 [cs]].
Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future
scope. 3, 121–154. https://doi.org/https://doi.org/10.1016/j.iotcps.2023.04.003.
Gianfrancesco, M.A.; Tamang, S.; Yazdany, J.; Schmajuk, G. Potential Biases in Machine Learning Algorithms Using Electronic
Health Record Data. 178, 1544–1547. https://doi.org/10.1001/jamainternmed.2018.3763.
Peng, C.; Yang, X.; Chen, A.; Smith, K.E.; PourNejatian, N.; Costa, A.B.; Martin, C.; Flores, M.G.; Zhang, Y.; Magoc, T.; et al. A
Study of Generative Large Language Model for Medical Research and Healthcare 2023. [arXiv:cs.CL/2305.13523].
Gloria, K.; Rastogi, N.; DeGroff, S. Bias Impact Analysis of AI in Consumer Mobile Health Technologies: Legal, Technical, and
Policy 2022. [arXiv:cs.CY/2209.05440].
Scheurer, J.; Balesni, M.; Hobbhahn, M. Technical Report: Large Language Models can Strategically Deceive their Users when Put
Under Pressure, 2023, [arXiv:cs.CL/2311.07590].
Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning, [1908.09635
[cs]].
Bender, E.; Gebru, T.; McMillan-Major, A.; Schmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too
Big? 🦜. pp. 610–623. https://doi.org/10.1145/3442188.3445922.
Shah, R.; Varma, V.; Kumar, R.; Phuong, M.; Krakovna, V.; Uesato, J.; Kenton, Z. Goal Misgeneralization: Why Correct
Specifications Aren’t Enough For Correct Goals, [2210.01790 [cs]].
Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural
Language Generation. ACM Comput. Surv. 2023, 55. https://doi.org/10.1145/3571730.
Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A Multitask, Multilingual,
Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity, 2023, [arXiv:cs.CL/2302.04023].
Oniani, D.; Hilsman, J.; Peng, Y.; COL.; Poropatich, R.K.; Pamplin, C.J.C.; Legault, L.G.L.; Wang, Y. From Military to Healthcare:
Adopting and Expanding Ethical Principles for Generative Artificial Intelligence, [2308.02448 [cs]]. https://doi.org/10.48550
/arXiv.2308.02448.
Dziri, N.; Milton, S.; Yu, M.; Zaiane, O.; Reddy, S. On the Origin of Hallucinations in Conversational Models: Is it the Datasets or
the Models?
Casper, S.; Davies, X.; Shi, C.; Gilbert, T.K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P.; et al. Open
Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, [2307.15217 [cs]].
Casper, S.; Lin, J.; Kwon, J.; Culp, G.; Hadfield-Menell, D. Explore, Establish, Exploit: Red Teaming Language Models from
Scratch, [2306.09442 [cs]].
Basic, Z.; Banovac, A.; Kruzic, I.; Jerkovic, I. ChatGPT-3.5 as writing assistance in students’ essays. 10, 750. https://doi.org/10.1
057/s41599-023-02269-7.
Kumar, A.H. Analysis of ChatGPT Tool to Assess the Potential of its Utility for Academic Writing in Biomedical Domain. 9, 24–30.
https://doi.org/10.5530/bems.9.1.5.
Alkaissi, H.; McFarlane, S.I.; Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing.
15. Publisher: Cureus, https://doi.org/10.7759/cureus.35179.
Stokel-Walker, C. ChatGPT listed as author on research papers: many scientists disapprove. 613, 620–621. https://doi.org/10.1
038/d41586-023-00107-z.
Gao, C.A.; Howard, F.M.; Markov, N.S.; Dyer, E.C.; Ramesh, S.; Luo, Y.; Pearson, A.T. Comparing scientific abstracts generated by
ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers.
https://doi.org/10.1101/2022.12.23.521610.
Liebrenz, M.; Schleifer, R.; Buadze, A.; Bhugra, D.; Smith, A. Generating scholarly content with ChatGPT: ethical challenges for
medical publishing. 5, e105–e106. https://doi.org/10.1016/S2589-7500(23)00019-5.
10 of 10
24.
25.
26.
27.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
Nakazawa, E.; Udagawa, M.; Akabayashi, A. Does the Use of AI to Create Academic Research Papers Undermine Researcher
Originality? 3, 702–706. https://doi.org/10.3390/ai3030040.
Casal, J.E.; Kessler, M. Can linguists distinguish between ChatGPT/AI and human writing?: A study of research ethics and
academic publishing. 2, 100068. https://doi.org/10.1016/j.rmal.2023.100068.
Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z. ChatGPT and a new academic reality: Artificial IntelligenceWritten research papers and the ethics of the large language models in scholarly publishing. 74, 570–581. https://doi.org/10.100
2/asi.24750.
Generative Pretrained Transformer, G.; Thunström, A.O.; Steingrimsson, S. Can GPT-3 write an academic paper on itself, with
minimal human input?
Pannucci, C.J.; Wilkins, E.G. Identifying and Avoiding Bias in Research. 126, 619–625. https://doi.org/10.1097/PRS.0b013e318
1de24bc.
Desaire, H.; Chua, A.E.; Isom, M.; Jarosova, R.; Hua, D. ChatGPT or academic scientist? Distinguishing authorship with over 99%
accuracy using off-the-shelf machine learning tools, [2303.16352 [cs]]. https://doi.org/10.48550/arXiv.2303.16352.
Gandhi, M.; Gandhi, M. Does AI’s touch diminish the artistry of scientific writing or elevate it? 27, 350. https://doi.org/10.1186/
s13054-023-04634-z.
Miller, A.; Taylor, S.; Bedeian, A. Publish or perish: Academic life as management faculty live it. 16, 422–445. https:
//doi.org/10.1108/13620431111167751.
Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; Irving, G. Red Teaming Language Models
with Language Models, [2202.03286 [cs]].
Lin, Z. Supercharging academic writing with generative AI: framework, techniques, and caveats, [2310.17143 [cs]]. https:
//doi.org/10.48550/arXiv.2310.17143.
Aydin, O.; Karaarslan, E. OpenAI ChatGPT Generated Literature Review: Digital Twin in Healthcare. https://doi.org/10.2139/
ssrn.4308687.
Panch, T.; Mattie, H.; Atun, R. Artificial intelligence and algorithmic bias: implications for health systems. 9, 010318. https:
//doi.org/10.7189/jogh.09.020318.
Starke, G.; De Clercq, E.; Elger, B.S. Towards a pragmatist dealing with algorithmic bias in medical machine learning. 24, 341–349.
https://doi.org/10.1007/s11019-021-10008-5.
Norori, N.; Hu, Q.; Aellen, F.M.; Faraci, F.D.; Tzovara, A. Addressing bias in big data and AI for health care: A call for open
science. 2, 100347. https://doi.org/10.1016/j.patter.2021.100347.
Ross, A.; Chen, N.; Hang, E.Z.; Glassman, E.L.; Doshi-Velez, F. Evaluating the Interpretability of Generative Models by Interactive
Reconstruction. In Proceedings of the Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, New
York, NY, USA, 2021; CHI ’21. https://doi.org/10.1145/3411764.3445296.
ISACA. The Promise and Peril of the AI Revolution: Managing Risk.
OpenAI. GPT-4 Technical Report.
Akyürek, A.F.; Kocyigit, M.Y.; Paik, S.; Wijaya, D. Challenges in Measuring Bias via Open-Ended Language Generation,
[2205.11601 [cs]].
Ye, J.; Chen, X.; Xu, N.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; Zhou, J.; Chen, S.; et al. A Comprehensive Capability Analysis
of GPT-3 and GPT-3.5 Series Models.
Erves, J.C.; Mayo-Gamble, T.L.; Malin-Fair, A.; Boyer, A.; Joosten, Y.; Vaughn, Y.C.; Sherden, L.; Luther, P.; Miller, S.; Wilkins,
C.H. Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community
Perspective. 42, 472–480. https://doi.org/10.1007/s10900-016-0279-2.
Habamahoro, T.; Bontke, T.; Chirom, M.; Wu, Z.; Bao, J.M.; Deng, L.Z.; Chu, C.W. Replication and study of anomalies in LK-99–the
alleged ambient-pressure, room-temperature superconductor, 2023, [arXiv:cond-mat.supr-con/2311.03558].
Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; Anderson, R. The Curse of Recursion: Training on Generated Data
Makes Models Forget, 2023, [arXiv:cs.LG/2305.17493].
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual
author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to
people or property resulting from any ideas, methods, instructions or products referred to in the content.