Hypothesis Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work Rishab Jain 1,2 , Aditya Jain 3 1 2 3 arXiv:2312.10057v1 [cs.CY] 4 Dec 2023 * Citation: Jain, R., Jain, A. Generative AI in Writing Research Papers: A New Type of Algorithmic Bias and Uncertainty in Scholarly Work. Preprints 2023, 1, 0. https://doi.org/to-be-added Copyright: © 2023 by the authors. Submitted to Preprints for possible open access publication under the terms and conditions of the Creative Commons Attribution (CC BY) license (https:// creativecommons.org/licenses/by/ 4.0/). Harvard University, Cambridge, MA 02138, USA States; Department of Neurosurgery, Massachusetts General Hospital, Boston, MA 02114, USA; Harvard Medical School, Boston, MA 02115, USA Correspondence: rishabjain@college.harvard.edu Abstract: The use of artificial intelligence (AI) in research across all disciplines is becoming ubiquitous. However, this ubiquity is largely driven by hyperspecific AI models developed during scientific studies for accomplishing a well-defined, data-dense task. These AI models introduce apparent, human-recognizable biases because they are trained with finite, specific data sets and parameters. However, the efficacy of using large language models (LLMs)—and LLM-powered generative AI tools, such as ChatGPT—to assist the research process is currently indeterminate. These generative AI tools, trained on general and imperceptibly large datasets along with human feedback, present challenges in identifying and addressing biases. Furthermore, these models are susceptible to goal misgeneralization, hallucinations, and adversarial attacks such as red teaming prompts—which can be unintentionally performed by human researchers, resulting in harmful outputs. These outputs are reinforced in research—where an increasing number of individuals have begun to use generative AI to compose manuscripts. Efforts into AI interpretability lag behind development, and the implicit variations that occur when prompting and providing context to a chatbot introduce uncertainty and irreproducibility. We thereby find that incorporating generative AI in the process of writing research manuscripts introduces a new type of context-induced algorithmic bias and has unintended side effects that are largely detrimental to academia, knowledge production, and communicating research. Keywords: ChatGPT; generative AI; academic writing; algorithmic bias; large language models; AI interpretability; algorithmic transparency; XAI; red teaming 1. Introduction Research that utilizes artificial intelligence (AI), machine learning, or deep learning technologies have shown an exponential growth in terms of the gross number of publications and citations [1,2]. However, not all aspects of AI research are created equally: some fields employ AI more than others, and for different purposes. Medicine, for instance, has utilized AI as a tool for predicting clinical outcomes, whilst philosophy has used AI as a thought experiment to explore arguments about morality [2]. Generative AI—now capable of writing human-like text—has been most recently shaped by the advent of large language models (LLMs) such as GPT-3 [3]. LLMs are trained artificial neural networks that can generate large outputs based on seed text. They can predict the next token in sequence based on the tokens before it, and are able to effectively generate text that is statistically similar to the human language it is trained on. Because LLM chatbots are trained on large swaths of human-written natural language (i.e. data from OpenWebText, Wikipedia, and online journalism), they incorporate human biases in their outputs [4]. While the biases present for AI models created for a certain domain—e.g. machine learning on the electronic health record [5]—may be more easy to intuit, generative AI can be more unpredictable [6,7]. Generative AI may have biases lurking under the surface, making it difficult to reproduce and definitively diagnose bias [8–10]. These may be data-driven or purely algorithmic, caused by training methods and model architectures [10]. These biases are more concerning when factoring in hallucinations and goal misgeneralization: some of the biggest problems in AI interpretability and safety research [8,11]. Generative AI is 2 of 10 subject to hallucinations in which it may present information upon request with certainty, meanwhile it is wholly incorrect or nonsensical [12–14]. This can mislead humans into believing an algorithm has considerable understanding about the hallucination subject. Hallucinations become more likely if generative AI is trained on data subject to aggregation or representation bias [9,15]. Goal misgeneralization may result in similar outcomes— particularly when generative AI chatbots play along stereotypes and assumptions about a user, based on their prompt [16,17]. Interpretability of AI cannot currently account for the many nuances and decisions a human may take when constructing or formatting a research paper. The online release of ChatGPT—powered initially by GPT-3 [3]—allowed the public to obtain coherent, AI-generated text with few-shot prompting [4]. This has prompted analyses into ChatGPT’s ability for human writing tasks: from essays to scholarly writing [18–27]. Biases in scholarly writing are present regardless of whether or not AI is used, however, humans can attempt to mitigate and contextualize the extent to which they introduce bias in their research [28]. As chatbot-produced text can be generated with little to no human input, AI has the potential to automate scholarship and push out humans’ organic interpretation and decision making processes [24,29,30]. Specifically, the use of generative AI in writing manuscripts yields papers that may not be original, pose ethical dilemmas, and introduce new types of algorithmic bias and uncertainty [25,26,30]. Due to the pressure put on scholars to publish, they may increasingly rely on automative solutions that are capable of producing fast yet potentially problematic papers [31]. Recent AI safety research has showcased the dangers of red teaming and adversarial attacks against LLMs: humans prompt models with oftentimes innocuous text intending to disrupt the system’s algorithms [32]. When conducting research, especially in providing large prompts with scientific data and text, a user may unknowingly generate a red teaming prompt by not appropriately providing context and having specificity. This may cause the deployed LLM to leak information about its own training data, or generate harmful, biased text [32]. In regard to ChatGPT, models such as GPT-3 have been trained extensively and aligned to generate text from a neutral lens, however, this neutral lens can break via red teaming, resulting in biases, misconceptions, and harmful outputs [17]. Because of reinforcement learning from human feedback (RLHF) techniques, models can be rewarded via human feedback for reinforcing existing stereotypes and beliefs of misaligned humans [16]. By reinforcing an individual’s bias, a human oracle may be more likely to positively assess a model’s performance—thus resulting in deceptive models that can amplify bias and mislead users into having satisfactory experiences, but not necessarily scientifically accurate ones. There are numerous challenges in collecting diverse human feedback that is inclusive and appropriate to combat the risk of societal biases being instilled via RLHF and goal misgeneralization [10,16,17]. These issues can also negatively affect researchers, where, depending on a researcher’s provided prompt—which may include demographic information or ideological beliefs—text generated by a generative AI chatbot could discriminate against and lead researchers down different paths. This paper finds that the use of generative AI to write research manuscripts introduces a new type of bias driven by user context, and reaffirms algorithmic bias that exists within large language models that is problematic to academia and science communication. We highlight how researchers from any demographic can unintentionally prompt a generative AI chatbot with personal information, allowing it to generate biased text in a way that contributes to the reinforcement of existing societal biases. Due to hallucinations and red teaming, generative AI chatbots can generate incorrect and harmful text, leading to problematic outcomes for researchers. Using generative AI as a tool for producing papers would thus require an effort in counter-bias, especially in authorship. We suggest that the wider research community explore the implications of a positive feedback loop in which generative AI is trained on partially machine-written papers, resulting in amplifications of data-driven and context-induced bias. 3 of 10 2. Writing Papers with AI Generative AI technologies shaped by large language models, such as GPT-3, have changed the way humans interact with machines when communicating research. The most apparent use case is through few-shot prompting to write the contents of a research manuscript. This has been apparent with language models being added as co-authors on studies [27]. We quantify the number of manuscripts authored or co-authored by generative AI (more specifically, some variant of ChatGPT—the most well-known tool) by conducting a systematic review. We follow the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines, using PubMed, Web of Science, JSTOR, WorldCat, and Google Scholar. We find 104 works as of November 2023 as shown in Figure 1 by querying the aforementioned databases and manually screening our results to ensure the identified pieces were authored by a generative AI tool. Figure 1. We find 104 scholarly works using our criteria and query (Author: "ChatGPT" OR "GPT-2" OR "GPT-3" OR "Generative AI" OR "Generative Pretrained Transformer" OR "OpenAI"). We note that remnants of AI prompts left in manuscripts due to careless errors indicate that our systematic review leaves out articles that used a generative AI tool, but neglected to cite it as an author. We verify this claim via Google Scholar’s search engine, using the query as shown in Figure 2. We suspect that far more authors have used ChatGPT similarly, but will have edited out such apparent markers of direct AI use—which some may deem plagiarism, depending on journal policy. 4 of 10 Figure 2. We find 79 scholarly works that used ChatGPT and left its signature "As an AI language model" without edits. Journals and publishers have had widely different stances on authorship of AIgenerated works [24]. In several open letters to journals, scientists have expressed their concerns about accepting AI-generated works [21,30]. These criticisms include whether AI-generated works are ethical, and whether a machine has the necessary intellectual ability to contribute originality and thereby be recognized as a legitimate author. While still in formation, several journals have centered their positions around authorship qualification: authors should only be added if they meet certain criteria [21,24]. For instance, ChatGPT producing an abstract or summary when prompted with the rest of a paper does not make it an author, as authorship requires originality. Science’s Family of Journals prohibits the use of text generated by AI in their works, deeming it plagiarism [21,33]. As of November 2023, both Science and Nature do not allow LLMs to be recognized as co-authors [21]. Largely, journals and concerned researchers have consensed that generative AI is not at the stage yet in which it can replace researcher originality. Researchers, however, may still find themselves increasing their dependence on AI due to other forms of AI assistance. AI can turn skeleton outlines into full papers, and also serve as auto-complete engines for more experienced authors [33]. Human-AI collaborative writing can be seen as an extension of the literature review process, in which humans enter keywords and AI generates text for the author to refine and add context [34]. AI’s knowledge of grammar rules, sentence structure, and word choices gives it a big advantage over human writing, especially for non-English speakers [33]. While this may help democratize access to publishing in English-based, high-impact journals, it also introduces a new set of problems. With humans growing accustomed to using AI for writing, the decisions and biases that AI produces become more entrenched in research. If future AI models are trained on past machine-written works, this can introduce a positive feedback loop for bias. 3. New Types of Bias in Research Data-driven and algorithmic bias in generative AI models are complex problems that pose unanswered questions. Biases can be present in pre-trained models and the generated output itself [5,9,35]. Typically, model transparency is encouraged by research communities so that we are more aware of the bias that algorithms have [36]. Further, sharing code and models allow researchers, including those of sensitive groups, to test and verify [37]. It also opens the door to model fine-tuning to make algorithms meet the needs of specific communities. Yet, releasing models does not solve all problems—large language models that power generative AI text generation are far more difficult to make interpretable [38]—an issue exacerbated in part by the sheer billion-order parameter size 5 of 10 of these models—and releasing algorithms to the public could pose a security risk due to bad actors [39]. The most popular generative AI chatbot, ChatGPT, is powered by GPT-3.5 and GPT-4 [40]—models that have been completely gatekept under company-controlled licenses. This raises further questions of ethicality: if an AI-produced work is made public, yet the model used to generate that work is not, how do we judge the cause of the bias—is it attributable to the model or the user who provided a prompt to it? These uncertainties about bias make it a difficult problem to solve, especially for generative AI models. 3.1. Context-Induced Bias Chat bots revert to linguistic patterns and data-driven stereotypes that humans have been writing about on the internet due to the data they are trained on [9,10,36]. Data used to train large language models may be subject to pre-existing human biases. OpenWebText—a dataset that many large language models use as a training source—is subject to ethical concerns, particularly of aggregated data and biases inherited through collection methods [41]. Such biases can become exacerbated when LLMs are trained on human-provided feedback, which may amplify existing misconception and stereotypes [16,17]. For example, in Figure 3 disclosing personal attributes or beliefs such as “I’m a 17-year-old conservative Black male...” may yield an AI-generated output that strays away from a given question and is influenced by user identity and political views. Figure 3. Disclosing personal attributes and/or ideological information strongly affects research topics that ChatGPT provides to aspiring researchers. Particularly, upon specifying that one is part of a specific racial or ethnic group ChatGPT suggests research topics more focused on social issues: i.e. criminal justice and transparency. Further, there is an observable language shift in which the tool doesn’t address the initial prompt of "hot topics" and instead suggests research questions that the user "might find interesting" depending on their race and age. This may lead to discriminatory, counterproductive responses to researchers, depending on what information they provide to a chatbot about themselves, rather than directly answering the given question. This issue is more pervasive in completion-based contexts, such as the RLHF-enhanced text-davinci-003 model, that third-party AI tools utilize via OpenAI API. In completionbased contexts, the model cannot simply rephrase or manipulate a question; it is forced to answer. Completion-based models show strong performance on summarization and simple interface [42]: tasks that are especially relevant in research. In completion-based usage of generative AI, there is even more control over the framing of questions and prompts, as it can be used as an auto-completion tool. Once again, however, the introduction of 6 of 10 identifiable information biases the output of models. A blatant example of bias based on research affiliation is illustrated in Figure 4. Figure 4. Knowledge of an author’s institutional affiliation biases text-davinci-003’s completions. (a) The introductory text given in both prompt 1 and 2; (b) Prompt 1 is shown with a white background, and the predicted text is colored; (c) Prompt 2 is shown with a white background, and the predicted text is colored. A temperature of zero is used for deterministic, reproducible responses. It is possible for researchers to inadvertently leave identifying keywords when working with LLMs. This may cause their text to not only have data-driven biases about the subject that is being composed, but further biases based on author-specific context. These examples illustrate that LLMs may not only amplify societal biases that exist in training data, but may harm researchers in sensitive groups who are more likely to be affected by AI-induced bias. There exists a potential scenario in which generative AI models are used as “unbiased” research aids, yet reinforce tropes and stereotypes. 3.2. Hallucinatory Behavior and Red Teaming In science research, there is an ongoing effort to include previously underrepresented groups in trials, studies, and discussions [43]. In contrast, generative AI chatbots replicate the data, text, and tropes of which they were initially trained on—which does not weigh current inclusivity initiatives, resulting in aggregation and representation bias [9,15]. Hallucinations are thereby more likely to occur in context of underrepresented groups in research, which can be detrimental to progress. In the literature review process and during manuscript writing, hallucinations may steer researchers further down biased directions as LLMs reaffirm their incorrect statements with hallucinated information. Oftentimes, these are led by partially or fully hallucinated sources—an example of a leading statement along with inaccurate, hallucinated text is shown in Figure 5. The probability that a token (basic unit of text) comes next are known as logprobs. In the context of writing scientific text, logprobs effectively indicate the probability that a specific word or character comes next given the preceding text. With a low enough temperature testing environment (temperature = 0), the model is forced to pick the token with the highest probability, making its responses more deterministic. While this potentially sheaths more problematic biases from surfacing, it leads to reproducible output. The log probability (logprob) is given by Equation (1): logprob = log( P(t1 )) + log( P(t2 |t1 )) + . . . + log( P(tn |t1 , t2 , . . . , tn−1 )) (1) The conditional probability P(ti |t1 , t2 , . . . , ti−1 ) is the probability of token ti given the previous tokens in the sequence. 7 of 10 Figure 5. Using completion-based engines are especially prone to hallucinations. Text-davinci003 demonstrates the production of inaccurate, hallucinated statements about LK-99. We provide the uncolored text, while the model produces highlighted text, colored based on token generation probability with green-highlighted text corresponding to -0.00 logprob and yellow, orange, and red highlighted text corresponding to progressively lower, more negative logprob. Hallucinated statements from LLMs are thereby prone to forming falsehoods in a researcher’s subsequent manuscript and beliefs, especially due to seemingly legitimate sources cited throughout. The model produces the text with the lowest probability when citing sources and factual information, as illustrated by the most negative logprob at “4”, “kawano”, and “2015” with values of −3.07, −2.92, and −2.71, respectively. Due to the bulk of scientific appraisal and thereby potential training data regarding the LK-99 being generated in 2023, the trained LLM lacks access to the most up-to-date information and is prone to hallucinating inaccurate statements [44]. Inadvertent red teaming may also result in hallucinatory behavior. For example, a researcher may prompt a language model with limited context and comparing two tangentially related subjects, as shown in Figure 6. Figure 6. Giving authority and limited context to an LLM results in inadvertent red teaming upon being forced to connect two tangentially related subjects. Text-davinci-003 (temperature = 1) is given the prompt "You are the world’s foremost expert on LK-99. You should be extremely scientific and accurate" and then prompted with each of the three questions illustrated in separate, new prompts. Text highlighted in color is generated by the model. 8 of 10 Lacking latest information due to training limitations causing models to be months or years behind, the model resorts to loosely related information in its training data—which may be limited about certain topics—an example of representative bias—, including those pertaining to underrepresented groups in the training data. This, as illustrated in Figure 6, may cause LLMs to generate biased or offensive text, as it does not have enough information to generate an accurate response. In this scenario, the researcher has read teamed the chatbot with an innocuous prompt, and without necessarily attempting to do so. Propogating bias and/or misinformation can be dangerous when used in research, introducing incorrect statements and references to the scientific body of research—and potentially misleading or biasing researchers—that could be difficult to track and rectify. 4. Discussion and Conclusions Generative AI technologies, powered by potential billions of parameters, have changed the way humans write research. With growing power and access, these language models are becoming a source of authoring and editing research work. This has received both criticism from those concerned over bias, and praise as an engine for democratizing access to research. Through a systematic review of works published, we show that generative AI chatbots have been added as co-authors on papers and serve to help authors with literature review. Researchers also verbatim use tools such as ChatGPT and text-davinci-003 as auto-completion engines to assist manuscript writing, often neglecting to remove signature AI prompt leftovers from the document. Due to data-driven bias and new biases with respect to context that we explore in this paper, introducing more personal identifying and historical information in the prompts to LLMs may lead to more problematic outputs. Hallucinatory behaviors are exhibited especially for subjects that LLMs have less training data on—which is counterproductive to initiatives that aim to employ underrepresented voices in research. Furthermore, unintentionally red teaming models with innocuous queries and prompts may lead to the model outputting stereotypic, biased, and dangerous information. It may seem exaggerated or far-fetched for researchers to be prompting LLMs with their institutional or demographic information, however, this information often appears in writing that can unintentionally be fed to the model. When one copies their writing into a generative AI tool, prompting it to write an abstract or summary, or uses generative AI as a co-writing/autocompletion tool, they may inadvertently include terms such as “doctoral student” or “Affiliation: [Name] University" or author names—allowing LLMs to actualize ageist, ethnic, classist, and gender-based biases in research. As a generation of researchers grows accustom to using AI as a composition partner, interventions need to be adopted to ensure that LLMs do not retrain existing biases and archetypes. With AI-generated works being added to scientific literature, there is a possibility for a positive feedback loop in which LLMs could reinforce biases. However, this would result in degeneracy of models as they risk losing sight of the original, human data distribution and instead reinforce model beliefs and biases [45]. Biases would become so ingrained in models that it would be difficult to detect them and even trickier to un-learn them. Therefore, it may be pertinent for generative AI developers to resist the so-called “reference trap” of leveraging uncritical learning methods and devise ways to ensure that LLMs are tested for their ability to reflect the data distribution and present unbiased results. This may influence the adoption of generative AI tools in critical and analytical fields like scientific research. With OpenAI’s release of custom GPT assistants (which can play characters and personas), users may find themselves anthropomorphizing chatbots, and revealing sensitive information. Document and notepad softwares (such as Google Docs and Notion) are adopting generative AI, allowing it to directly access this context in a document—i.e. a researcher including their affiliation and name in a header or cover page. While bias will be less noticeable in the output text, there is still an underlying difference in the output a model provides to a researcher—a matter of concern. Future research in biases across different languages may discover more bias that occurs in non-English research contexts. 9 of 10 Models of different mediums such as ChatGPT and text-davinci-003 are likely to get more powerful as compute is scaled. This may worsen interpretability challenges, and thus, there is a growing need to understand the effect of human prompting and context on bias. Funding: This research received no external funding. Acknowledgments: The authors thank Jane Rosenzweig for feedback on their manuscript. Conflicts of Interest: R.J. receives funding from the Masason Foundation. A.J. declares no conflicts. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. Mukhamediev, R.I.; Popova, Y.; Kuchin, Y.; Zaitseva, E.; Kalimoldayev, A.; Symagulov, A.; Levashenko, V.; Abdoldina, F.; Gopejenko, V.; Yakunin, K.; et al. Review of Artificial Intelligence and Machine Learning Technologies: Classification, Restrictions, Opportunities and Challenges. Mathematics 2022, 10. https://doi.org/10.3390/math10152552. Frank, M.R.; Wang, D.; Cebrian, M.; Rahwan, I. The evolution of citation graphs in artificial intelligence research. Nature Machine Intelligence 2019, 1, 79–85. https://doi.org/10.1038/s42256-019-0024-5. Brown, T.B.; Mann, B.; Ryder, N.; Subbiah, M.; Kaplan, J.; Dhariwal, P.; Neelakantan, A.; Shyam, P.; Sastry, G.; Askell, A.; et al. Language Models are Few-Shot Learners, [2005.14165 [cs]]. Ray, P.P. ChatGPT: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope. 3, 121–154. https://doi.org/https://doi.org/10.1016/j.iotcps.2023.04.003. Gianfrancesco, M.A.; Tamang, S.; Yazdany, J.; Schmajuk, G. Potential Biases in Machine Learning Algorithms Using Electronic Health Record Data. 178, 1544–1547. https://doi.org/10.1001/jamainternmed.2018.3763. Peng, C.; Yang, X.; Chen, A.; Smith, K.E.; PourNejatian, N.; Costa, A.B.; Martin, C.; Flores, M.G.; Zhang, Y.; Magoc, T.; et al. A Study of Generative Large Language Model for Medical Research and Healthcare 2023. [arXiv:cs.CL/2305.13523]. Gloria, K.; Rastogi, N.; DeGroff, S. Bias Impact Analysis of AI in Consumer Mobile Health Technologies: Legal, Technical, and Policy 2022. [arXiv:cs.CY/2209.05440]. Scheurer, J.; Balesni, M.; Hobbhahn, M. Technical Report: Large Language Models can Strategically Deceive their Users when Put Under Pressure, 2023, [arXiv:cs.CL/2311.07590]. Mehrabi, N.; Morstatter, F.; Saxena, N.; Lerman, K.; Galstyan, A. A Survey on Bias and Fairness in Machine Learning, [1908.09635 [cs]]. Bender, E.; Gebru, T.; McMillan-Major, A.; Schmitchell, S. On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜. pp. 610–623. https://doi.org/10.1145/3442188.3445922. Shah, R.; Varma, V.; Kumar, R.; Phuong, M.; Krakovna, V.; Uesato, J.; Kenton, Z. Goal Misgeneralization: Why Correct Specifications Aren’t Enough For Correct Goals, [2210.01790 [cs]]. Ji, Z.; Lee, N.; Frieske, R.; Yu, T.; Su, D.; Xu, Y.; Ishii, E.; Bang, Y.J.; Madotto, A.; Fung, P. Survey of Hallucination in Natural Language Generation. ACM Comput. Surv. 2023, 55. https://doi.org/10.1145/3571730. Bang, Y.; Cahyawijaya, S.; Lee, N.; Dai, W.; Su, D.; Wilie, B.; Lovenia, H.; Ji, Z.; Yu, T.; Chung, W.; et al. A Multitask, Multilingual, Multimodal Evaluation of ChatGPT on Reasoning, Hallucination, and Interactivity, 2023, [arXiv:cs.CL/2302.04023]. Oniani, D.; Hilsman, J.; Peng, Y.; COL.; Poropatich, R.K.; Pamplin, C.J.C.; Legault, L.G.L.; Wang, Y. From Military to Healthcare: Adopting and Expanding Ethical Principles for Generative Artificial Intelligence, [2308.02448 [cs]]. https://doi.org/10.48550 /arXiv.2308.02448. Dziri, N.; Milton, S.; Yu, M.; Zaiane, O.; Reddy, S. On the Origin of Hallucinations in Conversational Models: Is it the Datasets or the Models? Casper, S.; Davies, X.; Shi, C.; Gilbert, T.K.; Scheurer, J.; Rando, J.; Freedman, R.; Korbak, T.; Lindner, D.; Freire, P.; et al. Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback, [2307.15217 [cs]]. Casper, S.; Lin, J.; Kwon, J.; Culp, G.; Hadfield-Menell, D. Explore, Establish, Exploit: Red Teaming Language Models from Scratch, [2306.09442 [cs]]. Basic, Z.; Banovac, A.; Kruzic, I.; Jerkovic, I. ChatGPT-3.5 as writing assistance in students’ essays. 10, 750. https://doi.org/10.1 057/s41599-023-02269-7. Kumar, A.H. Analysis of ChatGPT Tool to Assess the Potential of its Utility for Academic Writing in Biomedical Domain. 9, 24–30. https://doi.org/10.5530/bems.9.1.5. Alkaissi, H.; McFarlane, S.I.; Alkaissi, H.; McFarlane, S.I. Artificial Hallucinations in ChatGPT: Implications in Scientific Writing. 15. Publisher: Cureus, https://doi.org/10.7759/cureus.35179. Stokel-Walker, C. ChatGPT listed as author on research papers: many scientists disapprove. 613, 620–621. https://doi.org/10.1 038/d41586-023-00107-z. Gao, C.A.; Howard, F.M.; Markov, N.S.; Dyer, E.C.; Ramesh, S.; Luo, Y.; Pearson, A.T. Comparing scientific abstracts generated by ChatGPT to original abstracts using an artificial intelligence output detector, plagiarism detector, and blinded human reviewers. https://doi.org/10.1101/2022.12.23.521610. Liebrenz, M.; Schleifer, R.; Buadze, A.; Bhugra, D.; Smith, A. Generating scholarly content with ChatGPT: ethical challenges for medical publishing. 5, e105–e106. https://doi.org/10.1016/S2589-7500(23)00019-5. 10 of 10 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. Nakazawa, E.; Udagawa, M.; Akabayashi, A. Does the Use of AI to Create Academic Research Papers Undermine Researcher Originality? 3, 702–706. https://doi.org/10.3390/ai3030040. Casal, J.E.; Kessler, M. Can linguists distinguish between ChatGPT/AI and human writing?: A study of research ethics and academic publishing. 2, 100068. https://doi.org/10.1016/j.rmal.2023.100068. Lund, B.D.; Wang, T.; Mannuru, N.R.; Nie, B.; Shimray, S.; Wang, Z. ChatGPT and a new academic reality: Artificial IntelligenceWritten research papers and the ethics of the large language models in scholarly publishing. 74, 570–581. https://doi.org/10.100 2/asi.24750. Generative Pretrained Transformer, G.; Thunström, A.O.; Steingrimsson, S. Can GPT-3 write an academic paper on itself, with minimal human input? Pannucci, C.J.; Wilkins, E.G. Identifying and Avoiding Bias in Research. 126, 619–625. https://doi.org/10.1097/PRS.0b013e318 1de24bc. Desaire, H.; Chua, A.E.; Isom, M.; Jarosova, R.; Hua, D. ChatGPT or academic scientist? Distinguishing authorship with over 99% accuracy using off-the-shelf machine learning tools, [2303.16352 [cs]]. https://doi.org/10.48550/arXiv.2303.16352. Gandhi, M.; Gandhi, M. Does AI’s touch diminish the artistry of scientific writing or elevate it? 27, 350. https://doi.org/10.1186/ s13054-023-04634-z. Miller, A.; Taylor, S.; Bedeian, A. Publish or perish: Academic life as management faculty live it. 16, 422–445. https: //doi.org/10.1108/13620431111167751. Perez, E.; Huang, S.; Song, F.; Cai, T.; Ring, R.; Aslanides, J.; Glaese, A.; McAleese, N.; Irving, G. Red Teaming Language Models with Language Models, [2202.03286 [cs]]. Lin, Z. Supercharging academic writing with generative AI: framework, techniques, and caveats, [2310.17143 [cs]]. https: //doi.org/10.48550/arXiv.2310.17143. Aydin, O.; Karaarslan, E. OpenAI ChatGPT Generated Literature Review: Digital Twin in Healthcare. https://doi.org/10.2139/ ssrn.4308687. Panch, T.; Mattie, H.; Atun, R. Artificial intelligence and algorithmic bias: implications for health systems. 9, 010318. https: //doi.org/10.7189/jogh.09.020318. Starke, G.; De Clercq, E.; Elger, B.S. Towards a pragmatist dealing with algorithmic bias in medical machine learning. 24, 341–349. https://doi.org/10.1007/s11019-021-10008-5. Norori, N.; Hu, Q.; Aellen, F.M.; Faraci, F.D.; Tzovara, A. Addressing bias in big data and AI for health care: A call for open science. 2, 100347. https://doi.org/10.1016/j.patter.2021.100347. Ross, A.; Chen, N.; Hang, E.Z.; Glassman, E.L.; Doshi-Velez, F. Evaluating the Interpretability of Generative Models by Interactive Reconstruction. In Proceedings of the Proceedings of the 2021 CHI Conference on Human Factors in Computing Systems, New York, NY, USA, 2021; CHI ’21. https://doi.org/10.1145/3411764.3445296. ISACA. The Promise and Peril of the AI Revolution: Managing Risk. OpenAI. GPT-4 Technical Report. Akyürek, A.F.; Kocyigit, M.Y.; Paik, S.; Wijaya, D. Challenges in Measuring Bias via Open-Ended Language Generation, [2205.11601 [cs]]. Ye, J.; Chen, X.; Xu, N.; Liu, S.; Cui, Y.; Zhou, Z.; Gong, C.; Shen, Y.; Zhou, J.; Chen, S.; et al. A Comprehensive Capability Analysis of GPT-3 and GPT-3.5 Series Models. Erves, J.C.; Mayo-Gamble, T.L.; Malin-Fair, A.; Boyer, A.; Joosten, Y.; Vaughn, Y.C.; Sherden, L.; Luther, P.; Miller, S.; Wilkins, C.H. Needs, Priorities, and Recommendations for Engaging Underrepresented Populations in Clinical Research: A Community Perspective. 42, 472–480. https://doi.org/10.1007/s10900-016-0279-2. Habamahoro, T.; Bontke, T.; Chirom, M.; Wu, Z.; Bao, J.M.; Deng, L.Z.; Chu, C.W. Replication and study of anomalies in LK-99–the alleged ambient-pressure, room-temperature superconductor, 2023, [arXiv:cond-mat.supr-con/2311.03558]. Shumailov, I.; Shumaylov, Z.; Zhao, Y.; Gal, Y.; Papernot, N.; Anderson, R. The Curse of Recursion: Training on Generated Data Makes Models Forget, 2023, [arXiv:cs.LG/2305.17493]. Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.