Journal of Applied Psychology © 2023 American Psychological Association ISSN: 0021-9010 2023, Vol. 108, No. 8, 1277–1299 https://doi.org/10.1037/apl0001082 How Well Can an AI Chatbot Infer Personality? Examining Psychometric Properties of Machine-Inferred Personality Scores Jinyan Fan1, Tianjun Sun2, Jiayi Liu1, Teng Zhao1, Bo Zhang3, 4, Zheng Chen5, Melissa Glorioso1, and Elissa Hack6 1 Department of Psychological Sciences, Auburn University Department of Psychological Sciences, Kansas State University 3 School of Labor and Employment Relations, University of Illinois Urbana-Champaign 4 Department of Psychology, University of Illinois Urbana-Champaign 5 School of Information Systems and Management, Muma College of Business, University of South Florida–St. Petersburg 6 Department of Behavioral Sciences and Leadership, United States Air Force Academy This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 2 The present study explores the plausibility of measuring personality indirectly through an artificial intelligence (AI) chatbot. This chatbot mines various textual features from users’ free text responses collected during an online conversation/interview and then uses machine learning algorithms to infer personality scores. We comprehensively examine the psychometric properties of the machine-inferred personality scores, including reliability (internal consistency, split-half, and test–retest), factorial validity, convergent and discriminant validity, and criterion-related validity. Participants were undergraduate students (n = 1,444) enrolled in a large southeastern public university in the United States who completed a self-report Big Five personality measure (IPIP-300) and engaged with an AI chatbot for approximately 20–30 min. In a subsample (n = 407), we obtained participants’ cumulative grade point averages from the University Registrar and had their peers rate their college adjustment. In an additional sample (n = 61), we obtained test– retest data. Results indicated that machine-inferred personality scores (a) had overall acceptable reliability at both the domain and facet levels, (b) yielded a comparable factor structure to self-reported questionnairederived personality scores, (c) displayed good convergent validity but relatively poor discriminant validity (averaged convergent correlations = .48 vs. averaged machine-score correlations = .35 in the test sample), (d) showed low criterion-related validity, and (e) exhibited incremental validity over self-reported questionnairederived personality scores in some analyses. In addition, there was strong evidence for cross-sample generalizability of psychometric properties of machine scores. Theoretical implications, future research directions, and practical considerations are discussed. Keywords: chatbot, personality, artificial intelligence, machine learning, psychometric properties Supplemental materials: https://doi.org/10.1037/apl0001082.supp During the last 3 decades, personality measures have been established as a useful talent assessment tool due to the findings that (a) personality scores are predictive of important organizational outcomes (e.g., Hurtz & Donovan, 2000; Judge & Bono, 2001) and (b) personality scores typically do not result in racial adverse impact (e.g., Foldes et al., 2008). While scholars and practitioners alike appreciate the utility of understanding individuals’ behavioral characteristics in organizational settings, debates have revolved around how to measure personality more effectively and efficiently. Self-report personality measures, often used in talent assessment practice, have been criticized for (a) modest criterion-related validity (Morgeson et al., 2007); (b) susceptibility to faking or response distortion, particularly within selection contexts (Ziegler et al., 2012); (c) idiosyncratic interpretation of items due to individual This article was published Online First February 6, 2023. https://orcid.org/0000-0001-9599-0737 Jinyan Fan https://orcid.org/0000-0002-3655-0042 Tianjun Sun Jiayi Liu https://orcid.org/0000-0002-4077-8234 Teng Zhao https://orcid.org/0000-0002-5588-8647 Bo Zhang https://orcid.org/0000-0002-6730-7336 https://orcid.org/0000-0003-3918-6976 Zheng Chen Melissa Glorioso is now at Army Research Institute for the Behavioral and Social Sciences. The authors thank Michelle Zhou, Huahai Yang, and Wenxi Chen of Juji, Inc., for their assistance with machine-learning-based model building. They also thank Andrew Speer, Louis Hickman, Filip Lievens, Emily Campion, Peter Chen, Alan Walker, and Jesse Michel for providing their valuable feedback on an earlier version of the article. The authors declare no financial conflict of interest nor advisory board affiliations, and so forth, with Juji, Inc. Jinyan Fan and Tianjun Sun contributed equally to this article. Earlier versions of part of the article were presented at the Society for Industrial and Organizational Psychology 2018 and 2022 conferences. The views expressed are those of the authors and do not reflect the official policy or position of the U.S. Air Force, Department of Defense, or the U.S. Government. Correspondence concerning this article should be addressed to Jinyan Fan, Department of Psychological Sciences, Auburn University, 225 Thach Hall, Auburn, AL 36849, United States or Jiayi Liu, Department of Psychological Sciences, Auburn University, 102A Thach Hall, Auburn, AL 36849, United States. Email: Jinyan.Fan@auburn.edu or jzl0217@auburn.edu 1277 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1278 FAN ET AL. differences in cross-situational behavioral consistency (Hauenstein et al., 2017); and (d) the tedious testing experience where test-takers have to respond to many items in one sitting. Recently, an innovative approach to personality assessment has emerged. This approach was originally developed by computer scientists and has now made its way into the applied psychology field. It is generally referred to as artificial intelligence (AI)-based personality assessment. This new form of assessment can be distinguished from traditional assessment in three ways: (a) technologies, (b) types of data, and (c) algorithms (Tippins et al., 2021). Data collected via diverse technological platforms (e.g., social media and video interviews) have been used to obtain an assortment of personality-relevant data (digital footprints) such as facial expression (Suen et al., 2019), smartphone data (Chittaranjan et al., 2013), interview responses (Hickman et al., 2022), and online chat scripts (Li et al., 2017). The third area in which AI-based personality assessments are unique is in their use of more complex algorithms. AI is a broad term that refers to the science and engineering of making intelligent systems or machines (e.g., especially computer programs) that mimic human intelligence to perform tasks and can iteratively improve themselves based on the information they collect (McCarthy & Wright, 2004). Machine learning (ML) is a subset of AI, which focuses on building computer algorithms that automatically learn or improve performance based on the data they consume (Mitchell, 1997). In some more complex work, the term deep learning (DL) may also be referenced. DL is a subset of ML, referring to neural-network-based ML algorithms that are composed of multiple processing layers to learn the representations of data with multiple levels of abstraction and mimic how a biological brain works (LeCun et al., 2015). The present article refers to personality assessment tools that purport to predict personality traits using digital footprints as the ML approach. The ML approach to personality assessment typically entails two stages: (a) model training and (b) model application. In the model training stage, researchers attempt to build predictive models using a large sample of individuals. The predictors are potentially traitrelevant features extracted through analyzing digital footprints generated by individuals, such as a corpus of texts, social media “likes,” sound/voice memos, and micro facial expressions. The criteria (ground truth) are the same individuals’ questionnairederived personality scores, either self-reported (e.g., Golbeck et al., 2011; Gou et al., 2014), other-rated (e.g., Chen et al., 2017; Harrison et al., 2019), or both (e.g., Hickman et al., 2022). Next, researchers try to establish empirical links between the predictors (features) and the criteria, often via linear regressions, support vector machines, tree-based analyses, or neural networks, resulting in estimated model parameters (e.g., regression coefficients).1 To avoid model overfitting, within-sample k-fold cross-validation is routinely conducted (Bleidorn & Hopwood, 2019). In addition, an independent test sample is often arranged for cross-sample validation and model testing. In the model application stage, the trained model is applied to automatically predict the personality of new individuals who do not have the questionnaire-derived personality data. Specifically, the computer algorithm first analyzes new individuals’ digital footprints, extracts features, obtains feature scores, and then uses feature scores and established model parameters to calculate predicted personality scores. The ML approach can be thought of as an indirect measurement of personality using a large number of features with empirically derived model parameters to score personality automatically (Hickman et al., 2022; Park et al., 2015). These features can be based on lingual or other behaviors, such as interaction logs or facial expressions. Model parameters indicate the influence of features on personality “ground truth” as the prediction targets. This approach boasts two advantages over traditional assessment methods, particularly self-report questionnaires (Mulfinger et al., 2020). The first advantage lies in its efficiency. For instance, it is possible to use the same set of digital footprints to train a series of models to infer scores on numerous individual difference variables such as personality traits, cognitive ability, values, and career interests. It is resource intensive to train these different models; however, once trained, various individual differences can be automatically and simultaneously inferred with a single set of digital footprint inputs. This would shorten the assessment time in general, which should be appealing to both test-takers and sponsoring organizations. Second, the testing experience tends to be less tedious (Kim et al., 2019). If individuals’ social media content is utilized to infer personality, individuals do not need to go through the assessment process at all. If video interview or online chat is used, individuals may feel they have more opportunities to perform and thus should enjoy the assessment more than, for instance, completing a self-report personality inventory (McCarthy et al., 2017). Despite the potential advantages, the ML approach to personality assessment faces several challenges and needs to sufficiently address many issues before it can be used in practice for talent assessment. For instance, earlier computer algorithms required users (e.g., job applicants) to share their social media content, which did not fare well due to apparent privacy concerns and potential legal ramifications (Oswald et al., 2020). In response, organizations have begun to use automated video interviews (AVIs; e.g., Hickman et al., 2022; Leutner et al., 2021) or text-based interviews (e.g., Völkel et al., 2020; Zhou et al., 2019) to extract trait-relevant features. The present study uses the textbased interview method (also known as the AI chatbot) to collect textual information from users. Strategies such as AVIs and AI chatbots may gain wider acceptance in applied settings as job applicants are less likely to refuse an interview request by the hiring organization where they expect a job offer. Another critical issue that remains largely unaddressed is a striking lack of extensive examinations of the psychometric properties of machine-inferred personality scores (Bleidorn & Hopwood, 2019; Hickman et al., 2022). Although numerous computer algorithms have been developed to infer personality scores, few validation efforts have gone beyond demonstrating Support vector machine has the objective of finding a hyperplane (i.e., can be multidimensional) in a high-dimensional space (e.g., a regression or classification model with many features or predictors) that distinctly classifies the observations, such that the plane has the maximum margin (i.e., the distance between data points and classes), where the maximization would reinforce future observations to be more confidently classified and values predicted. Neural networks (or artificial neural networks, as often referred to in computational sciences to be distinguished from neural networks in the biological brain) can be considered sets of algorithms that are designed— loosely after the information processing of the human brain—to recognize patterns. All patterns recognized—be it from sounds, images, text, or others—are numerical and contained in vectors to be stored and managed in an information layer (or multiple processing layers), and various follow-up tasks (e.g., clustering) can take place on another layer on top of the information layer(s). 1 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT the convergence between questionnaire-derived and machineinferred personality scores. The purpose of the present study is to explore the plausibility of measuring personality through an AI chatbot. More importantly, we extensively examine psychometric properties of machine-inferred personality scores at both facet and domain levels including reliability (internal consistency, split-half, and test–retest), factorial validity, convergent and discriminant validity, and criterion-related validity. Such a comprehensive examination has been extremely rare in the ML literature but is sorely needed. Although an unpublished doctoral dissertation (Sun, 2021) provided initial promising evidence for the utility of the AI chatbot method for personality inference, more empirical research is warranted. In the present study, we have chosen to use self-reported questionnaire-derived personality scores as ground truth when building predictive models. Self-report personality measures are the most widely used method of personality assessment in practice, with an impressive body of validity evidence. Although self-report personality measures may be prone to social desirability or faking, our study context is research-focused instead of selection-focused, and thus faking may not be a serious concern. In what follows, we first introduce the AI-powered, text-based interview system (AI chatbot) used in our research. We then briefly discuss how we examine the psychometric properties of machine-inferred personality scores and then present a large-scale empirical study. AI Chatbot and Personality Inference An AI chatbot is an artificial intelligence system that often utilizes a combination of technologies, such as deep learning for natural language processing (NLP), symbolic machine learning for pattern recognition, and predictive analytics for user insights inference to enable the personalization of conversation experiences and improve 1279 chatbot performance as it is exposed to more human interactions (International Business Machines, n.d.). Unlike automated video interview systems (e.g., Hickman et al., 2022; Leutner et al., 2021), which mostly entail one-way communications, an AI chatbot engages with users through two-way communications (e.g., Zhou et al., 2019). While there are many chatbot platforms commercially available (e.g., International Business Machines Watson Assistant, Google Dialogflow, and Microsoft Power Virtual Agents), we opted to use Juji’s AI chatbot platform (https://juji.io) as our study platform for three reasons. First, unlike many other platforms that require writing computer programs (e.g., via Application Programming Interfaces) to train and customize advanced chatbot functionalities, such as dialog management, Juji’s platform enables non-IT professionals, such as applied psychologists and HR professionals, to create, customize, and manage an AI chatbot to conduct virtual interviews without writing code. Second, Juji’s platform is publicly accessible (currently with no- or low-cost academic use options), which enables other scholars to conduct similar research studies and/or replicate our study. Third, scholars have successfully used the Juji chatbot platform in conducting various scientific studies, including team creativity (Hwang & Won, 2021), personality assessment (Völkel et al., 2020), and public opinion elicitation (Jiang et al., 2023). To enable readers to better understand what is behind the scenes of the Juji AI platform and facilitate others in selecting comparable chatbot platforms for conducting studies like ours, in the next section, we provide a high-level, nontechnical explanation of the Juji AI chatbot platform. Note that the key structure and functions of this specific chatbot platform also outline the required key components for supporting any capable conversational AI agents (e.g., Jayaratne & Jayatilleke, 2020), thus allowing our methods and results to be generalized. As shown in Figure 1, at the bottom level, several machine learning models are used, including data-driven machine learning Figure 1 Overview of AI Chatbot Platform for Building an Effective Chatbot and Predicting Personality Using Juji’s Virtual Conversation System as a Prototype Note. AI = artificial intelligence. See the online article for the color version of this figure. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1280 FAN ET AL. models and symbolic AI models, to support the entire chatbot system. At the middle level, specialty engines are built to facilitate two-way conversations and user insights inferences. Specifically, an NLP engine for conversation interprets highly diverse and complex user natural language inputs during a conversation. Based on the interpretation results, an active listening conversation engine, which is powered by social–emotional intelligence of carrying out an empathetic and effective conversation, decides how to best respond to users and guide the conversation forward. For example, it may decide to engage users in small talk, provide empathetic comments such as paraphrasing, verbalizing emotions, and summarizing user input, as well as handle diverse user interruptions, such as digressing from a conversation and trying to dodge a question (Xiao et al., 2020). The personality inference engine takes in the conversation script from a user and performs NLP on the script to extract textual features, which are used as predictors, with the same user’s questionnaire-based personality scores being used as the criteria. ML models can then be built to automatically infer personality using statistical methods such as linear regressions. To facilitate the customization of an AI chatbot, a set of reusable AI components is prebuilt, from conversation topics to AI assistant templates, which enables the automated generation of an AI assistant/chatbot (see the top part of Figure 1). Specifically, a chatbot designer (a researcher or an HR manager) uses a graphical user interface to specify what the AI chatbot should do, such as its main conversation flow or Q&As to be supported, the AI generator will automatically generate a draft AI chatbot based on the specifications. The generated AI chatbot is then compiled by the AI compiler to become alive—a live chatbot that can engage with users in a conversation, which is managed by the AI runtime component to ensure a smooth conversation. Examining Psychometric Properties of Machine-Inferred Personality Scores Bleidorn and Hopwood (2019) suggested three general classes of evidence for the construct validity of machine-inferred personality scores based on Loevinger’s (1957) original framework. We rely on this framework to organize the present examination of psychometric properties of machine scores. Table 1 summarizes relevant ML research in this area. Substantive Validity According to Bleidorn and Hopwood (2019), the first general class of evidence for construct validity is substantive validity, which has often been operationalized as content validity, defined as the extent to which test items sufficiently sample the conceptual domain of the construct but do not tap into other constructs. Establishing the content validity of machine-inferred personality scores proves quite challenging. This is because the ML approach is based on features identified empirically in a data-driven manner (Hickman et al., 2022; Park et al., 2015). As such, we typically do not know a priori which features (“items”) should predict which personality traits and why; furthermore, these features are diverse, heterogeneous, and in large quantity (Bleidorn & Hopwood, 2019). Although several previous ML studies have partially established the content validity of machine-inferred personality scores (e.g., Hickman et al., 2022; Kosinski et al., 2013; Park et al., 2015; Yarkoni, 2010), overall, the evidence for content validity has been very limited. Because Juji’s AI chatbot system uses a DL model that mines textual features purely based on mathematics, we could not examine content validity directly. However, we looked at content validity indirectly (see the Supplemental Analysis section). Table 1 Summary of Empirical Research on Psychometric Properties of Machine-Inferred Personality Scores Aspects of construct validity Substantive validity Content validity Structural validity Reliability Test–retest reliability Split-half reliability Internal consistency Generalizability Factorial validity External validity Convergent validity Discriminant validity Criterion-related validitya Incremental validitya a Prototypical examples Major findings/current status Hickman et al. (2022); Kosinski et al. (2013); Park et al. (2015); Yarkoni (2010) • Barring limited exceptions, significant relationships between digital features and questionnaire personality scores bear no substantive meanings. Harrison et al. (2019); Hickman et al. (2022); Li et al. (2020); Park et al. (2015) Hoppe et al. (2018); Wang and Chen (2020); Youyou et al. (2015) None Hickman et al. (2022) None • Machine-inferred personality scores have comparable or slightly lower test–retest reliability than questionnaire personality scores. • Split-half reliability of machine-inferred personality scores ranges from .40s to .60s. • None. • Models trained on self-reports tend to exhibit poorer generalizability than models trained on interview reports. • None. Three meta-analyses: Azucar et al. (2018); Sun (2021); Tay et al. (2020) Harrison et al. (2019); Harrison et al. (2019); Hickman et al. (2022); Marinucci et al. (2018) Gow et al. (2016); Harrison et al. (2019, 2020); Wang and Chen (2020) None • Correlations between machine-inferred and questionnaire scores of same personality traits range from .20s to .40s. • Correlations among machine-inferred scores tend to be similar to correlations between machine-inferred and questionnaire scores of same traits. • Machine-inferred chief executive officer personality trait scores predict various objective indicators of firm performance. • None. We only considered studies using non-self-report performance criteria. PERSONALITY ASSESSMENT THROUGH A CHATBOT Structural Validity Factorial Validity The second general class of evidence for construct validity is structural validity, which focuses on the internal characteristics of test scores (Bleidorn & Hopwood, 2019). There are three major categories of structural validity: reliability, generalizability, and factorial validity. Factorial validity is established if machine-inferred personality facet scores may recover the Big Five factor structure (Costa & McCrae, 1992; Goldberg, 1993) as rendered by self-reported questionnaire-derived personality facet scores (i.e., same factor loading patterns and similar magnitude of factor loadings). To our best knowledge, no empirical studies have examined the factorial validity of machine-inferred personality scores, primarily because in almost all empirical studies, researchers have trained models to predict personality domain scores rather than facet scores.2 In the present study, we overcome this limitation by building predictive models at the facet level. Reliability This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1281 Internal consistency (Cronbach’s α) typically does not apply to machine-inferred personality scores because mined features (often in the hundreds) are empirically derived in a purely data-driven manner and are unlikely to be homogenous “items,” thus having very low item-total correlations (Hickman et al., 2022). However, Cronbach’s α can be estimated at the personality domain level, treating facet scores as “items.” Test–retest reliability, on the other hand, can be readily calculated at both domain and facet levels. We located several empirical studies (e.g., Gow et al., 2016; Harrison et al., 2019; Hickman et al., 2022; Li et al., 2020; Park et al., 2015) reporting reasonable test–retest reliability of machine-inferred personality scores. A third type of reliability index, split-half reliability, cannot be calculated directly, since there is no “item” in machineinferred personality assessment. However, ML scholars have found a way to overcome this difficulty. Specifically, scholars first randomly split the corpus of text provided by participants into halves with roughly the same number of words or sentences, then apply the trained model to predict personality scores based on the respective segments of words separately. Split-half reliability is calculated as the correlations between these two sets of machine scores with the Spearman–Brown correction. A few studies (e.g., Hoppe et al., 2018; Wang & Chen, 2020; Youyou et al., 2015) reported reasonable split-half reliability of machine-inferred scores, with averaged split-half reliability across Big Five domains ranging from .59 to .71. In the present study, we estimated test–retest and split-half reliability for machine-inferred personality facet scores. We also estimated Cronbach’s αs of machine-inferred personality domain scores, treating facet scores under respective domains as “items.” Generalizability Generalizability refers to the extent to which the trained model may be applied to different contexts (e.g., different samples, different models trained on different sets of digital footprints) and still yield comparable personality scores (Bleidorn & Hopwood, 2019). Our review of the literature reveals that very few ML studies have examined the issue of generalizability. One important exception is a study done by Hickman et al. (2022) who obtained four different samples, trained predictive models on Samples 1–3 individually, and then applied trained models to separate samples. Hickman et al. reported mixed findings regarding the generalizability of machineinferred personality scores. In the present study, we focused on cross-sample generalizability, looking at many aspects of cross-sample generalizability, including reliability (internal consistency at the domain level and split-half at the facet level), factor structure, and convergence and discrimination relations at both the latent and manifest variable levels (to be discussed subsequently). External Validity The third general class of evidence for construct validity is external validity, which focuses on the correlation patterns between test scores and external, theoretically relevant variables (Bleidorn & Hopwood, 2019). Within this class of evidence, researchers typically look at convergent validity, discriminant validity, criterionrelated validity, and incremental validity. Convergent Validity Within the ML approach, convergent validity refers to the magnitude of correlations between machine-inferred and questionnairederived personality scores of the same personality traits. Because most computer algorithms treat the latter as ground truth and aim to maximize its prediction, it is not surprising that convergent validity of machine-inferred personality scores has been routinely examined, which has yielded several meta-analyses (e.g., Azucar et al., 2018; Sun, 2021; Tay et al., 2020). These meta-analyses reported modest-tomoderate convergent validity across Big Five domains, ranging from .20s to .40s. Discriminant Validity In contrast to the heavy attention given to convergent validity, very few empirical studies have examined the discriminant validity of machine-inferred personality scores (Sun, 2021). A measure with good discriminant validity should demonstrate that correlations between different measures of the same constructs (convergent relations) are much stronger than correlations between measures of different constructs using the same methods (discriminant relations; Cronbach & Meehl, 1955). Researchers usually rely on the multitrait multimethod (MTMM) matrix to investigate convergent and discriminant validity. We identified four empirical studies that examined the discriminant validity of machine-inferred personality scores (Harrison et al., 2019; Hickman et al., 2022; Marinucci et al., 2018; Park et al., 2015), with findings suggesting relatively poor discriminant validity. For instance, Park et al. (2015) showed that the average correlations among Big Five domain scores were significantly higher when measured by the ML method than by self-report questionnaires (r̄ = .29 vs. r̄ = .19). One possible reason for relatively poor discriminant validity is that 2 Speer (2021) examined the factor structure of machine-inferred dimension-level performance scores, but we are interested in the factor structure of machine-inferred personality scores. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1282 FAN ET AL. there are typically many features (easily in the hundreds) in predictive models inferring personality. As a result, it is common that models predicting different personality traits share many common features, which would inflate correlations between machine-inferred personality scores. In the present study, since we built predictive models at the facet level, we were able to examine the convergent and discriminant validity of personality domain scores at both the manifest and latent variable levels. Analyzing the MTMM structure of latent scores allowed for disentangling trait, method, people-specific, and measurement error influences on personality scores. One advantage of the latent variable approach is that it separates the method variance from the measurement error variance. As a result, correlations among personality domain scores should be more accurate at the latent variable level than at the manifest variable level. Criterion-Related and Incremental Validity Given the modest-to-moderate correlation between machineinferred and self-reported questionnaire-based scores (typically in the .20–.40 range as reported in several meta-analyses; e.g., Sun, 2021; Tay et al., 2020) and the similarly modest correlations between self-reported questionnaire-derived personality scores and job performance (r = .18 for personality scores overall meta-analytically; Sackett et al., 2022), one could expect that machine-inferred personality scores would exhibit some criterion-related validity— operationalized as the cross product of the above two coefficients (e.g., .30 × .18 = .054)—if treating the portion of machine-inferred scores that converges with self-reported scores as predictive of performance criteria. However, it may be argued that criterion-related validity, in this case, would be too small to be practically useful. Yet, we want to offer another reasoning approach: the criterionrelated validity of machine-inferred personality scores does not have to only come from the portion converging with self-reported questionnaire-derived personality scores. In other words, the part of the variance in machine-inferred personality scores that is unshared by self-reported questionnaire-derived personality scores might still be criterion relevant. Conceptually, it is plausible that these two types of personality scores capture related but distinct aspects of personality. For instance, self-reported questionnaire-derived personality scores represent one’s self-reflection on typical and situation-consistent behaviors (Pervin, 1994), with much of the nuances in actual behaviors perhaps lost. In contrast, the ML approach extracts many linguistic cues that may capture the nuances in behaviors above and beyond typical behaviors captured by self-reported questionnaire-derived personality scores. Although the exact nature of the unique aspects of personalities captured by the ML approach is unclear, their criterion relevance is not only a theoretical question but also an empirical one. Thus, if our above reasoning is correct, we would expect machineinferred personality scores to exhibit not only criterion-related validity but also incremental validity over self-reported questionnaire-derived personality scores. Although a few empirical studies (e.g., Kulkarni et al., 2018; Park et al., 2015; Youyou et al., 2015) have reported that machine-inferred personality scores predicted some demographic variables (e.g., network size, field of study, political party affiliation) and self-reported outcomes (e.g., life satisfaction, physical health, and depression), there has been a lack of evidence that machine-inferred personality scores can predict organizationally relevant non-self-report criteria, such as performance and turnover. The reasons self-report criteria are less desirable include: (a) they are rarely used in talent assessment practice and (b) machine-inferred personality scores based on selfreport models and self-report criteria share the same source, which might inflate criterion-related validity (Park et al., 2015). Interestingly, we located a handful of studies in the field of strategic management that documented criterion-related validity of machine-inferred chief executive officer (CEO) personality scores (e.g., Gow et al., 2016; Harrison et al., 2019, 2020; Wang & Chen, 2020). For example, Harrison et al. (2019) obtained 207 CEOs’ spoken or written texts (e.g., transcripts of earnings calls), extracted textual features, had experts (trained psychology doctoral students) rate these CEOs’ personalities based on their texts, and then built predictive models accordingly. Harrison et al. then applied the trained model to estimate Big Five domain scores for 3,449 CEOs of 2,366 S&P 1,500 firms between 1996 and 2014 and then linked them to firm strategy changes (e.g., advertising intensity, R&D intensity, inventory levels) and firm performance (return on assets). These authors reported that machine-inferred CEO Openness scores were positively related to firm strategic changes and that the relationship was stronger when the firm’s performance of the previous year was low versus high. Unfortunately, studies in this area typically obtained expertrated personality scores only for small samples of CEOs (n < 300) during model building and thus were unable to examine incremental validity in much larger holdout samples. Thus, to our best knowledge, even in the broad ML literature, there has not been any empirical evidence for incremental validity of machine-inferred personality scores over questionnaire-derived personality scores beyond self-report criteria. The present study addressed this major limitation by investigating the criterion-related and incremental validity of machine-inferred personality scores using two non-self-report performance criteria: (a) objective cumulative grade point average (GPA) and (b) peer-rated college adjustment. Method Transparency and Openness We describe our sampling plan, all data exclusions, manipulations, and all measures in the study, and we adhered to the Journal of Applied Psychology methodological checklist. Raw data and computer code for model building are not available due to their proprietary nature, but processed data (self-reported questionnaire-derived and machine-inferred personality scores) are available at https://osf.io/ w73qn/?view_only=165d652ef809442fbab46f815c57f467. Other associated research materials are included in the online Supplemental Materials. Data were analyzed using Python 3 (Van Rossum & Drake, 2009); the scikit-learn package (Pedregosa et al., 2011); R (Version 4.0.0); the package psych, Version 2.2.5 (Revelle, 2022); IBM SPSS Statistics (Version 27); and Mplus 8.5 (Muthén & Muthén, 1998/ 2017). The study design and its analysis were not preregistered. Sample and Procedure Participants were 1,957 undergraduate students enrolled in various psychology courses, recruited from the subject pool operated by the Department of Psychological Sciences at a large Southeastern public university via Sona Systems (https://www .sona-systems.com/; Auburn University institutional review This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT board Protocol 16-354 MR 1609, Examining the Relationships among Daily Word Usage through Online Media, Personalities, and College Performance; Auburn University institutional review board Protocol 18-410 EP 1902, Linking Online Interview/Chat Response Scripts to Self-reported Personality Scores). To ensure a large enough sample size, we collected data over several semesters. We obtained a training sample (n = 1,477) and a separate test sample (n = 480). The training sample contains data collected in fall 2018 (n = 531 in the lab), spring 2019 (n = 202 in the lab and n = 396 online), and spring 2020 (n = 348 online), and it was used to build predictive models. The separate test sample contained data collected in spring 2017 (n = 480 in the lab) including the criteria data. The test sample was used for cross-sample validation and model testing purposes.3 During spring 2017 data collection, when participants arrived at the lab, they first completed an online Big Five personality measure (IPIP-300) on Qualtrics and then were directed to the AI firm’s virtual conversation platform, where they engaged with the AI chatbot for approximately 20–30 min.4 The chatbot asked participants several open-ended questions organized around a series of topical areas, which were generic in nature (see the online Supplemental Materials A, for a list of interview questions used in the present study) and thus probably should be considered as unstructured interview questions. In other words, although the same questions were given to all participants, they were not aimed to measure anything specific. The questions were displayed in a chat box on the computer screen, and participants typed their responses into the chat box. Participants were allowed to ask the chatbot questions. After the virtual conversation, participants completed an enrollment certification request form, which gave us permission to obtain their SAT or/and American College Testing (ACT) scores and cumulative GPA from the University Registrar. Next, participants were asked to provide the names and email addresses of three peers who knew their college life well. Immediately after the lab session, the experimenter randomly selected one of the three peers and sent out an invitation email for a peer evaluation with a survey link. One week later, a reminder email was sent to the peer. If the first peer failed to respond, the experimenter randomly selected and invited another peer. The process continued until all three peers were exhausted. It turned out that eight participants received multiple peer ratings, and we used the first peer’s ratings in subsequent analyses. Peers were sent a $5 check for their time (10–15 min). We assigned a participant ID to each participant, which was used as the identifier to link his/her self-reported questionnaire-derived personality data, virtual conversation scripts, peer ratings, and cumulative GPAs. Out of 480 participants in the spring 2017 sample, 73 participants either failed to enter, mistyped their participant IDs, or chose to withdraw from the study and were excluded. Thus, we obtained matched data for 407 participants, among whom 75.2% were female, and the mean age was 19.38 years. Out of 407 participants, we were able to obtain 379 participants’ Scholastic Aptitude Test (SAT)/ACT scores and cumulative GPA from the University Registrar and 301 participants’ peer ratings. Two hundred eighty-nine participants had both SAT/ACT scores and peer ratings. For data collected after spring 2017, 733 participants came to our lab and went through the same study procedure as those in spring 2017 (the test sample) but without collecting the criteria data. In addition, 744 completed the study online via a Qualtrics link for the IPIP-300 and the AI firm’s virtual conversation system for the online 1283 chat. No criteria data were collected for the online participants, either. Based on the same matching procedure as the test sample, we were able to match the personality data and the virtual conversation scripts for 1,037 out of 1,477 participants in the training sample, among whom 76.5% were female with an average age of 19.90 years. With respect to race, 80% were White, 5% were Black, 3% were Hispanic, 2% were Asian, and 10% did not disclose their race. These participants’ data were used subsequently to build predictive models. For the entire sample, the median number of sentences provided by participants through the chatbot was 40 sentences, ranging from 26 to 112 sentences. To examine test–retest reliability, we obtained another independent sample. Participants were 74 undergraduate students enrolled in one of the two sections of a psychological statistics course at the same university in fall 2021 (Auburn University Protocol 21-351 EX 2111, Examining Test–Retest Reliability of Machine-Inferred Personality Scores). Participants were invited to engage with the chatbot twice with the same set of interview questions as the main study in exchange for extra course credit. Sixty-one participants completed both online chats, with time lapses ranging from 3 to 35 days with an average time lapse of 22 days. Measures Self-Reported Questionnaire-Derived Personalities The 300-item IPIP personality inventory (Goldberg, 1999) was used. The IPIP-300 was designed to measure 30 facet scales in the Big Five framework, modeling after the NEO Personality Inventory (NEO PI-R; Costa & McCrae, 1992). Items were rated on a 5-point scale ranging from 1 (strongly disagree) to 5 (strongly agree). These 30 facet scales, Big Five domain scales, and their reliabilities in the present samples (both the training and test samples) are presented in Tables 2 and 3. There was no missing data at the item level. Machine-Inferred Personalities The machine learning algorithm the AI firm helped develop was used to estimate the 30 personality facet scores based on mined textual features of participants’ virtual conversation scripts. See the Analytic Strategies section for NLP and model-building details. Reliabilities of machine-inferred 30 facet scores and Big Five 3 Juji developed an algorithm to infer personality scores prior to our study. We initially applied their original algorithm to spring 2017 data to calculate machine-inferred personality scores; however, results showed that machineinferred scores did not significantly predict the criteria (GPA and peer-rated college adjustment). After a discussion with the AI firm’s developers, we identified potential flaws in their original model-building strategy, which did not use self-reported personality scores as ground truth. Thus, we decided to collect online chat data and self-reported personality data during the subsequent semesters to enable the AI firm to train new predictive models using self-reported personality scores as ground truth. 4 The 25th, 50th, and 75th percentiles of time spent on a virtual conversation were 18, 22, and 30 min, respectively. We include two examples of chatbot conversation scripts across five participants in the online Supplemental Materials A. Out of professional courtesy, the AI firm (Juji, Inc.) provides a free chatbot demo link: https://juji.ai/pre-chat/62f571ec-11d7-4d1e-933637066dfa0f48, with these instructions: (a) use a computer (not a cell phone) for the online chat; (b) use Google Chrome as the browser; (c) responses should be in sentences (rather than words); (d) finish the entire chat in one sitting; and (e) if enough inputs are provided, the algorithm will infer and show your personality scores on the screen toward the end of the online chat. 1284 FAN ET AL. Table 2 Reliabilities of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Facet Scores Training sample (n = 1,037) This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Self-reported questionnaire-derived scores Test sample (n = 407) Machine-inferred scores Self-reported questionnaire-derived scores Machine-inferred scores Personality facets Coefficient α Split-half reliability Split-half reliability Coefficient α Split-half reliability Split-half reliability Test–retest reliability (n = 61) Openness (average) Imagination Art interest Feelings Adventure Intellectual Liberalism Conscientiousness (average) Self-efficacy Orderliness Dutifulness Achievement Self-discipline Cautiousness Extraversion (average) Friendliness Gregarious Assertiveness Activity level Excite seek Cheerfulness Agreeableness (average) Trust Straightforward Altruism Cooperation Modesty Sympathy Neuroticism (average) Anxiety Anger Depression Self-conscious Impulsiveness Vulnerability .79 .81 .79 .78 .75 .81 .82 .83 .80 .86 .76 .84 .87 .82 .81 .88 .88 .84 .64 .81 .80 .75 .82 .73 .76 .71 .77 .70 .81 .77 .88 .86 .75 .77 .80 .80 .82 .84 .73 .74 .78 .88 .86 .85 .88 .82 .85 .89 .86 .82 .92 .88 .84 .61 .80 .84 .82 .86 .84 .84 .80 .76 .80 .84 .84 .88 .90 .85 .79 .77 .68 .65 .72 .71 .59 .69 .70 .67 .65 .68 .83 .65 .59 .62 .64 .70 .70 .65 .60 .53 .66 .73 .71 .77 .75 .76 .62 .76 .60 .61 .62 .60 .61 .54 .61 .80 .82 .82 .77 .75 .83 .80 .81 .72 .88 .74 .82 .87 .85 .80 .87 .85 .84 .65 .80 .77 .79 .84 .79 .81 .77 .77 .77 .84 .84 .89 .89 .83 .71 .85 .79 .82 .84 .70 .73 .78 .86 .85 .80 .89 .80 .84 .90 .88 .80 .91 .82 .84 .66 .77 .83 .78 .84 .79 .78 .75 .79 .72 .81 .77 .88 .86 .79 .81 .72 .63 .58 .71 .68 .40 .70 .73 .68 .63 .67 .83 .67 .59 .66 .64 .70 .73 .59 .57 .54 .70 .68 .67 .75 .75 .68 .49 .71 .57 .54 .62 .58 .55 .57 .54 .67 .67 .70 .72 .49 .69 .76 .59 .58 .63 .63 .59 .62 .48 .66 .68 .77 .72 .61 .59 .58 .63 .57 .61 .69 .61 .63 .67 .58 .68 .45 .66 .68 .43 .59 Note. Average = averaged reliabilities of facet scales within a Big Five domain. Test–retest reliabilities of machine-inferred personality scores were based on an independent sample (n = 61) that was not part of the test sample (n = 407). domain scores5 in the training and test samples are presented in Tables 2 and 3, respectively. Objective College Performance Participants’ cumulative GPAs were obtained from the University Registrar. Peer-Rated College Adjustment Scholars have advocated expanding the conceptual space of college performance beyond GPA to include alternative dimensions such as social responsibility (Borman & Motowidlo, 1993) and adaptability and life skills (Pulakos et al., 2000). Oswald et al. (2004) proposed a 12-dimension college adjustment model that covers intellectual, interpersonal, and intrapersonal behaviors. Oswald et al. developed a 12-item Behaviorally Anchored Rating Scale to assess college students’ adjustment in these 12 dimensions; for instance, (a) knowledge, learning, and mastery of general principles; (b) continuous learning, intellectual interest, and curiosity; and (c) artistic cultural appreciation and curiosity, and so forth. For each dimension, peers were presented its name, definition, and two brief examples and were then asked to rate their friend’s adjustment on this dimension using a 7-point scale (1 = strongly disagree) to (7 = strongly agree). There was no missing data on the item level. Cronbach’s α was .87 in the test sample. 5 We used two methods to calculate machine-inferred personality domain scores. For the first method, we calculated domain scores as averaged machine-inferred facet scores in respective domains. For the second method, we first calculated self-reported domain scores based on self-reported facet scores and then built five predictive models with self-reported domain scores as ground truth. Next, we applied the trained model to predict machineinferred domain scores in both training and test samples. It turned out that both methods resulted in similar results and identical statistical conclusions. We therefore present the results based on the first method in the present article and refer readers to results based on the second method in the online Supplemental Material A (Supplemental Tables S1–S3). 1285 PERSONALITY ASSESSMENT THROUGH A CHATBOT Table 3 Reliabilities (Cronbach’s α) of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Domain Scores Training sample (n = 1,037) Personality domains Test sample (n = 407) Self-reported questionnaire-derived scores Machine-inferred scores Self-reported questionnaire-derived scores Machine-inferred scores .67 .84 .84 .79 .86 .76 .92 .90 .92 .90 .72 .84 .81 .76 .83 .76 .93 .90 .91 .88 Openness Conscientiousness Extroversion Agreeableness Neuroticism This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Note. Cronbach’s αs were calculated by treating facet scores under respective personality traits as “items.” Following Oswald et al., overall peer-rating scores (the sum of scores on the 12 dimensions) were used in subsequent analyses. In the present test sample, GPA and peer-rated college adjustment were modestly correlated (r = .21), suggesting that the two criteria represent distinct, yet somewhat overlapping conceptual domains. Control Variables We obtained participants’ ACT and/or SAT scores from the registrar along with their cumulative GPA. We converted SAT scores into ACT scores using the 2018 ACT-SAT concordance table (American College Testing, 2018). When examining the criterion-related validity of personality scores, we controlled for ACT scores, which served as proxies for cognitive ability. Analytic Strategies Natural Language Processing The conversation text from each participant was first segmented into single sentences. Then, sentence embedding (encoding) was performed on each sentence using the Universal Sentence Encoder (USE; Cer et al., 2018), which is a DL model that was trained and optimized for greater-than-text length text (e.g., sentences, phrases, and paragraphs). A sentence embedding yields a list of values, usually in an ordered vector, that numerically represent the meanings of a sentence by machine understanding. The USE takes in a sentence string and outputs a 512-dimension vector. The USE model adopted in this study was pretrained with a deep averaging network (DAN) encoder on a variety of large data sources and a variety of tasks by the Google USE team to capture semantic textual information such that sentences with similar meanings should have embeddings close together in the embedding space. The resultant model (a DL network) is then fine-tuned by adjusting parameters for the training data. Given the advantage of capturing contextual meanings, the USE is commonly used for text classification, semantic similarity, clustering, and other natural language tasks. To obtain text features for predictive models, we averaged each participant’s sentence embeddings across sentences, resulting in 512 feature scores for each participant. The same average sentence vector was used as the predictor in all subsequent model buildings. The practice of averaging embeddings is common in NLP research and has been shown to yield excellent model performance in various language tasks with much higher model training efficiency than other types of models, including some more sophisticated ones (Joulin et al., 2016; Yang et al., 2019). Model Building When multicollinearity among predictors is high and/or when there are many predictors relative to sample size—which is a typical situation ML scholars face—ordinary least-squared regression estimators are still accurate but tend to yield large variances (Zou & Hastie, 2005). Regularized regression methods are used to help address the bias– variance trade-off. For instance, ridge regression penalizes large β’s by imposing the same amount of shrinkage across β’s, referred to as the L2 penalty (Hoerl & Kennard, 1988). Least absolute shrinkage and selection operator (LASSO) regression shrinks some β’s to zero with varying amounts of shrinkage across β’s, referred to as the L1 penalty (Tibshirani, 1996). Elastic net regression combines ridge regression and LASSO regression using two hyperparameters: alpha (α) and lambda (λ; Zou & Hastie, 2005).6 The α parameter determines the relative weights given to the L1 versus L2 penalty. When α ranges between 0 and .5, elastic net behaves more like ridge regression (L2 penalty; at α = 0, it becomes completely ridge regression). When α ranges between .5 and 1, elastic net behaves more like LASSO regression (L1 penalty; at α = 1, it becomes completely LASSO regression). The λ parameter determines how severely regression weights are penalized. We built predictive models on the training sample (n = 1,037) using elastic net regression via the scikit-learn package in Python 3 (Pedregosa et al., 2011). Fivefold cross-validation was used to help tune the model’s hyperparameters (α and λ).7 Once 6 in elastic is as follows: Pn The loss Pp function Pp 2 net Pregression p 1−α 2 i=1 ðyi − j=1 xij βj Þ + λ½ð 2 Þ 1 βi + α 1 jβi j, in which the first term is the ordinary least squared loss function (sum of squared residuals), the second term is L2 penalty (ridge regression), and the third term is an L1 penalty (LASSO regression), with α indicating relative weights given to L1 versus L2 penalty and λ indicating the amount of penalty. Please refer to the online Supplemental Materials A, for a nontechnical explanation of how elastic net regression works. 7 Specifically, the training sample was split into five equally sized partitions to build different combinations of training and validation data sets for better estimation of the model’s out-of-sample performance. A model was fit using all subsets except the first fold, and then the model was applied to the first fold to examine model performance (i.e., prediction error in the validation data set in the current case). Then the first subset was returned to the training sample, and the second subset was used as the hold-out sample in the second round of crossvalidation. The procedure was repeated until the kth round of cross-validation was completed. In each round of cross-validation, a series of possible values of hyperparameters (α and λ) were explored and corresponding model performance indices were obtained. Then, the model performance indices associated with the same set of hyperparameter values were averaged across five crossvalidation trials, and hyperparameter values associated with the best overall model performance were chosen as the optimal hyperparameters, thus completing the hyperparameter tuning process. 1286 FAN ET AL. hyperparameters were tuned, the model was then fitted to the entire training sample with the hyperparameters fixed to their optimal values to obtain the final model parameters for the training sample. The elastic net has been considered the optimal modeling solution for linear relationships (Putka et al., 2018). Next, we applied the trained models to predict personality scores for 1,037 and 407 participants in the training and test samples, respectively. We trained 30 predictive models. loadings from machine-inferred facet scores against those from selfreported questionnaire-derived facet scores. We also calculated the root-mean-square error (RMSE) for each domain in both the training and testing samples. We further added two lines in the plots to show the range where the difference in factor loadings is less than .10. If most dots fall within the range formed by the two lines, it means that factor loadings were also similar in magnitude across the two measurement approaches. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Reliability Internal consistency and test–retest reliability are straightforward to estimate. For split-half reliability, we randomly divided participants’ conversation scripts into two halves with an equal number of sentences in each, applied the trained model to obtain two sets of machine scores, and then calculated their correlations with the Spearman–Brown correction. To obtain more consistent results, we shuffled the sentence order for each participant before each splithalf trial and reported split-half reliabilities averaged over 20 trials. Factorial Validity As independent-cluster structures with zero cross-loadings are too ideal to be true for personality data (Hopwood & Donnellan, 2010), and forcing nonzero cross-loadings to zero is detrimental in multiple ways (Zhang et al., 2021), a method that allows cross-loadings would be more appropriate. In addition, since we have two types (sets) of personality scores, it would be more informative if both types of scores are modeled within the same model so that correlations among latent factors derived from self-reported questionnairederived and machine-inferred scores can be directly estimated. Therefore, the set exploratory structural equation modeling (setESEM; Marsh et al., 2020) is an excellent option. In set-ESEM, two (or more) sets of constructs are modeled within a single model such that cross-loadings are allowed for factors within the same set but are constrained to be zero for constructs in different sets. Set-ESEM overcomes the limitations of ESEM (i.e., lack of parsimony and potential in confounding constructs) and represents a middle ground between the flexibility of exploratory factor analysis or full ESEM and the parsimony of confirmatory factor analysis (CFA) or structural equation modeling, as set-structural equation modeling aims to achieve a balance between CFA and full ESEM in terms of goodness-of-fit, parsimony, and factor structure definability (i.e., specifications of empirical item-factor mappings corresponding to a priori theories; Marsh et al., 2020). Target rotation was used because we have some prior knowledge about which personality domain (factor) each facet should belong to. The set-ESEM analyses were conducted using Mplus 8.5 (Muthén & Muthén, 1998/2017). Tucker’s congruence coefficients (TCCs) were used to quantify the similarity between factors assessed with a self-report questionnaire and factors inferred by the machine. TCC has been shown as a useful index for representing the similarity of factor loading patterns of comparison groups (Lorenzo-Seva & ten Berge, 2006): A TCC value in the range of .85–.94 corresponds to a fair similarity, whereas a TCC value at or higher than .95 can be seen as evidence that the two factors under comparison are identical. TCCs were calculated using the package psych, Version 2.2.5 (Revelle, 2022) in R, Version 4.0.0 (R Core Team, 2020). Given that TCC focuses primarily on the overall pattern (profile similarity), we also examined absolute agreement of magnitude. Specifically, we plot factor Convergent and Discriminant Validity We used Woehr et al. (2012) convergence, discrimination, and method variance indices calculated based on the MTMM matrix to examine convergent and discriminant validity. The convergence index (C1) was calculated as the average of the monotrait– heteromethod correlations. Conceptually, C1 indicates the proportion of expected observed variance in trait-method units attributable to the person main effects and shared variance specific to traits. A positive and large C1 indicates strong convergent validity and serves as the benchmark to examine discriminant indices (Woehr et al., 2012). The first discriminant index (D1) was calculated by subtracting the average of absolute heterotrait–heteromethod correlations from C1, where the former is conceptualized as the proportion of expected observed variance attributable to the person main effects. Thus, a positive and large D1 indicates that a much higher proportion of expected observed variance can be attributed to traits versus person, thus high discriminant validity. The second discriminant index (D2) is calculated by subtracting the average of absolute heterotrait– monomethod correlations from C1. Conceptually, D2 compares the proportion of expected observed variance attributable to traits versus methods. A positive and large D2 indicates high discriminant validity, in that trait-specific variance dominates method-specific variance (Woehr et al., 2012). We also calculated D2a, a variant of D2 calculated using only machine monomethod correlations (C1: the average of absolute heterotrait–machine method correlations). This is done considering previous empirical studies showing that the machine method tends to pose a major threat to the discriminant validity of machine-inferred personality scores (Park et al., 2015; Tay et al., 2020). We also calculated the average of absolute heterotrait–machine method correlations and the average of absolute heterotrait–self-report method correlations. If the former is substantially larger than the latter, it suggests machine-inferred scores tend to have relatively poorer discriminant validity than self-reported questionnaire-derived scores. Criterion-Related Validity To examine criterion-related validity, we first looked at the partial bivariate correlations between machine-inferred personality domain scores and the two external non-self-report criteria (GPA and peerrated college adjustment) controlling for ACT scores. We then ran 10 sets of regression analyses with the two external criteria as separate outcomes and each of the Big Five domain scores as separate personality predictors. Specifically, the criterion was first regressed on ACT scores (Step 1), then on the self-reported questionnaire-derived personality domain scores (Step 2), and then on the respective machine-inferred personality domain scores (Step 3). We were aware that the typical analytic strategy examining criterion-related validity entails entering all five personality domain PERSONALITY ASSESSMENT THROUGH A CHATBOT scores simultaneously in respective steps in regression models. However, we were concerned that due to poor discriminant validity of machine scores (i.e., high correlations among machine scores), such a strategy might mask the effect of machine scores on criteria. Results This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Reliability Tables 2 and 3 present facet- and domain-level reliabilities of selfreported questionnaire-derived and machine-inferred personality scores, respectively. Several observations are noteworthy. First, at the facet level, self-reported questionnaire-derived personality scores showed good internal consistency and comparable split-half reliabilities, which is not surprising, as the IPIP-300 is a wellestablished personality inventory. The second observation is that, at the facet level, split-half reliabilities of machine-inferred personality scores, albeit somewhat lower than those of self-reported questionnaire-derived personality scores, were overall in the acceptable range. Averaged split-half reliabilities for facet scores in the training and test samples are as follows: r̄ = .68 and .63 for Openness facets; r̄ = .67 and .68 for Conscientiousness facets; r̄ = .64 and .64 for Extraversion facets; r̄ = .73 and .68 for Agreeableness facets; and r̄ = .60 and .57 for Neuroticism facets. These results were comparable to those reported in similar ML studies (e.g., Hoppe et al., 2018; Wang & Chen, 2020; Youyou et al., 2015). Further, split-half reliabilities are comparable between the training and test samples (averaged split-half reliabilities are .66 and .64, respectively), suggesting good cross-sample generalizability. The third observation is that, at the facet level, test–retest reliabilities of machine-inferred personality scores were comparable to split-half reliabilities. Averaged test–retest reliabilities in the test sample are as follows: r̄ tt = .67 for Openness facets, r̄ tt = .59 for Conscientiousness facets, r̄ tt = .66 for Extraversion facets, r̄ tt = .63 for Agreeableness facets, and r̄ tt = .58 for Neuroticism facets, with an average r̄ tt for all facet scales of .63. The modest retest sample size (n = 61) rendered wide 95% confidence intervals (CIs) for test– retest reliabilities, with CI width ranging from .21 to .41, with an average of .31. Thus, the above findings should be interpreted with caution. We also note that the test–retest reliabilities of machineinferred personality scores were lower than those of self-reported questionnaire-derived personality scores which are estimated to be averaged around .80 according to a meta-analysis (Gnambs, 2014). The fourth observation is that at the domain level, machineinferred personality scores demonstrated somewhat higher internal consistency reliabilities than self-reported questionnaire-derived personality domain scores when facet scores were treated as “items.” In the training sample, averaged Cronbach’s αs across all Big Five domains for self-reported questionnaire-derived and machineinferred personality scores were .80 and .88, respectively. In the test sample, averaged Cronbach’s αs were .79 and .88, respectively. These result patterns were somewhat unexpected. One possible explanation is that many significant features shared by predictive models underneath the same Big Five domain might have inflated domain-level Cronbach’s αs of machine scores. Note that Cronbach’s αs of machine-inferred domain scores were identical between the training and test samples (averaged αs = .88 and .88, respectively), indicating excellent cross-sample generalizability. 1287 Based on the above findings, there is promising evidence that machine-inferred personality domain scores demonstrated excellent internal consistency. Further, machine-inferred personality facet scores exhibited overall acceptable split-half and test–retest reliabilities but were lower than those of self-reported questionnairederived facet scores. In addition, both domain-level internal consistency and facet-level split-half reliabilities of machine-inferred personality scores showed strong cross-sample generalizability. Factorial Validity To assess the set-ESEM model fit, chi-square goodness-of-fit, maximum-likelihood-based Tucker–Lewis index (TLI), comparative fit index (CFI), root-mean-squared error of approximation (RMSEA), and standardized root-mean-squared residual (SRMR) were calculated: for the training sample: χ2 = 11370.56, df = 1,435, p < .01, CFI = .87, TLI = .84, RMSEA = .08, SRMR = .03; for the test sample: χ2 = 5795.88, df = 1,435, p < .01, CFI = .85, TLI = .81, RMSEA = .09, SRMR = .04). Although most values of these model fit indices did not meet the commonly used rule of thumb (Hu & Bentler, 1999), we consider them as adequate given the complexity of personality structure (Hopwood & Donnellan, 2010) and because they also resemble what was reported in the literature (e.g., Booth & Hughes, 2014; Zhang et al., 2020). Table 4 presents the set-ESEM factor loadings from the selfreported questionnaire-derived personality scores and the machineinferred personality scores in the test sample. (The factor loadings on the training sample can be found in Supplemental Table S4 in the online Supplemental Materials A.) The overall patterns of facets loading onto their corresponding Big Five domains are clear. For the self-reported questionnaire-derived scores, all facets load onto their designated factors the highest with overall low cross-loadings on others except for the activity level facet in the Extroversion factor (which had the highest loading on Conscientiousness) and the feelings facet in the Openness factor (which had the highest loading on Neuroticism). For the machine-inferred personality scores, the overall Big Five patterns are recovered clearly as well, with the exceptions—once again—of the activity level and feelings facets bearing the highest loadings on nontarget factors (though the loadings on the target factors are moderately high). In general, the machine-inferred personality scores largely replicated the patterns and structure observed from the self-reported questionnairederived personality scores. Results further indicated that TCCs for Extroversion, Agreeableness, Conscientiousness, Neuroticism, and Openness across the training and test samples were .98 and .97, .98 and .97, .98 and .98, .98 and .98, and .95 and .94, respectively. As such, factor profile similarity between the self-reported questionnaire-derived and machine-inferred facet scores has been confirmed in both the training and test samples. Moreover, the plots in Figure 2 clearly show that factor loadings from the two measurement approaches are indeed similar to one another in terms of both rank order and magnitude, as most dots fall close to the line Y = X. In addition, RMSEs were also small in general (.08∼.11 in the training sample and .09∼.14 in the test sample). Thus, there is promising evidence that the machine-inferred personality facet scores recovered the underlying Big Five structure. Further, factor loading patterns and magnitude are similar across the two measurement approaches. Thus, the factorial validity of 1288 FAN ET AL. Table 4 Rotated Factor Loading Matrices Based on Set-Exploratory Structural Equation Model for the Test Sample This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Self-reported questionnaire-derived personality scores Machine-inferred personality scores Personality facets E A C N O E A C N O Friendliness (E1) Gregarious (E2) Assertiveness (E3) Activity level (E4) Excitement seeking (E5) Cheerfulness (E6) Trust (A1) Straightforward (A2) Altruism (A3) Cooperation (A4) Modesty (A5) Sympathy (A6) Self-efficacy (C1) Orderliness (C2) Dutifulness (C3) Achievement (C4) Self-discipline (C5) Cautiousness (C6) Anxiety (N1) Anger (N2) Depression (N3) Self-conscious (N4) Impulsiveness (N5) Vulnerability (N6) Imagination (O1) Art_Interest (O2) Feelings (O3) Adventure (O4) Intellectual (O5) Liberalism (O6) 0.81 0.82 0.66 0.27 0.55 0.64 0.34 −0.12 0.38 −0.10 −0.48 0.14 0.08 −0.01 −0.13 0.24 0.09 −0.42 −0.01 0.08 −0.30 −0.56 0.27 0.05 −0.03 0.07 0.32 0.17 −0.24 −0.23 0.21 0.13 −0.32 −0.18 −0.11 0.27 0.57 0.56 0.58 0.79 0.44 0.59 −0.05 0.12 0.38 0.00 0.07 0.22 0.05 −0.45 −0.01 0.19 −0.07 0.13 −0.03 0.25 0.24 0.04 −0.18 0.03 0.03 −0.16 0.25 0.53 −0.32 −0.09 −0.05 0.42 0.22 0.05 −0.12 0.11 0.69 0.64 0.57 0.71 0.66 0.61 0.17 0.07 −0.23 −0.13 −0.46 −0.14 −0.24 0.02 0.15 −0.17 0.23 −0.20 −0.13 −0.07 −0.07 −0.04 −0.08 −0.18 −0.16 −0.01 0.01 −0.14 0.05 0.13 −0.30 0.15 −0.15 −0.07 −0.16 −0.11 0.93 0.68 0.58 0.46 0.35 0.81 0.00 0.14 0.50 −0.27 −0.20 0.02 −0.03 −0.07 0.13 0.03 0.21 0.10 −0.18 0.02 0.28 −0.06 −0.11 0.31 0.20 −0.27 0.13 0.09 −0.18 −0.04 0.03 0.02 0.13 −0.01 0.03 −0.14 0.68 0.64 0.43 0.49 0.76 0.37 0.68 0.78 0.77 0.34 0.75 0.75 0.38 −0.09 0.35 0.06 −0.64 0.24 0.25 0.11 0.02 0.27 0.22 −0.55 0.06 0.05 −0.35 −0.52 0.23 −0.02 0.05 0.15 0.37 0.20 −0.21 −0.37 0.43 0.25 −0.23 −0.17 −0.34 0.49 0.71 0.73 0.71 0.91 0.49 0.75 0.04 0.05 0.50 0.01 −0.09 0.31 0.19 −0.55 −0.14 0.30 −0.17 0.29 −0.02 0.36 0.43 −0.07 −0.35 −0.13 0.10 −0.01 0.37 0.70 −0.21 −0.05 0.06 0.41 0.25 0.09 0.01 0.11 0.63 0.90 0.54 0.78 0.73 0.70 0.18 −0.01 −0.24 −0.11 −0.46 −0.15 −0.26 0.13 0.25 −0.19 0.31 −0.23 −0.17 −0.07 −0.12 −0.20 −0.10 −0.08 −0.19 −0.02 −0.02 −0.04 0.22 0.21 −0.36 0.17 −0.20 −0.13 −0.15 −0.23 0.95 0.82 0.56 0.49 0.53 0.84 0.11 0.27 0.54 −0.45 −0.22 0.09 −0.20 −0.30 0.01 0.11 −0.04 0.05 −0.16 0.09 0.10 −0.04 −0.10 0.25 0.21 −0.22 0.22 0.10 −0.24 0.11 0.05 −0.11 0.22 0.22 0.09 −0.13 0.84 0.68 0.37 0.55 0.82 0.58 Note. n = 407. E = Extroversion; A = Agreeableness; C = Conscientiousness; N = Neuroticism; O = Openness. machine-inferred personality scores is established in our samples. In addition, given the similar model fit indices, similar factor loading patterns and magnitude, and similarly high TCCs between the training and test samples, cross-sample generalizability for factorial validity seems promising. Convergent and Discriminant Validity Tables 5 and 6 present MTMM matrices of latent and manifest Big Five domain scores, respectively. As can be seen in Table 7, at the latent variable level, the convergence indices (C1s) were relatively large (.59 and .48), meaning that 59% and 48% of the observed variance can be attributed to person main effects and trait-specific variance in the training and test samples, respectively. The first discrimination indices (D1s: .48 and .38) indicate that 48% and 38% of the observed variance can be attributed to trait-specific variance in the training and test samples, respectively. Contrasting these values with C1s suggests that most of the convergence is contributed by trait-specific variance. The second discrimination indices (D2s: .43 and .33) were positive and moderate, indicating that the percentage of shared variance specific to traits is 43 and 33 percentage points higher than the percentage of shared variance specific to methods in the training and test samples, respectively. D2a, calculated using only machine method correlations, was .39 and .28 in the training and test samples, respectively, suggesting that the machine method tended to yield somewhat heightened percentage levels of method variance. In addition, the average absolute intercorrelations of self-reported questionnaire-derived domain scores were .12 and .11 in the training and test samples, respectively. Meanwhile, the average absolute intercorrelations of machineinferred domain scores were .20 and .20 in the training and test samples, respectively. Thus, together, machine-inferred latent personality domain scores demonstrated excellent convergent validity but somewhat weaker discriminant validity. Table 7 indicates that, at the manifest variable level, the convergence index (C1) was .57 and .46 in the training and test samples, respectively, indicating good convergent validity. D1 was .40 and .31 in the training and test samples, respectively, suggesting that most of the convergence is contributed by trait-specific variance. D2 was .30 and .19 in the training and test samples, respectively, showing that the percentage of shared variance specific to traits is substantially higher than the percentage of shared variance specific to methods. D2a was .24 and .11 in the training and test samples, respectively. The magnitude of D2a in the test sample suggests that the percentage of shared variance specific to traits is only slightly higher than the percentage of shared variance specific to the machine method. Indeed, Table 6 shows that four of 10 heterotrait–machine monomethod correlations exceeded the C1 (.46). In addition, the average absolute intercorrelations of self-reported questionnaire-derived domain scores were .20 and .19 in the training and test sample, respectively, Note. RMSE = root-mean-squared errors; Train = training sample; Test = testing sample; SR = self-reported questionnaire-derived, questionnaire-derived facet scores; ML = machine-inferred facet scores. Figure 2 Plots of Factor Loadings of Facet Scores of the Two Measurement Approaches and Root Mean Squared Errors Across Big Five Domains This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT 1289 1290 FAN ET AL. Table 5 Latent Factor Correlations of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Domain Scores in the Training and Test Samples This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Personality domains 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. Extroversion (S) Agreeableness (S) Conscientiousness (S) Neuroticism (S) Openness (S) Extroversion (M) Agreeableness (M) Conscientiousness (M) Neuroticism (M) Openness (M) 1 −.02 .06 −.15 .22 .50 .07 .08 −.06 −.18 2 −.03 .19 −.03 .07 .09 .45 .19 .01 .01 3 .15 .20 −.26 .04 .16 .08 .50 −.24 .05 4 5 6 7 8 9 10 −.26 .05 −.22 .16 .08 .01 −.03 .60 .08 .20 −.12 −.03 .10 .61 .13 .16 .01 .16 .18 .30 −.23 −.22 .18 .18 .56 −.10 .04 .33 .32 .35 .15 .09 −.17 .13 −.23 .55 .05 −.28 .20 −.30 −.14 .07 .00 .04 .64 −.17 .09 .04 .07 −.06 −.10 .23 −.12 .38 .09 −.14 .03 −.04 .08 .57 −.37 .05 .05 Note. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score. Correlations above the diagonal are based on the training sample. Correlations below the diagonal are based on the test sample. For the training sample, n = 1,037. For the test sample, n = 407. Bold values are given to highlight convergent correlations. whereas the average absolute intercorrelations of machine-inferred domain scores were .33 and .35, respectively (Δs = .13 and .16). Thus, machine-inferred manifest domain scores demonstrated excellent convergent validity but weaker discriminant validity. The poor discriminant validity in the test sample is particularly concerning. Table 7 also indicates that D2a was .39 versus .24 at the latent versus manifest variable level in the training sample (ΔD2a = .15) and was .28 versus .11 in the test sample (ΔD2a = .17). These result patterns suggested that discriminant validity of machine-inferred personality domain scores was somewhat higher at the latent variable level than at the manifest variable level. Based on the above findings, there is evidence that machineinferred personality latent domain scores displayed excellent convergent and somewhat weaker discriminant validity. Although machine-inferred personality manifest domain scores also showed good convergent validity, discriminant validity was less impressive, particularly in the test sample. Nevertheless, overall, our results show improvements in differentiating among different traits in the machine-inferred personalities as compared to existing machine learning applications (e.g., Hickman et al., 2022; Marinucci et al., 2018; Park et al., 2015). Table 7 also shows that C1, D1, and D2 (D2a) dropped around .10 from the training to testing sample at both the latent and manifest variable levels. Thus, there is evidence for reasonable levels of cross-sample generalizability of convergent and discriminant relations. Criterion-Related Validity Table 6 reports the bivariate correlations between manifest personality domain scores and the two criteria of cumulative GPA and peer-rated college adjustment. We also calculated partial correlations controlling for ACT scores. Specifically, controlling for ACT scores, GPA was significantly correlated with four machineinferred domain scores: Openness (r = −.11), Conscientiousness (r = .13), Extroversion (r = .12), and Neuroticism (r = −.12); peerrated college adjustment was significantly correlated with four machine-inferred domain scores: Conscientiousness (r = .17), Extroversion (r = .18), Agreeableness (r = .12), and Neuroticism (r = −.11). Thus, machine-inferred domain scores demonstrated some initial evidence for low levels of criteria-related validity. The correlations between self-reported questionnaire-derived domain scores and the two criteria were largely consistent with previous research (e.g., McAbee & Oswald, 2013; Oswald et al., 2004). Next, we conducted 10 sets of hierarchical regression analyses (Five Domains × Two Criteria) to examine the incremental validity of machine-inferred domain scores.8 Table 8 presents the results of these regression analyses. Several observations are noteworthy. First, whereas ACT scores were a significant predictor of cumulative GPA, they did not predict peer-rated college adjustment at all, suggesting that the former criterion has a strong cognitive connotation, whereas the latter criterion does not. Second, after controlling for ACT scores, self-reported questionnaire-derived personality domain scores exhibited modest incremental validity on the two criteria, with four domain scores being significant predictors of cumulative GPA and three domain scores being significant predictors of peer-rated college adjustment. Third, after controlling for ACT and self-reported questionnairederived personality domain scores, machine-inferred personality domain scores, overall, failed to explain additional variance in the two criteria; however, there are three important exceptions. Specifically, in one set of regression analyses involving Extroversion scores as the predictor and GPA as the criterion, the machine-inferred Extroversion scores explained an additional 3% of the variance in cumulative GPA (β = .18, p < .001). Interestingly, the regression coefficient of self-reported questionnaire-derived Extroversion scores became more negative from Step 2 to Step 3 (with β increasing from −.09 [p = .038] to −.17, [p < .001]), suggesting a potential suppression effect (Paulhus et al., 2004). In another set of regression analyses, which involved Extroversion scores as a predictor and peer-rated college adjustment as the criterion, machine-inferred Extroversion scores explained an additional 3% of the variance in the criterion (β = .18, p < .001), with self-reported questionnaire-derived Extroversion scores being a nonsignificant predictor in Step 2 (β = .07, p = .217) and Step 3 (β = −.002, p = .980). In still another set of regression analyses, which involved Neuroticism scores as a predictor and cumulative GPA as the criterion, machine-inferred Neuroticism scores explained an additional 1% of the variance in the criterion 8 We also ran 60 sets of regression analyses (30 Facets × Two Criteria) to examine incremental validity of machine-inferred personality facet scores. The results are presented in online Supplemental Material A (Supplemental Table S5). .50** .21** −.01 −.13* −.15** −.08 −.05 .12* .04 −.03 .42** −.53** .18** .08 −.13* .56** .57** −.61** .17** .14* .04 .07 .40** −.44** .00 .46** .22** .15** −.31** .25** .33** .05 .05 −.35** −.13** .14** .45** .09 −.18** .08 −.13* −.13* −.17** .04 .28** .17** .42** −.07 .14* .14* .05 .11* −.21** −.19** .10* .40** −.16** .02 −.003 .03 −.30** .17** .24** −.02 −.03 .19** .00 .53** .28** .27** −.28** .01 .05 −.43** −.50** −.12** .00 −.04 .24** .15** −.02 .58** −.07 −.21** .06 .18** −.01 −.14* .08 .38 .43 .42 .34 .46 .13 .13 .16 .13 .14 .90 .59 3.67 3.44 3.59 3.49 3.61 2.77 3.42 3.60 3.55 3.67 2.75 5.28 3.18 26.28 Openness (S) Conscientiousness (S) Extroversion (S) Agreeableness (S) Neuroticism (S) Openness (M) Conscientiousness (M) Extroversion (M) Agreeableness (M) Neuroticism (M) PRCA Cumulative GPA ACT 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. Note. n = 1,037 for the training sample. n = 289–407 for the test sample. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score; PRCP = peer-rated college adjustment; GPA = grade point average; N/A = Not Available; ACT = American College Testing. Statistics above the diagonal are for the training sample. Statistics below the diagonal are for the test sample. Bold values are given to highlight convergent correlations. * p < .05. ** p < .01. .36 .44 .48 .39 .53 .16 .16 .19 .17 .18 N/A N/A N/A 3.43 3.60 3.47 3.65 2.85 3.44 3.60 3.47 3.65 2.85 N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A .15** −.31** −.32** .04 .55** .25** −.55** −.54** .07* N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A N/A 11 10 .10** .23** .14** .57** .03 .17** .54** .36** 9 8 7 −.12** .26** .59** .15** −.28** −.26** .57** .63** −.02 −.13** .11** .12** .20** .34** .04 6 5 4 .11** .20** 3 2 1 SD M Personality domains Table 6 Means, Standard Deviations, and Correlations Among Study Variables in the Training and Test Samples at Manifest Variable Level This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 12 13 M SD PERSONALITY ASSESSMENT THROUGH A CHATBOT 1291 (β = −.13 p < .001), with self-reported questionnaire-derived Neuroticism score being a nonsignificant predictor in Step 2 (β = .03, p = .549) and Step 3 (β = .08, p = .099). Based on the above findings, there is some evidence that machineinferred personality domain scores had overall comparable, low criterion-related validity to self-reported questionnaire-derived personality domain scores. The only exception was that machine-inferred Conscientiousness domain scores had noticeably lower criterionrelated validity than self-reported questionnaire-derived Conscientiousness domain scores. There is also preliminary evidence that machine-inferred personality domain scores had incremental validity over ACT scores and self-reported questionnaire-derived domain scores in some analyses. Supplemental Analyses Robustness Checking During the review process, an issue was raised about whether the reliability and validity of machine-inferred personality scores might be compromised among participants who provided fewer inputs than others during the virtual conversation. We thus conducted additional analyses to examine the robustness of our findings. Based on the number of sentences participants in the test sample provided during the online chat, we divided the test sample into three equal parts, yielding three test sets: (a) bottom 1/3 of participants (n = 165), (b) bottom 2/3 of participants (n = 285), and (c) all participants (n = 407). We then reran all analyses on the two smaller test sets. Results indicated that barring a couple of exceptions, the reliability and validity of machine-inferred personality scores were very similar across the test sets, thus providing strong evidence for the robustness of our findings concerning the volume of input participants provided.9 Exploring Content Validity of Machine Scores We also conducted supplemental analyses that allowed for an indirect and partial examination of the content validity issue. Following Park et al.’s (2015) approach, we used the scikit-learn package in Python (Pedregosa et al., 2011) to count one-, two-, and three-word phrases (i.e., n-grams with n = 1, 2, or 3) in the text. Words and phrases that occurred in less than 1% of the conversation scripts were removed from the analysis. This created a document-term matrix that was populated by the counts of the remaining phrases. After identifying n-grams and their frequency scores for each participant, we calculated the correlations between machine-inferred personality facet scores derived from the chatbot and frequencies of language features in the test sample. If machine-inferred personality scores have content validity, it is expected that they should have significant correlations with language features that are known to reflect a specific personality facet. For each personality facet, we selected the 100 most positively and 100 most negatively correlated phrases. Comprehensive lists of all language features and correlations can be found in the online Supplemental Material B. The results show that the most strongly correlated phrases with predictions of each personality facet were largely consistent with the characteristics of 9 We thank Associate Editor Fred Oswald for encouraging us to examine the robustness of our findings. Interested readers are referred to the online Supplemental Materials A (Supplemental Tables S6–S15), for detailed analysis results. 1292 FAN ET AL. Table 7 Multitrait–Multimethod Statistics for Machine-Inferred Personality Domain Scores This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Present and pervious study samples Latent personality domain scores (self-report models) Training sample (n = 1,037) Test sample (n = 407) Manifest personality domain scores (self-report models) Training sample (n = 1,037) Test sample (n = 407) Park et al. (2015; self-report models, test sample) Hickman et al. (2022) Self-report models (in the training sample) Interviewer-report models (in the training sample) Interviewer-report models (in the test sample) Harrison et al. (2019; other-report models, split-half validation)a Marinucci et al. (2018; self-report models; in the training sample) C1 D1 .59 .48 .48 .38 .57 .46 .38 .12 .40 .37 .65 .21 D2 D2a MV MVa .43 .33 .39 .28 .05 .05 .09 .10 .40 .31 .27 .30 .19 .15 .24 .11 .10 .10 .12 .11 .16 .20 .16 .04 .20 .17 −.05 .09 .06 .24 −.03 .01 .09 .07 .24 −.07 .09 .11 .10 .04 .12 .10 Note. C1 = convergence index (average of monotrait–heteromethod correlations); r̄HTHM = average of heterotrait–heteromethod correlations; D1 = Discrimination Index 1 (C1—average of heterotrait–heteromethod correlations); D2 = Discrimination Index 2 (C1—average of heterotrait–monomethod correlations); D2a = Discrimination Index 2 (calculated using only machine method heterotrait–monomethod correlations); MV = method variance (average of hetero–monomethod correlations—average of heterotrait–heteromethod correlations); MVa = method variance due to the machine method; CEO = chief executive officer. a Unlike many other machine learning studies, Harrison et al. (2019) split CEO’s text into two halves. They built predictive models based on the first half of the text and then tested them based on the other half of the text. interest facet was associated with phrases showing enjoyment of sports and outdoor activities (e.g., football, game, sports). Based on the above supplemental analyses, it looks like our predictive models captured some aspects of language features that can predict specific that facet. For example, a high level of machine score in the artistic interest facet was associated with phrases reflecting music and art (e.g., music, art, poetry) and exploration (e.g., explore, creative, reading). In contrast, a low level of machine score in the artistic Table 8 Regression of Cumulative GPA and Peer-Rated College Adjustment on ACT Scores and Self-Reported QuestionnaireDerived and Machine-Inferred Personality Score Cumulative GPA (n = 379) Peer-rated college adjustment (n = 289) Predictors Step 1 Step 2 Step 3 Step 1 Step 2 Step 3 ACT Openness (S) Openness (M) R2 ΔR2 ACT Conscientiousness (S) Conscientiousness (M) R2 ΔR2 ACT Extroversion (S) Extroversion (M) R2 ΔR2 ACT Agreeableness (S) Agreeableness (M) R2 ΔR2 ACT Neuroticism (S) Neuroticism (M) R2 ΔR2 .50** (.04) .51** (.04) −.19** (.04) .51** (.04) −.20** (.05) .03 (.06) .28** .00 .48** (.04) .33** (.05) −.04 (.05) .34** .00 .50** (.04) −.17** (.05) .18** (.05) .28** .03** .49** (.04) .11* (.05) .01 (.05) .26** .00 .48** (.04) .08 (.05) −.13** (.05) .26** .01** −.01(.09) −.004 (.06) −.02 (.06) −.002 (.03) −.01 (.08) −.01 (.08) .00 .00 −.02 (.06) .24** (.06) .06 (.06) .08** .00 .02 (.06) −.002 (.07) .18** (.06) .03* .03** −.01 (.06) .13* (.06) .07 (.06) .03* .01 −.01 (.06) −.13 (.06) −.07 (.07) .03* .00 .24** .50** (.04) .24** .50** (.04) .24** .50** (.04) .24** .50** (.04) .24** .28** .04** .48** (.04) .31** (.04) .34** .10** .48** (.04) −.09* (.05) .25** .01* .49** (.04) .11* (.04) .26** .02* .50** (.04) .03 (.05) .25** .01 .00 −.01 (.09) .00 −.01 (.09) .00 −.01 (.09) .00 −.01 (.09) .00 .00 .00 −.02 (.06) .27** (.06) .07** .07** .01 (.09) .07 (.06) .01 .01 −.01 (.07) .15** (.06) .02* .02** −.01 (.06) −.16** (.06) .02* .02** Note. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score; GPA = grade point average; ACT = American College Testing. * p < .05. ** p < .01. PERSONALITY ASSESSMENT THROUGH A CHATBOT personality facets and, therefore, showed partial evidence for content validity of machine-inferred personality scores. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. Discussion The purpose of the present study was to explore the feasibility of measuring personality indirectly through an AI chatbot, with a particular focus on the psychometric properties of machine-inferred personality scores. Our AI chatbot approach is different from (a) earlier approaches that relied on individuals’ willingness to share their social media content (Youyou et al., 2015), which is not a given in talent management practice, and (b) automated video interview systems (e.g., Hickman et al., 2022), in that our approach allows for two-way communications and thus resembles more natural conversations. Results based on an ambitious study that involved approximately 1,500 participants, adopted a design that allowed for examining cross-sample generalizability, built predictive models at the personality facet level, and used non-self-report criteria showed some promise for the AI chatbot approach to personality assessment. Specifically, we found that machine-inferred personality scores (a) had overall acceptable reliability at both the domain and facet levels, (b) yielded a comparable factor structure to self-reported questionnairederived personality scores, (c) displayed good convergent validity but relatively poor discriminant validity, (d) showed low criterion-related validity, and (e) exhibited incremental validity in some analyses. In addition, there is strong evidence for cross-sample generalizability of various aspects of psychometric properties of machine scores. Important Findings and Implications Several important findings and their implications warrant further discussion. Regarding reliability, our finding that the average test– retest reliability of all facet machine scores was .63 in a small independent sample (with an average time lapse of 22 days) compared favorably to the .50 average test–rest reliability (with an average 15.6 days of time lapse) reported by Hickman et al. (2022), but unfavorably to both Harrison et al. (2019) .81 average based on other-report models (with 1-year time lapse) and Park et al. (2015) .70 average based on self-report models (with a 6-month time lapse). Given that both Harrison et al. and Park et al. relied on a much larger body of text obtained from participants’ social media, it seems that it is not the sample size, nor the length of time lapse, but the size of the text that would determine the magnitude of test–retest reliability of machine-inferred personality scores. Another possible explanation is that, in Hickman et al. and our studies, participants were asked to engage in the same task (online chat or video interviews) twice with relatively short time lapses, which might have evoked a Practice × Participant interaction effect. That is, participants might understand the questions better and might provide better quality responses during the second interview or conversation; however, such a practice effect is unlikely to be uniform across participants, thus resulting in lowered test–retest reliability. In contrast, Harrison et al. and Park et al.’s studies relied on participants’ social media content over time and thus were immune from the practice effect. However, one may counter that there might be some similar (repetitive) content on social media, which might have resulted in overestimated test–retest reliability of machine scores. Regarding discriminant validity, one interesting finding is that discriminant validity seemed higher at the latent variable level than 1293 at the manifest variable level in both the training and test samples. One possible explanation is that set-ESEM accounted for potential cross-loadings, whereas manifest variables did not. There is evidence that omitted cross-loadings inflate correlations among latent factors (e.g., Asparouhov & Muthén, 2009). Regarding criterion-related validity, despite the low criterionrelated validity of machine-inferred personality scores in predicting GPA and peer-rated college adjustment, our findings are comparable to criterion-related validities of self-reported questionnaire-derived personality domain scores reported in an influential meta-analysis (Hurtz & Donovan, 2000); for instance, for Conscientiousness: r = .14 and .17 versus 14, and for Neuroticism, r = −.13 and −.15 versus −.09. Further, eight of 10 criterion-related validities of machine scores were in the 40th (r = .12)–60th (r = .20) percentile range based on empirically derived effect size distribution for the psychological characteristics and performance correlations (cf. Bosco et al., 2015). The findings that, in three regression analyses, machine-inferred personality scores exhibited incremental validity suggest that the part of the variance in machine scores that is not shared by selfreported questionnaire-derived personality scores can be criterion relevant. However, we note that we know very little about the exact nature of this part of the criterion-relevant variance; for instance, does it capture personality-relevant information or some sort of biasing factors? In the formal case, we speculate that the unshared variance might have captured the reputation component of personality (as our daily conversations clearly influence how we are viewed by others), which has consistently been shown to contribute substantially to the prediction of performance criteria (e.g., Connelly & Ones, 2010; Connelly et al., 2022). However, no empirical studies have tested this speculation. This lack of understanding represents a general issue in ML literature, that is, the predictive utility is established first by practitioners, with the theoretical understandings lagging, awaiting scientists to address. We thus call for future research to close the “scientist–practitioner gap.” Another important finding is that the present study, which was based on self-report models, yielded better psychometric properties (e.g., substantially higher convergent validity and cross-sample generalizability) of machine-inferred personality scores than many similar ML studies that were also based on self-report models (e.g., Hickman et al., 2022; Marinucci et al., 2018). We offer several tentative explanations that might help reconcile this inconsistency. First, we would like to rule out data leakage that might have contributed to our better findings. There are two main types of data leakage (Cook, 2021): (a) target leakage and (b) train-test contamination. Target leakage happens when the training model contains predictors that are updated/created after the target value is realized, for instance, if the algorithm for inferring personality scores is constantly being updated based on new data. However, since this AI firm’s algorithm for inferring personality scores is static rather than constantly updated on the fly, target leakage is ruled out. The second type of data leakage, train-test contamination, happens when researchers don’t carefully distinguish the training data from the test data; for instance, researchers conduct preprocessing and/or train the model using both the training and test data. This would result in overfitting. However, in our study, training and test samples were kept separate, hence the test data were excluded from any model-building activities, including the fitting of preprocessing steps. Therefore, we are confident that data leakage cannot explain our superior findings on the psychometric properties of machine scores. This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1294 FAN ET AL. We attribute our success to three factors. The first factor is the sample size. The sample we used to train our predictive models (n = 1,037) was relatively larger than the samples used in similar ML studies. Larger samples may help detect more subtle trait-relevant features and facilitate complex relationships during model training, making the trained models more accurate (Hickman et al., 2022). The second factor is related to the data collection method used. The AI firm’s chatbot system allows for two-way communications, engaging users in small talk, providing empathetic comments, and managing user digressions, all of which should lead to higher quality data. The third factor concerns the NLP method used. The present study used the sentence embedding technique, the USE, which is a DL-based NLP method. USE goes beyond simple countbased representations such as bag-of-words and Linguistic Inquiry and Word Count (Pennebaker et al., 2015) used in previous ML studies and retains the information of context in language and relations of whole sentences (Cer et al., 2018). There has been consistent empirical evidence showing that DL-based NLP techniques tend to outperform n-gram and lexical methods (e.g., Guo et al., 2021; Mikolov, 2012). Contributions and Strengths The present study makes several important empirical contributions to personality science and ML literature. First, in the most general sense, the present study represents the most comprehensive examination of the psychometric properties of machine-inferred personality scores. Our findings, taken as a whole, greatly enhances confidence in the ML approach to personality assessment. Second, the present study demonstrates, for the first time, that machine-inferred personality facet scores are structurally equivalent to self-reported questionnairederived personality facet scores. Third, to our best knowledge, the present study is the first in the broad ML literature that shows the incremental validity of machine-inferred personality scores over questionnaire-derived personality scores with non-self-report criteria. Admittedly, scholars in other fields such as strategic management (e.g., Harrison et al., 2019; Wang & Chen, 2020) and marketing (e.g., Liu et al., 2021; Shumanov & Johnson, 2021) have reported that machine-inferred personality scores predicted non-self-report criteria. Further, there have been trends in using ML in organizational research to predict non-self-report criteria (e.g., Putka et al., 2022; Sajjadiani et al., 2019; Spisak et al., 2019). However, none of these studies have reported incremental validity of machine-inferred personality scores beyond self-report criteria. This is significant because establishing incremental validity of machine-inferred scores is a precondition for the ML approach to personality assessment and any other new talent signals to gain legitimacy in talent management practice (ChamorroPremuzic et al., 2016). Two methodological strengths of the present study should also be noted. First, building predictive models at the personality facet level opened opportunities to examine a few important research questions such as internal consistency at the domain level, factorial validity, and convergent and discriminant validity at the latent variable level. These questions cannot be investigated when predictive models are built at the domain level, which is the case for most ML studies. Second, the present research design with a training sample and an independent test sample allowed us to examine numerous aspects of cross-sample generalizability including reliabilities, factorial validity, and convergent and discriminant validity. Study Limitations Several study limitations should be kept in mind when interpreting our results. The first limitation concerns the generalizability of our models and findings. Our samples consisted of young, predominantly female, college-educated people, and as a result, our models and findings might not generalize to working adults. In the present study, we used the AI firm’s default interview questions to build and test predictive models. Given that the AI firm’s chatbot system allows for tailor-making conversation topics, interview questions, and their temporal order, we do not know to what extent predictive models built based on different sets of interview questions would yield similar machine scores. Further, the present study context was a nonselection, research context, and it is unclear whether our findings might generalize to selection contexts where applicants are motivated to fake and thus might provide quite different responses during virtual conversations. In addition, for both the USE and the elastic net analyses, it would be difficult to replicate in the exact form. For instance, using any pretrained model other than the USE (e.g., the Bidirectional Encoder Representations from Transformers; Devlin et al., 2019) would produce a different dimensional arrangement of vector representations. Therefore, we call for future research to examine the cross-model, cross-method, cross-population, and cross-context generalizability of machineinferred personality scores. The second limitation is that the quality of the predictive models we built might have been hampered by several factors; for instance, some participants might not respond truthfully to the self-report personality inventory (IPIP-300); the USE might have inappropriately encoded regional dialects not well represented in the training data; some participants were much less verbally expressive than others leading to fewer data; and some participants were less able/ interested in contributing to the virtual conversations; to name just a few. In addition, models built in a high-stakes versus low-stakes situation might yield different parameters. Future model-building efforts should take these factors into account. The third limitation is that we were not able to examine the content validity of machine scores directly. A major advantage of DL models lies in their accuracy and generalizability. However, as of now, these DL models including the USE used in the present study are not very interpretable and have weak theoretical commitments, as the DL-derived features do not have substantive meanings. We thus encourage computer scientists and psychologists to work together to figure out substantive meanings of high-dimension vectors extracted from various DL models, which shall allow for a fuller and more direct investigation of the content validity of machine-inferred personality scores. This is aligned with the current trend that explainable AI has been a growing area of research in ML (Tippins et al., 2021). Future Research Directions Despite some promising findings, we recommend several future research directions to further advance the ML approach to personality assessment. First, the default interview questions in the AI firm’s chatbot system probably should be considered semistructured in nature. This is because although all participants went through the same interview questions, they aimed to engage with users with no explicit intention to solicit personality-related information. It thus This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT remains to be seen whether more structured interview questions aimed to systematically tap into various personality traits and the predictive models built accordingly might yield more accurate and valid machine-inferred personality scores. For instance, it is possible to develop 30 interview questions with each targeting one of the 30 personality facets. We are optimistic about such an approach for the following reasons. First, when interview questions are built to inquire about respondents’ thoughts, feelings, and behaviors associated with specific personality facets and domains, the language data will be contextually connected, and trait driven. As a result, NLP algorithms shall be able to capture not only the linguistic cues but also the traitrelevant content of the narratives from the respondents. For instance, when assessing Conscientiousness, a question that asks about personal work style should prompt a respondent to type an answer using more work-relevant words/phrases and depict their work style in a way that should allow algorithms to extract relevant features more accurately. This should improve the predictive accuracy of the predictive models, resulting in better convergent validity. At the same time, when different questions are asked that tap into different personality facets or domains, text content unique to specific facets or domains is expected to be solicited. Questions representing different personality facets should contribute uniquely to the numerical representations of texts, as the semantics of the texts would cluster differently due to facet or domain differences. As a result, in the prediction algorithms, clusters involving language features that are more relevant to a personality facet or domain would carry more weight (i.e., predictive power) for that specific trait and less so for others. In other words, language features mined this way are less likely to overlap and should result in improved discriminant validity. However, one may successfully counter that structured interview questions might create a stronger situation that may lead to restricted variance in responses during the online chat, resulting in worse results than unstructured interview questions. In any event, future research is needed to address this important research question. Second, given the recent advancements in personality psychology showing that self-reported and other-reported questionnairederived personality scores seem to capture distinct aspects of personality, with self-reports tapping into identity and otherreports tapping into reputation (e.g., McAbee & Connelly, 2016), it may be profitable to develop two parallel predictive models with self-report and other-report, respectively. In a recent study, Connelly et al. (2022) empirically showed that the reputations component (assessed through other-report) of Conscientiousness and Agreeableness dominated the prediction of several performance criteria, whereas the identity component (assessed through self-report) generally did not predict performance criteria. These findings suggest that machine-inferred personality scores based on the other-report might have a high potential to demonstrate incremental validity over self-reported questionnaire-derived scores and machine scores based on self-reports. Future research should empirically investigate this possibility. Third, future research is also needed to examine whether the ML approach to personality assessment is resistant to faking. There are two different views. The first view argues that machine-inferred personality scores cannot be faked. This is mainly because the ML approach uses a large quantity (usually in the hundreds or thousands or even more) of empirically derived features that bear no substantive meaning, making it practically impossible to memorize and then 1295 fake these features. The counterargument, however, is that in highstakes selection situations, job applicants may still engage in impression management in their responses when engaging with an AI chatbot. Some of the sentiments, positive emotions, and word usage faked by job applicants are likely to be captured by the NLP techniques and then factored into machine-inferred personality scores. In other words, faking is still highly possible within the ML approach. Which of the above two views is correct is ultimately an empirical question. We thus call for future empirical research to examine the fakability of machine-inferred personality scores. Fourth, future research should also examine the criterion-related and incremental validity of machine-inferred personality scores in selection contexts. We speculate that in selection contexts, selfreported questionnaire-derived personality scores should have weaker criterion-related validity due to applicant faking that introduces irrelevant variance (e.g., Lanyon et al., 2014), whereas criterionrelated validity of machine-inferred scores probably should maintain due to its resistance to faking. As such, it is possible that machine scores are more likely to demonstrate incremental validity in selection contexts than in nonselection contexts such as the present one. We also encourage future researchers to use more organizationally relevant criteria such as task performance, organizational citizenship behaviors, counterproductive work behaviors, and turnover when examining the criteria-related validity of machine scores. Fifth, future research should address the low discriminant validity of machine scores, as high intercorrelations among machine-inferred domain scores tend to reduce the validity of a composite. We suggest that the relatively poor discriminant validity of machine scores might be attributed to the fact that existing ML algorithms often have emphases largely placed on single-target optimizations. Improvement is possible through the multitask framework. For instance, for predicting multidimensional constructs (e.g., personality facet/domain scores), an ideal way may perhaps be simultaneous optimization of multiple targets. A potential solution for that might be to integrate target matrices (multidimensional facet/domain scores as ground truth vector) into the multitask learning framework (i.e., a variation of transfer learning where models are built simultaneously to perform a set of related tasks, e.g., Ruder, 2017). Finally, as mentioned earlier, one major advantage of the AI chatbot approach to personality assessment is a less tedious testing experience. However, such an advantage is assumed rather than empirically verified. Thus, future research is needed to compare applicant perceptions about the traditional personality assessment and the AI chatbot approach. Practical Considerations Despite initial promises of the AI chatbot approach to personality assessment, we believe that a series of practical issues need to be sufficiently addressed before we could recommend its implementation in applied settings. First, the AI chatbot system used in this study allows users to tailor-make the conversation agenda. This makes sense, as organizations need to design interview questions according to specific positions. However, there is little empirical evidence within the chatbot approach that different sets of interview questions would yield similar machine scores for the same individuals. In this regard, we agree with Hickman et al. (2022) that vendors need to provide potential client organizations with enough This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1296 FAN ET AL. information about the training data such as sample demographics and the list of interview questions the models are built on. Second, there is no empirical evidence supporting the assumption that machine scores based on the chatbot approach are resistant to applicant faking. This is an important feature that must be established if we are to implement such an approach in real-world selection contexts. Third, although it has been well established that self-reported questionnaire-derived personality scores are unlikely to result in adverse impact, there is no evidence that machine scores based on the chatbot approach are also immune from adverse impact. It is entirely possible that certain language features might be associated with group membership and thus need to be removed from the predictive models. We were unable to examine this issue in the present study since the undergraduate student population of this university is majority White (80%). Fourth, although the present study shows that machine scores based on the chatbot approach had small criterion-related validity, we still need more evidence for criterion-related validity in industrial organizations with expanded criterion domains. Fifth, evidence for the robustness of our findings in terms of the volume of input participants provide during the online chat is encouraging and should be appealing to organizations interested in implementing the AI chatbot approach into selection practice. However, it is probably necessary to identify the lower limit (minimal number of sentences) that would hold up good psychometric properties of machine-inferred personality scores. Sixth, considering recent research showing that other-reported questionnaire-derived personality scores have a unique predictive power of job performance (e.g., Connelly et al., 2022), it is worthwhile to supplement the present predictive models based on self-report with parallel models based on other-report. Once other-report models are built, the AI chatbot system may provide more useful personality information for testtakers, thus further materializing the efficacy feature of the ML approach to personality assessment. Finally, the ML-based chatbot approach to personality assessment comes with potential challenges related to professional ethics and ethical AI (e.g., concerns involving the fair and responsible use of AI). For instance, is it ethical or legally defensible to have job applicants go through a chatbot interview without telling them that their textual data will be mined for selection-relevant information? Organizations might also decide to transform recorded in-person interviews into texts, apply predictive models such as the one used in the present study, and obtain machine-inferred personality scores for talent management purposes. AI chatbots may soon be an enormously popular tool for initial recruiting and screening processes, and corporate actors may not hesitate to repurpose harvested data for new applications.10 10 We thank an anonymous reviewer for raising this point. References American College Testing. (2018). Guide to the 2018 ACT/SAT concordance. https://www.act.org/content/dam/act/unsecured/documents/ACT-SATConcordance-Information.pdf Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16(3), 397–438. https://doi.org/10.1080/ 10705510903008204 Azucar, D., Marengo, D., & Settanni, M. (2018). Predicting the Big 5 personality traits from digital footprints on social media: A meta-analysis. Personality and Individual Differences, 124, 150–159. https://doi.org/10 .1016/j.paid.2017.12.018 Bleidorn, W., & Hopwood, C. J. (2019). Using machine learning to advance personality assessment and theory. Personality and Social Psychology Review, 23(2), 190–203. https://doi.org/10.1177/1088868318772990 Booth, T., & Hughes, D. J. (2014). Exploratory structural equation modeling of personality data. Assessment, 21(3), 260–271. https://doi.org/10.1177/ 1073191114528029 Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain to include elements of contextual performance. In N. Schmitt & W. C. Borman (Eds.), Personnel selection in organizations (pp. 71–98). Jossey-Bass. Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015). Correlational effect size benchmarks. Journal of Applied Psychology, 100(2), 431–449. https://doi.org/10.1037/a0038047 Cer, D., Yang, Y., Kong, S.-Y., Hua, N., Limtiaco, N., John, R., Constant, N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., & Kurzweil, R. (2018). Universal sentence encoder. ArXiv. https://arxiv .org/abs/1803.11175 Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A., & Hogan, R. (2016). New talent signals: Shiny new objects or a brave new world? Industrial and Organizational Psychology: Perspectives on Science and Practice, 9(3), 621–640. https://doi.org/10.1017/iop.2016.6 Chen, L., Zhao, R., Leong, C. W., Lehman, B., Feng, G., & Hoque, M. (2017, December 23–26). Automated video interview judgment on a large-sized corpus collected online [Conference session]. 2017 7th international conference on affective computing and intelligent interaction, ACII 2017, San Antonio, TX, United States. https://doi.org/10.1109/ACII.2017.8273646 Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2013). Mining large-scale smartphone data for personality studies. Personal and Ubiquitous Computing, 17(3), 433–450. https://doi.org/10.1007/s00779-011-0490-1 Connelly, B. S., McAbee, S. T., Oh, I.-S., Jung, Y., & Jung, C.-W. (2022). A multirater perspective on personality and performance: An empirical examination of the trait-reputation-identity model. Journal of Applied Psychology, 107(8), 1352–1368. https://doi.org/10.1037/apl0000732 Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality: Meta-analytic integration of observers’ accuracy and predictive validity. Psychological Bulletin, 136(6), 1092–1122. https://doi.org/10.1037/a0021212 Cook, A. (2021, November 9). Data leakage. Kaggle. Retrieved March 14, 2022, from https://www.kaggle.com/alexisbcook/data-leakage Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources. Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/ h0040957 Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT: Pre-training of deep bidirectional transformers for language understanding [Conference session]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Minneapolis, MN, United States. Foldes, H., Duehr, E. E., & Ones, D. S. (2008). GroDup difference in personality: Meta-analyses comparing five U.S. racial groups. Personnel Psychology, 61(3), 579–616. https://doi.org/10.1111/j.1744-6570.2008.00123.x Gnambs, T. (2014). A meta-analysis of dependability coefficients (test–retest reliabilities) for measures of the Big Five. Journal of Research in Personality, 52, 20–28. https://doi.org/10.1016/j.jrp.2014.06.003 Golbeck, J., Robles, C., & Turner, K. (2011, May). Predicting personality with social media [Conference session]. Proceedings of the 2011 Annual Conference on Human Factors in Computing Systems—CHI’11, Vancouver, BC, Canada. https://doi.org/10.1145/1979742.1979614 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT Goldberg, L. R. (1993). The structure of phenotypic personality traits. American Psychologist, 48(1), 26–34. https://doi.org/10.1037/0003-066X .48.1.26 Goldberg, L. R. (1999). A broad-bandwidth. public domain. personality inventory measuring the lower-level facets of several five-factor models. In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality psychology in Europe (Vol. 7, pp. 7–28). Tilburg University Press. Gou, L., Zhou, M. X., & Yang, H. (2014, April). KnowMe and ShareMe: Understanding automatically discovered personality traits from social media and user sharing preference [Conference session]. Proceedings of the SIGCHI Conference on Human Factors in Computing System— CHI’14, Toronto, ON, Canada. https://doi.org/10.1145/2556288.2557398 Gow, I. D., Kaplan, S. N., Larcker, D. F., & Zakolyukina, A. A. (2016). CEO personality and firm policies [Working paper 22435]. National Bureau of Economic Research. https://doi.org/10.3386/w22435 Guo, F., Gallagher, C. M., Sun, T., Tavoosi, S., & Min, H. (2021). Smarter people analytics with organizational text data: Demonstrations using classic and advanced NLP models. Human Resource Management Journal, 2021, 1–16. https://doi.org/10.1111/1748-8583.12426 Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2019). Measuring CEO personality: Developing, validating, and testing a linguistic tool. Strategic Management Journal, 40, 1316–1330. https:// doi.org/10.1002/smj.3023 Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2020). Perception is reality: How CEOs’ observed personality influences market perceptions of firm risk and shareholder returns. Academy of Management Journal, 63(4), 1166–1195. https://doi.org/10.5465/amj.2018.0626 Hauenstein, N. M. A., Bradley, K. M., O’Shea, P. G., Shah, Y. J., & Magill, D. P. (2017). Interactions between motivation to fake and personality item characteristics: Clarifying the process. Organizational Behavior and Human Decision Processes, 138, 74–92. https://doi.org/10.1016/j.obhdp.2016.11.002 Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2022). Automated video interview personality assessments: Reliability, validity, and generalizability investigations. Journal of Applied Psychology, 107(8), 1323–1351. https://doi.org/10.1037/apl0000695 Hoerl, A., & Kennard, R. (1988). Ridge regression. In S. Kotz, C. B. Read, N. Balakrishnan, B. Vidakovic, & N. L. Johnson (Eds.), Encyclopedia of statistical sciences (Vol. 8, pp. 129–136). Wiley. Hoppe, S., Loetscher, T., Morey, S. A., & Bulling, A. (2018). Eye movements during everyday behavior predict personality traits. Frontiers in Human Neuroscience, 12, Article 105. https://doi.org/10.3389/fnhum.2018.00105 Hopwood, C. J., & Donnellan, M. B. (2010). How should the internal structure of personality inventories be evaluated? Personality and Social Psychology Review, 14(3), 332–346. https://doi.org/10.1177/1088868310361240 Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. https://doi.org/10.1080/ 10705519909540118 Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The Big Five revisited. Journal of Applied Psychology, 85(6), 869–879. https:// doi.org/10.1037/0021-9010.85.6.869 Hwang, A. H. C., & Won, A. S. (2021, May). IdeaBot: Investigating social facilitation in human-machine team creativity. In Y. Kitamura & A. Quigley (Eds.), Proceedings of the 2021 CHI conference on human factors in computing systems (pp. 1–16). ACM. https://doi.org/10.1145/ 3411764.3445270 International Business Machines. (n.d.). What is a chatbot? Retrieved December 28, 2022, from https://www.ibm.com/topics/chatbots Jayaratne, M., & Jayatilleke, B. (2020). Predicting personality using answers to open-ended interview questions. IEEE Access: Practical Innovations, Open Solutions, 8, 115345–115355. https://doi.org/10.1109/ACCESS.2020 .3004002 Jiang, Z., Rashik, M., Panchal, K., Jasim, M., Sarvghad, A., Riahi, P., DeWitt, E., Thurber, F., & Mahyar, N. (2023). CommunityBots: Creating 1297 and evaluating a multi-agent chatbot platform for public input elicitation [Conference session]. Accepted to the 26th ACM Conference on Computer-Supported Cooperative Work and Social Computing, Minneapolis, Minnesota, United States. Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for efficient text classification. arXiv. https://doi.org/10.48550/arXiv.1607.01759 Judge, T. A., & Bono, J. E. (2001). Relationship of core self-evaluations traits —Self-esteem, generalized self-efficacy, locus of control, and emotional stability—With job satisfaction and job performance: A meta-analysis. Journal of Applied Psychology, 86(1), 80–92. https://doi.org/10.1037/ 0021-9010.86.1.80 Kim, S., Lee, J., & Gweon, G. (2019, May). Comparing data from chatbot and web surveys: Effects of platform and conversational style on survey response quality. In S. Brewster & G. Fitzpatrick (Eds.), Proceedings of the 2019 CHI conference on human factors in computing systems (pp. 1– 12). ACM. https://doi.org/10.1145/3290605.3300316 Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes are predictable from digital records of human behavior. Proceedings of the National Academy of Sciences of the United States of America, 110(15), 5802–5805. https://doi.org/10.1073/pnas.1218772110 Kulkarni, V., Kern, M. L., Stillwell, D., Kosinski, M., Matz, S., Ungar, L., Skiena, S., & Schwartz, H. A. (2018). Latent human traits in the language of social media: An open-vocabulary approach. PLOS ONE, 13(11), Article e0201703. https://doi.org/10.1371/journal.pone.0201703 Lanyon, R. I., Goodstein, L. D., & Wershba, R. (2014). ‘Good Impression’ as a moderator in employment-related assessment. International Journal of Selection and Assessment, 22(1), 52–61. https://doi.org/10.1111/ijsa.12056 LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature, 521(7553), 436–444. https://doi.org/10.1038/nature14539 Leutner, K., Liff, J., Zuloaga, L., & Mondragon, N. (2021). Hirevue’s assessment science [White paper]. https://Hirevue.com. https://webapi. hirevue.com/wp-content/uploads/2021/03/HireVue-Assessment-Sciencewhitepaper-2021.pdf Li, J., Zhou, M. X., Yang, H., & Mark, G. (2017, March). Confiding in and listening to virtual agents: The effect of personality [Conference session]. Paper presented at the 22nd annual meeting of the intelligent user interfaces community, Limassol, Cyprus. https://doi.org/10.1145/3025171.3025206 Li, W., Wu, C., Hu, X., Chen, J., Fu, S., Wang, F., & Zhang, D. (2020). Quantitative personality predictions from a brief EEG recording. IEEE Transactions on Affective Computing. Advance online publication. https:// doi.org/10.1109/TAFFC.2020.3008775 Liu, A. X., Li, Y., & Xu, S. X. (2021). Assessing the unacquainted: Inferred reviewer Personality and review helpfulness. Management Information Systems Quarterly, 45(3), 1113–1148. https://doi.org/10.25300/MISQ/ 2021/14375 Loevinger, J. (1957). Objective tests as instruments of psychological theory. Psychological Reports, 3(3), 635–694. https://doi.org/10.2466/pr0.1957.3 .3.635 Lorenzo-Seva, U., & ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 2(2), 57–64. https://doi.org/10.1027/1614-2241.2.2.57 Marinucci, A., Kraska, J., & Costello, S. (2018). Recreating the relationship between subjective wellbeing and personality using machine learning: An investigation into Facebook online behaviours. Big Data and Cognitive Computing, 2(3), Article 29. https://doi.org/10.3390/bdcc2030029 Marsh, H. W., Guo, J., Dicke, T., Parker, P. D., & Craven, R. G. (2020). Confirmatory factor analysis (CFA), exploratory structural equation modeling (ESEM), and Set-ESEM: Optimal balance between goodness of fit and parsimony. Multivariate Behavioral Research, 55(1), 102–119. https://doi.org/10.1080/00273171.2019.1602503 McAbee, S. T., & Connelly, B. S. (2016). A multi-rater framework for studying personality: The trait-reputation-identity model. Psychological Review, 123(5), 569–591. https://doi.org/10.1037/rev0000035 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. 1298 FAN ET AL. McAbee, S. T., & Oswald, F. L. (2013). The criterion-related validity of personality measures for predicting GPA: A meta-analytic validity competition. Psychological Assessment, 25(2), 532–544. https://doi.org/10 .1037/a0031748 McCarthy, J., & Wright, P. (2004). Technology as experience. Interactions, 11(5), 42–43. https://doi.org/10.1145/1015530.1015549 McCarthy, J. M., Bauer, T. N., Truxillo, D. M., Anderson, N. R., Costa, A. C., & Ahmed, S. M. (2017). Applicant perspectives during selection: A review addressing “So What?” “What’s New?” and “Where to Next?.” Journal of Management, 43(6), 1693–1725. https://doi.org/10.1177/0149206316681846 Mikolov, T. (2012). Statistical language models based on neural networks. Presentation at Google. Mitchell, T. M. (1997). Machine learning. McGraw-Hill. Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hllenbeck, J. R., Murphy, K., & Schmitt, N. (2007). Reconsidering the use of personality, tests in personnel selection contexts. Personnel Psychology, 60(3), 683–729. https://doi.org/10.1111/j.1744-6570.2007.00089.x Mulfinger, E., Wu, F., Alexander, L., III, & Oswald, F. L. (2020, February). AL technologies in talent management systems: It glitters but is it gold? [Poster Presentation]. Work in the 21st Century: Automation, Workers, and Society, Houston, TX, United States. Muthén, L. K., & Muthén, B. O. (2017). Mplus user’s guide (8th ed.). (Original work published 1998). Oswald, F. L., Behrend, T. S., Putka, D. J., & Sinar, E. (2020). Big data in industrial-organizational psychology and human resource management: Forward progress for organizational research and practice. Annual Review of Organizational Psychology and Organizational Behavior, 7(1), 505– 533. https://doi.org/10.1146/annurev-orgpsych-032117-104553 Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., & Gillespie, M. A. (2004). Developing a biodata measure and situational judgment inventory as predictors of college student performance. Journal of Applied Psychology, 89(2), 187–207. https://doi.org/10.1037/0021-9010.89.2.187 Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M., Stillwell, D. J., Ungar, L. H., & Seligman, M. E. (2015). Automatic personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934–952. https://doi.org/10.1037/ pspp0000020 Paulhus, D. L., Robins, R. W., Trzesniewski, K. H., & Tracy, J. L. (2004). Two replicable suppressor situations in personality research. Multivariate Behavioral Research, 39(2), 303–328. https://doi.org/10.1207/s15327906 mbr3902_7 Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., Blondel, M., Prettenhofter, P., Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The development and psychometric properties of LIWC2015. https://LIWC.net Pervin, L. A. (1994). Further reflections on current trait theory. Psychological Inquiry, 5(2), 169–178. https://doi.org/10.1207/s15327965pli0502_19 Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000). Adaptability in the workplace: Development of a taxonomy of adaptive performance. Journal of Applied Psychology, 85(4), 612–624. https:// doi.org/10.1037/0021-9010.85.4.612 Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods: New perspectives on a common problem. Organizational Research Methods, 21(3), 689–732. https://doi.org/10.1177/1094428117697041 Putka, D. J., Oswald, F. L., Landers, R. N., Beatty, A. S., McCloy, R. A., & Yu, M. C. (2022). Evaluating a natural language processing approach to estimating KSA and interest job analysis ratings. Journal of Business and Psychology. Advance online publication. https://doi.org/10.1007/s10869-022-09824-0 R Core Team. (2020). R: A language and environment for statistical computing. R Foundation for Statistical Computing. http://www.R-project.org/ Revelle, W. (2022). Psych: Procedures for psychological, psychometric, and personality research (Version 2.2.5). Northwestern University. http:// CRAN.R-project.org/package=psych Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv. https://arxiv.org/abs/1706.05098 Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting meta-analytic estimates of validity in personnel selection: Addressing systematic overcorrection for restriction of range. Journal of Applied Psychology, 107(11), 2040–2068. https://doi.org/10.1037/apl0000994 Sajjadiani, S., Sojourner, A. J., Kammeyer-Mueller, J. D., & Mykerezi, E. (2019). Using machine learning to translate applicant work history into predictors of performance and turnover. Journal of Applied Psychology, 104(10), 1207–1225. https://doi.org/10.1037/apl0000405 Shumanov, M., & Johnson, L. (2021). Making conversations with chatbots more personalized. Computers in Human Behavior, 117, Article 106627. https://doi.org/10.1016/j.chb.2020.106627 Speer, A. B. (2021). Scoring dimension-level job performance from narrative comments: Validity and generalizability when using natural language processing. Organizational Research Methods, 24(3), 572–594. https:// doi.org/10.1177/1094428120930815 Spisak, B. R., van der Laken, P. A., & Doornenbal, B. M. (2019). Finding the right fuel for the analytical engine: Expanding the leader trait paradigm through machine learning? The Leadership Quarterly, 30(4), 417–426. https://doi.org/10.1016/j.leaqua.2019.05.005 Suen, H.-Y., Huang, K.-E., & Lin, C.-L. (2019). TensorFlow-based automatic personality recognition used in asynchronous video interviews. IEEE Access: Practical Innovations, Open Solutions, 7, 61018–61023. https://doi.org/10.1109/ACCESS.2019.2902863 Sun, T. (2021). Artificial intelligence powered personality assessment: A multidimensional psychometric natural language processing perspective [Doctoral dissertation]. University of Illinois Urbana-Champaign. https:// hdl.handle.net/2142/113136 Tay, L., Woo, S. E., Hickman, L., & Saef, R. M. (2020). Psychometric and validity issues in machine learning approaches to personality assessment: A focus on social media text mining. European Journal of Personality, 34(5), 826–844. https://doi.org/10.1002/per.2290 Tibshirani, R. (1996). Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society. Series B. Methodological, 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x Tippins, N. T., Oswald, F. L., & McPhail, S. M. (2021). Scientific, legal, and ethical concerns about AI-based personnel selection tools: A call to action. Personnel Assessment and Decisions, 7(2), 1–22. https://doi.org/10 .25035/pad.2021.02.001 Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual. CreateSpace. Völkel, S. T., Haeuslschmid, R., Werner, A., Hussmann, H., & Butz, A. (2020). How to Trick AI: Users’ strategies for protecting themselves from automatic personality assessment. In R. Bernhaupt & F. F. Mueller (Eds.), Proceedings of the 2020 CHI conference on human factors in computing systems (pp. 1–15). ACM. https://doi.org/10.1145/3313831.3376877 Wang, S., & Chen, X. (2020). Recognizing CEO personality and its impact on business performance: Mining linguistic cues from social media. Information & Management, 57(5), Article 103173. https://doi.org/10 .1016/j.im.2019.103173 Woehr, D. J., Putka, D. J., & Bowler, M. C. (2012). An examination of gtheory methods for modeling multitrait–multimethod data: Clarifying links to construct validity and confirmatory factor analysis. Organizational Research Methods, 15(1), 134–161. https://doi.org/10.1177/109442811 1408616 Xiao, Z., Zhou, M. X., Liao, Q. V., Mark, G., Chi, C., Chen, W., & Yang, H. (2020). Tell me about yourself: Using an AI-powered chatbot to conduct conversational surveys with open-ended questions. ACM Transactions on Computer-Human Interaction, 27(3), 1–37. https://doi.org/10.1145/3381804 This document is copyrighted by the American Psychological Association or one of its allied publishers. This article is intended solely for the personal use of the individual user and is not to be disseminated broadly. PERSONALITY ASSESSMENT THROUGH A CHATBOT Yang, Y., Arego, G. H., Yuan, S., Guo, M., Shen, Q., Cer, D., Sung, Y-H., Strope, B., & Kurzweil, R. (2019). Improving multilingual sentence embedding using Bi-directional Dual Encoder with additive margin softmax. arXiv. https://doi.org/10.24963/ijcai.2019/746 Yarkoni, T. (2010). Personality in 100,000 words: A large-scale analysis of personality and word use among bloggers. Journal of Research in Personality, 44(3), 363–373. https://doi.org/10.1016/j.jrp.2010.04.001 Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences of the United States of America, 112(4), 1036–1040. https://doi.org/10.1073/pnas.1418680112 Zhang, B., Luo, J., Chen, Y., Roberts, B., & Drasgow, F. (2020). The road less traveled: A cross-cultural study of the negative wording factor in multidimensional scales. PsyArXiv. https://doi.org/10.31234/osf.io/2psyq Zhang, B., Luo, J., Sun, T., Cao, M., & Drasgow, F. (2021). Small but nontrivial: A comparison of six strategies to handle cross-loadings in bifactor predictive models. Multivariate Behavioral Research. Advance online publication. https://doi.org/10.1080/00273171.2021.1957664 1299 Zhou, M. X., Chen, W., Xiao, Z., Yang, H., Chi, T., & Williams, R. (2019, March 17–20). Getting virtually personal: Chatbots who actively listen to you and infer your personality [Conference session]. Paper presented at 24th International Conference on Intelligent User Interfaces (IUI’19 Companion), Marina Del Rey, CA, United States. https://doi.org/10 .1145/3308557.3308667 Ziegler, M., MacCann, C., & Roberts, R. D. (Eds.). (2012). New perspectives on faking in personality assessment. Oxford University Press. Zou, H., & Hastie, T. (2005). Regularization and variable selection via the elastic net. Journal of the Royal Statistical Society. Series B, Statistical Methodology, 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005 .00503.x Received May 13, 2022 Revision received January 3, 2023 Accepted January 4, 2023 ▪