AI Chatbot Personality Assessment: Psychometric Properties

Journal of Applied Psychology
© 2023 American Psychological Association
ISSN: 0021-9010
2023, Vol. 108, No. 8, 1277–1299
How Well Can an AI Chatbot Infer Personality? Examining
Psychometric Properties of Machine-Inferred Personality Scores
Jinyan Fan1, Tianjun Sun2, Jiayi Liu1, Teng Zhao1, Bo Zhang3, 4,
Zheng Chen5, Melissa Glorioso1, and Elissa Hack6
Department of Psychological Sciences, Auburn University
Department of Psychological Sciences, Kansas State University
School of Labor and Employment Relations, University of Illinois Urbana-Champaign
Department of Psychology, University of Illinois Urbana-Champaign
School of Information Systems and Management, Muma College of Business, University of South Florida–St. Petersburg
Department of Behavioral Sciences and Leadership, United States Air Force Academy
The present study explores the plausibility of measuring personality indirectly through an artificial
intelligence (AI) chatbot. This chatbot mines various textual features from users’ free text responses
collected during an online conversation/interview and then uses machine learning algorithms to infer
personality scores. We comprehensively examine the psychometric properties of the machine-inferred
personality scores, including reliability (internal consistency, split-half, and test–retest), factorial validity,
convergent and discriminant validity, and criterion-related validity. Participants were undergraduate
students (n = 1,444) enrolled in a large southeastern public university in the United States who completed
a self-report Big Five personality measure (IPIP-300) and engaged with an AI chatbot for approximately 20–30
min. In a subsample (n = 407), we obtained participants’ cumulative grade point averages from the University
Registrar and had their peers rate their college adjustment. In an additional sample (n = 61), we obtained test–
retest data. Results indicated that machine-inferred personality scores (a) had overall acceptable reliability at
both the domain and facet levels, (b) yielded a comparable factor structure to self-reported questionnairederived personality scores, (c) displayed good convergent validity but relatively poor discriminant validity
(averaged convergent correlations = .48 vs. averaged machine-score correlations = .35 in the test sample), (d)
showed low criterion-related validity, and (e) exhibited incremental validity over self-reported questionnairederived personality scores in some analyses. In addition, there was strong evidence for cross-sample
generalizability of psychometric properties of machine scores. Theoretical implications, future research
directions, and practical considerations are discussed.
Keywords: chatbot, personality, artificial intelligence, machine learning, psychometric properties
Supplemental materials: https://doi.org/10.1037/apl0001082.supp
During the last 3 decades, personality measures have been
established as a useful talent assessment tool due to the findings
that (a) personality scores are predictive of important organizational
outcomes (e.g., Hurtz & Donovan, 2000; Judge & Bono, 2001) and
(b) personality scores typically do not result in racial adverse impact
(e.g., Foldes et al., 2008). While scholars and practitioners alike
appreciate the utility of understanding individuals’ behavioral
characteristics in organizational settings, debates have revolved
around how to measure personality more effectively and efficiently.
Self-report personality measures, often used in talent assessment
practice, have been criticized for (a) modest criterion-related validity
(Morgeson et al., 2007); (b) susceptibility to faking or response
distortion, particularly within selection contexts (Ziegler et al.,
2012); (c) idiosyncratic interpretation of items due to individual
This article was published Online First February 6, 2023.
Jinyan Fan
Tianjun Sun
Jiayi Liu
Teng Zhao
Bo Zhang
Zheng Chen
Melissa Glorioso is now at Army Research Institute for the Behavioral and
Social Sciences.
The authors thank Michelle Zhou, Huahai Yang, and Wenxi Chen of Juji,
Inc., for their assistance with machine-learning-based model building. They
also thank Andrew Speer, Louis Hickman, Filip Lievens, Emily Campion,
Peter Chen, Alan Walker, and Jesse Michel for providing their valuable
feedback on an earlier version of the article.
The authors declare no financial conflict of interest nor advisory board
affiliations, and so forth, with Juji, Inc.
Jinyan Fan and Tianjun Sun contributed equally to this article.
Earlier versions of part of the article were presented at the Society for
Industrial and Organizational Psychology 2018 and 2022 conferences.
The views expressed are those of the authors and do not reflect the official
policy or position of the U.S. Air Force, Department of Defense, or the U.S.
Correspondence concerning this article should be addressed to Jinyan
Fan, Department of Psychological Sciences, Auburn University, 225
Thach Hall, Auburn, AL 36849, United States or Jiayi Liu,
Department of Psychological Sciences, Auburn University, 102A Thach
Hall, Auburn, AL 36849, United States. Email: Jinyan.Fan@auburn.edu or
differences in cross-situational behavioral consistency (Hauenstein
et al., 2017); and (d) the tedious testing experience where test-takers
have to respond to many items in one sitting.
Recently, an innovative approach to personality assessment has
emerged. This approach was originally developed by computer scientists and has now made its way into the applied psychology field. It is
generally referred to as artificial intelligence (AI)-based personality
assessment. This new form of assessment can be distinguished from
traditional assessment in three ways: (a) technologies, (b) types of data,
and (c) algorithms (Tippins et al., 2021). Data collected via diverse
technological platforms (e.g., social media and video interviews) have
been used to obtain an assortment of personality-relevant data (digital
footprints) such as facial expression (Suen et al., 2019), smartphone
data (Chittaranjan et al., 2013), interview responses (Hickman et al.,
2022), and online chat scripts (Li et al., 2017).
The third area in which AI-based personality assessments are
unique is in their use of more complex algorithms. AI is a broad term
that refers to the science and engineering of making intelligent
systems or machines (e.g., especially computer programs) that
mimic human intelligence to perform tasks and can iteratively
improve themselves based on the information they collect
(McCarthy & Wright, 2004). Machine learning (ML) is a subset
of AI, which focuses on building computer algorithms that automatically learn or improve performance based on the data they
consume (Mitchell, 1997). In some more complex work, the term
deep learning (DL) may also be referenced. DL is a subset of ML,
referring to neural-network-based ML algorithms that are composed
of multiple processing layers to learn the representations of data with
multiple levels of abstraction and mimic how a biological brain
works (LeCun et al., 2015). The present article refers to personality
assessment tools that purport to predict personality traits using
digital footprints as the ML approach.
The ML approach to personality assessment typically entails two
stages: (a) model training and (b) model application. In the model
training stage, researchers attempt to build predictive models using a
large sample of individuals. The predictors are potentially traitrelevant features extracted through analyzing digital footprints
generated by individuals, such as a corpus of texts, social media
“likes,” sound/voice memos, and micro facial expressions. The
criteria (ground truth) are the same individuals’ questionnairederived personality scores, either self-reported (e.g., Golbeck et
al., 2011; Gou et al., 2014), other-rated (e.g., Chen et al., 2017;
Harrison et al., 2019), or both (e.g., Hickman et al., 2022). Next,
researchers try to establish empirical links between the predictors
(features) and the criteria, often via linear regressions, support vector
machines, tree-based analyses, or neural networks, resulting in
estimated model parameters (e.g., regression coefficients).1 To
avoid model overfitting, within-sample k-fold cross-validation is
routinely conducted (Bleidorn & Hopwood, 2019). In addition, an
independent test sample is often arranged for cross-sample validation and model testing. In the model application stage, the trained
model is applied to automatically predict the personality of new
individuals who do not have the questionnaire-derived personality
data. Specifically, the computer algorithm first analyzes new individuals’ digital footprints, extracts features, obtains feature scores,
and then uses feature scores and established model parameters to
calculate predicted personality scores.
The ML approach can be thought of as an indirect measurement
of personality using a large number of features with empirically
derived model parameters to score personality automatically
(Hickman et al., 2022; Park et al., 2015). These features can be
based on lingual or other behaviors, such as interaction logs or facial
expressions. Model parameters indicate the influence of features on
personality “ground truth” as the prediction targets. This approach
boasts two advantages over traditional assessment methods, particularly self-report questionnaires (Mulfinger et al., 2020). The first
advantage lies in its efficiency. For instance, it is possible to use the
same set of digital footprints to train a series of models to infer scores
on numerous individual difference variables such as personality traits,
cognitive ability, values, and career interests. It is resource intensive
to train these different models; however, once trained, various
individual differences can be automatically and simultaneously
inferred with a single set of digital footprint inputs. This would
shorten the assessment time in general, which should be appealing to
both test-takers and sponsoring organizations. Second, the testing
experience tends to be less tedious (Kim et al., 2019). If individuals’
social media content is utilized to infer personality, individuals do not
need to go through the assessment process at all. If video interview or
online chat is used, individuals may feel they have more opportunities
to perform and thus should enjoy the assessment more than, for
instance, completing a self-report personality inventory (McCarthy
et al., 2017).
Despite the potential advantages, the ML approach to personality
assessment faces several challenges and needs to sufficiently address
many issues before it can be used in practice for talent assessment. For
instance, earlier computer algorithms required users (e.g., job applicants) to share their social media content, which did not fare well due
to apparent privacy concerns and potential legal ramifications (Oswald
et al., 2020). In response, organizations have begun to use automated
video interviews (AVIs; e.g., Hickman et al., 2022; Leutner et al.,
2021) or text-based interviews (e.g., Völkel et al., 2020; Zhou et al.,
2019) to extract trait-relevant features. The present study uses the textbased interview method (also known as the AI chatbot) to collect
textual information from users. Strategies such as AVIs and AI
chatbots may gain wider acceptance in applied settings as job
applicants are less likely to refuse an interview request by the hiring
organization where they expect a job offer.
Another critical issue that remains largely unaddressed is a
striking lack of extensive examinations of the psychometric
properties of machine-inferred personality scores (Bleidorn &
Hopwood, 2019; Hickman et al., 2022). Although numerous
computer algorithms have been developed to infer personality
scores, few validation efforts have gone beyond demonstrating
Support vector machine has the objective of finding a hyperplane (i.e.,
can be multidimensional) in a high-dimensional space (e.g., a regression or
classification model with many features or predictors) that distinctly classifies the observations, such that the plane has the maximum margin (i.e., the
distance between data points and classes), where the maximization would
reinforce future observations to be more confidently classified and values
predicted. Neural networks (or artificial neural networks, as often referred to
in computational sciences to be distinguished from neural networks in the
biological brain) can be considered sets of algorithms that are designed—
loosely after the information processing of the human brain—to recognize
patterns. All patterns recognized—be it from sounds, images, text, or
others—are numerical and contained in vectors to be stored and managed
in an information layer (or multiple processing layers), and various follow-up
tasks (e.g., clustering) can take place on another layer on top of the
information layer(s).
the convergence between questionnaire-derived and machineinferred personality scores.
The purpose of the present study is to explore the plausibility of
measuring personality through an AI chatbot. More importantly, we
extensively examine psychometric properties of machine-inferred
personality scores at both facet and domain levels including reliability (internal consistency, split-half, and test–retest), factorial
validity, convergent and discriminant validity, and criterion-related
validity. Such a comprehensive examination has been extremely
rare in the ML literature but is sorely needed. Although an unpublished doctoral dissertation (Sun, 2021) provided initial promising
evidence for the utility of the AI chatbot method for personality
inference, more empirical research is warranted.
In the present study, we have chosen to use self-reported
questionnaire-derived personality scores as ground truth when
building predictive models. Self-report personality measures are
the most widely used method of personality assessment in practice,
with an impressive body of validity evidence. Although self-report
personality measures may be prone to social desirability or faking,
our study context is research-focused instead of selection-focused,
and thus faking may not be a serious concern. In what follows,
we first introduce the AI-powered, text-based interview system
(AI chatbot) used in our research. We then briefly discuss how
we examine the psychometric properties of machine-inferred personality scores and then present a large-scale empirical study.
AI Chatbot and Personality Inference
An AI chatbot is an artificial intelligence system that often utilizes a
combination of technologies, such as deep learning for natural language processing (NLP), symbolic machine learning for pattern
recognition, and predictive analytics for user insights inference to
enable the personalization of conversation experiences and improve
chatbot performance as it is exposed to more human interactions
(International Business Machines, n.d.). Unlike automated video
interview systems (e.g., Hickman et al., 2022; Leutner et al., 2021),
which mostly entail one-way communications, an AI chatbot engages
with users through two-way communications (e.g., Zhou et al., 2019).
While there are many chatbot platforms commercially available
(e.g., International Business Machines Watson Assistant, Google
Dialogflow, and Microsoft Power Virtual Agents), we opted to use
Juji’s AI chatbot platform (https://juji.io) as our study platform for
three reasons. First, unlike many other platforms that require writing
computer programs (e.g., via Application Programming Interfaces) to
train and customize advanced chatbot functionalities, such as dialog
management, Juji’s platform enables non-IT professionals, such as
applied psychologists and HR professionals, to create, customize, and
manage an AI chatbot to conduct virtual interviews without writing
code. Second, Juji’s platform is publicly accessible (currently with
no- or low-cost academic use options), which enables other scholars
to conduct similar research studies and/or replicate our study. Third,
scholars have successfully used the Juji chatbot platform in conducting various scientific studies, including team creativity (Hwang &
Won, 2021), personality assessment (Völkel et al., 2020), and public
opinion elicitation (Jiang et al., 2023).
To enable readers to better understand what is behind the scenes
of the Juji AI platform and facilitate others in selecting comparable
chatbot platforms for conducting studies like ours, in the next
section, we provide a high-level, nontechnical explanation of the
Juji AI chatbot platform. Note that the key structure and functions of
this specific chatbot platform also outline the required key components for supporting any capable conversational AI agents (e.g.,
Jayaratne & Jayatilleke, 2020), thus allowing our methods and
results to be generalized.
As shown in Figure 1, at the bottom level, several machine
learning models are used, including data-driven machine learning
Figure 1
Overview of AI Chatbot Platform for Building an Effective Chatbot and Predicting Personality Using Juji’s
Virtual Conversation System as a Prototype
Note. AI = artificial intelligence. See the online article for the color version of this figure.
models and symbolic AI models, to support the entire chatbot
system. At the middle level, specialty engines are built to facilitate
two-way conversations and user insights inferences. Specifically,
an NLP engine for conversation interprets highly diverse and
complex user natural language inputs during a conversation. Based
on the interpretation results, an active listening conversation
engine, which is powered by social–emotional intelligence of
carrying out an empathetic and effective conversation, decides
how to best respond to users and guide the conversation forward.
For example, it may decide to engage users in small talk, provide
empathetic comments such as paraphrasing, verbalizing emotions,
and summarizing user input, as well as handle diverse user interruptions, such as digressing from a conversation and trying to
dodge a question (Xiao et al., 2020). The personality inference
engine takes in the conversation script from a user and performs
NLP on the script to extract textual features, which are used as
predictors, with the same user’s questionnaire-based personality
scores being used as the criteria. ML models can then be built to
automatically infer personality using statistical methods such as
linear regressions.
To facilitate the customization of an AI chatbot, a set of reusable
AI components is prebuilt, from conversation topics to AI assistant
templates, which enables the automated generation of an AI assistant/chatbot (see the top part of Figure 1). Specifically, a chatbot
designer (a researcher or an HR manager) uses a graphical user
interface to specify what the AI chatbot should do, such as its main
conversation flow or Q&As to be supported, the AI generator will
automatically generate a draft AI chatbot based on the specifications. The generated AI chatbot is then compiled by the AI compiler
to become alive—a live chatbot that can engage with users in a
conversation, which is managed by the AI runtime component to
ensure a smooth conversation.
Examining Psychometric Properties of
Machine-Inferred Personality Scores
Bleidorn and Hopwood (2019) suggested three general classes of
evidence for the construct validity of machine-inferred personality
scores based on Loevinger’s (1957) original framework. We rely on
this framework to organize the present examination of psychometric
properties of machine scores. Table 1 summarizes relevant ML
research in this area.
Substantive Validity
According to Bleidorn and Hopwood (2019), the first general
class of evidence for construct validity is substantive validity, which
has often been operationalized as content validity, defined as the
extent to which test items sufficiently sample the conceptual domain
of the construct but do not tap into other constructs. Establishing the
content validity of machine-inferred personality scores proves quite
challenging. This is because the ML approach is based on features
identified empirically in a data-driven manner (Hickman et al., 2022;
Park et al., 2015). As such, we typically do not know a priori which
features (“items”) should predict which personality traits and why;
furthermore, these features are diverse, heterogeneous, and in large
quantity (Bleidorn & Hopwood, 2019).
Although several previous ML studies have partially established
the content validity of machine-inferred personality scores (e.g.,
Hickman et al., 2022; Kosinski et al., 2013; Park et al., 2015;
Yarkoni, 2010), overall, the evidence for content validity has been
very limited. Because Juji’s AI chatbot system uses a DL model that
mines textual features purely based on mathematics, we could not
examine content validity directly. However, we looked at content
validity indirectly (see the Supplemental Analysis section).
Table 1
Summary of Empirical Research on Psychometric Properties of Machine-Inferred Personality Scores
Aspects of construct validity
Substantive validity
Content validity
Structural validity
Test–retest reliability
Split-half reliability
Internal consistency
Factorial validity
External validity
Convergent validity
Discriminant validity
Criterion-related validitya
Incremental validitya
Prototypical examples
Major findings/current status
Hickman et al. (2022); Kosinski et al. (2013);
Park et al. (2015); Yarkoni (2010)
• Barring limited exceptions, significant relationships between
digital features and questionnaire personality scores bear no
substantive meanings.
Harrison et al. (2019); Hickman et al. (2022);
Li et al. (2020); Park et al. (2015)
Hoppe et al. (2018); Wang and Chen (2020);
Youyou et al. (2015)
Hickman et al. (2022)
• Machine-inferred personality scores have comparable or slightly
lower test–retest reliability than questionnaire personality scores.
• Split-half reliability of machine-inferred personality scores ranges
from .40s to .60s.
• None.
• Models trained on self-reports tend to exhibit poorer generalizability than models trained on interview reports.
• None.
Three meta-analyses: Azucar et al. (2018);
Sun (2021); Tay et al. (2020)
Harrison et al. (2019); Harrison et al. (2019);
Hickman et al. (2022); Marinucci et al.
Gow et al. (2016); Harrison et al. (2019,
2020); Wang and Chen (2020)
• Correlations between machine-inferred and questionnaire scores of
same personality traits range from .20s to .40s.
• Correlations among machine-inferred scores tend to be similar to
correlations between machine-inferred and questionnaire scores of
same traits.
• Machine-inferred chief executive officer personality trait scores
predict various objective indicators of firm performance.
• None.
We only considered studies using non-self-report performance criteria.
Structural Validity
Factorial Validity
The second general class of evidence for construct validity is
structural validity, which focuses on the internal characteristics of
test scores (Bleidorn & Hopwood, 2019). There are three major
categories of structural validity: reliability, generalizability, and
factorial validity.
Factorial validity is established if machine-inferred personality
facet scores may recover the Big Five factor structure (Costa &
McCrae, 1992; Goldberg, 1993) as rendered by self-reported
questionnaire-derived personality facet scores (i.e., same factor loading patterns and similar magnitude of factor loadings). To our best
knowledge, no empirical studies have examined the factorial validity
of machine-inferred personality scores, primarily because in almost
all empirical studies, researchers have trained models to predict
personality domain scores rather than facet scores.2 In the present
study, we overcome this limitation by building predictive models at
the facet level.
Internal consistency (Cronbach’s α) typically does not apply to
machine-inferred personality scores because mined features (often
in the hundreds) are empirically derived in a purely data-driven
manner and are unlikely to be homogenous “items,” thus having
very low item-total correlations (Hickman et al., 2022). However,
Cronbach’s α can be estimated at the personality domain level,
treating facet scores as “items.” Test–retest reliability, on the other
hand, can be readily calculated at both domain and facet levels. We
located several empirical studies (e.g., Gow et al., 2016; Harrison et
al., 2019; Hickman et al., 2022; Li et al., 2020; Park et al., 2015)
reporting reasonable test–retest reliability of machine-inferred personality scores. A third type of reliability index, split-half reliability,
cannot be calculated directly, since there is no “item” in machineinferred personality assessment. However, ML scholars have found
a way to overcome this difficulty. Specifically, scholars first randomly split the corpus of text provided by participants into halves
with roughly the same number of words or sentences, then apply the
trained model to predict personality scores based on the respective
segments of words separately. Split-half reliability is calculated as
the correlations between these two sets of machine scores with the
Spearman–Brown correction. A few studies (e.g., Hoppe et al.,
2018; Wang & Chen, 2020; Youyou et al., 2015) reported reasonable split-half reliability of machine-inferred scores, with averaged
split-half reliability across Big Five domains ranging from .59 to .71.
In the present study, we estimated test–retest and split-half reliability
for machine-inferred personality facet scores. We also estimated
Cronbach’s αs of machine-inferred personality domain scores,
treating facet scores under respective domains as “items.”
Generalizability refers to the extent to which the trained model
may be applied to different contexts (e.g., different samples, different models trained on different sets of digital footprints) and still
yield comparable personality scores (Bleidorn & Hopwood, 2019).
Our review of the literature reveals that very few ML studies have
examined the issue of generalizability. One important exception is a
study done by Hickman et al. (2022) who obtained four different
samples, trained predictive models on Samples 1–3 individually,
and then applied trained models to separate samples. Hickman et al.
reported mixed findings regarding the generalizability of machineinferred personality scores.
In the present study, we focused on cross-sample generalizability,
looking at many aspects of cross-sample generalizability, including
reliability (internal consistency at the domain level and split-half at
the facet level), factor structure, and convergence and discrimination
relations at both the latent and manifest variable levels (to be
discussed subsequently).
External Validity
The third general class of evidence for construct validity is
external validity, which focuses on the correlation patterns between
test scores and external, theoretically relevant variables (Bleidorn &
Hopwood, 2019). Within this class of evidence, researchers typically look at convergent validity, discriminant validity, criterionrelated validity, and incremental validity.
Convergent Validity
Within the ML approach, convergent validity refers to the magnitude of correlations between machine-inferred and questionnairederived personality scores of the same personality traits. Because
most computer algorithms treat the latter as ground truth and aim to
maximize its prediction, it is not surprising that convergent validity of
machine-inferred personality scores has been routinely examined,
which has yielded several meta-analyses (e.g., Azucar et al., 2018;
Sun, 2021; Tay et al., 2020). These meta-analyses reported modest-tomoderate convergent validity across Big Five domains, ranging
from .20s to .40s.
Discriminant Validity
In contrast to the heavy attention given to convergent validity, very
few empirical studies have examined the discriminant validity of
machine-inferred personality scores (Sun, 2021). A measure with
good discriminant validity should demonstrate that correlations
between different measures of the same constructs (convergent
relations) are much stronger than correlations between measures of
different constructs using the same methods (discriminant relations;
Cronbach & Meehl, 1955). Researchers usually rely on the multitrait
multimethod (MTMM) matrix to investigate convergent and discriminant validity.
We identified four empirical studies that examined the discriminant
validity of machine-inferred personality scores (Harrison et al., 2019;
Hickman et al., 2022; Marinucci et al., 2018; Park et al., 2015), with
findings suggesting relatively poor discriminant validity. For instance,
Park et al. (2015) showed that the average correlations among Big
Five domain scores were significantly higher when measured by the
ML method than by self-report questionnaires (r̄ = .29 vs. r̄ = .19).
One possible reason for relatively poor discriminant validity is that
Speer (2021) examined the factor structure of machine-inferred
dimension-level performance scores, but we are interested in the factor
structure of machine-inferred personality scores.
there are typically many features (easily in the hundreds) in predictive
models inferring personality. As a result, it is common that models
predicting different personality traits share many common features,
which would inflate correlations between machine-inferred personality scores.
In the present study, since we built predictive models at the facet
level, we were able to examine the convergent and discriminant
validity of personality domain scores at both the manifest and latent
variable levels. Analyzing the MTMM structure of latent scores
allowed for disentangling trait, method, people-specific, and measurement error influences on personality scores. One advantage of
the latent variable approach is that it separates the method variance
from the measurement error variance. As a result, correlations
among personality domain scores should be more accurate at the
latent variable level than at the manifest variable level.
Criterion-Related and Incremental Validity
Given the modest-to-moderate correlation between machineinferred and self-reported questionnaire-based scores (typically in
the .20–.40 range as reported in several meta-analyses; e.g., Sun,
2021; Tay et al., 2020) and the similarly modest correlations between
self-reported questionnaire-derived personality scores and job performance (r = .18 for personality scores overall meta-analytically;
Sackett et al., 2022), one could expect that machine-inferred personality scores would exhibit some criterion-related validity—
operationalized as the cross product of the above two coefficients
(e.g., .30 × .18 = .054)—if treating the portion of machine-inferred
scores that converges with self-reported scores as predictive of
performance criteria. However, it may be argued that criterion-related
validity, in this case, would be too small to be practically useful.
Yet, we want to offer another reasoning approach: the criterionrelated validity of machine-inferred personality scores does not
have to only come from the portion converging with self-reported
questionnaire-derived personality scores. In other words, the part of the
variance in machine-inferred personality scores that is unshared by
self-reported questionnaire-derived personality scores might still be
criterion relevant. Conceptually, it is plausible that these two types of
personality scores capture related but distinct aspects of personality.
For instance, self-reported questionnaire-derived personality scores
represent one’s self-reflection on typical and situation-consistent
behaviors (Pervin, 1994), with much of the nuances in actual behaviors
perhaps lost. In contrast, the ML approach extracts many linguistic
cues that may capture the nuances in behaviors above and beyond
typical behaviors captured by self-reported questionnaire-derived
personality scores. Although the exact nature of the unique aspects
of personalities captured by the ML approach is unclear, their criterion
relevance is not only a theoretical question but also an empirical
one. Thus, if our above reasoning is correct, we would expect machineinferred personality scores to exhibit not only criterion-related validity
but also incremental validity over self-reported questionnaire-derived
personality scores.
Although a few empirical studies (e.g., Kulkarni et al., 2018; Park
et al., 2015; Youyou et al., 2015) have reported that machine-inferred
personality scores predicted some demographic variables (e.g., network size, field of study, political party affiliation) and self-reported
outcomes (e.g., life satisfaction, physical health, and depression),
there has been a lack of evidence that machine-inferred personality
scores can predict organizationally relevant non-self-report criteria,
such as performance and turnover. The reasons self-report criteria are
less desirable include: (a) they are rarely used in talent assessment
practice and (b) machine-inferred personality scores based on selfreport models and self-report criteria share the same source, which
might inflate criterion-related validity (Park et al., 2015).
Interestingly, we located a handful of studies in the field of
strategic management that documented criterion-related validity of
machine-inferred chief executive officer (CEO) personality scores
(e.g., Gow et al., 2016; Harrison et al., 2019, 2020; Wang & Chen,
2020). For example, Harrison et al. (2019) obtained 207 CEOs’
spoken or written texts (e.g., transcripts of earnings calls), extracted
textual features, had experts (trained psychology doctoral students)
rate these CEOs’ personalities based on their texts, and then built
predictive models accordingly. Harrison et al. then applied the trained
model to estimate Big Five domain scores for 3,449 CEOs of 2,366
S&P 1,500 firms between 1996 and 2014 and then linked them to firm
strategy changes (e.g., advertising intensity, R&D intensity, inventory
levels) and firm performance (return on assets). These authors
reported that machine-inferred CEO Openness scores were positively
related to firm strategic changes and that the relationship was stronger
when the firm’s performance of the previous year was low versus
high. Unfortunately, studies in this area typically obtained expertrated personality scores only for small samples of CEOs (n < 300)
during model building and thus were unable to examine incremental
validity in much larger holdout samples.
Thus, to our best knowledge, even in the broad ML literature, there
has not been any empirical evidence for incremental validity of
machine-inferred personality scores over questionnaire-derived personality scores beyond self-report criteria. The present study addressed this major limitation by investigating the criterion-related and
incremental validity of machine-inferred personality scores using two
non-self-report performance criteria: (a) objective cumulative grade
point average (GPA) and (b) peer-rated college adjustment.
Transparency and Openness
We describe our sampling plan, all data exclusions, manipulations,
and all measures in the study, and we adhered to the Journal of
Applied Psychology methodological checklist. Raw data and computer code for model building are not available due to their proprietary
nature, but processed data (self-reported questionnaire-derived and
machine-inferred personality scores) are available at https://osf.io/
w73qn/?view_only=165d652ef809442fbab46f815c57f467. Other
associated research materials are included in the online Supplemental
Materials. Data were analyzed using Python 3 (Van Rossum & Drake,
2009); the scikit-learn package (Pedregosa et al., 2011); R (Version
4.0.0); the package psych, Version 2.2.5 (Revelle, 2022); IBM SPSS
Statistics (Version 27); and Mplus 8.5 (Muthén & Muthén, 1998/
2017). The study design and its analysis were not preregistered.
Sample and Procedure
Participants were 1,957 undergraduate students enrolled in
various psychology courses, recruited from the subject pool
operated by the Department of Psychological Sciences at a large
Southeastern public university via Sona Systems (https://www
.sona-systems.com/; Auburn University institutional review
This document is copyrighted by the American Psychological Association or one of its allied publishers.
This article is intended solely for the personal use of the individual user and is not to be disseminated broadly.
board Protocol 16-354 MR 1609, Examining the Relationships
among Daily Word Usage through Online Media, Personalities,
and College Performance; Auburn University institutional review
board Protocol 18-410 EP 1902, Linking Online Interview/Chat
Response Scripts to Self-reported Personality Scores). To ensure a
large enough sample size, we collected data over several semesters.
We obtained a training sample (n = 1,477) and a separate test sample
(n = 480). The training sample contains data collected in fall 2018
(n = 531 in the lab), spring 2019 (n = 202 in the lab and n = 396
online), and spring 2020 (n = 348 online), and it was used to build
predictive models. The separate test sample contained data collected
in spring 2017 (n = 480 in the lab) including the criteria data. The test
sample was used for cross-sample validation and model testing
During spring 2017 data collection, when participants arrived at
the lab, they first completed an online Big Five personality measure
(IPIP-300) on Qualtrics and then were directed to the AI firm’s
virtual conversation platform, where they engaged with the AI
chatbot for approximately 20–30 min.4 The chatbot asked participants several open-ended questions organized around a series of
topical areas, which were generic in nature (see the online Supplemental Materials A, for a list of interview questions used in the
present study) and thus probably should be considered as unstructured interview questions. In other words, although the same questions were given to all participants, they were not aimed to measure
anything specific. The questions were displayed in a chat box on the
computer screen, and participants typed their responses into the chat
box. Participants were allowed to ask the chatbot questions.
After the virtual conversation, participants completed an enrollment certification request form, which gave us permission to obtain
their SAT or/and American College Testing (ACT) scores and
cumulative GPA from the University Registrar. Next, participants
were asked to provide the names and email addresses of three peers
who knew their college life well. Immediately after the lab session,
the experimenter randomly selected one of the three peers and sent out
an invitation email for a peer evaluation with a survey link. One week
later, a reminder email was sent to the peer. If the first peer failed to
respond, the experimenter randomly selected and invited another
peer. The process continued until all three peers were exhausted. It
turned out that eight participants received multiple peer ratings, and
we used the first peer’s ratings in subsequent analyses. Peers were
sent a $5 check for their time (10–15 min). We assigned a participant
ID to each participant, which was used as the identifier to link his/her
self-reported questionnaire-derived personality data, virtual conversation scripts, peer ratings, and cumulative GPAs.
Out of 480 participants in the spring 2017 sample, 73 participants
either failed to enter, mistyped their participant IDs, or chose to
withdraw from the study and were excluded. Thus, we obtained
matched data for 407 participants, among whom 75.2% were female,
and the mean age was 19.38 years. Out of 407 participants, we were
able to obtain 379 participants’ Scholastic Aptitude Test (SAT)/ACT
scores and cumulative GPA from the University Registrar and 301
participants’ peer ratings. Two hundred eighty-nine participants had
both SAT/ACT scores and peer ratings.
For data collected after spring 2017, 733 participants came to our
lab and went through the same study procedure as those in spring
2017 (the test sample) but without collecting the criteria data. In
addition, 744 completed the study online via a Qualtrics link for the
IPIP-300 and the AI firm’s virtual conversation system for the online
chat. No criteria data were collected for the online participants,
either. Based on the same matching procedure as the test sample, we
were able to match the personality data and the virtual conversation
scripts for 1,037 out of 1,477 participants in the training sample,
among whom 76.5% were female with an average age of 19.90 years.
With respect to race, 80% were White, 5% were Black, 3% were
Hispanic, 2% were Asian, and 10% did not disclose their race. These
participants’ data were used subsequently to build predictive models. For the entire sample, the median number of sentences provided
by participants through the chatbot was 40 sentences, ranging from
26 to 112 sentences.
To examine test–retest reliability, we obtained another independent sample. Participants were 74 undergraduate students enrolled
in one of the two sections of a psychological statistics course at the
same university in fall 2021 (Auburn University Protocol 21-351
EX 2111, Examining Test–Retest Reliability of Machine-Inferred
Personality Scores). Participants were invited to engage with the
chatbot twice with the same set of interview questions as the main
study in exchange for extra course credit. Sixty-one participants
completed both online chats, with time lapses ranging from 3 to
35 days with an average time lapse of 22 days.
Self-Reported Questionnaire-Derived Personalities
The 300-item IPIP personality inventory (Goldberg, 1999) was
used. The IPIP-300 was designed to measure 30 facet scales in the
Big Five framework, modeling after the NEO Personality Inventory
(NEO PI-R; Costa & McCrae, 1992). Items were rated on a 5-point
scale ranging from 1 (strongly disagree) to 5 (strongly agree). These
30 facet scales, Big Five domain scales, and their reliabilities in the
present samples (both the training and test samples) are presented in
Tables 2 and 3. There was no missing data at the item level.
Machine-Inferred Personalities
The machine learning algorithm the AI firm helped develop was
used to estimate the 30 personality facet scores based on mined
textual features of participants’ virtual conversation scripts. See the
Analytic Strategies section for NLP and model-building details.
Reliabilities of machine-inferred 30 facet scores and Big Five
Juji developed an algorithm to infer personality scores prior to our study.
We initially applied their original algorithm to spring 2017 data to calculate
machine-inferred personality scores; however, results showed that machineinferred scores did not significantly predict the criteria (GPA and peer-rated
college adjustment). After a discussion with the AI firm’s developers, we
identified potential flaws in their original model-building strategy, which did
not use self-reported personality scores as ground truth. Thus, we decided to
collect online chat data and self-reported personality data during the subsequent semesters to enable the AI firm to train new predictive models using
self-reported personality scores as ground truth.
The 25th, 50th, and 75th percentiles of time spent on a virtual conversation
were 18, 22, and 30 min, respectively. We include two examples of chatbot
conversation scripts across five participants in the online Supplemental
Materials A. Out of professional courtesy, the AI firm (Juji, Inc.) provides
a free chatbot demo link: https://juji.ai/pre-chat/62f571ec-11d7-4d1e-933637066dfa0f48, with these instructions: (a) use a computer (not a cell phone) for
the online chat; (b) use Google Chrome as the browser; (c) responses should be
in sentences (rather than words); (d) finish the entire chat in one sitting; and (e)
if enough inputs are provided, the algorithm will infer and show your
personality scores on the screen toward the end of the online chat.
Table 2
Reliabilities of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Facet Scores
Training sample (n = 1,037)
questionnaire-derived scores
Test sample (n = 407)
questionnaire-derived scores
Machine-inferred scores
Personality facets
Coefficient α
Coefficient α
reliability (n = 61)
Openness (average)
Art interest
Conscientiousness (average)
Extraversion (average)
Activity level
Excite seek
Agreeableness (average)
Neuroticism (average)
Note. Average = averaged reliabilities of facet scales within a Big Five domain. Test–retest reliabilities of machine-inferred personality scores were based
on an independent sample (n = 61) that was not part of the test sample (n = 407).
domain scores5 in the training and test samples are presented in
Tables 2 and 3, respectively.
Objective College Performance
Participants’ cumulative GPAs were obtained from the University
Peer-Rated College Adjustment
Scholars have advocated expanding the conceptual space of college
performance beyond GPA to include alternative dimensions such as
social responsibility (Borman & Motowidlo, 1993) and adaptability
and life skills (Pulakos et al., 2000). Oswald et al. (2004) proposed
a 12-dimension college adjustment model that covers intellectual,
interpersonal, and intrapersonal behaviors. Oswald et al. developed a
12-item Behaviorally Anchored Rating Scale to assess college students’ adjustment in these 12 dimensions; for instance, (a) knowledge,
learning, and mastery of general principles; (b) continuous learning,
intellectual interest, and curiosity; and (c) artistic cultural appreciation
and curiosity, and so forth. For each dimension, peers were presented
its name, definition, and two brief examples and were then asked to
rate their friend’s adjustment on this dimension using a 7-point scale
(1 = strongly disagree) to (7 = strongly agree). There was no missing
data on the item level. Cronbach’s α was .87 in the test sample.
We used two methods to calculate machine-inferred personality domain
scores. For the first method, we calculated domain scores as averaged
machine-inferred facet scores in respective domains. For the second method,
we first calculated self-reported domain scores based on self-reported facet
scores and then built five predictive models with self-reported domain scores
as ground truth. Next, we applied the trained model to predict machineinferred domain scores in both training and test samples. It turned out that
both methods resulted in similar results and identical statistical conclusions.
We therefore present the results based on the first method in the present
article and refer readers to results based on the second method in the online
Supplemental Material A (Supplemental Tables S1–S3).
Table 3
Reliabilities (Cronbach’s α) of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Domain Scores
Training sample (n = 1,037)
Personality domains
Test sample (n = 407)
questionnaire-derived scores
questionnaire-derived scores
Note. Cronbach’s αs were calculated by treating facet scores under respective personality traits as “items.”
Following Oswald et al., overall peer-rating scores (the sum of scores
on the 12 dimensions) were used in subsequent analyses. In the present
test sample, GPA and peer-rated college adjustment were modestly
correlated (r = .21), suggesting that the two criteria represent distinct,
yet somewhat overlapping conceptual domains.
Control Variables
We obtained participants’ ACT and/or SAT scores from the
registrar along with their cumulative GPA. We converted SAT
scores into ACT scores using the 2018 ACT-SAT concordance
table (American College Testing, 2018). When examining the
criterion-related validity of personality scores, we controlled for
ACT scores, which served as proxies for cognitive ability.
Analytic Strategies
Natural Language Processing
The conversation text from each participant was first segmented
into single sentences. Then, sentence embedding (encoding) was
performed on each sentence using the Universal Sentence Encoder
(USE; Cer et al., 2018), which is a DL model that was trained and
optimized for greater-than-text length text (e.g., sentences, phrases,
and paragraphs). A sentence embedding yields a list of values,
usually in an ordered vector, that numerically represent the meanings of a sentence by machine understanding. The USE takes in a
sentence string and outputs a 512-dimension vector. The USE model
adopted in this study was pretrained with a deep averaging network
(DAN) encoder on a variety of large data sources and a variety of
tasks by the Google USE team to capture semantic textual information such that sentences with similar meanings should have embeddings close together in the embedding space. The resultant model (a
DL network) is then fine-tuned by adjusting parameters for the
training data. Given the advantage of capturing contextual meanings, the USE is commonly used for text classification, semantic
similarity, clustering, and other natural language tasks. To obtain
text features for predictive models, we averaged each participant’s
sentence embeddings across sentences, resulting in 512 feature
scores for each participant. The same average sentence vector
was used as the predictor in all subsequent model buildings. The
practice of averaging embeddings is common in NLP research and
has been shown to yield excellent model performance in various
language tasks with much higher model training efficiency than
other types of models, including some more sophisticated ones
(Joulin et al., 2016; Yang et al., 2019).
Model Building
When multicollinearity among predictors is high and/or when there
are many predictors relative to sample size—which is a typical situation
ML scholars face—ordinary least-squared regression estimators are
still accurate but tend to yield large variances (Zou & Hastie, 2005).
Regularized regression methods are used to help address the bias–
variance trade-off. For instance, ridge regression penalizes large β’s by
imposing the same amount of shrinkage across β’s, referred to as the L2
penalty (Hoerl & Kennard, 1988). Least absolute shrinkage and
selection operator (LASSO) regression shrinks some β’s to zero
with varying amounts of shrinkage across β’s, referred to as the L1
penalty (Tibshirani, 1996). Elastic net regression combines ridge
regression and LASSO regression using two hyperparameters: alpha
(α) and lambda (λ; Zou & Hastie, 2005).6 The α parameter determines
the relative weights given to the L1 versus L2 penalty. When α ranges
between 0 and .5, elastic net behaves more like ridge regression
(L2 penalty; at α = 0, it becomes completely ridge regression).
When α ranges between .5 and 1, elastic net behaves more like
LASSO regression (L1 penalty; at α = 1, it becomes completely
LASSO regression). The λ parameter determines how severely regression weights are penalized. We built predictive models on the training
sample (n = 1,037) using elastic net regression via the scikit-learn
package in Python 3 (Pedregosa et al., 2011). Fivefold cross-validation
was used to help tune the model’s hyperparameters (α and λ).7 Once
in elastic
is as follows:
Pn The loss
Pp function
Pp 2 net Pregression
i=1 ðyi −
j=1 xij βj Þ + λ½ð 2 Þ
1 βi + α
1 jβi j, in which the first
term is the ordinary least squared loss function (sum of squared residuals),
the second term is L2 penalty (ridge regression), and the third term is an L1
penalty (LASSO regression), with α indicating relative weights given to L1
versus L2 penalty and λ indicating the amount of penalty. Please refer to the
online Supplemental Materials A, for a nontechnical explanation of how
elastic net regression works.
Specifically, the training sample was split into five equally sized partitions
to build different combinations of training and validation data sets for better
estimation of the model’s out-of-sample performance. A model was fit using all
subsets except the first fold, and then the model was applied to the first fold to
examine model performance (i.e., prediction error in the validation data set in
the current case). Then the first subset was returned to the training sample, and
the second subset was used as the hold-out sample in the second round of crossvalidation. The procedure was repeated until the kth round of cross-validation
was completed. In each round of cross-validation, a series of possible values of
hyperparameters (α and λ) were explored and corresponding model performance indices were obtained. Then, the model performance indices associated
with the same set of hyperparameter values were averaged across five crossvalidation trials, and hyperparameter values associated with the best overall
model performance were chosen as the optimal hyperparameters, thus completing the hyperparameter tuning process.
hyperparameters were tuned, the model was then fitted to the entire
training sample with the hyperparameters fixed to their optimal values to
obtain the final model parameters for the training sample. The elastic net
has been considered the optimal modeling solution for linear relationships (Putka et al., 2018). Next, we applied the trained models to predict
personality scores for 1,037 and 407 participants in the training and test
samples, respectively. We trained 30 predictive models.
loadings from machine-inferred facet scores against those from selfreported questionnaire-derived facet scores. We also calculated the
root-mean-square error (RMSE) for each domain in both the training
and testing samples. We further added two lines in the plots to show
the range where the difference in factor loadings is less than .10. If
most dots fall within the range formed by the two lines, it means that
factor loadings were also similar in magnitude across the two
measurement approaches.
Internal consistency and test–retest reliability are straightforward
to estimate. For split-half reliability, we randomly divided participants’ conversation scripts into two halves with an equal number of
sentences in each, applied the trained model to obtain two sets of
machine scores, and then calculated their correlations with the
Spearman–Brown correction. To obtain more consistent results,
we shuffled the sentence order for each participant before each splithalf trial and reported split-half reliabilities averaged over 20 trials.
Factorial Validity
As independent-cluster structures with zero cross-loadings are too
ideal to be true for personality data (Hopwood & Donnellan, 2010),
and forcing nonzero cross-loadings to zero is detrimental in multiple
ways (Zhang et al., 2021), a method that allows cross-loadings
would be more appropriate. In addition, since we have two types
(sets) of personality scores, it would be more informative if both
types of scores are modeled within the same model so that correlations among latent factors derived from self-reported questionnairederived and machine-inferred scores can be directly estimated.
Therefore, the set exploratory structural equation modeling (setESEM; Marsh et al., 2020) is an excellent option. In set-ESEM, two
(or more) sets of constructs are modeled within a single model such
that cross-loadings are allowed for factors within the same set but are
constrained to be zero for constructs in different sets.
Set-ESEM overcomes the limitations of ESEM (i.e., lack of
parsimony and potential in confounding constructs) and represents
a middle ground between the flexibility of exploratory factor analysis
or full ESEM and the parsimony of confirmatory factor analysis
(CFA) or structural equation modeling, as set-structural equation
modeling aims to achieve a balance between CFA and full ESEM in
terms of goodness-of-fit, parsimony, and factor structure definability
(i.e., specifications of empirical item-factor mappings corresponding
to a priori theories; Marsh et al., 2020). Target rotation was used
because we have some prior knowledge about which personality
domain (factor) each facet should belong to. The set-ESEM analyses
were conducted using Mplus 8.5 (Muthén & Muthén, 1998/2017).
Tucker’s congruence coefficients (TCCs) were used to quantify
the similarity between factors assessed with a self-report questionnaire and factors inferred by the machine. TCC has been shown as a
useful index for representing the similarity of factor loading patterns
of comparison groups (Lorenzo-Seva & ten Berge, 2006): A TCC
value in the range of .85–.94 corresponds to a fair similarity,
whereas a TCC value at or higher than .95 can be seen as evidence
that the two factors under comparison are identical. TCCs were
calculated using the package psych, Version 2.2.5 (Revelle, 2022) in
R, Version 4.0.0 (R Core Team, 2020). Given that TCC focuses
primarily on the overall pattern (profile similarity), we also examined absolute agreement of magnitude. Specifically, we plot factor
Convergent and Discriminant Validity
We used Woehr et al. (2012) convergence, discrimination, and
method variance indices calculated based on the MTMM matrix
to examine convergent and discriminant validity. The convergence
index (C1) was calculated as the average of the monotrait–
heteromethod correlations. Conceptually, C1 indicates the proportion
of expected observed variance in trait-method units attributable to the
person main effects and shared variance specific to traits. A positive
and large C1 indicates strong convergent validity and serves as the
benchmark to examine discriminant indices (Woehr et al., 2012).
The first discriminant index (D1) was calculated by subtracting
the average of absolute heterotrait–heteromethod correlations from
C1, where the former is conceptualized as the proportion of expected
observed variance attributable to the person main effects. Thus, a
positive and large D1 indicates that a much higher proportion of
expected observed variance can be attributed to traits versus person,
thus high discriminant validity. The second discriminant index (D2)
is calculated by subtracting the average of absolute heterotrait–
monomethod correlations from C1. Conceptually, D2 compares the
proportion of expected observed variance attributable to traits versus
methods. A positive and large D2 indicates high discriminant
validity, in that trait-specific variance dominates method-specific
variance (Woehr et al., 2012). We also calculated D2a, a variant of
D2 calculated using only machine monomethod correlations (C1:
the average of absolute heterotrait–machine method correlations).
This is done considering previous empirical studies showing that the
machine method tends to pose a major threat to the discriminant
validity of machine-inferred personality scores (Park et al., 2015;
Tay et al., 2020). We also calculated the average of absolute
heterotrait–machine method correlations and the average of absolute
heterotrait–self-report method correlations. If the former is substantially larger than the latter, it suggests machine-inferred scores tend
to have relatively poorer discriminant validity than self-reported
questionnaire-derived scores.
Criterion-Related Validity
To examine criterion-related validity, we first looked at the partial
bivariate correlations between machine-inferred personality domain
scores and the two external non-self-report criteria (GPA and peerrated college adjustment) controlling for ACT scores. We then ran
10 sets of regression analyses with the two external criteria as
separate outcomes and each of the Big Five domain scores as
separate personality predictors. Specifically, the criterion was first
regressed on ACT scores (Step 1), then on the self-reported
questionnaire-derived personality domain scores (Step 2), and
then on the respective machine-inferred personality domain scores
(Step 3). We were aware that the typical analytic strategy examining
criterion-related validity entails entering all five personality domain
scores simultaneously in respective steps in regression models.
However, we were concerned that due to poor discriminant validity
of machine scores (i.e., high correlations among machine scores),
such a strategy might mask the effect of machine scores on criteria.
Tables 2 and 3 present facet- and domain-level reliabilities of selfreported questionnaire-derived and machine-inferred personality
scores, respectively. Several observations are noteworthy. First,
at the facet level, self-reported questionnaire-derived personality
scores showed good internal consistency and comparable split-half
reliabilities, which is not surprising, as the IPIP-300 is a wellestablished personality inventory.
The second observation is that, at the facet level, split-half
reliabilities of machine-inferred personality scores, albeit somewhat
lower than those of self-reported questionnaire-derived personality
scores, were overall in the acceptable range. Averaged split-half
reliabilities for facet scores in the training and test samples are as
follows: r̄ = .68 and .63 for Openness facets; r̄ = .67 and .68
for Conscientiousness facets; r̄ = .64 and .64 for Extraversion
facets; r̄ = .73 and .68 for Agreeableness facets; and r̄ = .60 and
.57 for Neuroticism facets. These results were comparable to those
reported in similar ML studies (e.g., Hoppe et al., 2018; Wang &
Chen, 2020; Youyou et al., 2015). Further, split-half reliabilities
are comparable between the training and test samples (averaged
split-half reliabilities are .66 and .64, respectively), suggesting good
cross-sample generalizability.
The third observation is that, at the facet level, test–retest
reliabilities of machine-inferred personality scores were comparable
to split-half reliabilities. Averaged test–retest reliabilities in the test
sample are as follows: r̄ tt = .67 for Openness facets, r̄ tt = .59 for
Conscientiousness facets, r̄ tt = .66 for Extraversion facets, r̄ tt = .63
for Agreeableness facets, and r̄ tt = .58 for Neuroticism facets, with
an average r̄ tt for all facet scales of .63. The modest retest sample
size (n = 61) rendered wide 95% confidence intervals (CIs) for test–
retest reliabilities, with CI width ranging from .21 to .41, with an
average of .31. Thus, the above findings should be interpreted with
caution. We also note that the test–retest reliabilities of machineinferred personality scores were lower than those of self-reported
questionnaire-derived personality scores which are estimated to be
averaged around .80 according to a meta-analysis (Gnambs, 2014).
The fourth observation is that at the domain level, machineinferred personality scores demonstrated somewhat higher internal
consistency reliabilities than self-reported questionnaire-derived
personality domain scores when facet scores were treated as “items.”
In the training sample, averaged Cronbach’s αs across all Big Five
domains for self-reported questionnaire-derived and machineinferred personality scores were .80 and .88, respectively. In the
test sample, averaged Cronbach’s αs were .79 and .88, respectively.
These result patterns were somewhat unexpected. One possible
explanation is that many significant features shared by predictive
models underneath the same Big Five domain might have inflated
domain-level Cronbach’s αs of machine scores. Note that Cronbach’s αs of machine-inferred domain scores were identical between
the training and test samples (averaged αs = .88 and .88, respectively), indicating excellent cross-sample generalizability.
Based on the above findings, there is promising evidence that
machine-inferred personality domain scores demonstrated excellent
internal consistency. Further, machine-inferred personality facet
scores exhibited overall acceptable split-half and test–retest reliabilities but were lower than those of self-reported questionnairederived facet scores. In addition, both domain-level internal consistency and facet-level split-half reliabilities of machine-inferred
personality scores showed strong cross-sample generalizability.
Factorial Validity
To assess the set-ESEM model fit, chi-square goodness-of-fit,
maximum-likelihood-based Tucker–Lewis index (TLI), comparative fit index (CFI), root-mean-squared error of approximation
(RMSEA), and standardized root-mean-squared residual (SRMR)
were calculated: for the training sample: χ2 = 11370.56, df = 1,435,
p < .01, CFI = .87, TLI = .84, RMSEA = .08, SRMR = .03; for the
test sample: χ2 = 5795.88, df = 1,435, p < .01, CFI = .85, TLI = .81,
RMSEA = .09, SRMR = .04). Although most values of these model
fit indices did not meet the commonly used rule of thumb (Hu &
Bentler, 1999), we consider them as adequate given the complexity
of personality structure (Hopwood & Donnellan, 2010) and because
they also resemble what was reported in the literature (e.g., Booth &
Hughes, 2014; Zhang et al., 2020).
Table 4 presents the set-ESEM factor loadings from the selfreported questionnaire-derived personality scores and the machineinferred personality scores in the test sample. (The factor loadings
on the training sample can be found in Supplemental Table S4 in the
online Supplemental Materials A.) The overall patterns of facets
loading onto their corresponding Big Five domains are clear. For the
self-reported questionnaire-derived scores, all facets load onto their
designated factors the highest with overall low cross-loadings on
others except for the activity level facet in the Extroversion factor
(which had the highest loading on Conscientiousness) and the
feelings facet in the Openness factor (which had the highest loading
on Neuroticism). For the machine-inferred personality scores, the
overall Big Five patterns are recovered clearly as well, with
the exceptions—once again—of the activity level and feelings facets
bearing the highest loadings on nontarget factors (though the
loadings on the target factors are moderately high). In general,
the machine-inferred personality scores largely replicated the patterns and structure observed from the self-reported questionnairederived personality scores.
Results further indicated that TCCs for Extroversion, Agreeableness, Conscientiousness, Neuroticism, and Openness across the
training and test samples were .98 and .97, .98 and .97, .98 and
.98, .98 and .98, and .95 and .94, respectively. As such, factor profile
similarity between the self-reported questionnaire-derived and
machine-inferred facet scores has been confirmed in both the training
and test samples. Moreover, the plots in Figure 2 clearly show that
factor loadings from the two measurement approaches are indeed
similar to one another in terms of both rank order and magnitude, as
most dots fall close to the line Y = X. In addition, RMSEs were also
small in general (.08∼.11 in the training sample and .09∼.14 in the
test sample).
Thus, there is promising evidence that the machine-inferred
personality facet scores recovered the underlying Big Five structure.
Further, factor loading patterns and magnitude are similar across
the two measurement approaches. Thus, the factorial validity of
Table 4
Rotated Factor Loading Matrices Based on Set-Exploratory Structural Equation Model for the Test Sample
This document is copyrighted by the American Psychological Association or one of its allied publishers.
Self-reported questionnaire-derived personality scores
Machine-inferred personality scores
Personality facets
Friendliness (E1)
Gregarious (E2)
Assertiveness (E3)
Activity level (E4)
Excitement seeking (E5)
Cheerfulness (E6)
Trust (A1)
Straightforward (A2)
Altruism (A3)
Cooperation (A4)
Modesty (A5)
Sympathy (A6)
Self-efficacy (C1)
Orderliness (C2)
Dutifulness (C3)
Achievement (C4)
Self-discipline (C5)
Cautiousness (C6)
Anxiety (N1)
Anger (N2)
Depression (N3)
Self-conscious (N4)
Impulsiveness (N5)
Vulnerability (N6)
Imagination (O1)
Art_Interest (O2)
Feelings (O3)
Adventure (O4)
Intellectual (O5)
Liberalism (O6)
Note. n = 407. E = Extroversion; A = Agreeableness; C = Conscientiousness; N = Neuroticism; O = Openness.
machine-inferred personality scores is established in our samples. In
addition, given the similar model fit indices, similar factor loading
patterns and magnitude, and similarly high TCCs between the
training and test samples, cross-sample generalizability for factorial
validity seems promising.
Convergent and Discriminant Validity
Tables 5 and 6 present MTMM matrices of latent and manifest
Big Five domain scores, respectively. As can be seen in Table 7, at
the latent variable level, the convergence indices (C1s) were relatively large (.59 and .48), meaning that 59% and 48% of the
observed variance can be attributed to person main effects and
trait-specific variance in the training and test samples, respectively.
The first discrimination indices (D1s: .48 and .38) indicate that 48%
and 38% of the observed variance can be attributed to trait-specific
variance in the training and test samples, respectively. Contrasting
these values with C1s suggests that most of the convergence is
contributed by trait-specific variance. The second discrimination
indices (D2s: .43 and .33) were positive and moderate, indicating
that the percentage of shared variance specific to traits is 43 and 33
percentage points higher than the percentage of shared variance
specific to methods in the training and test samples, respectively.
D2a, calculated using only machine method correlations, was .39
and .28 in the training and test samples, respectively, suggesting that
the machine method tended to yield somewhat heightened percentage levels of method variance. In addition, the average absolute
intercorrelations of self-reported questionnaire-derived domain
scores were .12 and .11 in the training and test samples, respectively.
Meanwhile, the average absolute intercorrelations of machineinferred domain scores were .20 and .20 in the training and test
samples, respectively. Thus, together, machine-inferred latent personality domain scores demonstrated excellent convergent validity
but somewhat weaker discriminant validity.
Table 7 indicates that, at the manifest variable level, the convergence index (C1) was .57 and .46 in the training and test samples,
respectively, indicating good convergent validity. D1 was .40 and .31
in the training and test samples, respectively, suggesting that most of
the convergence is contributed by trait-specific variance. D2 was .30
and .19 in the training and test samples, respectively, showing that the
percentage of shared variance specific to traits is substantially higher
than the percentage of shared variance specific to methods. D2a
was .24 and .11 in the training and test samples, respectively. The
magnitude of D2a in the test sample suggests that the percentage of
shared variance specific to traits is only slightly higher than the
percentage of shared variance specific to the machine method. Indeed,
Table 6 shows that four of 10 heterotrait–machine monomethod
correlations exceeded the C1 (.46). In addition, the average absolute
intercorrelations of self-reported questionnaire-derived domain
scores were .20 and .19 in the training and test sample, respectively,
Note. RMSE = root-mean-squared errors; Train = training sample; Test = testing sample; SR = self-reported questionnaire-derived, questionnaire-derived facet scores; ML = machine-inferred facet
Figure 2
Plots of Factor Loadings of Facet Scores of the Two Measurement Approaches and Root Mean Squared Errors Across Big Five Domains
Table 5
Latent Factor Correlations of Self-Reported Questionnaire-Derived and Machine-Inferred Personality Domain Scores in the Training and
Test Samples
Personality domains
Extroversion (S)
Agreeableness (S)
Conscientiousness (S)
Neuroticism (S)
Openness (S)
Extroversion (M)
Agreeableness (M)
Conscientiousness (M)
Neuroticism (M)
Openness (M)
Note. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score. Correlations above the diagonal are based on the training sample.
Correlations below the diagonal are based on the test sample. For the training sample, n = 1,037. For the test sample, n = 407. Bold values are given to
highlight convergent correlations.
whereas the average absolute intercorrelations of machine-inferred
domain scores were .33 and .35, respectively (Δs = .13 and .16).
Thus, machine-inferred manifest domain scores demonstrated excellent convergent validity but weaker discriminant validity. The poor
discriminant validity in the test sample is particularly concerning.
Table 7 also indicates that D2a was .39 versus .24 at the latent
versus manifest variable level in the training sample (ΔD2a = .15)
and was .28 versus .11 in the test sample (ΔD2a = .17). These result
patterns suggested that discriminant validity of machine-inferred
personality domain scores was somewhat higher at the latent
variable level than at the manifest variable level.
Based on the above findings, there is evidence that machineinferred personality latent domain scores displayed excellent convergent and somewhat weaker discriminant validity. Although
machine-inferred personality manifest domain scores also showed
good convergent validity, discriminant validity was less impressive,
particularly in the test sample. Nevertheless, overall, our results
show improvements in differentiating among different traits in the
machine-inferred personalities as compared to existing machine
learning applications (e.g., Hickman et al., 2022; Marinucci et al.,
2018; Park et al., 2015). Table 7 also shows that C1, D1, and D2
(D2a) dropped around .10 from the training to testing sample at both
the latent and manifest variable levels. Thus, there is evidence for
reasonable levels of cross-sample generalizability of convergent and
discriminant relations.
Criterion-Related Validity
Table 6 reports the bivariate correlations between manifest
personality domain scores and the two criteria of cumulative
GPA and peer-rated college adjustment. We also calculated partial
correlations controlling for ACT scores. Specifically, controlling for
ACT scores, GPA was significantly correlated with four machineinferred domain scores: Openness (r = −.11), Conscientiousness
(r = .13), Extroversion (r = .12), and Neuroticism (r = −.12); peerrated college adjustment was significantly correlated with four
machine-inferred domain scores: Conscientiousness (r = .17),
Extroversion (r = .18), Agreeableness (r = .12), and Neuroticism
(r = −.11). Thus, machine-inferred domain scores demonstrated
some initial evidence for low levels of criteria-related validity. The
correlations between self-reported questionnaire-derived domain
scores and the two criteria were largely consistent with previous
research (e.g., McAbee & Oswald, 2013; Oswald et al., 2004).
Next, we conducted 10 sets of hierarchical regression analyses
(Five Domains × Two Criteria) to examine the incremental validity
of machine-inferred domain scores.8 Table 8 presents the results of
these regression analyses. Several observations are noteworthy.
First, whereas ACT scores were a significant predictor of cumulative
GPA, they did not predict peer-rated college adjustment at all,
suggesting that the former criterion has a strong cognitive connotation, whereas the latter criterion does not. Second, after controlling
for ACT scores, self-reported questionnaire-derived personality
domain scores exhibited modest incremental validity on the two
criteria, with four domain scores being significant predictors of
cumulative GPA and three domain scores being significant predictors of peer-rated college adjustment.
Third, after controlling for ACT and self-reported questionnairederived personality domain scores, machine-inferred personality
domain scores, overall, failed to explain additional variance in the
two criteria; however, there are three important exceptions. Specifically, in one set of regression analyses involving Extroversion scores
as the predictor and GPA as the criterion, the machine-inferred
Extroversion scores explained an additional 3% of the variance in
cumulative GPA (β = .18, p < .001). Interestingly, the regression
coefficient of self-reported questionnaire-derived Extroversion scores
became more negative from Step 2 to Step 3 (with β increasing from
−.09 [p = .038] to −.17, [p < .001]), suggesting a potential suppression effect (Paulhus et al., 2004). In another set of regression analyses,
which involved Extroversion scores as a predictor and peer-rated
college adjustment as the criterion, machine-inferred Extroversion
scores explained an additional 3% of the variance in the criterion (β =
.18, p < .001), with self-reported questionnaire-derived Extroversion
scores being a nonsignificant predictor in Step 2 (β = .07, p = .217)
and Step 3 (β = −.002, p = .980). In still another set of regression
analyses, which involved Neuroticism scores as a predictor and
cumulative GPA as the criterion, machine-inferred Neuroticism
scores explained an additional 1% of the variance in the criterion
We also ran 60 sets of regression analyses (30 Facets × Two Criteria) to
examine incremental validity of machine-inferred personality facet scores.
The results are presented in online Supplemental Material A (Supplemental
Table S5).
Openness (S)
Conscientiousness (S)
Extroversion (S)
Agreeableness (S)
Neuroticism (S)
Openness (M)
Conscientiousness (M)
Extroversion (M)
Agreeableness (M)
Neuroticism (M)
Cumulative GPA
Note. n = 1,037 for the training sample. n = 289–407 for the test sample. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score; PRCP = peer-rated college adjustment;
GPA = grade point average; N/A = Not Available; ACT = American College Testing. Statistics above the diagonal are for the training sample. Statistics below the diagonal are for the test sample. Bold
values are given to highlight convergent correlations.
* p < .05. ** p < .01.
Personality domains
Table 6
Means, Standard Deviations, and Correlations Among Study Variables in the Training and Test Samples at Manifest Variable Level
(β = −.13 p < .001), with self-reported questionnaire-derived Neuroticism score being a nonsignificant predictor in Step 2 (β = .03, p =
.549) and Step 3 (β = .08, p = .099).
Based on the above findings, there is some evidence that machineinferred personality domain scores had overall comparable, low
criterion-related validity to self-reported questionnaire-derived personality domain scores. The only exception was that machine-inferred
Conscientiousness domain scores had noticeably lower criterionrelated validity than self-reported questionnaire-derived Conscientiousness domain scores. There is also preliminary evidence that
machine-inferred personality domain scores had incremental validity
over ACT scores and self-reported questionnaire-derived domain
scores in some analyses.
Supplemental Analyses
Robustness Checking
During the review process, an issue was raised about whether the
reliability and validity of machine-inferred personality scores might
be compromised among participants who provided fewer inputs than
others during the virtual conversation. We thus conducted additional
analyses to examine the robustness of our findings. Based on the
number of sentences participants in the test sample provided during
the online chat, we divided the test sample into three equal parts,
yielding three test sets: (a) bottom 1/3 of participants (n = 165), (b)
bottom 2/3 of participants (n = 285), and (c) all participants (n = 407).
We then reran all analyses on the two smaller test sets. Results
indicated that barring a couple of exceptions, the reliability and
validity of machine-inferred personality scores were very similar
across the test sets, thus providing strong evidence for the robustness
of our findings concerning the volume of input participants provided.9
Exploring Content Validity of Machine Scores
We also conducted supplemental analyses that allowed for an
indirect and partial examination of the content validity issue. Following Park et al.’s (2015) approach, we used the scikit-learn package in
Python (Pedregosa et al., 2011) to count one-, two-, and three-word
phrases (i.e., n-grams with n = 1, 2, or 3) in the text. Words and
phrases that occurred in less than 1% of the conversation scripts were
removed from the analysis. This created a document-term matrix that
was populated by the counts of the remaining phrases. After identifying n-grams and their frequency scores for each participant, we
calculated the correlations between machine-inferred personality
facet scores derived from the chatbot and frequencies of language
features in the test sample. If machine-inferred personality scores have
content validity, it is expected that they should have significant
correlations with language features that are known to reflect a specific
personality facet. For each personality facet, we selected the 100 most
positively and 100 most negatively correlated phrases.
Comprehensive lists of all language features and correlations can
be found in the online Supplemental Material B. The results show
that the most strongly correlated phrases with predictions of each
personality facet were largely consistent with the characteristics of
We thank Associate Editor Fred Oswald for encouraging us to examine
the robustness of our findings. Interested readers are referred to the online
Supplemental Materials A (Supplemental Tables S6–S15), for detailed
analysis results.
Table 7
Multitrait–Multimethod Statistics for Machine-Inferred Personality Domain Scores
Present and pervious study samples
Latent personality domain scores (self-report models)
Training sample (n = 1,037)
Test sample (n = 407)
Manifest personality domain scores (self-report models)
Training sample (n = 1,037)
Test sample (n = 407)
Park et al. (2015; self-report models, test sample)
Hickman et al. (2022)
Self-report models (in the training sample)
Interviewer-report models (in the training sample)
Interviewer-report models (in the test sample)
Harrison et al. (2019; other-report models, split-half validation)a
Marinucci et al. (2018; self-report models; in the training sample)
Note. C1 = convergence index (average of monotrait–heteromethod correlations); r̄HTHM = average of heterotrait–heteromethod correlations; D1 =
Discrimination Index 1 (C1—average of heterotrait–heteromethod correlations); D2 = Discrimination Index 2 (C1—average of heterotrait–monomethod
correlations); D2a = Discrimination Index 2 (calculated using only machine method heterotrait–monomethod correlations); MV = method variance
(average of hetero–monomethod correlations—average of heterotrait–heteromethod correlations); MVa = method variance due to the machine method;
CEO = chief executive officer.
Unlike many other machine learning studies, Harrison et al. (2019) split CEO’s text into two halves. They built predictive models based on the first half
of the text and then tested them based on the other half of the text.
interest facet was associated with phrases showing enjoyment of
sports and outdoor activities (e.g., football, game, sports). Based on
the above supplemental analyses, it looks like our predictive models
captured some aspects of language features that can predict specific
that facet. For example, a high level of machine score in the artistic
interest facet was associated with phrases reflecting music and art
(e.g., music, art, poetry) and exploration (e.g., explore, creative,
reading). In contrast, a low level of machine score in the artistic
Table 8
Regression of Cumulative GPA and Peer-Rated College Adjustment on ACT Scores and Self-Reported QuestionnaireDerived and Machine-Inferred Personality Score
Cumulative GPA (n = 379)
Peer-rated college adjustment (n = 289)
Step 1
Step 2
Step 3
Step 1
Step 2
Step 3
Openness (S)
Openness (M)
Conscientiousness (S)
Conscientiousness (M)
Extroversion (S)
Extroversion (M)
Agreeableness (S)
Agreeableness (M)
Neuroticism (S)
Neuroticism (M)
.50** (.04)
.51** (.04)
−.19** (.04)
.51** (.04)
−.20** (.05)
.03 (.06)
.48** (.04)
.33** (.05)
−.04 (.05)
.50** (.04)
−.17** (.05)
.18** (.05)
.49** (.04)
.11* (.05)
.01 (.05)
.48** (.04)
.08 (.05)
−.13** (.05)
−.004 (.06)
−.02 (.06)
−.002 (.03)
−.01 (.08)
−.01 (.08)
−.02 (.06)
.24** (.06)
.06 (.06)
.02 (.06)
−.002 (.07)
.18** (.06)
−.01 (.06)
.13* (.06)
.07 (.06)
−.01 (.06)
−.13 (.06)
−.07 (.07)
.50** (.04)
.50** (.04)
.50** (.04)
.50** (.04)
.48** (.04)
.31** (.04)
.48** (.04)
−.09* (.05)
.49** (.04)
.11* (.04)
.50** (.04)
.03 (.05)
−.01 (.09)
−.01 (.09)
−.01 (.09)
−.01 (.09)
−.02 (.06)
.27** (.06)
.01 (.09)
.07 (.06)
−.01 (.07)
.15** (.06)
−.01 (.06)
−.16** (.06)
Note. (S) = self-reported questionnaire-derived scores; (M) = machine-inferred score; GPA = grade point average; ACT = American
College Testing.
* p < .05. ** p < .01.
personality facets and, therefore, showed partial evidence for content validity of machine-inferred personality scores.
The purpose of the present study was to explore the feasibility
of measuring personality indirectly through an AI chatbot, with a
particular focus on the psychometric properties of machine-inferred
personality scores. Our AI chatbot approach is different from (a)
earlier approaches that relied on individuals’ willingness to share their
social media content (Youyou et al., 2015), which is not a given in
talent management practice, and (b) automated video interview
systems (e.g., Hickman et al., 2022), in that our approach allows
for two-way communications and thus resembles more natural conversations. Results based on an ambitious study that involved
approximately 1,500 participants, adopted a design that allowed
for examining cross-sample generalizability, built predictive models
at the personality facet level, and used non-self-report criteria showed
some promise for the AI chatbot approach to personality assessment.
Specifically, we found that machine-inferred personality scores
(a) had overall acceptable reliability at both the domain and facet levels,
(b) yielded a comparable factor structure to self-reported questionnairederived personality scores, (c) displayed good convergent validity but
relatively poor discriminant validity, (d) showed low criterion-related
validity, and (e) exhibited incremental validity in some analyses. In
addition, there is strong evidence for cross-sample generalizability of
various aspects of psychometric properties of machine scores.
Important Findings and Implications
Several important findings and their implications warrant further
discussion. Regarding reliability, our finding that the average test–
retest reliability of all facet machine scores was .63 in a small
independent sample (with an average time lapse of 22 days) compared favorably to the .50 average test–rest reliability (with an
average 15.6 days of time lapse) reported by Hickman et al.
(2022), but unfavorably to both Harrison et al. (2019) .81 average
based on other-report models (with 1-year time lapse) and Park et al.
(2015) .70 average based on self-report models (with a 6-month time
lapse). Given that both Harrison et al. and Park et al. relied on a much
larger body of text obtained from participants’ social media, it seems
that it is not the sample size, nor the length of time lapse, but the size
of the text that would determine the magnitude of test–retest
reliability of machine-inferred personality scores. Another possible
explanation is that, in Hickman et al. and our studies, participants
were asked to engage in the same task (online chat or video interviews) twice with relatively short time lapses, which might have
evoked a Practice × Participant interaction effect. That is, participants might understand the questions better and might provide better
quality responses during the second interview or conversation;
however, such a practice effect is unlikely to be uniform across
participants, thus resulting in lowered test–retest reliability. In
contrast, Harrison et al. and Park et al.’s studies relied on participants’ social media content over time and thus were immune from
the practice effect. However, one may counter that there might be
some similar (repetitive) content on social media, which might have
resulted in overestimated test–retest reliability of machine scores.
Regarding discriminant validity, one interesting finding is that
discriminant validity seemed higher at the latent variable level than
at the manifest variable level in both the training and test samples.
One possible explanation is that set-ESEM accounted for potential
cross-loadings, whereas manifest variables did not. There is evidence that omitted cross-loadings inflate correlations among latent
factors (e.g., Asparouhov & Muthén, 2009).
Regarding criterion-related validity, despite the low criterionrelated validity of machine-inferred personality scores in predicting
GPA and peer-rated college adjustment, our findings are comparable
to criterion-related validities of self-reported questionnaire-derived
personality domain scores reported in an influential meta-analysis
(Hurtz & Donovan, 2000); for instance, for Conscientiousness: r = .14
and .17 versus 14, and for Neuroticism, r = −.13 and −.15 versus
−.09. Further, eight of 10 criterion-related validities of machine scores
were in the 40th (r = .12)–60th (r = .20) percentile range based on
empirically derived effect size distribution for the psychological
characteristics and performance correlations (cf. Bosco et al., 2015).
The findings that, in three regression analyses, machine-inferred
personality scores exhibited incremental validity suggest that the
part of the variance in machine scores that is not shared by selfreported questionnaire-derived personality scores can be criterion
relevant. However, we note that we know very little about the exact
nature of this part of the criterion-relevant variance; for instance,
does it capture personality-relevant information or some sort of
biasing factors? In the formal case, we speculate that the unshared
variance might have captured the reputation component of personality (as our daily conversations clearly influence how we are
viewed by others), which has consistently been shown to contribute
substantially to the prediction of performance criteria (e.g., Connelly
& Ones, 2010; Connelly et al., 2022). However, no empirical studies
have tested this speculation. This lack of understanding represents
a general issue in ML literature, that is, the predictive utility is
established first by practitioners, with the theoretical understandings
lagging, awaiting scientists to address. We thus call for future
research to close the “scientist–practitioner gap.”
Another important finding is that the present study, which was
based on self-report models, yielded better psychometric properties
(e.g., substantially higher convergent validity and cross-sample generalizability) of machine-inferred personality scores than many similar ML studies that were also based on self-report models (e.g.,
Hickman et al., 2022; Marinucci et al., 2018). We offer several
tentative explanations that might help reconcile this inconsistency.
First, we would like to rule out data leakage that might have
contributed to our better findings. There are two main types of
data leakage (Cook, 2021): (a) target leakage and (b) train-test
contamination. Target leakage happens when the training model
contains predictors that are updated/created after the target value is
realized, for instance, if the algorithm for inferring personality scores
is constantly being updated based on new data. However, since this
AI firm’s algorithm for inferring personality scores is static rather than
constantly updated on the fly, target leakage is ruled out. The second
type of data leakage, train-test contamination, happens when researchers don’t carefully distinguish the training data from the test
data; for instance, researchers conduct preprocessing and/or train the
model using both the training and test data. This would result in
overfitting. However, in our study, training and test samples were kept
separate, hence the test data were excluded from any model-building
activities, including the fitting of preprocessing steps. Therefore, we
are confident that data leakage cannot explain our superior findings on
the psychometric properties of machine scores.
We attribute our success to three factors. The first factor is the
sample size. The sample we used to train our predictive models (n =
1,037) was relatively larger than the samples used in similar ML
studies. Larger samples may help detect more subtle trait-relevant
features and facilitate complex relationships during model training,
making the trained models more accurate (Hickman et al., 2022).
The second factor is related to the data collection method used. The
AI firm’s chatbot system allows for two-way communications,
engaging users in small talk, providing empathetic comments,
and managing user digressions, all of which should lead to higher
quality data. The third factor concerns the NLP method used. The
present study used the sentence embedding technique, the USE,
which is a DL-based NLP method. USE goes beyond simple countbased representations such as bag-of-words and Linguistic Inquiry
and Word Count (Pennebaker et al., 2015) used in previous ML
studies and retains the information of context in language and
relations of whole sentences (Cer et al., 2018). There has been
consistent empirical evidence showing that DL-based NLP techniques tend to outperform n-gram and lexical methods (e.g., Guo et al.,
2021; Mikolov, 2012).
Contributions and Strengths
The present study makes several important empirical contributions
to personality science and ML literature. First, in the most general
sense, the present study represents the most comprehensive examination of the psychometric properties of machine-inferred personality
scores. Our findings, taken as a whole, greatly enhances confidence in
the ML approach to personality assessment. Second, the present study
demonstrates, for the first time, that machine-inferred personality
facet scores are structurally equivalent to self-reported questionnairederived personality facet scores. Third, to our best knowledge, the
present study is the first in the broad ML literature that shows the
incremental validity of machine-inferred personality scores over
questionnaire-derived personality scores with non-self-report criteria.
Admittedly, scholars in other fields such as strategic management
(e.g., Harrison et al., 2019; Wang & Chen, 2020) and marketing (e.g.,
Liu et al., 2021; Shumanov & Johnson, 2021) have reported that
machine-inferred personality scores predicted non-self-report criteria.
Further, there have been trends in using ML in organizational research
to predict non-self-report criteria (e.g., Putka et al., 2022; Sajjadiani
et al., 2019; Spisak et al., 2019). However, none of these studies have
reported incremental validity of machine-inferred personality scores
beyond self-report criteria. This is significant because establishing
incremental validity of machine-inferred scores is a precondition for
the ML approach to personality assessment and any other new talent
signals to gain legitimacy in talent management practice (ChamorroPremuzic et al., 2016).
Two methodological strengths of the present study should also be
noted. First, building predictive models at the personality facet level
opened opportunities to examine a few important research questions
such as internal consistency at the domain level, factorial validity,
and convergent and discriminant validity at the latent variable level.
These questions cannot be investigated when predictive models are
built at the domain level, which is the case for most ML studies.
Second, the present research design with a training sample and an
independent test sample allowed us to examine numerous aspects of
cross-sample generalizability including reliabilities, factorial validity, and convergent and discriminant validity.
Study Limitations
Several study limitations should be kept in mind when interpreting our results. The first limitation concerns the generalizability
of our models and findings. Our samples consisted of young,
predominantly female, college-educated people, and as a result,
our models and findings might not generalize to working adults. In
the present study, we used the AI firm’s default interview questions
to build and test predictive models. Given that the AI firm’s chatbot
system allows for tailor-making conversation topics, interview
questions, and their temporal order, we do not know to what extent
predictive models built based on different sets of interview questions
would yield similar machine scores. Further, the present study
context was a nonselection, research context, and it is unclear
whether our findings might generalize to selection contexts where
applicants are motivated to fake and thus might provide quite
different responses during virtual conversations. In addition, for
both the USE and the elastic net analyses, it would be difficult to
replicate in the exact form. For instance, using any pretrained model
other than the USE (e.g., the Bidirectional Encoder Representations
from Transformers; Devlin et al., 2019) would produce a different
dimensional arrangement of vector representations. Therefore, we
call for future research to examine the cross-model, cross-method,
cross-population, and cross-context generalizability of machineinferred personality scores.
The second limitation is that the quality of the predictive models
we built might have been hampered by several factors; for instance,
some participants might not respond truthfully to the self-report
personality inventory (IPIP-300); the USE might have inappropriately encoded regional dialects not well represented in the training
data; some participants were much less verbally expressive than
others leading to fewer data; and some participants were less able/
interested in contributing to the virtual conversations; to name just a
few. In addition, models built in a high-stakes versus low-stakes
situation might yield different parameters. Future model-building
efforts should take these factors into account.
The third limitation is that we were not able to examine the
content validity of machine scores directly. A major advantage of
DL models lies in their accuracy and generalizability. However, as
of now, these DL models including the USE used in the present
study are not very interpretable and have weak theoretical commitments, as the DL-derived features do not have substantive meanings.
We thus encourage computer scientists and psychologists to work
together to figure out substantive meanings of high-dimension
vectors extracted from various DL models, which shall allow for
a fuller and more direct investigation of the content validity of
machine-inferred personality scores. This is aligned with the current
trend that explainable AI has been a growing area of research in ML
(Tippins et al., 2021).
Future Research Directions
Despite some promising findings, we recommend several future
research directions to further advance the ML approach to personality assessment. First, the default interview questions in the AI firm’s
chatbot system probably should be considered semistructured in
nature. This is because although all participants went through the
same interview questions, they aimed to engage with users with no
explicit intention to solicit personality-related information. It thus
remains to be seen whether more structured interview questions
aimed to systematically tap into various personality traits and the
predictive models built accordingly might yield more accurate and
valid machine-inferred personality scores. For instance, it is possible
to develop 30 interview questions with each targeting one of the 30
personality facets. We are optimistic about such an approach for the
following reasons. First, when interview questions are built to inquire
about respondents’ thoughts, feelings, and behaviors associated with
specific personality facets and domains, the language data will be
contextually connected, and trait driven. As a result, NLP algorithms
shall be able to capture not only the linguistic cues but also the traitrelevant content of the narratives from the respondents. For instance,
when assessing Conscientiousness, a question that asks about personal work style should prompt a respondent to type an answer using
more work-relevant words/phrases and depict their work style in a
way that should allow algorithms to extract relevant features more
accurately. This should improve the predictive accuracy of the
predictive models, resulting in better convergent validity.
At the same time, when different questions are asked that tap
into different personality facets or domains, text content unique to
specific facets or domains is expected to be solicited. Questions
representing different personality facets should contribute uniquely
to the numerical representations of texts, as the semantics of the texts
would cluster differently due to facet or domain differences. As a
result, in the prediction algorithms, clusters involving language
features that are more relevant to a personality facet or domain
would carry more weight (i.e., predictive power) for that specific
trait and less so for others. In other words, language features mined
this way are less likely to overlap and should result in improved
discriminant validity. However, one may successfully counter that
structured interview questions might create a stronger situation that
may lead to restricted variance in responses during the online chat,
resulting in worse results than unstructured interview questions.
In any event, future research is needed to address this important
research question.
Second, given the recent advancements in personality psychology showing that self-reported and other-reported questionnairederived personality scores seem to capture distinct aspects of
personality, with self-reports tapping into identity and otherreports tapping into reputation (e.g., McAbee & Connelly,
2016), it may be profitable to develop two parallel predictive
models with self-report and other-report, respectively. In a recent
study, Connelly et al. (2022) empirically showed that the reputations component (assessed through other-report) of Conscientiousness and Agreeableness dominated the prediction of several
performance criteria, whereas the identity component (assessed
through self-report) generally did not predict performance criteria.
These findings suggest that machine-inferred personality scores
based on the other-report might have a high potential to demonstrate incremental validity over self-reported questionnaire-derived
scores and machine scores based on self-reports. Future research
should empirically investigate this possibility.
Third, future research is also needed to examine whether the ML
approach to personality assessment is resistant to faking. There are
two different views. The first view argues that machine-inferred
personality scores cannot be faked. This is mainly because the ML
approach uses a large quantity (usually in the hundreds or thousands
or even more) of empirically derived features that bear no substantive meaning, making it practically impossible to memorize and then
fake these features. The counterargument, however, is that in highstakes selection situations, job applicants may still engage in
impression management in their responses when engaging with
an AI chatbot. Some of the sentiments, positive emotions, and word
usage faked by job applicants are likely to be captured by the NLP
techniques and then factored into machine-inferred personality
scores. In other words, faking is still highly possible within the
ML approach. Which of the above two views is correct is ultimately
an empirical question. We thus call for future empirical research to
examine the fakability of machine-inferred personality scores.
Fourth, future research should also examine the criterion-related
and incremental validity of machine-inferred personality scores in
selection contexts. We speculate that in selection contexts, selfreported questionnaire-derived personality scores should have weaker
criterion-related validity due to applicant faking that introduces
irrelevant variance (e.g., Lanyon et al., 2014), whereas criterionrelated validity of machine-inferred scores probably should maintain
due to its resistance to faking. As such, it is possible that machine
scores are more likely to demonstrate incremental validity in selection
contexts than in nonselection contexts such as the present one. We
also encourage future researchers to use more organizationally relevant criteria such as task performance, organizational citizenship
behaviors, counterproductive work behaviors, and turnover when
examining the criteria-related validity of machine scores.
Fifth, future research should address the low discriminant validity
of machine scores, as high intercorrelations among machine-inferred
domain scores tend to reduce the validity of a composite. We suggest
that the relatively poor discriminant validity of machine scores might
be attributed to the fact that existing ML algorithms often have
emphases largely placed on single-target optimizations. Improvement
is possible through the multitask framework. For instance, for predicting multidimensional constructs (e.g., personality facet/domain
scores), an ideal way may perhaps be simultaneous optimization of
multiple targets. A potential solution for that might be to integrate
target matrices (multidimensional facet/domain scores as ground truth
vector) into the multitask learning framework (i.e., a variation of
transfer learning where models are built simultaneously to perform a
set of related tasks, e.g., Ruder, 2017).
Finally, as mentioned earlier, one major advantage of the AI
chatbot approach to personality assessment is a less tedious testing
experience. However, such an advantage is assumed rather than
empirically verified. Thus, future research is needed to compare
applicant perceptions about the traditional personality assessment
and the AI chatbot approach.
Practical Considerations
Despite initial promises of the AI chatbot approach to personality
assessment, we believe that a series of practical issues need to be
sufficiently addressed before we could recommend its implementation in applied settings. First, the AI chatbot system used in this
study allows users to tailor-make the conversation agenda. This
makes sense, as organizations need to design interview questions
according to specific positions. However, there is little empirical
evidence within the chatbot approach that different sets of interview
questions would yield similar machine scores for the same individuals. In this regard, we agree with Hickman et al. (2022) that
vendors need to provide potential client organizations with enough
information about the training data such as sample demographics
and the list of interview questions the models are built on.
Second, there is no empirical evidence supporting the assumption
that machine scores based on the chatbot approach are resistant to
applicant faking. This is an important feature that must be established if we are to implement such an approach in real-world
selection contexts. Third, although it has been well established
that self-reported questionnaire-derived personality scores are
unlikely to result in adverse impact, there is no evidence that
machine scores based on the chatbot approach are also immune
from adverse impact. It is entirely possible that certain language
features might be associated with group membership and thus need
to be removed from the predictive models. We were unable to
examine this issue in the present study since the undergraduate
student population of this university is majority White (80%).
Fourth, although the present study shows that machine scores based
on the chatbot approach had small criterion-related validity, we still
need more evidence for criterion-related validity in industrial organizations with expanded criterion domains.
Fifth, evidence for the robustness of our findings in terms of the
volume of input participants provide during the online chat is
encouraging and should be appealing to organizations interested in
implementing the AI chatbot approach into selection practice. However, it is probably necessary to identify the lower limit (minimal
number of sentences) that would hold up good psychometric properties of machine-inferred personality scores. Sixth, considering recent
research showing that other-reported questionnaire-derived personality scores have a unique predictive power of job performance (e.g.,
Connelly et al., 2022), it is worthwhile to supplement the present
predictive models based on self-report with parallel models based on
other-report. Once other-report models are built, the AI chatbot
system may provide more useful personality information for testtakers, thus further materializing the efficacy feature of the ML
approach to personality assessment.
Finally, the ML-based chatbot approach to personality assessment comes with potential challenges related to professional ethics
and ethical AI (e.g., concerns involving the fair and responsible use
of AI). For instance, is it ethical or legally defensible to have job
applicants go through a chatbot interview without telling them that
their textual data will be mined for selection-relevant information?
Organizations might also decide to transform recorded in-person
interviews into texts, apply predictive models such as the one used in
the present study, and obtain machine-inferred personality scores for
talent management purposes. AI chatbots may soon be an enormously popular tool for initial recruiting and screening processes,
and corporate actors may not hesitate to repurpose harvested data for
new applications.10
We thank an anonymous reviewer for raising this point.
American College Testing. (2018). Guide to the 2018 ACT/SAT concordance.
Asparouhov, T., & Muthén, B. (2009). Exploratory structural equation modeling. Structural Equation Modeling, 16(3), 397–438. https://doi.org/10.1080/
Azucar, D., Marengo, D., & Settanni, M. (2018). Predicting the Big 5
personality traits from digital footprints on social media: A meta-analysis.
Personality and Individual Differences, 124, 150–159. https://doi.org/10
Bleidorn, W., & Hopwood, C. J. (2019). Using machine learning to advance
personality assessment and theory. Personality and Social Psychology
Review, 23(2), 190–203. https://doi.org/10.1177/1088868318772990
Booth, T., & Hughes, D. J. (2014). Exploratory structural equation modeling
of personality data. Assessment, 21(3), 260–271. https://doi.org/10.1177/
Borman, W. C., & Motowidlo, S. J. (1993). Expanding the criterion domain
to include elements of contextual performance. In N. Schmitt & W. C.
Borman (Eds.), Personnel selection in organizations (pp. 71–98).
Bosco, F. A., Aguinis, H., Singh, K., Field, J. G., & Pierce, C. A. (2015).
Correlational effect size benchmarks. Journal of Applied Psychology,
100(2), 431–449. https://doi.org/10.1037/a0038047
Cer, D., Yang, Y., Kong, S.-Y., Hua, N., Limtiaco, N., John, R., Constant,
N., Guajardo-Cespedes, M., Yuan, S., Tar, C., Sung, Y.-H., Strope, B., &
Kurzweil, R. (2018). Universal sentence encoder. ArXiv. https://arxiv
Chamorro-Premuzic, T., Winsborough, D., Sherman, R. A., & Hogan, R.
(2016). New talent signals: Shiny new objects or a brave new world?
Industrial and Organizational Psychology: Perspectives on Science and
Practice, 9(3), 621–640. https://doi.org/10.1017/iop.2016.6
Chen, L., Zhao, R., Leong, C. W., Lehman, B., Feng, G., & Hoque, M. (2017,
December 23–26). Automated video interview judgment on a large-sized
corpus collected online [Conference session]. 2017 7th international conference on affective computing and intelligent interaction, ACII 2017, San
Antonio, TX, United States. https://doi.org/10.1109/ACII.2017.8273646
Chittaranjan, G., Blom, J., & Gatica-Perez, D. (2013). Mining large-scale
smartphone data for personality studies. Personal and Ubiquitous Computing, 17(3), 433–450. https://doi.org/10.1007/s00779-011-0490-1
Connelly, B. S., McAbee, S. T., Oh, I.-S., Jung, Y., & Jung, C.-W. (2022). A
multirater perspective on personality and performance: An empirical
examination of the trait-reputation-identity model. Journal of Applied
Psychology, 107(8), 1352–1368. https://doi.org/10.1037/apl0000732
Connelly, B. S., & Ones, D. S. (2010). An other perspective on personality:
Meta-analytic integration of observers’ accuracy and predictive validity.
Psychological Bulletin, 136(6), 1092–1122. https://doi.org/10.1037/a0021212
Cook, A. (2021, November 9). Data leakage. Kaggle. Retrieved March 14,
2022, from https://www.kaggle.com/alexisbcook/data-leakage
Costa, P. T., Jr., & McCrae, R. R. (1992). Revised NEO Personality
Inventory (NEO-PI-R) and NEO Five-Factor Inventory (NEO-FFI) professional manual. Psychological Assessment Resources.
Cronbach, L. J., & Meehl, P. E. (1955). Construct validity in psychological
tests. Psychological Bulletin, 52(4), 281–302. https://doi.org/10.1037/
Devlin, J., Chang, M.-W., Lee, K., & Toutanova, K. (2019, June). BERT:
Pre-training of deep bidirectional transformers for language understanding [Conference session]. Proceedings of the 2019 Conference of the
North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers),
Minneapolis, MN, United States.
Foldes, H., Duehr, E. E., & Ones, D. S. (2008). GroDup difference in personality:
Meta-analyses comparing five U.S. racial groups. Personnel Psychology,
61(3), 579–616. https://doi.org/10.1111/j.1744-6570.2008.00123.x
Gnambs, T. (2014). A meta-analysis of dependability coefficients (test–retest
reliabilities) for measures of the Big Five. Journal of Research in
Personality, 52, 20–28. https://doi.org/10.1016/j.jrp.2014.06.003
Golbeck, J., Robles, C., & Turner, K. (2011, May). Predicting personality
with social media [Conference session]. Proceedings of the 2011 Annual
Conference on Human Factors in Computing Systems—CHI’11, Vancouver, BC, Canada. https://doi.org/10.1145/1979742.1979614
Goldberg, L. R. (1993). The structure of phenotypic personality traits.
American Psychologist, 48(1), 26–34. https://doi.org/10.1037/0003-066X
Goldberg, L. R. (1999). A broad-bandwidth. public domain. personality
inventory measuring the lower-level facets of several five-factor models.
In I. Mervielde, I. Deary, F. De Fruyt, & F. Ostendorf (Eds.), Personality
psychology in Europe (Vol. 7, pp. 7–28). Tilburg University Press.
Gou, L., Zhou, M. X., & Yang, H. (2014, April). KnowMe and ShareMe:
Understanding automatically discovered personality traits from social
media and user sharing preference [Conference session]. Proceedings of
the SIGCHI Conference on Human Factors in Computing System—
CHI’14, Toronto, ON, Canada. https://doi.org/10.1145/2556288.2557398
Gow, I. D., Kaplan, S. N., Larcker, D. F., & Zakolyukina, A. A. (2016). CEO
personality and firm policies [Working paper 22435]. National Bureau of
Economic Research. https://doi.org/10.3386/w22435
Guo, F., Gallagher, C. M., Sun, T., Tavoosi, S., & Min, H. (2021). Smarter
people analytics with organizational text data: Demonstrations using
classic and advanced NLP models. Human Resource Management Journal, 2021, 1–16. https://doi.org/10.1111/1748-8583.12426
Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2019).
Measuring CEO personality: Developing, validating, and testing a linguistic tool. Strategic Management Journal, 40, 1316–1330. https://
Harrison, J. S., Thurgood, G. R., Boivie, S., & Pfarrer, M. D. (2020).
Perception is reality: How CEOs’ observed personality influences market
perceptions of firm risk and shareholder returns. Academy of Management
Journal, 63(4), 1166–1195. https://doi.org/10.5465/amj.2018.0626
Hauenstein, N. M. A., Bradley, K. M., O’Shea, P. G., Shah, Y. J., & Magill, D. P.
(2017). Interactions between motivation to fake and personality item characteristics: Clarifying the process. Organizational Behavior and Human
Decision Processes, 138, 74–92. https://doi.org/10.1016/j.obhdp.2016.11.002
Hickman, L., Bosch, N., Ng, V., Saef, R., Tay, L., & Woo, S. E. (2022).
Automated video interview personality assessments: Reliability, validity,
and generalizability investigations. Journal of Applied Psychology,
107(8), 1323–1351. https://doi.org/10.1037/apl0000695
Hoerl, A., & Kennard, R. (1988). Ridge regression. In S. Kotz, C. B. Read, N.
Balakrishnan, B. Vidakovic, & N. L. Johnson (Eds.), Encyclopedia of
statistical sciences (Vol. 8, pp. 129–136). Wiley.
Hoppe, S., Loetscher, T., Morey, S. A., & Bulling, A. (2018). Eye movements
during everyday behavior predict personality traits. Frontiers in Human
Neuroscience, 12, Article 105. https://doi.org/10.3389/fnhum.2018.00105
Hopwood, C. J., & Donnellan, M. B. (2010). How should the internal structure
of personality inventories be evaluated? Personality and Social Psychology
Review, 14(3), 332–346. https://doi.org/10.1177/1088868310361240
Hu, L. T., & Bentler, P. M. (1999). Cutoff criteria for fit indexes in
covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling, 6(1), 1–55. https://doi.org/10.1080/
Hurtz, G. M., & Donovan, J. J. (2000). Personality and job performance: The
Big Five revisited. Journal of Applied Psychology, 85(6), 869–879. https://
Hwang, A. H. C., & Won, A. S. (2021, May). IdeaBot: Investigating social
facilitation in human-machine team creativity. In Y. Kitamura & A.
Quigley (Eds.), Proceedings of the 2021 CHI conference on human
factors in computing systems (pp. 1–16). ACM. https://doi.org/10.1145/
International Business Machines. (n.d.). What is a chatbot? Retrieved
December 28, 2022, from https://www.ibm.com/topics/chatbots
Jayaratne, M., & Jayatilleke, B. (2020). Predicting personality using answers
to open-ended interview questions. IEEE Access: Practical Innovations,
Open Solutions, 8, 115345–115355. https://doi.org/10.1109/ACCESS.2020
Jiang, Z., Rashik, M., Panchal, K., Jasim, M., Sarvghad, A., Riahi, P.,
DeWitt, E., Thurber, F., & Mahyar, N. (2023). CommunityBots: Creating
and evaluating a multi-agent chatbot platform for public input elicitation
[Conference session]. Accepted to the 26th ACM Conference on
Computer-Supported Cooperative Work and Social Computing, Minneapolis, Minnesota, United States.
Joulin, A., Grave, E., Bojanowski, P., & Mikolov, T. (2016). Bag of tricks for
efficient text classification. arXiv. https://doi.org/10.48550/arXiv.1607.01759
Judge, T. A., & Bono, J. E. (2001). Relationship of core self-evaluations traits
—Self-esteem, generalized self-efficacy, locus of control, and emotional
stability—With job satisfaction and job performance: A meta-analysis.
Journal of Applied Psychology, 86(1), 80–92. https://doi.org/10.1037/
Kim, S., Lee, J., & Gweon, G. (2019, May). Comparing data from chatbot
and web surveys: Effects of platform and conversational style on survey
response quality. In S. Brewster & G. Fitzpatrick (Eds.), Proceedings of
the 2019 CHI conference on human factors in computing systems (pp. 1–
12). ACM. https://doi.org/10.1145/3290605.3300316
Kosinski, M., Stillwell, D., & Graepel, T. (2013). Private traits and attributes
are predictable from digital records of human behavior. Proceedings of the
National Academy of Sciences of the United States of America, 110(15),
5802–5805. https://doi.org/10.1073/pnas.1218772110
Kulkarni, V., Kern, M. L., Stillwell, D., Kosinski, M., Matz, S., Ungar, L.,
Skiena, S., & Schwartz, H. A. (2018). Latent human traits in the language of
social media: An open-vocabulary approach. PLOS ONE, 13(11), Article
e0201703. https://doi.org/10.1371/journal.pone.0201703
Lanyon, R. I., Goodstein, L. D., & Wershba, R. (2014). ‘Good Impression’ as a
moderator in employment-related assessment. International Journal of
Selection and Assessment, 22(1), 52–61. https://doi.org/10.1111/ijsa.12056
LeCun, Y., Bengio, Y., & Hinton, G. (2015). Deep learning. Nature,
521(7553), 436–444. https://doi.org/10.1038/nature14539
Leutner, K., Liff, J., Zuloaga, L., & Mondragon, N. (2021). Hirevue’s
assessment science [White paper]. https://Hirevue.com. https://webapi.
Li, J., Zhou, M. X., Yang, H., & Mark, G. (2017, March). Confiding in and
listening to virtual agents: The effect of personality [Conference session].
Paper presented at the 22nd annual meeting of the intelligent user interfaces
community, Limassol, Cyprus. https://doi.org/10.1145/3025171.3025206
Li, W., Wu, C., Hu, X., Chen, J., Fu, S., Wang, F., & Zhang, D. (2020).
Quantitative personality predictions from a brief EEG recording. IEEE
Transactions on Affective Computing. Advance online publication. https://
Liu, A. X., Li, Y., & Xu, S. X. (2021). Assessing the unacquainted: Inferred
reviewer Personality and review helpfulness. Management Information
Systems Quarterly, 45(3), 1113–1148. https://doi.org/10.25300/MISQ/
Loevinger, J. (1957). Objective tests as instruments of psychological theory.
Psychological Reports, 3(3), 635–694. https://doi.org/10.2466/pr0.1957.3
Lorenzo-Seva, U., & ten Berge, J. M. (2006). Tucker’s congruence coefficient as a meaningful index of factor similarity. Methodology: European
Journal of Research Methods for the Behavioral and Social Sciences, 2(2),
57–64. https://doi.org/10.1027/1614-2241.2.2.57
Marinucci, A., Kraska, J., & Costello, S. (2018). Recreating the relationship
between subjective wellbeing and personality using machine learning: An
investigation into Facebook online behaviours. Big Data and Cognitive
Computing, 2(3), Article 29. https://doi.org/10.3390/bdcc2030029
Marsh, H. W., Guo, J., Dicke, T., Parker, P. D., & Craven, R. G. (2020).
Confirmatory factor analysis (CFA), exploratory structural equation
modeling (ESEM), and Set-ESEM: Optimal balance between goodness
of fit and parsimony. Multivariate Behavioral Research, 55(1), 102–119.
McAbee, S. T., & Connelly, B. S. (2016). A multi-rater framework for
studying personality: The trait-reputation-identity model. Psychological
Review, 123(5), 569–591. https://doi.org/10.1037/rev0000035
McAbee, S. T., & Oswald, F. L. (2013). The criterion-related validity of
personality measures for predicting GPA: A meta-analytic validity competition. Psychological Assessment, 25(2), 532–544. https://doi.org/10
McCarthy, J., & Wright, P. (2004). Technology as experience. Interactions,
11(5), 42–43. https://doi.org/10.1145/1015530.1015549
McCarthy, J. M., Bauer, T. N., Truxillo, D. M., Anderson, N. R., Costa, A. C., &
Ahmed, S. M. (2017). Applicant perspectives during selection: A review
addressing “So What?” “What’s New?” and “Where to Next?.” Journal of
Management, 43(6), 1693–1725. https://doi.org/10.1177/0149206316681846
Mikolov, T. (2012). Statistical language models based on neural networks.
Presentation at Google.
Mitchell, T. M. (1997). Machine learning. McGraw-Hill.
Morgeson, F. P., Campion, M. A., Dipboye, R. L., Hllenbeck, J. R., Murphy,
K., & Schmitt, N. (2007). Reconsidering the use of personality, tests in
personnel selection contexts. Personnel Psychology, 60(3), 683–729.
Mulfinger, E., Wu, F., Alexander, L., III, & Oswald, F. L. (2020, February).
AL technologies in talent management systems: It glitters but is it gold?
[Poster Presentation]. Work in the 21st Century: Automation, Workers,
and Society, Houston, TX, United States.
Muthén, L. K., & Muthén, B. O. (2017). Mplus user’s guide (8th ed.).
(Original work published 1998).
Oswald, F. L., Behrend, T. S., Putka, D. J., & Sinar, E. (2020). Big data in
industrial-organizational psychology and human resource management:
Forward progress for organizational research and practice. Annual Review
of Organizational Psychology and Organizational Behavior, 7(1), 505–
533. https://doi.org/10.1146/annurev-orgpsych-032117-104553
Oswald, F. L., Schmitt, N., Kim, B. H., Ramsay, L. J., & Gillespie, M. A.
(2004). Developing a biodata measure and situational judgment inventory
as predictors of college student performance. Journal of Applied Psychology, 89(2), 187–207. https://doi.org/10.1037/0021-9010.89.2.187
Park, G., Schwartz, H. A., Eichstaedt, J. C., Kern, M. L., Kosinski, M.,
Stillwell, D. J., Ungar, L. H., & Seligman, M. E. (2015). Automatic
personality assessment through social media language. Journal of Personality and Social Psychology, 108(6), 934–952. https://doi.org/10.1037/
Paulhus, D. L., Robins, R. W., Trzesniewski, K. H., & Tracy, J. L. (2004).
Two replicable suppressor situations in personality research. Multivariate
Behavioral Research, 39(2), 303–328. https://doi.org/10.1207/s15327906
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel,
O., Blondel, M., Prettenhofter, P., Weiss, R., Dubourg, V., Vanderplas, J.,
Passos, A., Cournapeau, D., Brucher, M., Perrot, M., & Duchesnay, E.
(2011). Scikit-learn: Machine learning in Python. Journal of Machine
Learning Research, 12, 2825–2830.
Pennebaker, J. W., Boyd, R. L., Jordan, K., & Blackburn, K. (2015). The
development and psychometric properties of LIWC2015. https://LIWC.net
Pervin, L. A. (1994). Further reflections on current trait theory. Psychological Inquiry, 5(2), 169–178. https://doi.org/10.1207/s15327965pli0502_19
Pulakos, E. D., Arad, S., Donovan, M. A., & Plamondon, K. E. (2000).
Adaptability in the workplace: Development of a taxonomy of adaptive
performance. Journal of Applied Psychology, 85(4), 612–624. https://
Putka, D. J., Beatty, A. S., & Reeder, M. C. (2018). Modern prediction methods:
New perspectives on a common problem. Organizational Research
Methods, 21(3), 689–732. https://doi.org/10.1177/1094428117697041
Putka, D. J., Oswald, F. L., Landers, R. N., Beatty, A. S., McCloy, R. A., & Yu,
M. C. (2022). Evaluating a natural language processing approach to estimating
KSA and interest job analysis ratings. Journal of Business and Psychology.
Advance online publication. https://doi.org/10.1007/s10869-022-09824-0
R Core Team. (2020). R: A language and environment for statistical computing.
R Foundation for Statistical Computing. http://www.R-project.org/
Revelle, W. (2022). Psych: Procedures for psychological, psychometric, and
personality research (Version 2.2.5). Northwestern University. http://
Ruder, S. (2017). An overview of multi-task learning in deep neural networks. arXiv. https://arxiv.org/abs/1706.05098
Sackett, P. R., Zhang, C., Berry, C. M., & Lievens, F. (2022). Revisiting
meta-analytic estimates of validity in personnel selection: Addressing
systematic overcorrection for restriction of range. Journal of Applied
Psychology, 107(11), 2040–2068. https://doi.org/10.1037/apl0000994
Sajjadiani, S., Sojourner, A. J., Kammeyer-Mueller, J. D., & Mykerezi, E.
(2019). Using machine learning to translate applicant work history into
predictors of performance and turnover. Journal of Applied Psychology,
104(10), 1207–1225. https://doi.org/10.1037/apl0000405
Shumanov, M., & Johnson, L. (2021). Making conversations with chatbots
more personalized. Computers in Human Behavior, 117, Article 106627.
Speer, A. B. (2021). Scoring dimension-level job performance from narrative
comments: Validity and generalizability when using natural language
processing. Organizational Research Methods, 24(3), 572–594. https://
Spisak, B. R., van der Laken, P. A., & Doornenbal, B. M. (2019). Finding the
right fuel for the analytical engine: Expanding the leader trait paradigm
through machine learning? The Leadership Quarterly, 30(4), 417–426.
Suen, H.-Y., Huang, K.-E., & Lin, C.-L. (2019). TensorFlow-based automatic personality recognition used in asynchronous video interviews.
IEEE Access: Practical Innovations, Open Solutions, 7, 61018–61023.
Sun, T. (2021). Artificial intelligence powered personality assessment: A
multidimensional psychometric natural language processing perspective
[Doctoral dissertation]. University of Illinois Urbana-Champaign. https://
Tay, L., Woo, S. E., Hickman, L., & Saef, R. M. (2020). Psychometric and
validity issues in machine learning approaches to personality assessment:
A focus on social media text mining. European Journal of Personality,
34(5), 826–844. https://doi.org/10.1002/per.2290
Tibshirani, R. (1996). Regression shrinkage and selection via the lasso.
Journal of the Royal Statistical Society. Series B. Methodological, 58(1),
267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x
Tippins, N. T., Oswald, F. L., & McPhail, S. M. (2021). Scientific, legal, and
ethical concerns about AI-based personnel selection tools: A call to action.
Personnel Assessment and Decisions, 7(2), 1–22. https://doi.org/10
Van Rossum, G., & Drake, F. L. (2009). Python 3 reference manual.
Völkel, S. T., Haeuslschmid, R., Werner, A., Hussmann, H., & Butz, A.
(2020). How to Trick AI: Users’ strategies for protecting themselves from
automatic personality assessment. In R. Bernhaupt & F. F. Mueller (Eds.),
Proceedings of the 2020 CHI conference on human factors in computing
systems (pp. 1–15). ACM. https://doi.org/10.1145/3313831.3376877
Wang, S., & Chen, X. (2020). Recognizing CEO personality and its impact
on business performance: Mining linguistic cues from social media.
Information & Management, 57(5), Article 103173. https://doi.org/10
Woehr, D. J., Putka, D. J., & Bowler, M. C. (2012). An examination of gtheory methods for modeling multitrait–multimethod data: Clarifying links
to construct validity and confirmatory factor analysis. Organizational
Research Methods, 15(1), 134–161. https://doi.org/10.1177/109442811
Xiao, Z., Zhou, M. X., Liao, Q. V., Mark, G., Chi, C., Chen, W., & Yang, H.
(2020). Tell me about yourself: Using an AI-powered chatbot to conduct
conversational surveys with open-ended questions. ACM Transactions on
Computer-Human Interaction, 27(3), 1–37. https://doi.org/10.1145/3381804
Yang, Y., Arego, G. H., Yuan, S., Guo, M., Shen, Q., Cer, D., Sung, Y-H.,
Strope, B., & Kurzweil, R. (2019). Improving multilingual sentence
embedding using Bi-directional Dual Encoder with additive margin
softmax. arXiv. https://doi.org/10.24963/ijcai.2019/746
Yarkoni, T. (2010). Personality in 100,000 words: A large-scale analysis of
personality and word use among bloggers. Journal of Research in Personality, 44(3), 363–373. https://doi.org/10.1016/j.jrp.2010.04.001
Youyou, W., Kosinski, M., & Stillwell, D. (2015). Computer-based personality judgments are more accurate than those made by humans. Proceedings of the National Academy of Sciences of the United States of America,
112(4), 1036–1040. https://doi.org/10.1073/pnas.1418680112
Zhang, B., Luo, J., Chen, Y., Roberts, B., & Drasgow, F. (2020). The road
less traveled: A cross-cultural study of the negative wording factor in
multidimensional scales. PsyArXiv. https://doi.org/10.31234/osf.io/2psyq
Zhang, B., Luo, J., Sun, T., Cao, M., & Drasgow, F. (2021). Small but
nontrivial: A comparison of six strategies to handle cross-loadings in
bifactor predictive models. Multivariate Behavioral Research. Advance
online publication. https://doi.org/10.1080/00273171.2021.1957664
Zhou, M. X., Chen, W., Xiao, Z., Yang, H., Chi, T., & Williams, R. (2019,
March 17–20). Getting virtually personal: Chatbots who actively listen to
you and infer your personality [Conference session]. Paper presented at
24th International Conference on Intelligent User Interfaces (IUI’19
Companion), Marina Del Rey, CA, United States. https://doi.org/10
Ziegler, M., MacCann, C., & Roberts, R. D. (Eds.). (2012). New perspectives
on faking in personality assessment. Oxford University Press.
Zou, H., & Hastie, T. (2005). Regularization and variable selection via the
elastic net. Journal of the Royal Statistical Society. Series B, Statistical
Methodology, 67(2), 301–320. https://doi.org/10.1111/j.1467-9868.2005
Received May 13, 2022
Revision received January 3, 2023
