Uploaded by Information Iaeme

DECODING RESUMES: THE DATA-DRIVEN APPROACH TO TALENT ACQUISITION

advertisement
International Journal of Data Science Research and Development (IJDSRD)
Volume 2, Issue 1, January-December 2023, pp. 17-26, Article ID: IJDSRD_02_01_003
Available online at https://iaeme.com/Home/issue/IJDSRD?Volume=2&Issue=1
Journal ID: 2811-6201.
© IAEME Publication
DECODING RESUMES: THE DATA-DRIVEN
APPROACH TO TALENT ACQUISITION
Akshata Upadhye
Cincinnati, Ohio, USA
ABSTRACT
The process of talent acquisition has been subject to a profound transformation due
to developments in the field of data science and technology. In this paper begin by
discussing the importance of resume categorization in identifying qualified candidates.
We also discuss the challenges faced in scrutinizing unstructured resumes and the
advantages offered by machine learning and natural language processing techniques in
this domain. Then we discuss the significance of resume categorization, its role in
streamlining the initial screening process, and its contribution to strategic talent pool
management. We also explore the unique advantages of Latent Dirichlet Allocation
(LDA) topic modeling over other classification-based approaches. We train a LDA
model and use it to extract the topic distribution within the dataset and the keyword
distribution within each of the topics. Finally, we demonstrate the effectiveness of our
approach by visualizing topics on an inter-topic distance map by highlighting the
important keywords for a given topic.
Keywords: Resume Categorization, Talent Acquisition, Data Science, Natural
Language Processing, Latent Dirichlet Allocation, Recruitment Technology.
Cite this Article: Akshata Upadhye, Decoding Resumes: The Data-Driven Approach
to Talent Acquisition, International Journal of Data Science Research and Development
(IJDSRD), 2(1), 2023, pp. 17–26.
https://iaeme.com/Home/issue/IJDSRD?Volume=2&Issue=1
1. INTRODUCTION
With a large volume of job applications for each position, manual review of resumes is timeconsuming and inefficient. Resume categorization, often aided by technology and automation,
helps in streamlining the initial screening process. Efficient resume categorization helps in
allocating resources effectively. Recruiters can focus their time and efforts on candidates who
are the best fit for the position, improving the overall recruitment process. Analyzing resumes
helps match the skills, qualifications, and experience of candidates with the specific
requirements of the job. This ensures that only qualified candidates are considered for further
evaluation. Therefore, employers use resume categorization tools to match the skills and
experience listed on a candidate's resume with the requirements of the job opening.
https://iaeme.com/Home/journal/IJDSRD
17
editor@iaeme.com
Akshata Upadhye
This helps ensure that candidates meet the basic criteria for the position. Recruiters often
use applicant tracking systems (ATS) to aid the resume shortlisting process. These systems scan
resumes for specific keywords related to skills, experience, and qualifications, helping identify
suitable candidates more efficiently. Analyzing resumes generates valuable data and insights
about candidate pools. This data can be used for strategic decision-making, such as identifying
skill gaps or adjusting recruitment strategies. It helps in shortlisting candidates for interviews
or further assessments. By categorizing resumes, employers can create a pool of potential
candidates who are a good fit for the organization. Resume categorization also aids in building
and managing a talent pool. Even if a candidate is not selected for a specific position, their
information can be stored for future opportunities. In summary automating resume
categorization can be helpful in making data-driven decisions to improve the organization’s
talent acquisition strategies.
2. RELATED WORK
When working with talent data, resume is one of the key sources of information which contains
education, skills and work experience of a talent and the resume data is unstructured. In order
to deal with unstructured resume data, the development of research and practical applications
of Machine Learning and Natural Language Processing have proved to be useful. Categorizing
talent databases have gained popularity in recent years, due to the organizations seeking more
effective and data-driven ways to manage and recruit talent. Let us dive deep into the recent
research and developments used to categorize talent data.
To handle the large numbers of resumes submitted for a job application, many researchers
have developed techniques to extract skills and categorize the resumes. Most of the research in
this area includes using various classification algorithms. In the paper [1] the authors have used
TFIDF for resume text representation and have performed classification using Machine
Learning Algorithms such as Naive Bayes, Random Forest, and SVM for categorizing resumes.
Another approach to categorize resume is to compute word order similarity between sentences
on large dataset as introduced in the paper [2]. In the paper [3] the authors tested 9 different
classifiers - Support Vector Machine (Linear, SGD, SVC, NuSVC), Naive Bayes (Bernoulli,
Multinomial, Gaussian), K-Nearest Neighbor, and Logistic Regression and obtained promising
results from Support Vector Machine. A very interesting approach to classify resumes is to
utilize the deep learning-based methods such as Convolutional Neural Networks and this can
be found in the paper [4]. In the paper [5], the authors have demonstrated how resumes are
classified using classifiers such as Random Forest, Multinomial Naive Bayes, Logistic
Regression, Linear Support Vector Classifier and the classification results are then used to
design a content-based recommendation system using KNN and cosine similarity. Another
interesting approach to categorize resume is to use an ensemble of various classifiers as
presented in the paper [6] in which the authors have used 5 different classifiers - Naive Bayes,
Multinomial Naive Bayes, Linear SVC, Bernoulli Naive Bayes, Logistic Regression to get the
votes from these classifies to determine the final category assigned to a resume. Finally in the
paper [7], the authors have designed a rule-based system for rating the resumes based on the
keywords extracted in the resume to assign the category.
As we can observe, based on the survey of the existing approaches for categorizing resumes,
most of the researchers have used the classification approach. Although classification
algorithms such as SVM, Naïve Bayes and Random Forest do a very good job at classifying
resumes with higher precision and recall for specific categories when classifiers are properly
trained, we must acknowledge that these algorithms also have their own limitations such as:
Training classification models for resume analysis requires a large amount of labeled data,
which can be time-consuming and costly to obtain.
https://iaeme.com/Home/journal/IJDSRD
18
editor@iaeme.com
Decoding Resumes: The Data-Driven Approach to Talent Acquisition
Classification models are less flexible when it comes to identifying latent topics or
discovering hidden patterns within the resumes, since they are designed to work with predefined
classes. Additionally, classification models may struggle to capture the broader context and
nuances present in resumes, especially when dealing with unstructured text. Finally, imbalanced
datasets, where one class greatly outnumbers others, can affect model performance and bias
results. Therefore, our goal is to build a model that can overcome the limitations of traditional
classifiers by discovering hidden patterns and topics within a corpus of resumes, without the
need for labeled training data.
3. BACKGROUND
3.1. Bag of Words
In our roadmap for categorizing resumes, let's address a crucial initial step which is preparing
resumes for model training. As we know, resume data tends to be in the form of unstructured
text. Therefore, extracting precise and meaningful information from them can be challenging.
Hence in order to utilize the text data within the resumes, our first task is to create a standardized
representation for all the resumes in our collection. This representation takes the form of a fixedlength Bag-of-Words and is known as the Bag-of-Word (BoW) model [8]. Using this approach
every resume can be represented as a fixed length vector containing the frequency of each term
within that resume. In other words, this representation is also known as term frequency matrix.
With this standardized representation in place, we can then feed it to machine learning models
for categorization of resumes.
3.2. Latent Dirichlet Allocation
Once we have preprocessed the resumes, our next step involves training a model to extract
relevant topics and keywords from the resume data. For this task, we will be using a
probabilistic topic modeling algorithm known as Latent Dirichlet Allocation [9], or LDA for
short. LDA is a widely recognized approach in the world of document clustering and topic
modeling. Its primary function is to identify hidden topics within a collection of documents by
estimating two key distributions:
•
•
Probabilistic Distribution of Words in Topics: - This distribution reveals which words are most
likely to appear within specific topics.
Probabilistic Distribution of Topics in Documents: - This distribution tells us how likely various
topics are to be present within each document.
The key advantage of LDA lies in its ability to uncover the underlying semantic structures
within documents. In other words, it helps us reveal the hidden meanings and themes that may
not be immediately apparent in the text.
3.3. pyLDAvis
Visualizing the topics and the important keywords can be helpful in deriving insights into the
data. LDA topic visualization can help understand the topics within the resumes, helping us
understand content, relationships, and structure visually. Therefore, we use pyLDAvis [11] for
visualizing topics on an Inter-topic Distance Map to get insights into some of the hidden topics
the LDA topic model has uncovered within the data. This visualization is helpful to display
relationships among different topics. Similar topics are placed closer on the 2D space and
dissimilar are represented farther. Each bubble on this plot represents a topic, with the bubble's
size indicating how prevalent that topic is in our document collection.
https://iaeme.com/Home/journal/IJDSRD
19
editor@iaeme.com
Akshata Upadhye
Additionally, On the right side of the inter-topic plot, the top words for a topic are shown.
And when we select a topic, on the right, you'll find a detailed chart showing word frequencies
for that topic. This kind of visualization is useful to understand the data and to evaluate the
topics.
4. METHODOLOGY
Fig. 1. Data Preparation and workflow
In this section we discuss in detail the steps used for processing resumes and extracting
hidden topics within them and the visualization of topics.
4.1. Phase 1: Text Preprocessing
In order to use the text data from the resume dataset R for every resume Ri, several
preprocessing steps are applied to every resume. The text data from every resume Ri is
tokenized, then the stop words are removed, and every token is lemmatized, and the frequent ngrams are added to the list of tokens.
4.2. Phase 2: Text Representation
In this step the preprocessed documents are used to generate a fixed length vector representation
using the Bag-of-Words model. The fixed length vector representation, also known as the term
frequency matrix, can be used for topic modeling.
4.3. Phase 3: Topic and keyword extraction
Once we have extracted the term frequency vector representation for every resume, we train a
LDA topic model for n topics and extract the keyword distribution for every topic and topic
distribution within every resume using the LDA implementation in the gensim library [10].
4.4. Phase 4: Visualization
Finally, we will visualize the topics and the keyword distribution within the topics in a 2D space
using pyLDAvis for visualization.
5. DATASETS
For our research used a resume dataset from Kaggle containing 2500 resumes from the
following classes: HR, designer, information-technology, teacher, advocate, businessdevelopment, healthcare, fitness, agriculture, BPO, sales, consultant, digital-media, automobile,
chef, finance, apparel, engineering, accountant, construction, public-relations, banking, arts,
aviation. The class distribution within the dataset is shown in fig 2.
https://iaeme.com/Home/journal/IJDSRD
20
editor@iaeme.com
Decoding Resumes: The Data-Driven Approach to Talent Acquisition
Fig. 2. Class distribution in the data
6. RESULTS AND INTERPRETATION
In this section we will dive deep into the performance of LDA topic model by examining the
topics on the inter-topic visualization and by looking into the top keywords belonging to each
of these topics. As discussed earlier, the topics that are close to each other are similar and if any
of the two topics are away from each other then they are dissimilar. Additionally on the right
side the keyword belonging to the selected topic are displayed along with their probability
scores.
As we can see in fig 3, we have selected the topic 1, and based on the top 30 words and their
frequencies, we can say that the resumes within this topic belong to the candidates having
experience in the construction industry. In fig 4, we have selected topic 2, and based on the top
30 words and their frequencies, we can say that the resumes within this topic belong to the
candidates having experience in the customer succes industry. In fig 5, we have selected topic
6, and based on the top 30 words and their frequencies, we can say that the resumes within this
topic belong to the candidates having experience in the sales and marketing industry. In fig 6,
we have selected the topic 7, and based on the top 30 words and their frequencies, we can say
that the resumes within this topic belong to the candidates having experience in the teaching
profession. In fig 7, we have selected topic 9, and based on the top 30 words and their
frequencies, we can say that the resumes within this topic belong to the candidates having
experience in the aviation industry.
https://iaeme.com/Home/journal/IJDSRD
21
editor@iaeme.com
Akshata Upadhye
Fig. 3. Topic 1 and its top 30 keywords
Fig. 4. Topic 2 and its top 30 keywords
https://iaeme.com/Home/journal/IJDSRD
22
editor@iaeme.com
Decoding Resumes: The Data-Driven Approach to Talent Acquisition
Fig. 5. Topic 6 and its top 30 keywords
Fig. 6. Topic 7 and its top 30 keywords
https://iaeme.com/Home/journal/IJDSRD
23
editor@iaeme.com
Akshata Upadhye
Fig. 7. Topic 9 and its top 30 keywords
Therefore, by looking at the inter-topic distance visualization it looks like the LDA topic
model is able to uncover hidden themes within the resumes and is able to extract the top skills
within the topics which is useful when the recruiters have to review a bunch of resumes for a
particular job. For instance, if I wanted to select a bunch of suitable candidates for a Customer
Service Representative position, I would prioritize reviewing resumes from topic 2 since those
resumes have the most suitable keywords for customer service jobs. Additionally, the
overlapping topics may represent candidate resumes having skills belonging to similar skill
categories such as skills required for marketing and sales. Apart from these individual topics,
by looking at the overall visualization we also get a better idea about how the talents and their
skills are distributed in our data which can help the HR teams to develop specific recruitment
strategies. We have reviewed only a few topics in this paper since it’s not possible to include
images for all of 25 topics in the paper.
https://iaeme.com/Home/journal/IJDSRD
24
editor@iaeme.com
Decoding Resumes: The Data-Driven Approach to Talent Acquisition
7. CONCLUSION
In this paper we have discussed the importance of automating the resume categorization for the
initial screening process and researched the existing techniques most popularly used for resume
categorization. We have explored the unique strengths of Latent Dirichlet Allocation (LDA) in
uncovering hidden semantic structures and found that it is a useful technique to help sort
resumes into relevant groups. We have also presented the results of our approach using
visualizations and the top keywords in that topic to give a better idea.
Therefore, we have uncovered the transformative power of the probabilistic topic modeling
to identify hidden insights within the resume data and have demonstrated how it can be
leveraged to pick the group of suitable candidates and to get an overall idea about all the resumes
within the database in general.
REFERENCES
[1]
Pal, Riya, Shahrukh Shaikh, Swaraj Satpute, and Sumedha Bhagwat. “Resume classification
using various machine learning algorithms.” In ITM Web of Conferences, vol. 44, p. 03011.
EDP Sciences, 2022.
[2]
Shaikh, Razkeen, Nikita Phulkar, Harsha Bhute, Sana Kauser Shaikh, and Prajakta Bhapkar.
“An intelligent framework for e-recruitment system based on text categorization and semantic
analysis.” In 2021 Third International Conference on Inventive Research in Computing
Applications (ICIRCA), pp. 1076-1080. IEEE, 2021.
[3]
Ali, Irfan, Nimra Mughal, Zahid Hussain Khand, Javed Ahmed, and Ghulam Mujtaba. “Resume
classification system using natural language processing and machine learning techniques.”
Mehran University Re- search Journal of Engineering and Technology 41, no. 1 (2022): 65-79.
[4]
Nasser, Shabna, C. Sreejith, and M. Irshad. “Convolutional neural net- work with word
embedding based approach for resume classification.” In 2018 International Conference on
Emerging Trends and Innovations in Engineering and Technological Research (ICETIETR), pp.
1-6. IEEE, 2018.
[5]
Roy, Pradeep Kumar, Sarabjeet Singh Chowdhary, and Rocky Bhatia. “A Machine Learning
approach for automation of Resume Recommendation system.” Procedia Computer Science 167
(2020): 2318-2327.
[6]
Gopalakrishna, Suhas Tangadle, and Vijayaraghavan Vijayaraghavan. “Automated tool for
Resume classification using Sementic analysis.” International Journal of Artificial Intelligence
and Applications (IJAIA) 10, no. 1 (2019).
[7]
Chandola, Divyanshu, Aditya Garg, Ankit Maurya, and Amit Kushwaha. “Online resume
parsing system using text analytics.” Journal of Multi- Disciplinary Engineering Technologies
9 (2015).
[8]
Patil, Rajvardhan, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. “A Survey of Text
Representation and Embedding Techniques in NLP.” IEEE Access (2023).
https://iaeme.com/Home/journal/IJDSRD
25
editor@iaeme.com
Akshata Upadhye
[9]
Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of
machine Learning research 3, no. Jan (2003): 993-1022.
[10]
Reh ů ˇrek, Radim, and Petr Sojka. “Gensim—statistical semantics in python.” Retrieved from
genism. org (2011).
[11]
Mabey, Ben. “pyLDAvis documentation.” (2021).
Citation: Akshata Upadhye, Decoding Resumes: The Data-Driven Approach to Talent Acquisition,
International Journal of Data Science Research and Development (IJDSRD), 2(1), 2023, pp. 17–26
Article Link:
https://iaeme.com/MasterAdmin/Journal_uploads/IJDSRD/VOLUME_2_ISSUE_1/IJDSRD_02_01_003.pdf
Abstract:
https://iaeme.com/Home/article_id/IJDSRD_02_01_003
Copyright: © 2023 Authors. This is an open-access article distributed under the terms of the Creative
Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium,
provided the original author and source are credited.
This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0).
✉ editor@iaeme.com
https://iaeme.com/Home/journal/IJDSRD
26
editor@iaeme.com
Download