International Journal of Data Science Research and Development (IJDSRD) Volume 2, Issue 1, January-December 2023, pp. 17-26, Article ID: IJDSRD_02_01_003 Available online at https://iaeme.com/Home/issue/IJDSRD?Volume=2&Issue=1 Journal ID: 2811-6201. © IAEME Publication DECODING RESUMES: THE DATA-DRIVEN APPROACH TO TALENT ACQUISITION Akshata Upadhye Cincinnati, Ohio, USA ABSTRACT The process of talent acquisition has been subject to a profound transformation due to developments in the field of data science and technology. In this paper begin by discussing the importance of resume categorization in identifying qualified candidates. We also discuss the challenges faced in scrutinizing unstructured resumes and the advantages offered by machine learning and natural language processing techniques in this domain. Then we discuss the significance of resume categorization, its role in streamlining the initial screening process, and its contribution to strategic talent pool management. We also explore the unique advantages of Latent Dirichlet Allocation (LDA) topic modeling over other classification-based approaches. We train a LDA model and use it to extract the topic distribution within the dataset and the keyword distribution within each of the topics. Finally, we demonstrate the effectiveness of our approach by visualizing topics on an inter-topic distance map by highlighting the important keywords for a given topic. Keywords: Resume Categorization, Talent Acquisition, Data Science, Natural Language Processing, Latent Dirichlet Allocation, Recruitment Technology. Cite this Article: Akshata Upadhye, Decoding Resumes: The Data-Driven Approach to Talent Acquisition, International Journal of Data Science Research and Development (IJDSRD), 2(1), 2023, pp. 17–26. https://iaeme.com/Home/issue/IJDSRD?Volume=2&Issue=1 1. INTRODUCTION With a large volume of job applications for each position, manual review of resumes is timeconsuming and inefficient. Resume categorization, often aided by technology and automation, helps in streamlining the initial screening process. Efficient resume categorization helps in allocating resources effectively. Recruiters can focus their time and efforts on candidates who are the best fit for the position, improving the overall recruitment process. Analyzing resumes helps match the skills, qualifications, and experience of candidates with the specific requirements of the job. This ensures that only qualified candidates are considered for further evaluation. Therefore, employers use resume categorization tools to match the skills and experience listed on a candidate's resume with the requirements of the job opening. https://iaeme.com/Home/journal/IJDSRD 17 editor@iaeme.com Akshata Upadhye This helps ensure that candidates meet the basic criteria for the position. Recruiters often use applicant tracking systems (ATS) to aid the resume shortlisting process. These systems scan resumes for specific keywords related to skills, experience, and qualifications, helping identify suitable candidates more efficiently. Analyzing resumes generates valuable data and insights about candidate pools. This data can be used for strategic decision-making, such as identifying skill gaps or adjusting recruitment strategies. It helps in shortlisting candidates for interviews or further assessments. By categorizing resumes, employers can create a pool of potential candidates who are a good fit for the organization. Resume categorization also aids in building and managing a talent pool. Even if a candidate is not selected for a specific position, their information can be stored for future opportunities. In summary automating resume categorization can be helpful in making data-driven decisions to improve the organization’s talent acquisition strategies. 2. RELATED WORK When working with talent data, resume is one of the key sources of information which contains education, skills and work experience of a talent and the resume data is unstructured. In order to deal with unstructured resume data, the development of research and practical applications of Machine Learning and Natural Language Processing have proved to be useful. Categorizing talent databases have gained popularity in recent years, due to the organizations seeking more effective and data-driven ways to manage and recruit talent. Let us dive deep into the recent research and developments used to categorize talent data. To handle the large numbers of resumes submitted for a job application, many researchers have developed techniques to extract skills and categorize the resumes. Most of the research in this area includes using various classification algorithms. In the paper [1] the authors have used TFIDF for resume text representation and have performed classification using Machine Learning Algorithms such as Naive Bayes, Random Forest, and SVM for categorizing resumes. Another approach to categorize resume is to compute word order similarity between sentences on large dataset as introduced in the paper [2]. In the paper [3] the authors tested 9 different classifiers - Support Vector Machine (Linear, SGD, SVC, NuSVC), Naive Bayes (Bernoulli, Multinomial, Gaussian), K-Nearest Neighbor, and Logistic Regression and obtained promising results from Support Vector Machine. A very interesting approach to classify resumes is to utilize the deep learning-based methods such as Convolutional Neural Networks and this can be found in the paper [4]. In the paper [5], the authors have demonstrated how resumes are classified using classifiers such as Random Forest, Multinomial Naive Bayes, Logistic Regression, Linear Support Vector Classifier and the classification results are then used to design a content-based recommendation system using KNN and cosine similarity. Another interesting approach to categorize resume is to use an ensemble of various classifiers as presented in the paper [6] in which the authors have used 5 different classifiers - Naive Bayes, Multinomial Naive Bayes, Linear SVC, Bernoulli Naive Bayes, Logistic Regression to get the votes from these classifies to determine the final category assigned to a resume. Finally in the paper [7], the authors have designed a rule-based system for rating the resumes based on the keywords extracted in the resume to assign the category. As we can observe, based on the survey of the existing approaches for categorizing resumes, most of the researchers have used the classification approach. Although classification algorithms such as SVM, Naïve Bayes and Random Forest do a very good job at classifying resumes with higher precision and recall for specific categories when classifiers are properly trained, we must acknowledge that these algorithms also have their own limitations such as: Training classification models for resume analysis requires a large amount of labeled data, which can be time-consuming and costly to obtain. https://iaeme.com/Home/journal/IJDSRD 18 editor@iaeme.com Decoding Resumes: The Data-Driven Approach to Talent Acquisition Classification models are less flexible when it comes to identifying latent topics or discovering hidden patterns within the resumes, since they are designed to work with predefined classes. Additionally, classification models may struggle to capture the broader context and nuances present in resumes, especially when dealing with unstructured text. Finally, imbalanced datasets, where one class greatly outnumbers others, can affect model performance and bias results. Therefore, our goal is to build a model that can overcome the limitations of traditional classifiers by discovering hidden patterns and topics within a corpus of resumes, without the need for labeled training data. 3. BACKGROUND 3.1. Bag of Words In our roadmap for categorizing resumes, let's address a crucial initial step which is preparing resumes for model training. As we know, resume data tends to be in the form of unstructured text. Therefore, extracting precise and meaningful information from them can be challenging. Hence in order to utilize the text data within the resumes, our first task is to create a standardized representation for all the resumes in our collection. This representation takes the form of a fixedlength Bag-of-Words and is known as the Bag-of-Word (BoW) model [8]. Using this approach every resume can be represented as a fixed length vector containing the frequency of each term within that resume. In other words, this representation is also known as term frequency matrix. With this standardized representation in place, we can then feed it to machine learning models for categorization of resumes. 3.2. Latent Dirichlet Allocation Once we have preprocessed the resumes, our next step involves training a model to extract relevant topics and keywords from the resume data. For this task, we will be using a probabilistic topic modeling algorithm known as Latent Dirichlet Allocation [9], or LDA for short. LDA is a widely recognized approach in the world of document clustering and topic modeling. Its primary function is to identify hidden topics within a collection of documents by estimating two key distributions: • • Probabilistic Distribution of Words in Topics: - This distribution reveals which words are most likely to appear within specific topics. Probabilistic Distribution of Topics in Documents: - This distribution tells us how likely various topics are to be present within each document. The key advantage of LDA lies in its ability to uncover the underlying semantic structures within documents. In other words, it helps us reveal the hidden meanings and themes that may not be immediately apparent in the text. 3.3. pyLDAvis Visualizing the topics and the important keywords can be helpful in deriving insights into the data. LDA topic visualization can help understand the topics within the resumes, helping us understand content, relationships, and structure visually. Therefore, we use pyLDAvis [11] for visualizing topics on an Inter-topic Distance Map to get insights into some of the hidden topics the LDA topic model has uncovered within the data. This visualization is helpful to display relationships among different topics. Similar topics are placed closer on the 2D space and dissimilar are represented farther. Each bubble on this plot represents a topic, with the bubble's size indicating how prevalent that topic is in our document collection. https://iaeme.com/Home/journal/IJDSRD 19 editor@iaeme.com Akshata Upadhye Additionally, On the right side of the inter-topic plot, the top words for a topic are shown. And when we select a topic, on the right, you'll find a detailed chart showing word frequencies for that topic. This kind of visualization is useful to understand the data and to evaluate the topics. 4. METHODOLOGY Fig. 1. Data Preparation and workflow In this section we discuss in detail the steps used for processing resumes and extracting hidden topics within them and the visualization of topics. 4.1. Phase 1: Text Preprocessing In order to use the text data from the resume dataset R for every resume Ri, several preprocessing steps are applied to every resume. The text data from every resume Ri is tokenized, then the stop words are removed, and every token is lemmatized, and the frequent ngrams are added to the list of tokens. 4.2. Phase 2: Text Representation In this step the preprocessed documents are used to generate a fixed length vector representation using the Bag-of-Words model. The fixed length vector representation, also known as the term frequency matrix, can be used for topic modeling. 4.3. Phase 3: Topic and keyword extraction Once we have extracted the term frequency vector representation for every resume, we train a LDA topic model for n topics and extract the keyword distribution for every topic and topic distribution within every resume using the LDA implementation in the gensim library [10]. 4.4. Phase 4: Visualization Finally, we will visualize the topics and the keyword distribution within the topics in a 2D space using pyLDAvis for visualization. 5. DATASETS For our research used a resume dataset from Kaggle containing 2500 resumes from the following classes: HR, designer, information-technology, teacher, advocate, businessdevelopment, healthcare, fitness, agriculture, BPO, sales, consultant, digital-media, automobile, chef, finance, apparel, engineering, accountant, construction, public-relations, banking, arts, aviation. The class distribution within the dataset is shown in fig 2. https://iaeme.com/Home/journal/IJDSRD 20 editor@iaeme.com Decoding Resumes: The Data-Driven Approach to Talent Acquisition Fig. 2. Class distribution in the data 6. RESULTS AND INTERPRETATION In this section we will dive deep into the performance of LDA topic model by examining the topics on the inter-topic visualization and by looking into the top keywords belonging to each of these topics. As discussed earlier, the topics that are close to each other are similar and if any of the two topics are away from each other then they are dissimilar. Additionally on the right side the keyword belonging to the selected topic are displayed along with their probability scores. As we can see in fig 3, we have selected the topic 1, and based on the top 30 words and their frequencies, we can say that the resumes within this topic belong to the candidates having experience in the construction industry. In fig 4, we have selected topic 2, and based on the top 30 words and their frequencies, we can say that the resumes within this topic belong to the candidates having experience in the customer succes industry. In fig 5, we have selected topic 6, and based on the top 30 words and their frequencies, we can say that the resumes within this topic belong to the candidates having experience in the sales and marketing industry. In fig 6, we have selected the topic 7, and based on the top 30 words and their frequencies, we can say that the resumes within this topic belong to the candidates having experience in the teaching profession. In fig 7, we have selected topic 9, and based on the top 30 words and their frequencies, we can say that the resumes within this topic belong to the candidates having experience in the aviation industry. https://iaeme.com/Home/journal/IJDSRD 21 editor@iaeme.com Akshata Upadhye Fig. 3. Topic 1 and its top 30 keywords Fig. 4. Topic 2 and its top 30 keywords https://iaeme.com/Home/journal/IJDSRD 22 editor@iaeme.com Decoding Resumes: The Data-Driven Approach to Talent Acquisition Fig. 5. Topic 6 and its top 30 keywords Fig. 6. Topic 7 and its top 30 keywords https://iaeme.com/Home/journal/IJDSRD 23 editor@iaeme.com Akshata Upadhye Fig. 7. Topic 9 and its top 30 keywords Therefore, by looking at the inter-topic distance visualization it looks like the LDA topic model is able to uncover hidden themes within the resumes and is able to extract the top skills within the topics which is useful when the recruiters have to review a bunch of resumes for a particular job. For instance, if I wanted to select a bunch of suitable candidates for a Customer Service Representative position, I would prioritize reviewing resumes from topic 2 since those resumes have the most suitable keywords for customer service jobs. Additionally, the overlapping topics may represent candidate resumes having skills belonging to similar skill categories such as skills required for marketing and sales. Apart from these individual topics, by looking at the overall visualization we also get a better idea about how the talents and their skills are distributed in our data which can help the HR teams to develop specific recruitment strategies. We have reviewed only a few topics in this paper since it’s not possible to include images for all of 25 topics in the paper. https://iaeme.com/Home/journal/IJDSRD 24 editor@iaeme.com Decoding Resumes: The Data-Driven Approach to Talent Acquisition 7. CONCLUSION In this paper we have discussed the importance of automating the resume categorization for the initial screening process and researched the existing techniques most popularly used for resume categorization. We have explored the unique strengths of Latent Dirichlet Allocation (LDA) in uncovering hidden semantic structures and found that it is a useful technique to help sort resumes into relevant groups. We have also presented the results of our approach using visualizations and the top keywords in that topic to give a better idea. Therefore, we have uncovered the transformative power of the probabilistic topic modeling to identify hidden insights within the resume data and have demonstrated how it can be leveraged to pick the group of suitable candidates and to get an overall idea about all the resumes within the database in general. REFERENCES [1] Pal, Riya, Shahrukh Shaikh, Swaraj Satpute, and Sumedha Bhagwat. “Resume classification using various machine learning algorithms.” In ITM Web of Conferences, vol. 44, p. 03011. EDP Sciences, 2022. [2] Shaikh, Razkeen, Nikita Phulkar, Harsha Bhute, Sana Kauser Shaikh, and Prajakta Bhapkar. “An intelligent framework for e-recruitment system based on text categorization and semantic analysis.” In 2021 Third International Conference on Inventive Research in Computing Applications (ICIRCA), pp. 1076-1080. IEEE, 2021. [3] Ali, Irfan, Nimra Mughal, Zahid Hussain Khand, Javed Ahmed, and Ghulam Mujtaba. “Resume classification system using natural language processing and machine learning techniques.” Mehran University Re- search Journal of Engineering and Technology 41, no. 1 (2022): 65-79. [4] Nasser, Shabna, C. Sreejith, and M. Irshad. “Convolutional neural net- work with word embedding based approach for resume classification.” In 2018 International Conference on Emerging Trends and Innovations in Engineering and Technological Research (ICETIETR), pp. 1-6. IEEE, 2018. [5] Roy, Pradeep Kumar, Sarabjeet Singh Chowdhary, and Rocky Bhatia. “A Machine Learning approach for automation of Resume Recommendation system.” Procedia Computer Science 167 (2020): 2318-2327. [6] Gopalakrishna, Suhas Tangadle, and Vijayaraghavan Vijayaraghavan. “Automated tool for Resume classification using Sementic analysis.” International Journal of Artificial Intelligence and Applications (IJAIA) 10, no. 1 (2019). [7] Chandola, Divyanshu, Aditya Garg, Ankit Maurya, and Amit Kushwaha. “Online resume parsing system using text analytics.” Journal of Multi- Disciplinary Engineering Technologies 9 (2015). [8] Patil, Rajvardhan, Sorio Boit, Venkat Gudivada, and Jagadeesh Nandigam. “A Survey of Text Representation and Embedding Techniques in NLP.” IEEE Access (2023). https://iaeme.com/Home/journal/IJDSRD 25 editor@iaeme.com Akshata Upadhye [9] Blei, David M., Andrew Y. Ng, and Michael I. Jordan. “Latent dirichlet allocation.” Journal of machine Learning research 3, no. Jan (2003): 993-1022. [10] Reh ů ˇrek, Radim, and Petr Sojka. “Gensim—statistical semantics in python.” Retrieved from genism. org (2011). [11] Mabey, Ben. “pyLDAvis documentation.” (2021). Citation: Akshata Upadhye, Decoding Resumes: The Data-Driven Approach to Talent Acquisition, International Journal of Data Science Research and Development (IJDSRD), 2(1), 2023, pp. 17–26 Article Link: https://iaeme.com/MasterAdmin/Journal_uploads/IJDSRD/VOLUME_2_ISSUE_1/IJDSRD_02_01_003.pdf Abstract: https://iaeme.com/Home/article_id/IJDSRD_02_01_003 Copyright: © 2023 Authors. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. This work is licensed under a Creative Commons Attribution 4.0 International License (CC BY 4.0). ✉ editor@iaeme.com https://iaeme.com/Home/journal/IJDSRD 26 editor@iaeme.com