Context Matters: Exploring Document Classification Challenges Using LLM’s JeroĢnimo Villegas1 1 Guane Enterprises jeronimo.villegas@guane.com.co Abstract This study investigates the application of large language models in the domain of document classification, focusing on labor capacity loss assessment document classification. Drawing inspiration from a prior work, we conducted a replication study employing Azure’s GPT-3.5-turbo and Google’s Chat-Bison. Our aim was to understand the impact of positive and negative demonstrations on the accuracy of language models in the nuanced task of categorizing relevant and irrelevant information within documents. Our methodology involved querying the language models with varying numbers of positive and negative demonstrations, utilizing a dataset of annotated documents related to labor capacity loss. Surprisingly, our results diverged from conventional expectations. In the case of Azure’s GPT with positive demonstrations, we observed a non-linear relationship between the number of demonstrations and classification accuracy. Contrary to prior findings, accuracy initially improved but plateaued, and in some instances, decreased with an increased number of positive demonstrations. Conversely, our study confirmed the positive correlation between the number of negative demonstrations and classification accuracy. With Google’s chat-bison with positive demonstrations, we observed a nonlinear relationship between the number of demonstrations and the classification accuracy, but this time accuracy first started deteriorating and then it improved. For the negative demos, the best accuracy was found in two specific numbers of demonstrations. This work contributes to the evolving discourse on language model behavior, emphasizing the nuanced dynamics between the nature of the task and the efficacy of different language models. The unexpected patterns uncovered in this study prompt a reconsideration of assumptions derived from previous research, highlighting the task-specific intricacies that influence language model performance. As document classification tasks continue to gain prominence, our findings underscore the importance of tailored approaches and context-specific investigations in leveraging language models for diverse domains. Keywords: LLM, Classification, Accuracy, Precision, Demonstration 1 Introduction In recent years, the rapid advancements in natural language processing have propelled language models to the forefront of various cognitive tasks. Among these tasks, document classification stands out as a critical area with applications ranging from information retrieval to nuanced analysis. This study delves into the intricate landscape of document classification, focusing specifically on the assessment of labor capacity loss. Inspired by the work of [1], which explored the performance of large language models in diverse benchmarks, our investigation aims to extend these insights to the domain of labor-related documents. Labor capacity loss, a multifaceted issue with far-reaching implications, demands a meticulous approach to document categorization. As organizations grapple with an ever-expanding volume of textual data, the need for effective automated classification tools becomes increasingly apparent. Our choice of language models, Azure’s GPT-3.5-turbo and Google’s Chat-Bison, reflects the diversity of tools available for document analysis. These models, renowned for their natural language understanding capabilities, serve as the bedrock for our exploration into the nuances of labor-related content. The central question guiding our research is the impact of positive and negative demonstrations on the accuracy of language models in the intricate task of categorizing relevant and irrelevant information within documents. As we navigate through this investigation, we will unravel the complexities inherent in 1 Document classification with LLM’s document classification, addressing biases, and contextual challenges that may influence the performance of language models. The remainder of this paper is organized as follows: Section 2 outlines the methodology adopted for our experiments, detailing the dataset, language models, and experimental procedures. Section 3 presents the results, offering a comprehensive analysis of the observed patterns. In Section 4, we engage in a discussion that interprets our findings, considering the implications for language model applications in document classification. Through this exploration, we contribute insights that advance our understanding of language model behavior in the context of labor capacity loss assessment. 2 2.1 Methodology Data Our dataset comprises a collection of 172 documents related to labor capacity loss, from which five where chosen for the appended graphs. Each document has been meticulously annotated, assigning relevancy labels to individual pages. The task at hand involves the fine-grained classification of pages as either containing relevant or irrelevant information concerning labor capacity loss. 2.2 Language models For our experiments, we employed two prominent language models: Azure’s GPT-3.5-turbo and Google’s Chat-Bison. These models were selected for their robust natural language processing capabilities and affordable cost, making them well-suited for the nuanced task of document classification. 2.3 Experimental setup For each test query, we categorize demos into “positive demos” leading to the correct answer, and “negative demos” resulting in wrong answers. We used a pool of 12 demos from which some where positive and some where negative. The whole experiment was done with 5 documents, that amounts to 32 different pages from which most where not relevant pages, as they did not contain the information we needed. Positive Demonstrations: To assess the impact of positive demonstrations, we adopted a nuanced approach inspired by [1]. Pages were considered for positive demonstrations if they yielded six or more correct answers across a series of one-time queries with different examples from the pool of examples. Each page was then queried with varying numbers of positive demonstrations, ranging from one to six, capturing the model’s ability to generalize with increasing exposure to correct instances. Negative Demonstrations: Similarly, for negative demonstrations, pages were selected if they resulted in six or more incorrect answers during the initial queries. This stringent criterion aimed to focus on pages where the models demonstrated difficulty in correctly classifying irrelevant information. The pages selected for negative demonstrations were also subjected to queries with varying numbers of examples, following the same one to six demonstration increments. 2.4 Metrics For each combination of positive and negative demonstrations, we recorded both accuracy and precision metrics. Accuracy represents the overall correctness of the model’s classification, while precision focuses on the ratio of correctly identified relevant or irrelevant pages among the total identified by the model. This methodology was applied consistently across the chosen documents, resulting in a comprehensive analysis of how language models respond to varying degrees of exposure to positive and negative demonstrations in the context of labor capacity loss document classification. 3 Results Our experimental findings shed light on the nuanced dynamics between positive and negative demonstrations in the context of labor capacity loss document classification. We present the results in terms 2 Document classification with LLM’s of accuracy and precision metrics, capturing the performance of Azure’s GPT-3.5-turbo and Google’s Chat-Bison across varying numbers of demonstrations. 3.1 3.2 Azure Accuracy - Azure Precision - Azure Accuracy - Bard Precision - Bard Bard The observed patterns in both positive and negative demonstrations prompt a reevaluation of the assumed relationship between the number of demonstrations and model accuracy. Further insights into the qualitative aspects of the model’s behavior will be discussed in the following section. 4 Discussion The results of our experiments offer valuable insights into the complexities of leveraging language models, specifically Azure’s GPT-3.5-turbo and Google’s Chat-Bison (referred to as Bard), for document classification in the context of labor capacity loss. The following discussion delves into key observations, implications, and areas for further investigation. 4.1 Positive Demonstrations Contrary to expectations and findings by [1], our experiments revealed a non-linear relationship between the number of positive demonstrations and classification accuracy. While initial exposure to positive demonstrations led to improved accuracy, further increases resulted in a plateau and, in some instances, a decline. This suggests a nuanced interplay between the nature of queries, the complexity of the classification task, and the adaptability of the language model. The positive correlation observed from 3 to 6 demos in Bard aligns more closely with the conventional understanding, although not from the results gathered by [1]. As the number of positive demonstrations increased, so did the accuracy, demonstrating the model’s capacity to generalize from positive examples. 3 Document classification with LLM’s In contrast, the experiments with negative demonstrations exhibited a positive correlation with accuracy, consistent with previous findings. This suggests that increasing exposure to negative examples enhances the model’s ability to correctly classify irrelevant information. Contrary to Azure’s GPT-3.5-turbo, Bard demonstrated a deterioration in accuracy with an increased number of negative demonstrations. This reaffirms the complexity involved with large language models for nuanced document categorization and the high dependance on the nature of the task. 4.2 Biases and Challenges The unexpected results in positive demonstrations highlight potential biases in our dataset or the inherent nature of the labor capacity loss documents. The bias towards declining accuracy with more positive demonstrations may indicate the challenges in generalizing from a limited set of positive examples in the case of Azure’s GPT-3.5-turbo. Additionally, the ease of classifying pages with relevant information over those without may point to biases in the nature of queries or the language used in labor-related documents. 4.3 Model-specific Considerations The divergent behaviors of Azure’s GPT-3.5-turbo and Google’s Chat-Bison emphasize the model-specific nuances in language understanding. Each model’s response to different types and quantities of demonstrations reflects the need for careful consideration when selecting a language model for specific tasks. 4.4 Limitations and Future Work Our study is not without limitations. The dataset, while annotated for relevancy, may still contain inherent biases. Additionally, the choice of language models may influence the observed patterns. Future work should explore a wider range of benchmarks, datasets, and language models to validate and extend our findings. In conclusion, our investigation contributes to the ongoing dialogue on the application of language models in document classification, offering nuanced insights that challenge conventional assumptions. The task-specific behaviors observed underscore the need for careful consideration of both positive and negative examples in training language models for real-world applications. References [1] Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need for in-context learning? 2023. arXiv:2303.08119. 4 Document classification with LLM’s