Uploaded by Jeronimo Villegas

Context Matters

advertisement
Context Matters: Exploring Document Classification Challenges
Using LLM’s
JeroĢnimo Villegas1
1
Guane Enterprises
jeronimo.villegas@guane.com.co
Abstract
This study investigates the application of large language models in the domain of document
classification, focusing on labor capacity loss assessment document classification. Drawing inspiration
from a prior work, we conducted a replication study employing Azure’s GPT-3.5-turbo and Google’s
Chat-Bison. Our aim was to understand the impact of positive and negative demonstrations on the
accuracy of language models in the nuanced task of categorizing relevant and irrelevant information
within documents.
Our methodology involved querying the language models with varying numbers of positive and
negative demonstrations, utilizing a dataset of annotated documents related to labor capacity loss.
Surprisingly, our results diverged from conventional expectations. In the case of Azure’s GPT with
positive demonstrations, we observed a non-linear relationship between the number of demonstrations
and classification accuracy. Contrary to prior findings, accuracy initially improved but plateaued,
and in some instances, decreased with an increased number of positive demonstrations. Conversely,
our study confirmed the positive correlation between the number of negative demonstrations and
classification accuracy. With Google’s chat-bison with positive demonstrations, we observed a nonlinear relationship between the number of demonstrations and the classification accuracy, but this
time accuracy first started deteriorating and then it improved. For the negative demos, the best
accuracy was found in two specific numbers of demonstrations.
This work contributes to the evolving discourse on language model behavior, emphasizing the
nuanced dynamics between the nature of the task and the efficacy of different language models.
The unexpected patterns uncovered in this study prompt a reconsideration of assumptions derived
from previous research, highlighting the task-specific intricacies that influence language model performance. As document classification tasks continue to gain prominence, our findings underscore the
importance of tailored approaches and context-specific investigations in leveraging language models
for diverse domains.
Keywords: LLM, Classification, Accuracy, Precision, Demonstration
1
Introduction
In recent years, the rapid advancements in natural language processing have propelled language models
to the forefront of various cognitive tasks. Among these tasks, document classification stands out as a
critical area with applications ranging from information retrieval to nuanced analysis. This study delves
into the intricate landscape of document classification, focusing specifically on the assessment of labor
capacity loss.
Inspired by the work of [1], which explored the performance of large language models in diverse
benchmarks, our investigation aims to extend these insights to the domain of labor-related documents.
Labor capacity loss, a multifaceted issue with far-reaching implications, demands a meticulous approach
to document categorization. As organizations grapple with an ever-expanding volume of textual data,
the need for effective automated classification tools becomes increasingly apparent.
Our choice of language models, Azure’s GPT-3.5-turbo and Google’s Chat-Bison, reflects the diversity of tools available for document analysis. These models, renowned for their natural language
understanding capabilities, serve as the bedrock for our exploration into the nuances of labor-related
content.
The central question guiding our research is the impact of positive and negative demonstrations on
the accuracy of language models in the intricate task of categorizing relevant and irrelevant information
within documents. As we navigate through this investigation, we will unravel the complexities inherent in
1
Document classification with LLM’s
document classification, addressing biases, and contextual challenges that may influence the performance
of language models.
The remainder of this paper is organized as follows: Section 2 outlines the methodology adopted
for our experiments, detailing the dataset, language models, and experimental procedures. Section
3 presents the results, offering a comprehensive analysis of the observed patterns. In Section 4, we
engage in a discussion that interprets our findings, considering the implications for language model
applications in document classification. Through this exploration, we contribute insights that advance
our understanding of language model behavior in the context of labor capacity loss assessment.
2
2.1
Methodology
Data
Our dataset comprises a collection of 172 documents related to labor capacity loss, from which five where
chosen for the appended graphs. Each document has been meticulously annotated, assigning relevancy
labels to individual pages. The task at hand involves the fine-grained classification of pages as either
containing relevant or irrelevant information concerning labor capacity loss.
2.2
Language models
For our experiments, we employed two prominent language models: Azure’s GPT-3.5-turbo and Google’s
Chat-Bison. These models were selected for their robust natural language processing capabilities and
affordable cost, making them well-suited for the nuanced task of document classification.
2.3
Experimental setup
For each test query, we categorize demos into “positive demos” leading to the correct answer, and
“negative demos” resulting in wrong answers. We used a pool of 12 demos from which some where
positive and some where negative. The whole experiment was done with 5 documents, that amounts to
32 different pages from which most where not relevant pages, as they did not contain the information
we needed.
Positive Demonstrations:
To assess the impact of positive demonstrations, we adopted a nuanced approach inspired by [1].
Pages were considered for positive demonstrations if they yielded six or more correct answers across a
series of one-time queries with different examples from the pool of examples. Each page was then queried
with varying numbers of positive demonstrations, ranging from one to six, capturing the model’s ability
to generalize with increasing exposure to correct instances.
Negative Demonstrations:
Similarly, for negative demonstrations, pages were selected if they resulted in six or more incorrect
answers during the initial queries. This stringent criterion aimed to focus on pages where the models
demonstrated difficulty in correctly classifying irrelevant information. The pages selected for negative
demonstrations were also subjected to queries with varying numbers of examples, following the same one
to six demonstration increments.
2.4
Metrics
For each combination of positive and negative demonstrations, we recorded both accuracy and precision
metrics. Accuracy represents the overall correctness of the model’s classification, while precision focuses
on the ratio of correctly identified relevant or irrelevant pages among the total identified by the model.
This methodology was applied consistently across the chosen documents, resulting in a comprehensive analysis of how language models respond to varying degrees of exposure to positive and negative
demonstrations in the context of labor capacity loss document classification.
3
Results
Our experimental findings shed light on the nuanced dynamics between positive and negative demonstrations in the context of labor capacity loss document classification. We present the results in terms
2
Document classification with LLM’s
of accuracy and precision metrics, capturing the performance of Azure’s GPT-3.5-turbo and Google’s
Chat-Bison across varying numbers of demonstrations.
3.1
3.2
Azure
Accuracy - Azure
Precision - Azure
Accuracy - Bard
Precision - Bard
Bard
The observed patterns in both positive and negative demonstrations prompt a reevaluation of the
assumed relationship between the number of demonstrations and model accuracy. Further insights into
the qualitative aspects of the model’s behavior will be discussed in the following section.
4
Discussion
The results of our experiments offer valuable insights into the complexities of leveraging language models, specifically Azure’s GPT-3.5-turbo and Google’s Chat-Bison (referred to as Bard), for document
classification in the context of labor capacity loss. The following discussion delves into key observations,
implications, and areas for further investigation.
4.1
Positive Demonstrations
Contrary to expectations and findings by [1], our experiments revealed a non-linear relationship between
the number of positive demonstrations and classification accuracy. While initial exposure to positive
demonstrations led to improved accuracy, further increases resulted in a plateau and, in some instances,
a decline. This suggests a nuanced interplay between the nature of queries, the complexity of the
classification task, and the adaptability of the language model.
The positive correlation observed from 3 to 6 demos in Bard aligns more closely with the conventional
understanding, although not from the results gathered by [1]. As the number of positive demonstrations
increased, so did the accuracy, demonstrating the model’s capacity to generalize from positive examples.
3
Document classification with LLM’s
In contrast, the experiments with negative demonstrations exhibited a positive correlation with accuracy, consistent with previous findings. This suggests that increasing exposure to negative examples
enhances the model’s ability to correctly classify irrelevant information.
Contrary to Azure’s GPT-3.5-turbo, Bard demonstrated a deterioration in accuracy with an increased
number of negative demonstrations. This reaffirms the complexity involved with large language models
for nuanced document categorization and the high dependance on the nature of the task.
4.2
Biases and Challenges
The unexpected results in positive demonstrations highlight potential biases in our dataset or the inherent
nature of the labor capacity loss documents. The bias towards declining accuracy with more positive
demonstrations may indicate the challenges in generalizing from a limited set of positive examples in the
case of Azure’s GPT-3.5-turbo.
Additionally, the ease of classifying pages with relevant information over those without may point to
biases in the nature of queries or the language used in labor-related documents.
4.3
Model-specific Considerations
The divergent behaviors of Azure’s GPT-3.5-turbo and Google’s Chat-Bison emphasize the model-specific
nuances in language understanding. Each model’s response to different types and quantities of demonstrations reflects the need for careful consideration when selecting a language model for specific tasks.
4.4
Limitations and Future Work
Our study is not without limitations. The dataset, while annotated for relevancy, may still contain
inherent biases. Additionally, the choice of language models may influence the observed patterns. Future
work should explore a wider range of benchmarks, datasets, and language models to validate and extend
our findings.
In conclusion, our investigation contributes to the ongoing dialogue on the application of language
models in document classification, offering nuanced insights that challenge conventional assumptions.
The task-specific behaviors observed underscore the need for careful consideration of both positive and
negative examples in training language models for real-world applications.
References
[1] Jiuhai Chen, Lichang Chen, Chen Zhu, and Tianyi Zhou. How many demonstrations do you need
for in-context learning? 2023. arXiv:2303.08119.
4
Document classification with LLM’s
Download