Wireless Personal Communications (2023) 129:2213–2237 https://doi.org/10.1007/s11277-023-10235-4 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task Learning Yik Yang Tan1 · Chee‑Onn Chow1 YongLiang Lim1 · Jeevan Kanesan1 · Joon Huang Chuah1 · Accepted: 20 February 2023 / Published online: 4 March 2023 © The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2023 Abstract Social media platforms such as Twitter and Facebook have become popular channels for people to record and express their feelings, opinions, and feedback in the last decades. With proper extraction techniques such as sentiment analysis, this information is useful in many aspects, including product marketing, behavior analysis, and pandemic management. Sentiment analysis is a technique to analyze people’s thoughts, feelings and emotions, and to categorize them into positive, negative, or neutral. There are many ways for someone to express their feelings and emotions. These sentiments are sometimes accompanied by sarcasm, especially when conveying intense emotion. Sarcasm is defined as a positive sentence with underlying negative intention. Most of the current research work treats them as two distinct tasks. To date, most sentiment and sarcasm classification approaches have been treated primarily and standalone as a text categorization problem. In recent years, research work using deep learning algorithms have significantly improved performance for these standalone classifiers. One of the major issues faced by these approaches is that they could not correctly classify sarcastic sentences as negative. With this in mind, we claim that knowing how to spot sarcasm will help sentiment classification and vice versa. Our work has shown that these two tasks are correlated. This paper proposes a multi-task learning-based framework utilizing a deep neural network to model this correlation to improve sentiment analysis’s overall performance. The proposed method outperforms the existing methods by a margin of 3%, with an F1-score of 94%. Keywords Sentiment analysis · Sarcasm detection · Deep learning algorithm · Multi-task learning * Chee‑Onn Chow cochow@um.edu.my 1 Department of Electrical Engineering, Faculty of Engineering, Universiti Malaya, 50603 Kuala Lumpur, Malaysia 13 Vol.:(0123456789) 2214 Y. Y. Tan et al. 1 Introduction According to Statista, there are 5.03 billion active internet users worldwide, about 63% of the global population, as of October 2022. Of this total, 93% are social media users [1]. Social media, such as Twitter, Reddit, and Facebook, has become an integral part of our life. We share almost everything on social media, from social and business events to personal opinions and emotions. Besides, social media is also a popular and reliable platform for obtaining information in almost real-time. People have strong trust in the information received and shared by other users on social media. In other words, people can inform and influence one another via social media platforms. This has significant social, political, and economic impacts on society. Nowadays, almost all businesses are using social media to engage directly with their consumers to understand their needs and advertise their products or services. Consumers have complete control of what they want to see and how they want to respond. A single product review may affect consumer behavior and decision-making. As a result, the company’s success and failure are made public and spread quickly and widely through the social media platforms. For example, a survey done by Podium claimed that 93% of internet users are influenced by customer reviews in their purchases and decisions [2]. So, if a company can keep up quicker with their customers’ thinking, it would be more advantageous to respond promptly and devise a successful strategy to compete with their rivals. Another impact of social media was observed during the outbreak of the COVID-19 pandemic that emerged in December 2019 [3], which has infected more than 619 million people and caused more than 6.55 million deaths as of October 2022, and the number is still increasing. This has created great fear of getting infected and tremendous stress worrying their daily life. According to the American Psychological Association, US adults recorded the highest stress level since the early days of the COVID-19 pandemic, and 80% of it is due to the prolonged stress caused by COVID-19 [4]. Social media has become one of the quickest ways for individuals to express themselves, and this causes the newsfeed to flood with information representing their thoughts. Undoubtedly, analyzing these newsfeeds is a direct way to capture their emotions and sentiment [5]. Sentiment analysis (also known as opinion mining) is the process of identifying, extracting, and classifying subjective information from unstructured text using text analysis and computational linguistic techniques in Natural Language Processing (NLP) [6]. It aims to determine the polarity of sentences using word clues extracted from the sentence’s meaning [, 7, 8]. As a result, sentiment analysis is an essential technique for extracting valuable information from unstructured data sources, including tweets and reviews. It is widely used in extracting opinions from online sources, such as product reviews on the Web [9]. Since then, several other fields have been targeted using sentiment analysis, such as stock market forecasts [10] and responses to terrorist attacks [11]. In addition, research work that overlaps sentiment analysis and the production of natural language has discussed several concerns that relate to the applicability of sentiment analysis, such as multi-lingual support [12] and irony detection [13]. Recently, sentiment analysis shows its importance and plays a crucial role in understanding people’s feelings during the COVID-19 pandemic. It helps the government to understand people’s concerns about COVID-19 and take the appropriate measures accordingly [14]. Despite the potential benefits, automated analysis has limitations, including the complexity of its implementation due to the uncertainty of natural language and the characteristics of the posted content. The study of tweets is an example since hashtags and emoticons 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2215 often accompany them and links, making identifying the expressed sentiment difficult. Furthermore, automated techniques need large datasets of annotated posts or lexical databases of emotional terms associated with sentiment values. Unlike humans, machines have difficulties relating to the subjectivity of a text, such as sarcastic context [15]. People frequently use encouraging words to express their negative feelings in sarcastic texts. This fact enables sarcasm to trick sentiment analysis models easily unless explicitly designed to take sarcasm into account. Ultimately, the variation in the terms used in sarcastic sentences makes it difficult to train a sentiment analysis model. Owing to the misclassification of sarcastic texts that could change the polarity of a sentence, the primary aim of this paper is to study the effect of sarcasm detection on sentiment analysis to improve the existing sentiment analysis model for better accuracy and more intelligent information extraction. The contribution of this paper is twofold. First, we design a general framework that tackles sentiment analysis and sarcasm detection for more intelligent and accurate information extraction. Second, we propose deep multi-task learning to simultaneously train two models for sentiment analysis and sarcasm detection, respectively, to reduce model complexity and increase model efficiency. This paper is divided into five sections. A comprehensive literature review of the related work is presented in Sect. 2. Section 3 provides a detailed explanation on the proposed multi-task learning framework for sentiment analysis and sarcasm detection. Performance evaluation is presented in Sect. 4. Section 5 gives the conclusions and possible future work. 2 Related Work The roots of sentiment analysis on written documents can be observed back in World War II, where the focus was mainly on politics in nature. It became an active research focus since the mid of 2000s utilizing Natural Language Processing (NLP) to mine subjective information from various contents on the Internet. Different methods have been proposed to train sentiment analysis models ranging from traditional machine learning algorithms to deep learning algorithms. 2.1 Sentiment Analysis Using Machine Learning Most natural language processing algorithms were based on complex sets of hand-written rules up to the 1980s. Since then, a revolution in natural language processing has been seen with the introduction of machine learning algorithms. Some early work involved the classification of sentiment based on the categorization method with positive and negative sentiments, such as in [7], in which three machine learning algorithms were used in their experiment for sentiment classification. These algorithms are Support Vector Machine (SVM), Naïve Bayes classifier, and Maximum Entropy. The classification process was carried out using the n-gram method; they are the unigram, bigram, and the combination of both. Besides, they also used the bag-of-word (BOW) paradigm to introduce algorithms for machine learning. Great potential has been observed based on the promising performance shown in their studies. The syntactic relation between words has been used in [16] to analyze document-level sentiment. In this paper, the sub-sequences of frequent terms and the sub-trees of dependence were derived from sentences that serve as features for the SVM algorithm. Unigram, bigram, word subsequence, and dependence have also been extracted from each sentence 13 2216 Y. Y. Tan et al. in the dataset. Another similar work involves using a mix of unsupervised and supervised techniques to learn word vectors and subsequently capture the semantic term (document information) and rich sentiment contents [17]. In [18], a mechanism that embeds the higher-order n-gram phrases with the low-order dimensional semantic latent space was proposed to define a sentiment classification function. They also used a SVM to construct a discriminative system that estimates latent space parameters and with a bias towards the classification task. This method can perform both binary classifications and multi-score sentiment classifications, which involve prediction within a set of sentiment scores. A sentiment classification method using entropy weighted genetic algorithm (EWGA), and SVM has been proposed in [19]. Different sets of features consisting of syntactic and stylistic characteristics have been evaluated. In terms of stylistic, it reflects the measure of word length distribution, vocabulary richness, and frequency of special characters. Weights are allocated for different sentiment attributes before the genetic algorithm is used to optimize the sentiment classification. SVM with a ten-fold cross-validation technique was used to validate the model, and promising results were obtained. 2.2 Sentiment Analysis using Deep Learning In recent years, deep learning has received an overwhelming reception as deep learning does not require traditional, task-specific feature engineering, which makes it a more powerful alternative for sentiment analysis. In [20], an architecture using a deep neural network to determine the similarity of documents was proposed. This architecture was trained to generate vector foes articles by using multiple market news obtained from T&C. The cosine similarity was then calculated among the labeled papers considering the polarity of the documents, while neglecting the contents of the documents. The proposed method achieved an outstanding performance in terms of similarity estimates of the articles. The authors in [21] suggested a sequence modeling based neural network for sentiment analysis at the document level, focusing on customers’ reviews having temporal nature. Their approach trained the recurrent neural network with the gated recurrent unit (RNNGRU) to learn the distributed product and user representations. These representations were then fed into a machine learning classifier for sentiment classification. The method was evaluated on three datasets obtained from Yelp and IMDb. Also, each assessment was tagged according to the rating score, and the back-propagation algorithm was used to train the network with Adam’s stochastic optimization to calculate the loss function. Simulation results showed that the distributed product and user representation learning sequence modeling enhances the document-level performance of sentiment classification. In [22] sentiment analysis at the sentence level by using a Deep Recurrent Neural Network (RNN) composed of Long Short-Term Memory (RNN-LSTM) was proposed because it was found that sentiment analysis that considers a unified feature set including the representative of word embedding, sentiment knowledge, sentiment shifter rules, statistical and linguistic knowledge had not been previously studied. This combination provided sequential processing and overcomes some of the flaws in the conventional methods. A hybrid deep learning method called ConvNet-SVMBoVW was proposed in [23] to deal with real-time data for fine-grained sentiment analysis. An aggregation model was created to calculate the hybrid polarity, and SVM was used to train the beg-of-visual-word (BoVW) to predict the sentiment of visual content. The proposed 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2217 methods not only provided five levels of fine-grain sentiment analysis (highly positive, positive, neutral, negative, and highly negative) but also outperformed the existing methods. In a study by [24], examined the sentiments in datasets containing reviews of cars and real estate in Arabic online. They used the Bi-LSTM (Bidirectional Long Short-Term Memory), LSTM (Long Short-Term Memory), GRU (Gated Recurrent Unit), CNN (Convolutional Neural Networks), and CNN-GRU deep learning algorithms in combination with the BERT word embedding model. The real estate dataset had about 6,434 opinions, whereas the automotive dataset included nearly 6,585 opinions. Three different sentiment kinds were assigned to the records in both datasets (negative, positive, and mixed). The BERT model with the LSTM produced the greatest F1 score of 98.71 percent for the vehicle dataset. On the other hand, utilizing the BERT model with the CNN, the maximum F1 score for the real estate dataset was 98.67%. Recently, multi-task learning has gained substantial attention in deep learning research. Multi-task learning allows multiple tasks to be performed simultaneously by a shared model. [25] proposed a multi-task learning approach based on CNN and RNN. This model jointly learns the citation sentiment classification (CSC) and citation purpose classification (CPC) to boost the overall performance of automated citation analysis and simultaneously ease the problem of inadequate training data and time-consuming for feature engineering. [26] proposed a method utilizing sarcasm detection to improve standalone sentiment classifiers, similar to our proposed method. This method required two explicitly trained models: the sentiment model and the sarcasm model. The feature extraction of sarcasm detection used top word features, and Unigram and 4 Boaziz features consist of punctuation-relate features, sentiment-relate features, and lexical and syntactic features. The sarcasm detection used the Random Forest algorithm, while the sentiment classification used the Naïve Bayes algorithm. The model achieved 80.4% accuracy, 91.3% recall, and 83.2% precision. The evaluation also showed that sarcasm detection improves the performance of sentiment analysis by about 5.49% improvement. This section introduces different techniques used by researchers on Sentiment Analysis, such as N-gram, Hybrid MLTs, and Deep learning methods. Also, finding a suitable dataset and converting data to numerical vector form is a crucial step all researchers take to obtain better results. The accuracy obtained from various methods is high, such as the N-gram method achieved an accuracy of 94.6% using SVM [18], and the hybrid MLTs method achieved 91.7% using a Hybrid of EWGA and SVM [19]. Most of these deep learning methods outperformed the conventional ones. However, some shortcomings can be observed in the method. As discussed in the previous section, sarcasm context plays a vital role in sentiment classification. If sarcasm is not being considered in the system, the sarcastic text is classified as a positive tweet, which leads to misclassification. An additional step is required to solve this misclassification to obtain more accurate outcome. A significant milestone has been achieved in the field of sentiment analysis in the last decades in mining and understanding text-based documents for various purposes using machine learning and deep learning methods. However, there are limitations in the existing techniques due to the inherent nature of languages, such as using sarcasm in expressing feelings. This issue was tackled in [26] using two explicitly trained models for each task. This method provides more accurate sentiment analysis but comes with a price of higher complexity, longer processing time, and possible overfitting. In this paper we propose a complete framework that enhances sentiment classification by detecting the presence of sarcasm in the sentence. In order to reduce the complexity and processing time, multi-task learning is deployed in our framework. 13 2218 Y. Y. Tan et al. Fig. 1 Sentiment analysis and sarcasm detection using deep multi-task learning: overall framework 3 Deep Multi‑Task Learning Framework for Sentiment Analysis with Sarcasm Detection Sarcasm is defined as the use of remarks that clearly carry the opposite meaning or sentiment. It is made in order to mock or to annoy someone, or for humorous purposes. The presence of sarcasm in a document makes sentiment analysis less accurate as the conventional methods are not capable of detecting sarcasm. In this project we propose a framework for sentiment analysis that considers the possible presence of sarcasm. Specifically, the framework involves using pre-processed data to train a Bidirectional LSTM (Bi-LSTM) network to simultaneously perform two tasks: sentiment classification and sarcasm classification. This method is called multi-task learning and has been proven to improve learning efficiency and prediction accuracy for task-specific models. Besides, each classifier in the proposed framework has a distinct perceptron layer, and different activation functions are used in the perceptron layer to classify various classification tasks, either binary or multiclass. The architecture of the proposed framework is given in Fig. 1, and the details of each component in the framework are explained in detail in the remaining of this section. 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2219 Fig. 2 a Sentiment header data; b Amount of data for different label plotted in a graph Fig. 3 a Sarcasm header data; b Amount of data for different label plotted in a graph 3.1 Data Acquisition Data acquisition is the first step in machine and deep learning and usually involves spending lots of time and resources on gathering data that may or may not be relevant. Two datasets are needed in the proposed framework: a sentiment dataset and a sarcasm dataset. In this paper, the dataset used is obtained from Kaggle. The sentiment analysis dataset [27], which is extracted from Twitter, contains 227,599 tweets labeled as 0 for neutral, 1 for negative sentiment, and 2 for positive sentiment. In order to have a better understanding of the dataset, some sample headers are shown Fig. 2a, and the distribution is given in Fig. 2b. It is important to note that the dataset is slightly unbalanced, with the neutral label being the majority and the negative one being the minority. As for the sarcasm dataset [28], there are 28,619 tweets labeled with 0 for not sarcastic and 1 for sarcastic. Figure 3a shows some samples of the sarcastic and not sarcastic headers of tweets. Similarly, the dataset is slightly unbalanced, with more tweets that are not sarcastic (type 0) in Fig. 3b. 13 2220 Y. Y. Tan et al. 3.2 Data Pre‑Processing Most of the datasets available online or collected manually are noisy and unstructured in nature. So pre-processing is needed to transform the noisy datasets into an understandable format for the training to ensure high accuracy. In this paper, the same pre-processing method is used for both datasets since both sentiment and sarcasm classifications use the same natural language processing (NLP) techniques. The pre-processing steps used in our work are summarized as follows: a. Removal of irrelevant words such as hyperlinks and noisy words (such as retweet and stock markets tickers and stop words). b. Removal of punctuations. c. Word stemming to reduce inflected to tier word stem or base. d. Tokenization to covert text to vector representation as input to deep learning model. 3.3 Deep Multi‑Task Learning Neural Network Figure 4 shows the overall architecture of the multi-task learning deep neural network. In this stage, the pre-processed data are converted into a numerical vector before they are used to train the Bi-LSTM network for both sentiment classification and sarcasm classification. 3.3.1 Embedding Layer Word embedding takes in the pre-processed input and converts each of them into a vector of numeric values. It is a typical method for representing words in text analysis, in which words with similar meanings are expected to be closer in the vector space. In this paper, the Word2Vec model is used to generate a word vector matrix to convert the inputs [29]. Word2Vec calculates the vectors to represent the degree of semantic similarity [between words] using the Cosine Similarity method. Each text of n words represents as T = w1 , w2 , … , wn is converted into an n-dimensional dense vector as given in Eq. 1. ] [ T = w1 , w2 , … , wn ∈ Rn∗d (1) Since each input text may have different lengths, but the length of the produced vectors is fixed, denoted as l . With thin in mind, zero padding is used if the text has a length shorter than l . Thus, all texts have the same matrix dimension and result in the final form of text input with (l) dimension as given in Eq. 2. ] [ T = w1 , w2 , … , wn ∈ Rl∗d (2) 3.3.2 Multi‑Task Learning The purpose of the proposed framework is to enhance the performance of standalone sentiment classification by adding an auxiliary task which is sarcasm detection. This can be achieved by training two explicit models sequentially: the sentiment model as the primary task and the sarcasm model as the secondary task. However, this method is inefficient, redundant, and costly in terms of processing time and computation power. With this in mind, we propose the use of multi-task learning in our framework to train 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2221 Fig. 4 Architecture of Multi-Task Learning these two tasks simultaneously with a shared Bi-LSTM layer. This method offers many advantages, including reduced overfitting through shared representations and reduced model complexity, improved data efficiency, and fast learning by leveraging auxiliary information. Besides, multi-task learning also helps in alleviating known weaknesses of deep learning methods, such as computational demand and large-scale data requirements [30]. As shown in Fig. 4, the two classification tasks share a single-layer Bi-LSTM recurrent neural network but separate multilayer perceptron (MLP) layer. The sentences obtained from word embedding are fed into the Bi-LSTM network, and the output is then passed through a fully connected layer (FCL) to get the sentence representation, Z∗. This operation can be represented by Eq. 3, 13 Y. Y. Tan et al. 2222 Z∗ = FCLayer∗ (bidirection(LSTM(X))) (3) where * represent the shared layer between the two tasks, and X is the list of input sentence obtained from the word embedding. The respective classification is then obtained by applying the activation function to the sentence representation Z∗. In this framework, the activation functions used are softmax and sigmoid for sentiment classification and sarcasm classification, respectively, as given by Eqs. 4(a) and (b). Ssentiment = Softmax(Z∗ ) (4a) Ssarcasm = Sigmoid(Z∗ ) (4b) 3.3.3 Loss Function The configuration of the frameworks allows the secondary task to inform the training on the primary task by computing the loss of the model using Eq. 5: ∑ ∑ Li = L1 (x, y) + L2 (x, y) (5) (x,y)∈Ω1 (x,y)∈Ω2 where L1 denoted as the loss for the primary task, L2 denoted as the loss for the secondary task, and Li is the total loss of the proposed model. The total loss Li is used to compute for each sentence present in the dataset Ωi. The cross-entropy losses are given in Eqs. 6 and 7, respectively, for sentiment classification and sarcasm classification. ∑C Categorical Cross Entropy = − t log(f (Softmax)i ) (6) j i ( ) Binary Cross Entropy = −ti log s1 − (1 − t1 )log(1 − s1 ) (7) In order to optimize the performance of the model, the RMSprop optimizer is used in our framework. At every epoch, the proposed algorithm computes the gradient of Li for each batch to fine-tune our parameters. 3.3.4 Bi‑Directional Long Short‑Term Memory (Bi‑LSTM) The use of deep recurrent neural networks (RNNs) has been proven to be a highly effective way in sentiment analysis, as RNNs have feedback loops in the recurrent layer that allows information to be maintained as ‘memory’ over time. Training conventional RNNs to solve problems that require learning long-term temporal dependencies, such as sentences or tweets, can be difficult because the gradient of the loss function decays exponentially with time, known as the vanishing gradient [31]. Hence, we use Bi-LSTM to tackle this problem. Bi-LSTM is a sequence processing model consisting of two LSTMs: one taking the input in a forward direction and the other in a backward direction. It provides more context to the algorithm and results in faster and more complete learning of the problem. In other words, it allows the network to preserve information from both the future and the past. 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2223 Fig. 5 Architecture of LSTM The basic architecture of the LSTM network is illustrated in Fig. 5. The LSTM network is a type of RNN that uses special units called memory cells and other standard units to solve the vanishing gradient problem. An LSTM network can accomplish this by incorporating a cell state and three different gates: the forget gate, the input gate, and the output gate. Using this explicit gating mechanism, the cell can determine what to do with the state vector at each step: read from it, write to it, or delete it. The input gate allows the cell to decide whether to update the cell state or not. The cell can also erase its memory with the forget gate to determine whether to make the output information available at the output gate. Ultimately, the ability to ‘memorize’ and forget makes LSTM a suitable method for the sentiment and sarcasm model because every word present in a sentence is crucial. Furthermore, it is vital to conserve information bi-directionally from the past to the future and from the future to the past when evaluating the nature of a sentence. 3.4 Multi‑Layer Perceptron (MLP) Figure 6 gives a deeper insight of the multilayer perceptron used in the proposed framework. It can be observed that the primary task and the secondary task have similar architecture except for the activation functions of the dense layer. For both tasks, we fed the ReLU activation function to ensure all negative elements present in Z∗ (outputs of the Bidirectional neural network) to be 0, as shown in Eq. 8. This implementation simplifies the backward propagation computation, giving benefits such as fast training and preventing vanishing gradients. ReLU(x) = max(0, x) (8) Then, a dropout is added to the model as a regularizer where the algorithm randomly sets half of the activations on the fully connected layers to zero during training. This technique improves the generalization ability and reduces overfitting [32]. Finally, training an LSTM-based multi-task model involves remembering broadly diverse sequential data. In this case, RMSprop is a perfect optimizer as it uses a moving average of squared gradients 13 2224 Y. Y. Tan et al. Fig. 6 Architecture of Multilayer Perceptron (MLP) to normalize the gradient itself [33]. This provides an effect of balancing the step size: increasing the step size for a small gradient to avoid vanishing gradient and decreasing the step size for a large gradient to avoid exploding gradient. Hence ensuring efficient learning by the model [34]. 3.5 Model Analysis In this section, we provide a preliminary analysis of the performance of the proposed framework. The proposed model is trained using the hyperparameters defined in Table 1, and Fig. 7 gives the training and validation losses. The validation loss reduces significantly from 1.4 to 0.5 from epoch 0 to 15, which shows that the model is learning effectively. From epoch 15 onwards, the model has stopped learning as the validation loss has reached a steady state, although the training losses are still decreasing. For this reason, the epoch is set to 10 for training since the validation loss equals the training loss with this number of epochs. In terms of complexity, the proposed model is slightly more complex than the conventional deep neural networks due to bi-directional data propagation. Compared to similar classification work that involves two tasks [26], the proposed model is much simpler due to the multi-task learning. Figure 7 shows that the validation loss curve remains constant at around 0.5 instead of going upwards. This indicates that the proposed method successfully reduces overfitting. 13 2225 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… Table 1 Hyperparameters Settings Parameters Values Embedding dimension Word2Vec = 200 Bi-LSTM output size Dropout Epoch Learning rate Batch size Validation split Loss function Optimizer 20 0.4 50 0.001 128 0.2 Cross-entropy RMS Prop Model loss vs Epoch 1.8 1.6 1.4 Loss 1.2 1 0.8 0.6 0.4 0.2 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 39 41 43 45 47 49 loss Epoch val_loss Fig. 7 Validation and training losses In terms of training time, we compare the performance with the model presented in [26], as given in Fig. 8 by using the same hypermeters used in our proposed method. At a glance, [26] repeatedly uses the same input and hidden layer settings to train the sentiment and sarcasm model, which is redundant, and the model training has to be done twice. Without any doubt, the multi-task learning method allows us to train two tasks at each epoch and thus significantly lower the training time. In terms of computation power, the proposed method requires less computation power because the neural network used is smaller than [26] work. The goal is to shrink the neural network size so that we can remove unnecessary processes and mathematical operations. Therefore, data passed through the hidden layer once instead of twice, significantly reducing unwanted processes and computation power. 13 2226 Y. Y. Tan et al. Fig. 8 Architecture of two explicitly sentiment and sarcasm model 4 Performance Evaluation 4.1 Experiment Settings Figure 9 shows the overview of the experiments conducted to evaluate the performance of the proposed framework. There are two categories of datasets used. The first experiment was performed based on various baselines and variant models used in other similar works for a fair comparison. For the second experiment, we extracted data from multiple social media platforms, including Twitter and Reddit, to eliminate bias by feeding unseen data to the model. Table 2 shows the relationship between sentiment and sarcasm classifications. If the sentiment is positive and sarcastic, the tweet is then classified as negative. For example, a tweet, “So many assignments, I love school life so much.” The sentiment classifier will detect this tweet as positive because “Love school life so much” is highlighted by the classifier as a positive remark. On the other hand, the sarcasm classifier will identify it as sarcastic due to the phrase “So many assignments.” In this case, this tweet is negative. 13 2227 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… Sentiment Classification Positive Experiment Dataset Neutral Sarcasm Classification Train test split Evaluation Negative Reddit Experiment 1 Twitter US Airline Positive Neutral Negative Experiment 2 Fig. 9 Overview of performance evaluation Table 2 Sentiment and Sarcasm Relationship Sentiment score Sarcasm score Output Negative Not sarcastic Negative Negative Positive Positive Sarcastic Not sarcastic Sarcastic Negative Positive Negative 4.2 Evaluation Metrics The following standard metrics are used in this paper to benchmark the performance of the proposed framework. a. Recall is a measure of the ability of a model to detect positive in each sentence (as known as the sensitivity). It is the ratio of true positive (TP) to the total of True Positive and false negative (FN) as given in Eq. 9. TP Recall = (9) TP + FN b. Precision is the accuracy of positive predictions, and it is defined as the ratio of true positives to the total predicted positives, including both true positive (TP) and false positive (FP) as given in Eq. 10. TP Precision = (10) TP + FP 13 Y. Y. Tan et al. 2228 iii. F1-score is a helpful metric to compare two classifiers. F1 score considers both recall and precision, which is defined as Eq. 11. F1score = 2 × Precision × Recall Precision + Recall (11) Another commonly used parameter named accuracy is the ratio of total correct predictions to total predictions. The accuracy metric did not consider false negative/positive in the calculation, making this metric highly unreliable. Imagine having data of 1000 sarcastic tweets with 900 sarcastic and 100 not sarcastic. Suppose our classifier predicts all sarcastic, we will have a high accuracy of 90%, but in fact, we will get 0 using the F1-score metric due to the recall. Hence, our experiment will utilize the F1 score as the primary metric to evaluate our proposed model performance. 4.3 Experiment 1: Model Baselines and Variants The dataset used is split into 80% for training and 20% for testing. The following baselines and variations of the models are compared. a. Standalone classifier with RNN with the following characteristics. Zsentiment = FCLayersentiment (RNN(Xsentiment )) Zsarcasm = FCLayersarcasm (RNN(Xsarcasm )) Ssentiment = Softmax(Zsentiment ) Ssarcasm = Sigmoid(Zsarcasm ) In this case, there are two standalone models respectively, for sentiment and sarcasm models. X is the list of input sentences obtained from the word embeddings. We feed X to RNN and then pass the output through a fully connected layer (FCLayer) to get the sentence representation Z. Finally, Z is passed through the dense layer, which is softmax and sigmoid for sentiment and sarcasm classification, respectively (Ssentiment &Ssarcasm). b. Standalone classifier using CNN with the following characteristics: Zsentiment = FCLayersentiment (CNN(Xsentiment )) Zsarcasm = FCLayersarcasm (CNN(Xsarcasm )) Ssentiment = Softmax(Zsentiment ) Ssarcasm = Sigmoid(Zsarcasm ) This case is the same as case (i) except CNN is used instead of RNN. iii. Standalone classifier using Bi-GRU with the following characteristics: 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2229 )) ( ( Zsentiment = FCLayersentiment (Bidirection GRU Xsentiment ) )) ( ( Zsarcasm = FCLayersarcasm (Bidirection GRU Xsarcasm ) Ssentiment = Softmax(Zsentiment ) Ssarcasm = Sigmoid(Zsarcasm ) This case is the same as the two previous cases except Bi-GRU is used. d. Standalone classifier using Bi-LSTM with the following characteristics: )) ( ( Zsentiment = FCLayersentiment (Bidirection LSTM Xsentiment ) )) ( ( Zsarcasm = FCLayersarcasm (Bidirection LSTM Xsarcasm ) Ssentiment = Softmax(Zsentiment ) Ssarcasm = Sigmoid(Zsarcasm ) This case also involves two standalone models same as the previous cases except BiLSTM is used. v. Multi-task learning using LSTM with the following characteristics: Z∗ = FCLayer∗ (bidirection(LSTM(X))) Ssentiment = Softmax(Z∗ ) Ssarcasm = Sigmoid(Z∗ ) where * represent the shared layer between two tasks sentiment. The experiment results are given in Fig. 10. First, when the four standalone models are compared, it is evident that the conventional neural networks offer poorer outcomes as compared to the more recent Bi-LSTM network. The Bi-LSTM outperforms other methods by a margin of 0.4%, 0.1%, and 0.1%, respectively, for RNN, CNN, and Bi-GRU. It is essential to highlight that this model achieves an F1-score of 91% for sentiment classification and 92% for sarcasm classification. This improvement is achieved using memory gates that preserve important data while erasing useless data in the Bi-LSTM network. Besides, this network also allows the machines to learn bi-directionally, so information from the past and the future are preserved for better classification. Next, we compare the standalone model with the proposed framework, which uses multi-task learning with a Bi-LSTM network. The proposed method outperforms all the other models in sentiment and sarcasm classification, with an F1-score of 94% and 93%, respectively. Compared to the standalone Bi-LSTM model, the improvement is 3% and 1%, respectively, for sentiment and sarcasm classification. This improvement is because multi-task learning utilizes a shared layer, reduces the risk of overfitting when computing gradient descent, and ensures efficient learning. This also means that the sarcasm classifier 13 Y. Y. Tan et al. 2230 Fig. 10 Experimental results using different variety of models: a sentiment classification; b sarcasm classification Table 3 Distribution of Sentiment Sentences Datasets Positive Negative Neutral Reddit 15,830 (42%) 8277 (22%) 13,142 (35%) Twitter US airline Twitter 2363(16%) 1103 (17%) 9178 (63%) 4001 (61%) 3090 (21%) 1430 (22%) successfully boosts the sentiment classifier’s performance. It can also be observed that the margin of improvement for the sentiment classifier is more significant than the sarcasm classifier because sarcasm detection is a subtask of sentiment analysis. It is also important to note that the performance of the proposed model is consistent when we analyze the precision and the recall values, in which the pattern is the same as the F1-Score. The margin of improvement is also about 3% and 1%, respectively, for sentiment and sarcasm classifications. 4.4 Experiment 2: Testing the Proposed Method Using Unbiased Datasets In this experiment, we evaluate the performance of the proposed method using different datasets and compare it with the Bi-LSTM standalone model, which is the best standalone model from the previous experiment. This experiment aims to analyze the performance of our model in analyzing unseen data and gauge how well the model does in a real-life scenario. Three datasets obtained from Kaggle are used in this experiment, as summarized in Table 3, and explained as follows. a. Reddit dataset The Reddit dataset was collected from the comment section of Reddit posts. It is an unbalanced dataset with major being positive sentiment. This dataset fits the purpose 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2231 of this experiment well because it is from a different social media platform, which may have a different structure in writing b. Twitter US Airline dataset [35] This dataset was collected from Twitter US Airline, consisting of reviews written by their customers. The contributors were asked to classify the nature of its comment as positive, negative, or neutral tweets, followed by their reasons for giving such comments. It is also an unbalanced dataset, with negative sentiment being the majority of the data. iii. Twitter dataset This dataset was scraped from Twitter using the Tweepy module. We can see that it is an unbalanced dataset, with negative sentiment being the majority of data. This dataset is the closest in nature to the dataset used to train models. Figure 11 shows the experimental results obtained through testing using different datasets. The F1-score is given in Fig. 11a. We can see that the Twitter dataset has an F1-score of 58% for the standalone sentiment classifier and 63% for the proposed method, which is the highest among the three datasets. We believe the reason for this better performance is the nature of the dataset, which is the same as the data used for training but of broader scope. The Reddit dataset obtained an F1-score of 55% for the standalone sentiment classifier and 61% for the proposed method, while the Twitter US airline dataset obtained an F1-score of 52% for the standalone sentiment classifier and 54% for the proposed method. Generally, the performance of the proposed model on unbiased data is relatively low, with an average performance of about 59%. To better understand this poor performance, we observe the neutral score of the proposed method on these datasets, as shown in 10(b). It is essential to highlight that the poor performance on one of the labels will significantly affect the overall F1-score of the model. Thus, it can be observed that a deficient neutral label F1-score with 48%, 51%, and 44% for Reddit, Twitter, and Twitter US airline datasets, respectively. The low F1-score is mainly due to the low recall scores on classifying neutral sentiment, which are 33%, 36%, and 29% for Reddit, Twitter, and Twitter US airline datasets, respectively. This suggests that many neutral sentiment sentences are classified as false negatives. Figure 12 shows the word cloud for the neutral label in these datasets. We can observe that these neutral words, mostly stop words, are removed in the pre-processing stage, and consequently, the model has difficulty classifying neutral tweets. Nevertheless, the proposed method still outperforms the existing methods in analyzing unbiased datasets. 5 Conclusions and Future Work Sentiment analysis plays a vital role in the current digital world. It can provide useful information obtained from the Internet, especially social media, for various purposes. In this paper, we tackle the problem of inaccurate sentiment analysis caused by the use of sarcasm in some sentences. More specifically, a multi-task learning framework for simultaneous sentiment analysis and sarcasm detection has been proposed. In this framework, sarcasm detection aims to detect the sarcastic context in a sentence and improve the accuracy of sentiment analysis. Two experiments using different databases were conducted to evaluate 13 2232 Y. Y. Tan et al. Fig. 11 Experimental results using unbiased datasets: a F1-Score; b neutral score the performance of the proposed framework, and the results show promising improvement. It can be concluded that sarcasm tweets do affect the performance of the sentiment analysis model, and adding an efficient sarcasm detection model can significantly improve the overall performance of the sentiment analysis model. The performance of the proposed method relies on the dataset used for training. As demonstrated in the second experiment presented in the previous section, the performance of the proposed framework is poorer on unseen datasets because neutral sentiments return poor recall scores, which also leads to lower F1 scores. This happens because the stop words, which are neutral in nature, are removed during the pre-processing stage, making 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2233 Fig. 12 Word cloud for neutral label in datasets the neutral class less significant. To overcome this problem, a more appropriate pre-processing technique is needed to effectively remove stop words without reducing the number of neutral statements in the datasets. Acknowledgements This work was supported by the Impact Oriented Interdisciplinary Research Grant (IIRG) Programme, University of Malaya [IIRG002B-19IISS]. 13 2234 Y. Y. Tan et al. Authors Contributions Conceptualization, Y.-Y.T., C.-O.C., J.-H.C.; Methodology, Y.-Y.T., C.-O.C., J.K.; Software, Y.-Y.T., Investigation, Y.-Y.T., C.-O.C.; Data, Y.-Y.T., J.K.; Analysis, Y.-Y.T., C.-O.C.; Validation, Y.L.; Writing (Original Draft), Y.-Y.T.; Writing (Review), C.-O.C.; Supervision: C.-O.C.; Funding Acquisition, C.-O.C. Funding This work was supported by the Impact Oriented Interdisciplinary Research Grant (IIRG) Programme, University of Malaya [IIRG002B-19IISS]. Data Availability The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. Code Availability The datasets generated during and/or analysed during the current study are available from the corresponding author on reasonable request. Declarations Conflicts of interest The authors declare that they have no conflicts of interest and competing interests. References 1. Statista Research Department. (2022, September 20). Internet and social media users in the world 2022. Statista. Retrieved October 6, 2022, from https://www.statista.com/statistics/617136/digital- population-worldwide/ 2. Huang, C., Wang, Y., Li, X., Ren, L., Zhao, J., Hu, Y., Zhang, L., Fan, G., Xu, J., Gu, X., Cheng, Z., Yu, T., Xia, J., Wei, Y., Wu, W., Xie, X., Yin, W., Li, H., Liu, M., & Cao, B. (2020). Clinical features of patients infected with 2019 novel coronavirus in Wuhan China. The Lancet, 395(10223), 497–506. https://doi.org/10.1016/s0140-6736(20)30183-5 3. American Psychological Association. (n.d.). APA: U.S. adults report highest stress level since early days of the COVID-19 pandemic. American Psychological Association. Retrieved October 6, 2022, from https://www.apa.org/news/press/releases/2021/02/adults-stress-pandemic 4. Online Reviews Stats & Insights. Podium. (n.d.). Retrieved October 6, 2022, from https://www.podium. com/resources/podium-state-of-online-reviews. 5. De Choudhury, Munmun, Counts, & Scott. (2012). The nature of emotional expression in social media: measurement, inference and utility. Human Computer Interaction Consortium (HCIC). 6. Zhao, J., Liu, K., & Xu, L. (2016). Sentiment analysis: Mining opinions, sentiments, and emotions. Computational Linguistics, 42(3), 595–598. https://doi.org/10.1162/coli_r_00259 7. Pang, B., & Lee, L. (2004). A sentimental education: Sentiment analysis using subjectivity summarization based on minimum cuts. In: Proceedings of the 42nd Annual Meeting on Association for Computational Linguistics–ACL ’04. https://doi.org/10.3115/1218955.1218990 8. Turney, P. D. (2001). Thumbs up or thumbs down? Semantic orientation applied to unsupervised classification of reviews. Proceedings of the 40th Annual Meeting on Association for Computational Linguistics–ACL ’02. https://doi.org/10.3115/1073083.1073153 9. Dave, K., Lawrence, S., & Pennock, D. M. (2003). Mining the peanut gallery: Opinion extraction and semantic classification of product reviews. In: Proceedings of the Twelfth International Conference on World Wide Web - WWW ’03. https://doi.org/10.1145/775152.775226 10. Khadjeh Nassirtoussi, A., Aghabozorgi, S., Ying Wah, T., & Ngo, D. C. (2014). Text mining for market prediction: A systematic review. Expert Systems with Applications, 41(16), 7653–7670. https://doi. org/10.1016/j.eswa.2014.06.009 11. Burnap, P., Williams, M. L., Sloan, L., Rana, O., Housley, W., Edwards, A., Knight, V., Procter, R., & Voss, A. (2014). Tweeting the terror: Modelling the social media reaction to the Woolwich terrorist attack. Social Network Analysis and Mining. https://doi.org/10.1007/s13278-014-0206-4 12. Hogenboom, A., Heerschop, B., Frasincar, F., Kaymak, U., & de Jong, F. (2014). Multi-lingual support for lexicon-based sentiment analysis guided by semantics. Decision Support Systems, 62, 43–53. https://doi.org/10.1016/j.dss.2014.03.004 13. Reyes, A., & Rosso, P. (2013). On the difficulty of automatically detecting irony: Beyond a simple case of negation. Knowledge and Information Systems, 40(3), 595–614. https://doi.org/10.1007/ s10115-013-0652-8 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2235 14. Arunachalam, R., & Sarkar, S. (2013). The new eye of government: Citizen sentiment analysis in social media. In: Proceedings of the IJCNLP 2013 Workshop on Natural Language Processing for Social Media (SocialNLP), 23–28. 15. Diana, M., & MA, G. (2014). Who cares about sarcastic tweets? Investigating the impact of sarcasm on sentiment analysis. Lrec 2014 Proceedings. 16. Matsumoto, S., Takamura, H., & Okumura, M. (2005). Sentiment classification using word subsequences and dependency sub-trees. Advances in Knowledge Discovery and Data Mining. https://doi. org/10.1007/11430919_37 17. Maas, A., Daly, R. E., Pham, P. T., Huang, D., Ng, A. Y., & Potts, C. (2011). Learning word vectors for sentiment analysis. Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies, 142–150. 18. Bespalov, D., Bai, B., Qi, Y., & Shokoufandeh, A. (2011). Sentiment classification based on supervised latent N-gram analysis. Proceedings of the 20th ACM International Conference on Information and Knowledge Management - CIKM ’11. https://doi.org/10.1145/2063576.2063635 19. Abbasi, A., Chen, H., & Salem, A. (2008). Sentiment analysis in multiple languages: Feature selection for opinion classification in web forums. ACM Transactions on Information Systems, 26(3), 1–34. https://doi.org/10.1145/1361684.1361685 20. Yanagimoto, H., Shimada, M., & Yoshimura, A. (2013). Document similarity estimation for sentiment analysis using neural network. 2013 IEEE/ACIS 12th International Conference on Computer and Information Science (ICIS). https://doi.org/10.1109/icis.2013.6607825 21. Chen, T., Xu, R., He, Y., Xia, Y., & Wang, X. (2016). Learning user and product distributed representations using a sequence model for sentiment analysis. IEEE Computational Intelligence Magazine, 11(3), 34–44. https://doi.org/10.1109/mci.2016.2572539 22. Abdi, A., Shamsuddin, S. M., Hasan, S., & Piran, J. (2019). Deep learning-based sentiment classification of evaluative text based on multi-feature fusion. Information Processing & Management, 56(4), 1245–1259. https://doi.org/10.1016/j.ipm.2019.02.018 23. Kumar, A., Srinivasan, K., Cheng, W.-H., & Zomaya, A. Y. (2020). Hybrid context enriched deep learning model for fine-grained sentiment analysis in textual and visual semiotic modality social data. Information Processing & Management, 57(1), 102141. https://doi.org/10.1016/j.ipm.2019.102141 24. Yafoz, A., & Mouhoub, M. (2021). Sentiment analysis in Arabic social media using deep learning models. 2021 IEEE International Conference on Systems, Man, and Cybernetics SMC. https://doi.org/ 10.1109/smc52423.2021.9659245 25. Yousif, A., Niu, Z., Chambua, J., & Khan, Z. Y. (2019). Multi-task learning model based on recurrent convolutional neural networks for citation sentiment and purpose classification. Neurocomputing, 335, 195–205. https://doi.org/10.1016/j.neucom.2019.01.021 26. Yunitasari, Y., Musdholifah, A., & Sari, A. K. (2019). Sarcasm detection for sentiment analysis in Indonesian tweets. IJCCS Indonesian Journal of Computing and Cybernetics Systems, 13(1), 53. https://doi.org/10.22146/ijccs.41136 27. A, C. K. (2019). Twitter and reddit sentimental analysis dataset. Kaggle. Retrieved October 7, 2022, from https://www.kaggle.com/datasets/cosmos98/twitter-and-reddit-sentimental-analysis-dataset 28. Misra, R. (2019). News headlines dataset for sarcasm detection. Kaggle. Retrieved October 7, 2022, from https://www.kaggle.com/datasets/rmisra/news-headlines-dataset-for-sarcasm-detection 29. Mikolov, T., Kai, C., Corrado, G., & Jeffrey, D. (2013). Efficient estimation of word representations in vector space. ArXiv Preprint ArXiv:1301.3781. 30. Crawshaw, M. (2020). Multi-task learning with deep neural networks: A survey. ArXiv Preprint ArXiv:2009.09796. 31. Arbel, N. (2020). How LSTM networks solve the problem of vanishing gradients. Medium. Retrieved October 7, 2022, from https://medium.datadriveninvestor.com/how-do-lstm-networks-solve-the-probl em-of-vanishing-gradients-a6784971a577 32. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., & Salakhutdinov, R. (2014). Dropout: A simple way to prevent neural networks from overfitting. The Journal of Machine Learning Research, 15, 1929–1958. 33. Agarap, A. F. (2018). Deep learning using rectified linear units (ReLU). ArXiv Preprint ArXiv:1803.08375. 34. Hochreiter, S. (1998). The vanishing gradient problem during learning recurrent neural nets and problem solutions. International Journal of Uncertainty, Fuzziness and Knowledge-Based Systems, 06(02), 107–116. https://doi.org/10.1142/s0218488598000094 35. Eight, F. (2019). Twitter us airline sentiment. Kaggle. Retrieved October 7, 2022, from https://www. kaggle.com/datasets/crowdflower/twitter-airline-sentiment 13 2236 Y. Y. Tan et al. Publisher’s Note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations. Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law. Yik Yang Tan is currently a student at the Department of Electrical Engineering, University of Malaya. His research topics include application of machine learning techniques and big data in solving engineering problems, making the most out of state-of-art techniques. Chee‑Onn Chow received his Bachelor of Engineering (honors) and Master of Engineering Science degrees from University of Malaya, Malaysia in 1999 and 2001, respectively. He received his Doctorate of Engineering from the Tokai University, Japan in 2008. He joined the Department of Electrical Engineering, University of Malaya 1999, and currently an Associate Professor in the same department. His research areas include wireless communications, multimedia transmission, data analytics and machine learning. He is member of IET and senior member of IEEE. He is a registered Professional Engineer (Board of Engineers Malaysia) and Chartered Engineer (IET). Jeevan Kanesan received his B.S. degree in Electrical & Electronics Engineering from University Technology Malaysia, Malaysia, in 1999, M.S. degree and Ph.D. degree in mechanical engineering from Author’s Picture & Biography Click here to access/download;Author’s Picture & Biography;Bio.pdf University Science Malaysia, Malaysia in 2003 and 2006 respectively. He worked as equipment engineer at Carsem Semiconductor, Malaysia between 2000 and 2001 and IC Design engineer in the thermomechanical department, Intel Technology Sdn. Bhd., Malaysia from 2006 to 2008. He has been with the University of Malaya, Malaysia in the Electrical Engineering department since 2008. He has published in over 50 peer-reviewed technical 13 Sentiment Analysis and Sarcasm Detection using Deep Multi‑Task… 2237 papers in international journals. His research interests include Nature Inspired Metaheuristics, Optimization, CAD of VLSI circuits and design and analysis of algorithms. Joon Huang Chuah received the B.Eng. (Hons.) degree from the Universiti Teknologi Malaysia, the M.Eng. degree from the National University of Singapore, and the M.Phil. and Ph.D. degrees from the University of Cambridge. He is currently Head of VIP Research Group at the Department of Electrical Engineering, University of Malaya. He was the Honorary Treasurer of IEEE Computational Intelligence Society (CIS) Malaysia Chapter and the Honorary Secretary of IEEE Council on RFID Malaysia Chapter. He is the Vice Chairman of the Institution of Engineering and Technology (IET) Malaysia Network. He is a Fellow and the Honorary Secretary of the Institution of Engineers, Malaysia (IEM). His main research interests include image processing, computational intelligence, IC design, and scanning electron microscopy. YongLiang Lim received the Bachelor degree and the Master degree in Electronic Engineering from the University of Science Malaysia (USM) Engineering Campus, Malaysia, in 2016 and 2019, respectively. He is currently a PHD student at the Department of Electrical Engineering, University of Malaya. His research interests are computational intelligence and context-awareness. 13