Temporal Mental Health Dynamics on Social Media
arXiv:2008.13121v3 [cs.CL] 2 Sep 2020
1
Tom Tabak1
School of Electronic Engineering and Computer Science
Queen Mary University of London
London, United Kingdom
tabaktom360@gmail.com
Abstract—We describe a set of experiments for building a
temporal mental health dynamics system. We utilise a preexisting methodology for distant-supervision of mental health
data mining from social media platforms and deploy the system
during the global COVID-19 pandemic as a case study. Despite
the challenging nature of the task, we produce encouraging
results, both explicit to the global pandemic and implicit to
a global phenomenon, Christmas Depression, supported by the
literature. We propose a methodology for providing insight into
temporal mental health dynamics to be utilised for strategic
decision-making.
Index Terms—Mental Health, Social Media, COVID-19
I. I NTRODUCTION
Mental health issues pose a significant threat to the general population. Quantifiable data sources pertaining to mental health are scarce in comparison to physical health data
(Coppersmith et al. 2014). This scarcity contributes to the
complexity of development of reliable diagnoses and effective
treatment of mental health issues as is the norm in physical
health (Righetti-Veltema et al. 1998). The scarcity is partially
due to complexity and variation in underlying causes of mental illness. Furthermore, the traditional method for gathering
population-level mental health data, behavioral surveys, is
costly and often delayed (De Choudhury, Counts & Horvitz
2013b).
Whilst widespread adoption and engagement in social media
platforms has provided researchers with a plentiful data source
for a variety of tasks, including mental health diagnosis; it has
not, yet, yielded a concrete solution to mental health diagnosis
(Ayers et al. 2014). Conducting mental health diagnosis tasks
on social media data presents its own set of challenges: The
users’ option of conveying a particular public persona posts
that may not be genuine; sampling from a sub-population
that is either technologically savvy, which may lend to a
generational bias, or those that can afford the financial cost
of the technology, which may lead to a demographic bias.
However, the richness and diversity of the available data’s
content make it an attractive data source. Quantifiable data
from social media platforms is by nature social and crucially
(in the context of our cases study) virtual.
Quantifiable social media data enables researchers to develop methodologies for distant mental health diagnosis and
analyse different mental illnesses (De Choudhury, Counts
& Horvitz 2013a). Distant detection and analysis enables
2
Matthew Purver 1 2
Department of Knowledge Technologies
Jožef Stefan Institute
Ljubljana, Slovenia
m.purver@qmul.ac.uk
researchers to monitor relationships of temporal mental health
dynamics to adverse conditions such as war, economic crisis
or a pandemic such as the Coronavirus (COVID-19) pandemic.
COVID-19, a novel virus, proved to be fatal in many cases
during the global pandemic that started in 2019. Governments
reacted to the pandemic by placing measures restricting the
movement of people on and within their borders in an attempt
to slow the spread of the virus. The restrictions came in
the form of many consecutive temporary policies that varied
across countries in their execution. We focus on arguably the
most disruptive measure: The National Lockdown. This required individuals, other than essential workers (e.g. healthcare
professionals) to remain in their own homes. The lockdown
enforcement varied across countries but the premise was that
individuals were only permitted to leave their homes briefly
for essential shopping (food and medicine). This policy had
far reaching social and economic impacts: growing concern
towards individuals’ own and their families’ health, economic
well-being and financial uncertainty as certain industries (such
as hospitality, retail and travel) suspended operations. As a
result, many individuals became redundant and unemployed
which constrained their financial resources as well as being
confined to their homes, resulted in excess leisure time.
These experiences along with the uncertainty of the measures’
duration reflected a unique period where the general public
would be experiencing a similar stressful and anxious period,
which are both feelings associated with clinical depression
(Hecht et al. 1989, Rickels & Schweizer 1993).
In this paper, we investigate the task of detecting whether
a user is diagnosis-worthy over a given period of time and
explore what might this appropriate time period be. We investigate the role of balance of classes in datsets by experimenting
with a variety of training regimes. Finally, we examine the
temporal mental health dynamics in relations to the respective
national lockdowns and investigate how these temporal mental
health dynamics varied across countries highly-disrupted by
the pandemic. Our main contributions in this paper are:
1) We demonstrate an improvement in mental health detection
performance with increasingly enriched sample representations. 2) We highlight the importance of the balance in
classes of the training dataset whilst remaining aware of an
approximated expected balance of classes in the unsupervised
(test) dataset. 3) We analyse empirically proven relationships
between populations’ temporal mental health dynamics and
respective national lockdowns that can be used for strategic
decision-making purposes.
II. R ELATED RESEARCH
A. Natural Language Processing for Mental Health Detection
Unlike physical health conditions that often show physical
symptoms, mental health is often reflected by more subtle
symptoms (Chung & Pennebaker 2007, De Choudhury, Counts
& Horvitz 2013a). This yielded a body of work that focused
on linguistic analysis of lexical and semantic uses in speech,
such as diagnosing a patient with depression and paranoia
(Oxman et al. 1982). Furthermore, an examination of college
students essays, found an increased use of negative emotional
lexical content in the group of students that had high scores on
depression scales (Rude et al. 2004). Such findings confirmed
that language can be an indicator of an individuals psychological state (Bucci & Freedman 1981) which lead to the
development of Linguistic Enquiry and Word Count (LIWC)
software (Pennebaker et al. 2003, Tausczik & Pennebaker
2010) which allows users to evaluate texts based on word
counts in a variety of categories. More recent and larger scale
computational linguistics have been applied in conversational
counselling by utilising data from an SMS service where
vulnerable users can engage in therapeutic discussion with
counsellors (Althoff et al. 2016). For a more in-depth review of
uses of natural language processing (NLP) techniques applied
in mental health the reader is referred to Trotzek et al. (2018).
B. Social Media as a Platform for Mental Health Monitoring
The widespread engagement in social media platforms by
users coupled with the availability of platforms data enables
researchers to extract population-level health information that
make it possible to track diseases, medications and symptoms
(Paul & Dredze 2011). The use of social media data is
attractive to researchers not only due to its vast domain
coverage but also due to the cheap methodologies by which
data can be collected in comparison to previously available
methodologies (Coppersmith et al. 2014). A plethora of mental health monitoring literature have utilised this cheap and
efficient data mining methodologies from a variety of social
media platforms such as: Reddit (Losada & Crestani 2016),
Facebook (Guntuku et al. 2017) and Twitter (De Choudhury,
Gamon, Counts, & Horvitz 2013).
Twitter user’s engagement in the popular social media
platform give way for the creation of social patterns that
can be analysed by researchers, making this platform a
widely used data source for data mining. Additionally, the
customisable parameters querying available in the Application
Programmable Interface (API) allows researchers to monitor
specific populations and/or domains (De Choudhury, Counts
& Horvitz 2013b).
C. Mental Health Monitoring During COVID-19 Pandemic
In the context of the COVID-19 pandemic, we found a
handful of projects with similar intentions as our own, to mon-
itor depression during the pandemic. Li et al. (2020) gather
large scale, pandemic-related twitter data and infers depression
based on emotional characteristics and sentiment analysis of
tweets. Zhou et al. (2020) focus on detecting community level
depression in Australia during the pandemic. They use the
distant-supervision methodologies of Shen et al. (2017) to
gather a balanced dataset, they utilise the methodology of
Coppersmith et al. (2014) to model the rates of depression
and observing the relationship with the number of COVID19 infections in the community. Our work differs from this in
three main areas: 1) We investigate the implication of different
sample representations to provide more context to our classifier. 2) We retain an imbalance in our development dataset.
3) We investigate European countries (France, Germany, Italy,
Spain and the United Kingdom) that experienced a relatively
high number of COVID-19 infections.
III. D IAGNOSIS C LASSIFIER E XPERIMENTS
In this section we describe the data mining methodology
used to build a distantly supervised dataset and the classifier
experiments conducted on this dataset.
A. Data
To conduct the proposed experiments, we firstly construct a
distantly supervised development dataset for each country, to
be used in training and validation of the classifier. The data
mining methods follow the novel distant-supervision methodology proposed in Coppersmith et al. (2014) as it is relatively
cheap but also well-structured for clinical experiments.
We follow the widely-accepted methodology proposed by
Watson (1768) where diagnosed (Diagnosed) and nondiagnosed (Control), groups are created. In this paper we
will only be exploring depression as a mental health condition,
accordingly we will have a single Diagnosed group for each
country’s development dataset. However, if multiple mental
issues were to be explored, then the same number of different
Diagnosed groups would be required for each country’s
dataset.
1) Diagnosed Group: We gather 200 public tweets with
a geolocation inside the country of interest, posted during
a two-week period during 2019. As we are searching for a
depression Diagnosed tweets, this two-week period needs
to be chosen strategically, as we want to capture users that
have been diagnosed with depression rather than seasonal
affect disorder (SAD), a separate albeit a condition with
similar symptoms. Tweets collected via Twitters API1 , were
retrieved based on lexical content indicating that the user has
history/is currently dealing with a clinical case, e.g. I was
diagnosed with depression, rather than expressing depression
in a colloquial context. Human annotators were then instructed
to remove tweets that are perceived to have made a nongenuine statement regarding the users’ own diagnosis, most
of these were referring to a third party. Examples of genuine
and non-genuine tweets encountered can be seen in Table I.
1 Twitter API: https://developer.twitter.com/en/docs
TABLE I
E XAMPLE OF ANNOTATION
Diagnosis indication
Genuine
Non-genuine
Example tweet
I was diagnosed with severe depression and
went through the works of treatment for it.
It’s official. My guinea pig has
been diagnosed with depression
We then collect all (up to 5,000 most recent) tweets made
public by the remaining users between the start of 2015 and
October 2019. Further filtering includes removal of all users
with less than 20 tweets during this period or those whose
tweets do not meet our major language of instruction benchmark. This benchmark requires 70% of the tweets collected to
be written in the major language of instruction of the country
of interest (i.e. United Kingdom is English, Italy is Italian etc.).
Following this filtering process and some preprocessing on the
tweet level, which includes medial capital splitting, mention
white-space removal (i.e. if another user was mentioned this
will be shown as a unique mention token), the same has
been done with URLs, all uppercase and non-emoticon related
punctuation were removed.
2) Control Group: We gather 10,000 public tweets with
a geolocation in the country of interest, posted during the same
two-week period as the Diagnosed in 2019 and remove
any tweets made by Diagnosed group users. We then
follow a similar process to that of the Diagnosed collection
methodology by collecting up to 5,000 most recent tweets for
each user from the period mentioned above.
TABLE II
C OMPOSITION OF D EVELOPMENT DATASETS
Country
France
Germany
Italy
Spain
U.K.
Group
Diagnosed
Control
Diagnosed
Control
Diagnosed
Control
Diagnosed
Control
Diagnosed
Control
No. Users
57
1, 041
53
1, 138
38
1, 051
53
1, 013
98
1, 365
No. Tweets
190, 447
2, 861, 580
160, 864
2, 802, 959
132, 743
2, 514, 483
107, 833
2, 564, 966
289, 624
3, 319, 201
As can be seen in Table II, we construct imbalanced
datasets. World Health Organisation (WHO) claim 264 million
people suffer from depression worldwide2 . Whilst, at the
time of writing, the global population is approximately 7.8
2 World Health Organization, Depression, 2020, [Online]. Available:
https://www.who.int/news-room/fact-sheets/detail/depression[Accessed: 26
July 2020]
billion3 . This would suggest that 1 in 30 individuals suffer
from depression. However, these figures are approximations.
Therefore, the extent to which our datasets are imbalanced is
not an attempt to create datasets that are representative of the
expected balance of classes, as these are unverifiable. Nevertheless, our datasets present ratios of Control:Diagnosed
samples between 23.78:1 and 11.46:1, which came about from
the data mining methods previously described. We accept
these ratios to retain imbalanced datasets in a similar order of
magnitude as the expected balance whilst achieving reasonable
classifier performance. We inherit the caveats to the distantsupervision approach of Coppersmith et al. (2014):
(a) When sampling a population we always run the risk
of only capturing a subpopulation of the Control or
Diagnosed that is not fully representative of the population, especially considering that Diagnosed samples
are identified based on the fact that they publicly speak
out about what is a deeply personal subject this attribute
may not generalise well to the entire population.
(b) We do not implement a verification of the method used
to identify users in Diagnosed but rather rely on the
social stigma around mental illness whereas it could be
regarded as unusual for a user to tweet about a diagnosis
of a mental health illness that is fictitious.
(c) Control is likely contaminated with users that are
diagnosed with a variety of conditions, perhaps mental
health related, whether they explicitly mention this or not.
We have made no attempt to remove such users.
(d) Depression is often comorbid with other mental health
issues (Aina & Susman 2006). As such, it is plausible
that the users forming Diagnosed are suffering from
other mental health conditions. This could suggest that the
classifier will be trained to pick up these hidden meaning
representations of other mental health issues and classify
them as depression. We have made no attempt to further
investigate nor remove such users from Diagnosed
as having a complex Diagnosed group is a realistic
representation of the task.
B. Methodology
In this section we describe the experiments conducted in
training our classifier to diagnose depression. The trained
classifier is deployed in Section IV for classifying samples
from an unsupervised experiment dataset which is then used
in analysing temporal mental health dynamics.
1) Sample Representation: We investigate the most appropriate sample representation of our distantly supervised
dataset. We are posed with these considerations:
(a) Symptoms’ temporal dependencies: as the tweets gathered
come from a variety of days, weeks, months and even
years, symptoms may only be present in specific timedependant samples. However, when represented by overwhelming tweet-enriched samples the classifier perfor3 Worldometer. 2020. Worldometer - Real Time World Statistics. [online]
Available at: https://www.worldometers.info [Accessed 19 August 2020].
mance is traded-off with retaining the symptoms’ temporal
dependencies.
(b) As our final task will be to monitor and analyse the temporal mental health dynamics, we are interested in modelling
the rate of depression as fine-grained as possible.
Therefore, the ability to accurately identify Diagnosed
samples and correctly discriminate between Control and
Diagnosed with the least tweet-enriched samples will be
vital in modelling a fine-grained rate of depression in the
deployment stage of the final task where conclusions could be
drawn in the context of the national lockdowns. The sample
representations we examined:
• Individual each sample constitutes of a single tweet.
• U ser day - each sample constitutes of all tweets by a
unique user made public during a given day.
• U ser week each sample constitutes of all tweets by a
unique user made public during a given week.
• All user - each sample constitutes of all tweets made
public by a unique user.
We examine the performance of a benchmark, Support Vector
Machine (SVM) with a linear kernel function (Peng et al.
2019), on the different sample representations datasets where
the benchmark classifier inputs are sparse many-hot encoding
representations of the samples’ lexical content.
As we are working with imbalanced datasets we need to
think about the metrics we use to assess the classifiers’ performance. The accuracy metric is insufficient for imbalanced
datasets and is best illustrated with an example. If we have
a dataset with 24:1 ratio split between the samples of each
class, the classifier could achieve 96% accuracy by classifying
every sample as the majority class. The classifier is clearly not
discriminating between the distributions of the two classes
but yet achieving high performance. As such, we will be
assessing the performance of the classifier on the individual
classes’ Precision (P), Recall (R) and F1 score measures as
well as the Macro F1 score for this and the remainder of the
experiments in this paper. The Precision measure will tell us:
of all the samples the classifier labelled as a particular class,
what fraction are correct. The Recall measure will tell us: of
all the samples that actually belong to that particular class,
how many did the classifier correctly identify. Whilst the F1
score is a harmonic mean between the two and the Macro
F1 score takes the F1 scores of all classes and calculated a
non-weighted mean between them. By having a more classspecific breakdown of the classifiers’ performance we can
better understand the strengths and limitations of our classifiers
and hence make a more informed decision when choosing the
highest performing classifier.
The results in Table III suggest that our benchmark
classifier improved in identifying Diagnosed, with
increasingly tweet-enriched, samples. Interestingly however,
when presented with the U ser day sample representations
a sharp decreased in performance in Control samples
causing a decrease in Macro F1 score when compared with
the F1 scores of both Individual and U ser week sample
TABLE III
SVM P ERFORMANCE ON VARYING S AMPLE R EPRESENTATIONS OF U.K.
D EVELOPMENT DATASETS
Sample
Representation
Individual*
User
day
User
week
All user
P
0.92
Control
R
F1
0.99
0.95
Diagnosed
P
R
F1
0.27
0.05
0.08
Macro
F1
0.52
0.1
0.52
0.74
0.96
0.84
0.36
0.06
0.92
0.97
0.94
0.26
0.11
0.15
0.55
0.94
0.99
0.96
0.5
0.14
0.22
0.59
representations. Barring this decrease in Macro F1 score, we
can say that we are able to achieve improved performance
when using increasingly tweet-enriched samples. However,
the end task would benefit from fine-grained modelling
of the rate of depression, providing us with more detailed
relationships between the temporal mental health dynamics
and noteworthy dates. As such, our task is bias towards the
two fine-grained sample representations, Individual and
U ser day. As our benchmark classifier achieves superior
performance on the Individual sample representation we
will adopt this representation, as denoted by the asterisk in
Table III.
2) Classifier Experiments on U.K. Development Dataset:
Once we have chosen the sample representation that balances
out or fine-grained sample requirements with the benchmark
classifier performance, we must now build a classifier that
best discriminates between our two classes. The higher the
performance of the classifier, the more accurate the temporal
mental health dynamics will in Section IV. We outline the
classifier architectures included in our experimentation:
SV M : Linear kernel SVM as used in Section III-B1. This
classifier will serve as our benchmark.
4
• AV EP LEF C : Average pooling layer.
• CN N -M XP LEF C : Unigram level CNN with 1 f ilter
and kernel size of 1 and a Max-pooling layer.
• BILST MEF C : Bi-directional LSTM (Hochreiter &
Schmidhuber 1997).
• CN N -BILST MEF C : Unigram level CNN with 1
f ilter and kernel size of 1 and a Bi-directional LSTM.
• CN N -AT T : Unigram level CNN with 1 f ilter and
kernel size of 1, Attention layer (Vaswani et al. 2017)
and an Average pooling layer.
• BILST M -SELF A: Bi-directional LSTM and a Selfattention layer.
5
• BERT : Pretrained BERTBase
fine-tuned on our
dataset (Devlin et al. 2018).
•
4 All uses of X
EF C indicate a learned embedding layer as the first layer
in the classifier, after the input, and 3 Fully Connected layers following the
specified classifier architecture, the activations of which are Rectified Linear
(ReLU) before the output layer with a sigmoid activation.
5 Exact pretrained BERT
Base version implementation available here:
https://tfhub.dev/google/bert uncased L-12 H-768 A-12/1
We set hyper-parameters where an Adam optimiser (Kingma
& Ba 2015) is used with a learning rate of 0.01, batch
size of 1, 000. All classifiers were trained for a single epoch
with a dataset training:validation split of 4:1 and weighting the
samples of Diagnosed as 5 times more valuable than those
of Control. Training was done on a single Tesla P100-PCIE
with 16GB of RAM available through Google’s Colaboratory6 .
TABLE IV
C LASSIFIERS P ERFORMANCE ON U.K. D EVELOPMENT DATASET
Classifier
SV M
AV EP L
CN N -M XP L
BILST M
CN N -BILST M
CN N -AT T
BILST M -SELF A*
BERT
Control
P
R
F1
0.92
0.9
0.91
0.94
0.95 0.94
0.94
0.95 0.94
0.94
0.95 0.94
0.93
0.97 0.95
0.94
0.94 0.94
0.94
0.94 0.94
0.94
0.66 0.78
Diagnosed
P
R
F1
0.27
0.05 0.08
0.33
0.26 0.29
0.31
0.25 0.28
0.3
0.33 0.31
0.3
0.15
0.2
0.31
0.3
0.3
0.32
0.31 0.31
0.12
0.52
0.2
Macro
F1
0.52
0.62
0.61
0.63
0.57
0.62
0.63
0.47
Table IV shows that all classifiers achieve significantly
higher performance on Control than Diagnosed. As
we are trying to correctly detect Diagnosed samples
and discriminate between the two classes, we prioritise the
Diagnosed Precision and Macro F1 score metrics. Based
on these 2 chosen metrics to guide our classifier selection
process 3 candidates emerge: AV EP L, BILST M and
BILST M -SELF A achieving {Diagnosed Precision,
Macro F1} scores of: {0.33, 0.62}; {0.3, 0.63} and {0.32,
0.63} respectively. Whilst the performance of these classifiers
is similar the performance of BILST M -SELF A is the
highest performance combination of the desired metrics
(indicated by the asterisk) and as such we will be adopting
this classifier in further experiments.
3) Dataset Balance Experiment: In this section we investigate the distribution of our datasets in training and validation
of our classifier. By conducting this experiment we intend to
gather an in-depth understanding of our task from a linguistic
standpoint. We train and validate the classifier on datasets with
varying balances to investigate the role of our imbalanced
dataset in the depression diagnosis task. This experiment
analyses the performance of the BILST M -SELF A classifier
on a number of different training regimes:
Balanced: a dataset containing all Diagnosed samples
and downsampling from Control.
• Imbalanced: a dataset of the development dataset’s
distribution (See Table II).
•
Furthermore, we explore the effects of sample weighting
of the classes by weighting Diagnosed samples as 5 times
more valuable than Control samples as mentioned in the
previous experiment. The performance of the BILST M SELF A classifier on the different training regimes can be
seen in Table V.
6 Google Colaboratory: https://colab.research.google.com
TABLE V
BILST M -SELF A P ERFORMANCE ON VARYING T RAINING R EGIMES
Training
Validation
Balanced
Imbalanced
Imbalanced
Balanced
Balanced
Imbalanced
Imbalanced
Balanced
Imbalanced
Imbalanced
Imbalanced
Imbalanced
Balanced
Balanced
Sample
Weighting
None
None
Weighted
None
Weighted
None
Weighted
P
0.69
0.93
0.94
0.95
0.99
0.53
0.58
Control
R
F1
0.69 0.69
1
0.96
0.94 0.94
0.63 0.76
0.13 0.23
1
0.69
0.96 0.72
Diagnosed
P
R
F1
0.69
0.69 0.69
0.72
0.11 0.19
0.32
0.31 0.31
0.14
0.66 0.23
0.09
0.98 0.16
0.98
0.11
0.2
0.88
0.29 0.44
Macro
F1
0.69
0.58
0.63
0.49
0.2
0.45
0.58
The Balanced-Balanced (T raining-V alidation) training
regime achieves encouraging results in terms of its PrecisionRecall trade-off, for both classes, as well as the Macro F1
score. This shows that the problem is reasonably linguistically
achievable, when the imbalance challenge is removed from
the equation. The Imbalanced-Imbalanced training regime
shows that adjusting the sample weighting is a successful measure we can implement to adjust the Precision-Recall trade-off
in our class of interest (Diagnosed). Our classifier performs
significantly worse in the Balanced-Imbalanced regime
when compared to the performance on the ImbalancedImbalanced regime, this performance is reduced by the introduction of sample weighting. This means that when training
on a Balanced dataset our classifier is less robust to an
Imbalanced dataset at validation. Finally, whilst our classifier
experiences a significant improvement in performance on the
Imbalanced-Balanced training regime when sample weighting is introduced due to our final depression diagnosis task in
which we expect an Imbalanced unsupervised dataset (discussed in Section III-A2) the training regimes implementing
Balanced validation datasets are not suitable approximations
of our classifier’s depression diagnosis performance. Therefore, we conclude that Imbalanced training, with suitable
sample weighting, yields more desirable and robust depression
diagnosis performance as it’s able to see a broader range of
data examples in training (i.e. no sub-sampling).
C. Results
We train separate BILST M -SELF A classifiers for each
of the countries’ imbalanced development datasets following
the Individual sample representation. The test performance
of these classifiers can be seen in Table VI.
TABLE VI
BILST M -SELF A C LASSIFIER P ERFORMANCE ON C OUNTRIES ’
D EVELOPMENT DATASETS
Country
France
Germany
Italy
Spain
U.K.
P
0.96
0.96
0.97
0.97
0.94
Control
R
F1
0.94
0.95
0.96
0.96
0.96
0.97
0.97
0.97
0.94
0.94
Diagnosed
P
R
F1
0.29 0.37
0.33
0.3
0.32 0.31
0.38
0.4
0.39
0.35 0.38
0.36
0.32 0.31
0.32
Macro
F1
0.64
0.63
0.68
0.67
0.63
We observe that the BILST M -SELF A classifier
architecture achieved similar performance on the remaining
countries’ datasets as was acheived on U.K. dataset. This
shows that the BILST M -SELF A architecture is able to
generalise well to different languages and cultural differences
after training. Hence, producing an encouraging set of results
and increase our confidence in its classification ability. Whilst
the BILST M -SELF A classifier architecture achieved the
highest performance of all our classifier architectures, a
combination of 0.32 Diagnosed Precision and 0.63 Macro
F1 score leaves much to be desired. As such, we perform an
error analysis and examine the significance of its results.
1) Error Analysis: Table VII shows the input samples, Text,
the Prediction type as well as the Sigmoid Output which is
the output layer of the classifier and is responsible for the
final classification of the samples. The Sigmoid Output is
normalised in the range of [0, 1] ∈ R, where an output of 0.5
represent the decision boundary, as such it can be interpreted
as complete uncertainty by the classifier as to how the sample
should be classified. A Sigmoid Output of 1 is complete
certainty that the sample should be classified as positive
(Diagnosed) and an output of 0 is complete certainty the
sample should be classified as negative (Control).
TABLE VII
C LASSIFICATION E XAMPLES FOR E RROR A NALYSIS
Prediction
Type
Text
Sigmoid
Output
True
Positive
hi davenport handmade is a small one man business i make handmade wooden bowls pens
jewellery boxes and other wooden items in a workshop that i built myself it started as a way
of overcoming depression and has taken over my life
0.999
im too depressed lol
0.507
False
Positive
False
Negative
True
Negative
”i miss you too man its actually depressing me”
0.19
half term kids camps are up on wandsworth common with a dedicated kids football camp
0.001
We observe that the True positive example mentions having
“overcoming depression” which implies that the user has
recovered from depression, as one overcomes other health
issues. The Sigmoid Output for this sample is 0.999 which
is extremely high certainty by the classifier that this sample
follows the distribution of Diagnosed. Whilst on the other
end of the scale, the True negative sample discusses a topic
that is completely unrelated to nor implies that the individual
suffers from depression, as such it is classified as part of
Control with a Sigmoid Output of 0.001.
However, we find the Texts of the two samples misclassified
by the classifier are rather similar. They both use words
rooted from the word ‘depress’ in rather colloquial contexts,
with no indication of a past diagnosis or clinical appropriation
of depression - which is rather a desirable ability for our
classifier to be able to discriminate between. It is also
noteworthy that the Sigmoid Outputs of these two samples are
much less polarised than the correctly classified samples, with
the Sigmoid Output of the False positive sample just about
falls within the Diagnosed classification region. However,
these two incorrectly classified samples reflect the complexity
of depression diagnosis from distantly supervised tweets.
2) Significance of Results: We investigate the significance
of our classifiers’ results by performing a χ2 significance
test. Our null hypothesis, H0 , states that both sets of data,
our classifiers’ predictions (DP ) and the distribution it is
being tested against (DT ), have been drawn from the same
distribution (D).
H0 : DP ∩ DT ⊆ D
(1)
We compare the distribution of the classifiers’ predictions
against a random uniformly distributed set (denoted by Uniform) and against a random distributed set following the distribution of the development datasets (denoted by Weighted). All
classifier results in Table VI are statistically significant from
the random baselines, according to the χ2 significance test see Table VIII in Appendix A. Therefore, we can reject H0
and conclude that the predictions of the classifier and those of
the respective randomly distributed benchmarks have not been
drawn from the same distribution.
IV. M ONITORING AND A NALYSIS
We prepare the unsupervised dataset and deploy the previously trained BILST M -SELF A classifier to annotate this
dataset. We analyse the relationships between the temporal
mental health dynamics to respective national lockdown dates.
A. Data
In this section we discuss the procedure for constructing the
unsupervised experiment dataset, to be used for monitoring
the temporal mental health dynamics during the respective
pandemic-inflicted national lockdowns.
1) Experiment Dataset: The experiment dataset is used
for the deployment of the classifier, which is trained and
validated on the development set, to analyse the temporal
mental health dynamics of a country. We start by gathering
tweets made public by users during the first two weeks
of 2020 with a geolocation within the country of interest.
We then follow the same methodology for data mining as
Control outlined in Section III-A, for the periods starting
from 1 December 2019 until 15 May 2020, where we apply
similar user-level and tweet-level preprocessing and filtering
on these dataset. The composition of these experiment datasets
can be seen (Table IX in Appendix B) along with key dates.
The key dates specified observe the official date announcing
the commencement of and the announcement of first step
towards easing of the national lockdowns, rather than the first
official data implementing these measures as we anticipate
that the announcement would provoke users to express their
opinion more than the implementation of the measures. We
acknowledge some caveats to the methodology with relations
to the temporal mental health dynamics during the respective
national lockdowns:
(a) The activity-level of users whose lifestyles have been
highly disrupted by the national lockdowns may be overstated during this period, due to increased leisure time.
(b) The language-based filtering component may exclude certain users of the population such as stranded expatriates
that use a non-majority language to communicate their
thoughts. Such samples may contain a bias towards a
higher rate of depression.
B. Methodology
To monitor and analyse temporal mental health dynamics
we must firstly deploy our trained BILST M -SELF A on the
respective countries’ unsupervised experiment datasets. Once
we have the classifier’s predictions, we must model the rate of
depression by calculating the rate of depression at any given
day, Rt , using the following equation:
Nt
P
Rt = i=1
Φ(xi )
Nt
(2)
Where Φ represents our trained classifier, xi is the input,
Nt is the total number of samples on day t. The output
of the classifier, Φ(xi ) takes the form [0, 1] ∈ N. Rt is a
normalised continuous value between 0 and 1, interpreted as
the proportion of tweets at t that classify as Diagnosed: 0
meaning all samples belong to Control and 1 meaning all
samples belong to Diagnosed.
C. Results
Figures 1 and 2 (see Appendix C) display the temporal
mental health dynamics for the countries under investigation.
It is noteworthy that Rt across the different countries is a
function of the ratio of Control:Diagnosed samples in the
country specific datasets on which the classifier was trained.
As such, the rates across countries are not directly comparable
but are rather analysed by the momentum of how Rt in a
country changed over time and how it differed from Rt of
other countries.
D. Discussion
Foremost, we note that we categorically cannot, nor do we,
state that the rates of depression discussed in this section are
caused by the imposition of respective national lockdowns
or any other measures of any type, taken by governments to
combat the spread of the virus. In this section, we merely offer
interpretations of the rates of depression in line with explicit
relationships that we discover between the rates of depression
and key events that occurred during the time-period included
in this case study.
Upon examination of the U.K. rate of depression (RU K ),
the first distinct observation we make is not related to any
pandemic-related measures but rather the sharp, non-sustained,
increase of RU K of over 50% on Christmas day, before
decreasing back to the status quo the next day. Upon further
investigation we find that this phenomenon is well-documented
(Hillard & Buckman 1982) and seeing that our classifier was
able to identify this phenomenon, without explicitly being
aware of its existence, is encouraging. We continue observing
RU K chronologically, on March 9th , Italy National Lockdown
begins. From this point on we observe a sharp, sustained increase in RU K approximately until March 23rd , U.K. National
Lockdown begins (the last country to impose such a restriction
in this study), where RU K somewhat plateaus. We interpret
this as an increase in anxiety, a symptom of depression,
amongst the U.K. population as neighbouring countries take
decisive measures to slow the spread. A key theme in the
build up to the U.K. national lockdown implementation was
the intentional delay so that to ensure maximum utility from
the policy7 . However, a report published on the 16th of March
by the Imperial College COVID-19 response team8 estimated
that the the current combative approach taken up by the U.K.
government would result in 250,000 deaths. The report was
well-publicized by the British media and was arguably a factor
in the change of combative approach by the U.K. government.
This is somewhat supported by the change in RU K during the
U.K. National Lockdown where we see a sustained decrease
for the majority of the period. Finally, we observe a slight
increase in RU K towards the end of and in the aftermath of
the National Lockdown that could perhaps be interpreted as
anxiety and concern from the population at the uncertainty
with which they are faced both from a social and an economic
perspective.
The rates of depression of France (RF R ), Germany (RDE ),
Italy (RIT ) and Spain (RES ) behave rather differently from
RU K . Firstly, RIT increases sharply by over 100% over the
initial days of the Italian National Lockdown. This can be
interpreted as anxiety and concern in response to the quickly
imposed stringent measures by the Italian government in
response to the outbreak of the virus in Italy, which at this
point was largely believed to be the epicentre of the pandemic.
This was coupled with economic turmoil and great concern
over the capacity of hospitals and their ability to handle the
high requirement for intensive care units that would ensue9 .
A similar pattern emerges in RF R , a sharp increase over the
initial days of the French National Lockdown period, afterwhich RF R continues to rise throughout the lockdown period
at a lower and inconsistent rate. A similar story could be
tailored to RES . Whereas, the major increase in RDE occurs
7 ITV News. 2020. Coronavirus: Boris Johnson Announces UK
Government’s Plan To Tackle Virus Spread, Youtube. [online]
Available
at:
https://www.youtube.com/watch?v=2U1YoKujYeY&list=
PLFXSE3NhAYiZdb2qijJ7uemIB-IAYK5-y&index=893&t=0s [Accessed 1
September 2020].
8 Ferguson, N., Laydon, D., Nedjati-Gilani, G., Imai, N., Ainslie, K.,
Baguelin, M., Bhatia, S., Boonyasiri, A., Cucunub, Z., Cuomo-Dannenburg,
G., Dighe, A., Dorigatti, I., Fu, H., Gaythorpe, K., Green, W., Hamlet,
A., Hinsley, W., Okell, L., van Elseland, S., Thompson, H., Verity, R.,
Volz, E., Wang, H., Wang, Y., Walker, P., Walters, C., Winskill, C.,
Donnelly, C., Riley, Steven, R. and Ghani, A. 2020. Report 9: Impact Of
Non-Pharmaceutical Interventions (Npis) To Reduce COVID-19 Mortality
And Healthcare Demand. [online] Imperial.ac.uk. Available at: https://www.
imperial.ac.uk/media/imperial-college/medicine/sph/ide/gida-fellowships/
Imperial-College-COVID19-NPI-modelling-16-03-2020.pdf [Accessed 1
September 2020].
9 CIDRAP. 2020. Doctors: COVID-19 Pushing Italian ICUs Toward Collapse. [online] Available at: https://www.cidrap.umn.edu/news-perspective/
2020/03/doctors-covid-19-pushing-italian-icus-toward-collapse [Accessed 8
August 2020].
Fig. 1. U.K. rate of depression before and during the National Lockdown. Noise in the rate of depression has been smoothed with a 7-day moving average.
in the build-up month, whilst during the German National
Lockdown, RDE increases in the initial days, albeit at a lower
rate. RDE then plateaus and decreases - creating a turning
point in RDE during the German National Lockdown.
Furthermore, we can observe the R of the respective countries following the easing of respective lockdowns and interpret
them as the countries’ outlook on the easing of restrictions.
From the time period that we have available it seems that
the French and Spanish general populations experienced a
reduction in symptoms of depression, such as anxiety, as is
evidenced by the clear reduction in RF R and RES respectively. We can therefore conclude by, tentatively, stating that
the easing of restrictions were received by an improvement
in the mental state of the general populations of France and
Spain, the mental state of the Italian and German general
populations deteriorated, whilst the general mental state of the
U.K. was rather agnostic to the easing of restriction.
We are hesitant to state the changes in the rates of depression had been caused by the imposition/easing of national
lockdowns. To make such a claim we would be required
to undertake a more fine-grained causality study which is
beyond the scope of this paper, however we note this for
future work. We can however claim to have discovered clear
relationships between the drastic changes in the behaviour of
rates of depression during the periods of the build-up to, during
and in the aftermath of national lockdowns
V. C ONCLUSION
Our set of experiments have been conducted with the aim
of providing organisations with a methodology for monitoring
and analysing temporal mental health dynamics using social
media data. We examine sample representations and their
ability to impact classifier performance. We investigate the role
of including an imbalanced dataset in the classifier training
regime. Our classifier provides encouraging performance on
two fronts: the first being that it is able to discriminate with
high certainty between clear Diagnosed and Control
samples and secondly, it was able to identify the Christmas
Depression phenomenon supported by the literature. Finally,
we present an analysis and discussion of the rates of depression
and their relationships with key events during the COVID-19
pandemic. We reiterate that through the analysis conducted in
this paper, we cannot state that the measures imposed caused
the changes in rates of depression during the pandemic and
leave this causality analysis for future work. Mental health
monitoring methodologies such as the one proposed in this
paper can be adopted by governments, to identify relationships
between the general population’s mental health state to imposed measures, mental health authorities, to assist in planning
and targeting individual locations in which to dynamically
concentrate their resources, as well corporations involved in
producing or disseminating drugs, such as pharmaceuticals, to
combat mental health issues for a more commercial use case.
ACKNOWLEDGMENTS
Purver is partially supported by the EPSRC under grant
EP/S033564/1, and by the European Unions Horizon 2020
pro- gramme under grant agreements 769661 (SAAM, Supporting Active Ageing through Multimodal coaching) and
825153 (EMBEDDIA, Cross-Lingual Embeddings for LessRepresented Languages in European News Media). The results
of this publication reflect only the authors views and the
Commission is not responsible for any use that may be made
of the information it contains. We express our thanks to all of
our data annotators: L. Achour, L. Del Zombo, N. Fiore, M.
Hechler and R. Medivil Zamudio.
R EFERENCES
Aina, Y. & Susman, J. (2006), ‘Understanding comorbidity with depression and anxiety disorders’, JAOA 2006
106, 509–514.
Althoff, T., Clark, K. & Leskovec, J. (2016), ‘Large-scale
analysis of counselling conversations: An application of
natural language processing to mental health’, TACL .
Ayers, J., Althouse, B. & Dredze, M. (2014), ‘Could behavioral medicine lead the web data revolution?’, JAMA
311(14), 1399–1400.
Bucci, W. & Freedman, N. (1981), ‘The language of depression’, Bulletin of the Menninger Clinic 45(4), 334.
Chung, C. & Pennebaker, J. (2007), ‘The psychological functions of function words’, Social communication 1, 343–359.
Coppersmith, G., Dredze, M. & Harman, C. (2014), ‘Quantifying mental health signals in twitter’, In ACL Workshop
on Computational Linguistics and Clinical Psychology .
De Choudhury, M., Counts, S. & Horvitz, E. (2013a), ‘Predicting postpartum changes in behavior and mood via social
media’, In Proceedings of the SIGCHI conference on human
factors in computing systems pp. 3267–3276.
De Choudhury, M., Counts, S. & Horvitz, E. (2013b), ‘Social
media as a measurement tool of depression in populations’,
In Proceedings of the Annual ACM Web Science Conference
.
De Choudhury, M., Gamon, M., Counts, S., & Horvitz, E.
(2013), ‘Predicting depression via social media’, In Proceedings of the International AAAI Conference on Weblogs
and Social Media (ICWSM) .
Devlin, J., Chang, M.-W., Lee, K. & Toutanova, K. (2018),
‘Bert: Pre-training of deep bidirectional transformers for
language understanding’, In proceedings of NAACL .
Guntuku, S., Yaden, D., Kern, M., Ungar, L. & Eichstaedt,
J. (2017), ‘Detecting depression and mental illness on
social media: an integrative review’, Current Opinion in
Behavioral Sciences 18, 43–49.
Hecht, H., von Zerssen, D., Krieg, C., Possl, J. & Wittchen,
H. (1989), ‘Anxiety and depression: comorbidity, psychopathology, and social functioning’, Comprehensive Psychiatry 30, 420–433.
Hillard, J. & Buckman, J. (1982), ‘Christmas depression’,
JAMA 248(23), 3175–3176.
Hochreiter, S. & Schmidhuber, J. (1997), ‘Long short-term
memory’, Neural Computation 9(10), 1735–1780.
Kingma, D. & Ba, J. (2015), ‘Adam: A method for stochastic
optimization’, ICLR 2015 .
Li, I., Li, Y., Li, T., Alvarez-Napagao, S. & Garcia, D. (2020),
‘What are we depressed about when we talk about covid19:
Mental health analysis on tweets using natural language
processing’, arXiv preprint arXiv:2004.10899 .
Losada, D. & Crestani, F. (2016), ‘A test collection for
research on depression and language use’, Experimental IR
Meets Multilinguality, Multimodality, and Interaction: 7th
International Conference of the CLEF Association, CLEF
2016 pp. 28–39.
Oxman, T., Rosenberg, S. & Tucker, G. (1982), ‘The language
of paranoia’, American J. Psychiatry 139, 275–1282.
Paul, M. J. & Dredze, M. (2011), ‘You are what you tweet:
Analyzing twitter for public health’, Fifth International
AAAI Conference on Weblogs and Social Media (ICWSM)
20, 265–272.
Peng, Z., Hu, Q. & Dang, J. (2019), ‘Multi-kernel svm
based depression recognition using social media data’, International Journal of Machine Learning and Cybersecurity
10, 43–57.
Pennebaker, J., Mehl, M. & Niederhoffer, K. (2003), ‘Psychological aspects of natural language use: Our words,
ourselves’, Annual Review of Psychology 54(1), 547–577.
Rickels, K. & Schweizer, E. (1993), ‘The treatment of generalized anxiety disorder in patients with depressive symptomatology’, Journal Clinical Psychiatry 54, 20–23.
Righetti-Veltema, M., Conne-Perrard, E., Bousquet, A. &
Manzano, J. (1998), ‘Risk factors and predictive signs
of postpartum depression’, Journal of Affective Disorders
49(3), 167–180.
Rude, S., Gortner, E. & Pennebaker, J. (2004), ‘Language use
of depressed and depression vulnerable college students’,
Cognition & Emotion 18(8), 1121–1133.
Shen, G., Jia, J., Nie, L., Feng, F., C. Zhang, T. H., Chua, T.S. & Zhu, W. (2017), ‘Depression detection via harvesting
social media: A multimodal dictionary learning solution’,
in Proceedings of IJCAI pp. 3838–3844.
Tausczik, Y. & Pennebaker, J. (2010), ‘The psychological
meaning of words: Liwc and computerized text analysis
methods’, Journal of Language and Social Psychology
29(1), 24–54.
Trotzek, M., Koitka, S. & Friedrich, C. (2018), ‘Utilizing
neural networks and linguistic metadata for early detection
of depression indications in text sequences’, arXiv preprint
arXiv:1804.07000 .
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L.,
Gomez, A., Kaiser, L. & Polosukhin, I. (2017), ‘Attention is
all you need’, In Advances in Neural Information Processing
Systems pp. 6000–6010.
Watson, W. (1768), ‘An account of a series of experiments,
instituted with a view of ascertaining the most successful
method of inoculating the small-pox’, London: J. Nourse .
Zhou, J., Zogan, H., S.Yang, Jameel, S., Xu, G. & Chen,
F. (2020), ‘Detecting community depression dynamics
due to covid-19 pandemic in australia’, arXiv preprint
arXiv:2007.02325 .
A PPENDIX
A. Significance of Results
TABLE VIII
S IGNIFICANCE IN P REDICTIONS OF BILST M -SELF A C LASSIFIER
Country
France
Germany
Italy
Spain
U.K.
Comparison
Uniform
Weighted
Uniform
Weighted
Uniform
Weighted
Uniform
Weighted
Uniform
Weighted
Significance
χ2 = 1, 311, 459 (p <0.00001)
χ2 = 6, 726 (p <0.00001)
2
χ = 1, 504, 290 (p <0.00001)
χ2 = 5, 607 (p <0.00001)
2
χ = 1, 485, 204 (p <0.00001)
χ2 = 5, 050 (p <0.00001)
χ2 = 2, 122, 253 (p <0.00001)
χ2 = 6, 324 (p <0.00001)
2
χ = 1, 242, 848 (p <0.00001)
χ2 = 16, 591 (p <0.00001)
B. Experiment Dataset Composition
TABLE IX
C OMPOSITION OF E XPERIMENT DATASETS
Country
France
Germany
Italy
Spain
U.K.
1a The
Restrictions
Begin
17 March 20201a
22 March 20203a
9 March 20205a
14 March 20207a
23 March 20209a
Restrictions
Eased
11 May 20202a
6 May 20204a
27 April 20206a
28 April 20208a
30 April 202010a
No.
Users
1, 351
1, 643
1, 725
2, 060
2, 883
No.
Tweets
945, 919
998, 248
764, 089
1, 012, 847
2, 050, 554
Independent. 2020. France Imposes 15-Day Lockdown And
Mobilises 100,000 Police To Enforce Coronavirus Restrictions.
[online] Available at: https://www.independent.co.uk/news/world/europe/
coronavirus-france-lockdown-cases-update-covid-19-macron-a9405136.html
[Accessed 12 July 2020].
2a BBC News. 2020. France Eases Lockdown After Eight Weeks. [online]
Available at: https://www.bbc.co.uk/news/world-europe-52615733 [Accessed
12 July 2020].
3a BBC News. 2020. Germany Bans Groups Of More Than Two To Curb Virus.
[online] Available at: https://www.bbc.co.uk/news/world-europe-51999080
[Accessed 4 August 2020].
4a BBC News. 2020. Germany Says Football Can Resume And Shops Reopen.
[online] Available at: https://www.bbc.co.uk/news/world-europe-52557718
[Accessed 4 August 2020].
5a CNN. 2020. All Of Italy Is In Lockdown As Coronavirus Cases
Rise. [online] Available at: https://edition.cnn.com/2020/03/09/europe/
coronavirus-italy-lockdown-intl/index.html [Accessed 12 July 2020].
6a BBC
News. 2020. Coronavirus: Italy’s PM Outlines Lockdown
Easing Measures. [online] Available at: https://www.bbc.com/news/amp/
world-europe-52435273 [Accessed 12 July 2020].
7a The Guardian. 2020. Spain Orders Nationwide Lockdown To Battle
Coronavirus. [online] Available at: https://www.theguardian.com/world/2020/
mar/14/spain-government-set-to-order-nationwide-coronavirus-lockdown [Accessed 12 July 2020].
8a BBC News. 2020. Spain Plans Return To ’New Normal’ By End Of June. [online] Available at: https://www.bbc.co.uk/news/world-europe-52459034 [Accessed 12 July 2020].
9a BBC News. 2020. Coronavirus Updates: ’You Must Stay At Home’ UK
Public Told - BBC News. [online] Available at: https://www.bbc.co.uk/news/
live/world-52000039 [Accessed 12 July 2020].
10a BBC News. 2020. UK Past The Peak Of Coronavirus, Says PM. [online] Available at: https://www.bbc.co.uk/news/uk-52493500 [Accessed 12 July
2020].
C. Temporal Mental Health Dynamics Results
Fig. 2. France, Germany, Italy and Spain rates of depression before and
during respective national lockdowns. Noise in rates of depression have been
smoothed with 7-day moving averages
0
You can add this document to your study collection(s)
Sign in Available only to authorized usersYou can add this document to your saved list
Sign in Available only to authorized users(For complaints, use another form )