Uncovering deep user context from blogs Robert McArthur

advertisement
Uncovering deep user context from blogs
Robert McArthur
CSIRO ICT Centre
GPO Box 664
Canberra ACT 2601 Australia
+61-2-6216 7000
robert.mcarthur@csiro.au
information retrieval to an increased involvement of context,
especially user context. Context in this sense can be defined as
information that allows retrieval and presentation in ways that are
not necessarily representative of the average person, but are more
personalized to an individual or identifiable group. The success of
workshops at SIGIR [15] and the subsequent IIiX Symposium and
workshops [2], the SIGIR exploratory search workshop [40], and
the outcome of the 2004 SWIRL workshop [32] are indicators that
elements of user context are both sort and becoming available:
“The provision of tools…can yield great rewards for users,
especially when contextual factors such as user emotion, task
constraints, and dynamism of information needs are considered”
[39].
ABSTRACT
People’s utterances are fundamentally different to other
documents because they are more immediate and less thought
through. While this makes them more natural – noisy and
unstructured – it provides an unrivalled opportunity to see
“inside” the author, to collect some context. The data requires
analysis methods that have a relationship to human information
processing: socio-cognitively motivated semantic systems. Using
HAL, a method validated by cognitive science, the text from a
large number of blog entries was analysed to uncover changes in
entries author’s sense-of-self. Sense-of-self was measured through
geometric projection of author’s first-person usage onto key
indicators of kin and negative emotion words. An example of
non-clinical qualitative evaluation affirmed the utility and
promise of the technique: that deep personal context can be
uncovered from utterances through the appropriate analysis and
inference.
User context is context about the user. While trite, this is an
important distinction as it attenuates the human dimension which
has frequently been acknowledged but ignored: we need to know
more about the author and questioner/reader as people if we are to
identify and use their context for better retrieval. Retrieval using
unstructured, noisy data that are people’s utterances requires a
different approach compared to documents created by multiple
authors, and/or over a longer period of time. The most readily
available contemporary manifestation of people’s utterances on
which experimentation is possible in this search for context are
blog entries.
Categories and Subject Descriptors
H.3.3 [Information Systems]: Information Search and Retrieval–
information filtering, clustering, search process.
J.4 [Social and Behavioral Sciences]: Psychology
General Terms
This paper continues a line of research in which the utility of
socio-cognitively motivated algorithms to uncover facts,
relationships and context from automatic analysis of noisy,
unstructured textual utterances is demonstrated. In the research
described, a “deep” style of user context is uncovered by
automatically analyzing blog entries. The change in a person’s
“sense-of-self” is one example of “deep” personal context. It is
more implicit information about a person beyond the absolute
explicit (“she plays netball”) or relative explicit (“he types fast”).
Decision support for computer-assisted clinical intervention is the
domain in which this particular context has shown promise.
Algorithms,Measurement,Experimentation,Human
Factors,Theory
Keywords
context, socio-cognitive, semantics, HAL, sense-of-self
1. INTRODUCTION
Information retrieval has a history of dealing with a variety of
unstructured and noisy data (UAND). This has culminated in the
phenomenon of the general web search engine. These can lay
claim to dealing with an enormous amount of data, much of which
is changing, noisy and relatively unstructured, in functionally
very successful ways. In a highly competitive operational field,
and with new directions in theory, there is a rising trend in
Permission to make digital or hard copies of all
or part of this work for personal or classroom
use is granted without fee provided that copies
are not made or distributed for profit or
commercial advantage and that copies bear this
notice and the full citation on the first page.
To copy otherwise, to republish, to post on
servers or to redistribute to lists, requires
prior specific permission and/or a fee.
An experiment is presented in which the HAL algorithm
from cognitive science [22], as a manifestation of sociocognitively motivated analysis, is applied to UAND from the
TREC Blog track. The results demonstrate the promising utility of
the techniques applied to blog data by identifying individuals –
candidates of interest – who may be exhibiting a low sense-ofself. The intention is that such a list would assist clinical decision
making about who to support with the scarce resources available.
Since such decisions over all blog data is beyond the scope of this
study, an example is presented which demonstrates the insights
possible by means of qualitative evaluation techniques.
AND '08, July 24, 2008 Singapore
Copyright ©2008 ACM 978-1-60558-196-5... $5.00
This research is an important contribution demonstrating
what appropriate methods of analysis of people’s utterances may
47
about the user. An example of such information is a
computational manifestation of a person’s sense-of-self. This has
been investigated previously using other UAND data – mailing
list records – in an online health setting [24]. The research
presented here utilizes that work as an exemplar of deeper user
context, refining the methods and applying the ideas and
techniques to a new type of data - blog posts. Thus this work uses
very similar methods but investigates the utility in the very
different data of blog posts, across a much larger collection of
authors and posts, and therefore by necessity uses a more
quantative method of deciding candidates of interest.
be able to achieve. It is precisely the artefacts of human
communication –expressing oneself in a human manner: lack of
structure and “noise” – that present us as humans with the context
we need, and present computer systems with electronic
headaches.
The next section introduces background to the research,
followed by a description of the experiment including methods
and results. The paper concludes with some further ideas and
directions on unstructured, noisy data.
2. BACKGROUND
Recent weblog workshops [1, 34] reflect the difficulty of
dissecting UAND for a task, but also a strong interest in using
such data for the discovery of context about the user in the form
of the blogger’s age [4, 37], gender [35, 37], opinions, sentiments
and opinions expressed [16, 30, 33], mood levels [3, 31],
happiness [29], residential area [41] and social network(s) [12, 14,
19-21, 38]. The research areas have concentrated on particular
ways of identifying or extracting information, providing fertile
ground for TREC-style comparisons of the approaches. All of the
tasks and methods show how interesting but difficult the problem
of using UAND is. It has been the more simple problems in which
reasonable success has occurred, almost always using methods
successful in less noisy, non-utterance data (e.g. splog detection).
2.1 Socio-cognitive systems
Freyd [10] provides a socio-cognitive frame for a viable
communal (i.e., shared or conveyed) knowledge representation:
what seems common to most of the main approaches to
semantics is an assumption that values of semantics
components, or features, are critical to word meaning. What is
relevant to shareability theory is that a smaller number of
features seem to be used than number of words. (pp195-6)
I am arguing that a dimensional structure for representing
knowledge is efficient for communicating meaning between
individuals. That is, a small dimensional structure with a small
number of values on each dimension is argued to be especially
shareable, which might explain why such structures are
observed.
(pp198-9)
Considerable interest has been shown in the analysis of
sentiment in weblogs [3, 29-31, 33]. This broad umbrella
encompasses notions of mood, opinion, emotion and happiness,
both explicit and implicit, short-term and over longer periods.
Sentiment is only one aspect of user context though. Indeed,
many of the published notions of sentiment derive from reflection
of authorship. That is, determination of the mood, opinion or
emotion comes through analysis of the artefacts of
communication, the blog entry, and is deemed a reflection of the
sentiment of the author at the time. The assumption is that a
person’s writing reflects their inner state of being. Although it is
easy to ignore the author and concentrate on the blog entry itself,
the focus should remain on determining more about the user and
how to gain this using UAND.
Context
about the
author
Implicit,
meta,
aggregated
Sense-of-self
others
Freyd’s suppositions on the dimensional nature of shared
knowledge are compatible with a recent, three-level sociocognitive model of cognition by Gärdenfors [11], who argues that
meanings of words come from conceptual structures in people’s
heads. Of his three levels of representation, symbolic, conceptual
and associationist, it is the middle, conceptual, level that is of
relevance for this paper.
Gärdenfors’ socio-cognitive position is that meanings
emerge from the conceptual structures harbored by individual
cognition together with the linguistic power structure within the
community: “The dimensions are the basic building blocks of
representations on the conceptual level” (p43). Of significant note
is his adoption of Freyd’s supposition: social interactions will
constrain these conceptual structures. This constriction of the
dimensional structure by the individual for social interaction is
important. We tend to economise our utterances, losing structure
and increasing noise, leveraging a shared background. This may
be especially true in the persistent conversations of online
discussion groups, which can become “grounded in sufficient
mutual understanding to allow very brief, sketchy and implicit
references to succeed without posing significant problems in
interpretation.” [8]
Context
about the
“document”
Sentiment
Style
mood
happiness
opinion
length
grammar
2.2 Semantic spaces
Discovering and representing knowledge from utterances requires
cognitive processing ability, or the ability to mimic it. While
Gärdenfors presents an attractive socio-cognitive basis, high
dimensional semantic spaces are the preferred computational
models to operationalise the theory.
Figure 1: Different contexts drawn from textual data
To this end, more information about the user is required
apart from the emphasis on the (usually) explicit manifestations of
their selves. Figure 1 presents this difference by separating what
we know about the document (e.g. style), what we know about the
user from the document at a particular point in time (e.g.
sentiment), and the more implicit and meta-information which is
derived from the sentiment and which captures deeper context
A human encountering a new concept derives the meaning
via an accumulation of experience of the contexts in which the
concept appears [9]. This opens the door to “learn” the meaning
of a concept through how it appears within the context of other
48
concepts. HAL is one of an ensemble of models of semantic
memory from psychology that automatically constructs a
dimensional semantic space from a corpus of text [5, 6, 22, 23].
HAL has accounted for a wide variety of semantic effects in the
cognitive and neuropsychological literature and has an enviable
track record (http://locutus.ucr.edu/ has many publications).
Latent Semantic Analysis (LSA) [18] is another leading model.
2.3 Defining sense-of-self
As defined in [24], an important element of our personal context
is our sense-of-self. It has been posited, and demonstrated to some
degree, that people manifest this in their utterances by being in
the first person singular form, using personal pronouns I, me, my
or mine. First-person form is one that is inherent to people’s
utterances. The dimensional representation of all of the firstperson terms is a beginnings of computational representation of
the author’s sense-of-self. This representation derives
automatically from the corpus from which the semantic space was
computed or, in other words, the semantics of a person’s
computational self derive from their utterances involving selfreference. See [24] for more justification and details.
Burgess and Lund [6] examined whether HAL could
represent abstract concepts, such as love, hate and joy. They
found that, in a comparison with human raters in predicting
abstract variables for a set of words, “global co-occurrence
information carried in the word vectors can be used to predict a
tangible proportion of the human likert scale ratings.”
Using a geometric point in the semantic space as the
representation, movement in sense-of-self can be analysed against
axes of reference. The same termsets as [24] were used here to
form axis centroids, the average of the terms in the sets in
Table 1, creating two axes of kin and negative emotion terms. The
projection of the sense-of-self against these axes provide a
temporal point for an author’s sense-of-self, while change in
sense-of-self is determined by the distance between temporal
points for the same author over different periods (e.g. days).
Previous work [25-27] has shown that HAL can be used
successfully in extracting knowledge directly from email
utterances, both explicit and tacit. The studies have also shown
the validity of using a (modified) global association model like
HAL, originally developed on a 300 million word corpus from
Usenet, on datasets of a few sentences [26] to a few thousand
emails or documents [25]. For all of these reasons, HAL was
chosen in [24] and here as the semantic space model by which a
representation of “sense-of-self” could be captured from the
discourse in blog entries.
Table 1: Terms of interest
Semantic spaces also have the property that they can be
created on the basis of different variables – e.g. particular datasets
derived from various times or authors – as well as through known
global association models such as HAL. These local associations
provide a framework for understanding, in authorship or
longitudinally, the utterances of individuals as well as the
community as a whole. This is used to capture the state of a
person at a particular time. The minimum time period, such as a
day or month, depends on the amount of textual utterance
available: the more coherent and more utterances for the time the
better.
Negative emotion words: "hate", "pain", "anger", "painful",
"fatigue", "fatigued", "angry"
Kin words: "mother", "father", "fathers", "grandfather",
"mothers", "grandmother", "mother-in-law", "grandson",
"sons", "son-in-law", "daughters", "granddaughter",
"daughter-in-law", "sibling", "siblings", "husbands",
"sister", "brother", "daughter", "son", "wife", "husband"
Words comprising the sense-of-self: "I", "me", "my", "mine"
In short, the HAL-produced semantic space is the knowledge
representation framework; blog entries provide utterances of the
right form for analysis; and sense-of-self is the geometric position
of the author’s self-referents as against terms of kin and negative
emotion in the semantic space. Movement in sense-of-self is
determined using temporally determined sense-of-self points for
individual authors. In practical terms, certain terms, chosen to
exemplify the concepts of negative emotion and kin, are
combined to form a 2D space onto which the monthly sense-ofself vectors of the authors are projected. The sense-of-self vectors
arise from the identification of first-person language within the
blog posts. A large movement within the 2D space of the monthly
sense-of-self points indicate a potentially interesting change to the
author’s sense-of-self. Terms used to create the axes are drawn
from the original work, as well as more recent blog-related
sources (like [29]).
Therefore it is hypothesized that a socio-cognitively
motivated paradigm and practical semantic spaces provides the
best chance of representing and inferring user context. This
research is the first to apply semantic spaces to blog data. Blog
entries are quite different to most other text data because, unlike
many types of documents, are written by the individual as a
personal communication but often not directed to a particular
reader. Sometimes they are written in the knowledge that many
thousands of people will read, digest and comment on the
contents. Other blogs are more personal and it doesn’t seem to
matter to the author whether anyone else reads them. Whichever
is the case on this continuum, the blogs have a clear authorship
and are often written in first person. Apart from email on which it
is notoriously difficult to conduct research because of privacy
concerns, blogs are the principle other media which has these
features, and it is these features which provide the strongest
source of information – context – about the author.
It is important to note that while some blog entries provide
personal utterances which provide evidence towards a sense-ofself, there are many that do not. These include splogs (spamblogs), quotations, technology blogs, documents not in the first
person, etc. The practical effect of limiting the analysis to entries
written in the first person, and substantially so when vector size is
taken into account (see sections 4 and 5), removes any effect from
these spurious or uninteresting blog entries.
In brief, the advantage of blogs as against all other forms of
data for identifying and analysing user context is that
1.
2.
3.
they are written by a single, identifiable author;
they are revealing – often explicit personal information
is presented;
they are freely available.
The next sections describe the particular experimental
procedure and results to empirically investigate these notions.
49
The amount of utterance data for each blogger on each day is
wildly different. Also, the largest difference in time between
examining an author’s text is more likely to uncover deeper,
inherent, context and show pronounced change. Therefore 9
consecutive days in the first part of the data, December
(20051207, 20051208, 20051209, 20051210, 20051211,
20051212, 20051213, 20051214, 20051215), and 7 days in the
latter period of February (20060201, 20060202, 20060203,
20060204, 20060205, 20060206, 20060207) were chosen for
analysis. Multiple days in each period were chosen to attempt to
capture one of the qualities of personal blogging – that repeated
daily or near-daily entries were the type that are more likely to
evince and respond to the deep personal analysis of the type this
research desires to uncover.
3. METHODS
3.1 HAL
A HAL space comprises high dimensional vector representations
for each term in a vocabulary. Given an n-word vocabulary, the
space is an n x n matrix constructed by moving a window of
length l (typically 10) over the corpus by one word increments
ignoring punctuation, sentence and paragraph boundaries. All
words within the window are considered as co-occurring with
each other with strengths inversely proportional to the distance
between them. After traversing the corpus, an accumulated cooccurrence matrix for all the words in a target vocabulary is
produced. Note that word pairs in HAL are direction sensitive –
the co-occurrence information for words preceding every word
and co-occurrence information for words following it are recorded
separately by its row and column vectors.
In summary, the following stages were implemented, with
comments in italics:
The HAL matrix for an example text “the pain of seeing dad
that way was too strong to bear” is depicted in Table 1 using a 3
word moving window (l=3). An example of reading Table 1
would be that the word dad occurs before bear (is related to) with
strength 1 (2-1 intervening words in the window). Stop words
(non-italics) were removed.
1.
Table 2: HAL matrix for
"the pain of seeing dad that way was too strong to bear"
2.
pain seeing dad strong
bear
pain
0
2
1
0
0
seeing 2
0
2
1
0
dad
1
2
0
2
1
strong
0
1
2
0
2
bear
0
0
1
2
0
Prior studies [25] revealed that for the purposes of this
research, it was not useful to preserve word order information, so
the HAL vector of a word was represented by the addition of its
row and column vectors. The quality of HAL vectors is
influenced by the window size; the longer the window, the higher
the chance of representing spurious associations between terms.
Burgess et al. [5] used a size of ten in their studies. To limit the
influence of stop words (frequently occurring words that do not
help to differentiate documents or utterances) in dimensional
reduction and reduce the frequency bias, the INQUERY [7] list
provided stopwords that were deleted from the input to HAL.
Thus a window of 10 non-stop words was used, rather than 10
words.
3.
4.
3.2 Data
The TREC Blog track [36] provided a large scale test collection
of blog posts from 2006. The data, being a snapshot in time over a
range of blog sites, appears to include all genres of blogs, from
one-off to “A-list” [13], as well as spam for “a realistic research
setting” making it a particularly realistic example of UAND data..
There were 100,649 unique blogs with 3,215,171 entries from
6/12/2005 to 21/2/2006.
5.
The weblogs were analysed in a manner similar to [24].
Unlike in that study, some “pre-semantic” information such as
part-of-speech tagging was not required because of the larger
amount of data; for similar reasons, SVD (singular value
decomposition) was not applied, thus only explicit associations or
relationships were uncovered.
50
The text of the Blog entries were extracted from the
XML-like
syntax;
This was more difficult than necessary due to the XMLlike format used, the fact that the messages were often
in their original HTML format – each of which used
different methods and standards including Javascript –
and the variety of character sets & languages involved.
Entries were tokenised, and non-English language blogs
eliminated;
This again was difficult as the texts were very ‘dirty’ –
many strange characters, odd positioning of spaces and
inter-word separators. It is unlikely more detailed
linguistic analysis, such as part-of-speech and
chunking, would succeed well because of both the
format of the data as well as the language used. NonEnglish blogs were relatively easy to identify and
remove using simple regexps.
A small number of stopwords were removed;
Some words usually considered stopwords, like “I” or
“my”, are words that are vital to understanding the
deep personal information in the blog. Removal of too
many words that are important indicators in terms of
personal context risks loss of the ability, wholly or in
part, to find the desired entries; but the cluttering of the
vectors with ‘useless’ words harms the quality of the
subsequent analysis. The choice of which and how many
stopwords to remove is likely to depend upon the
particular personal context under examination.
General cleaning of the text was performed (translation
of forms like “I’ve” to “I have”, removal of URI’s,
converting to lower-case)
Hyperspace analogue to language (HAL, [5]) was
performed over all the blog entries identified above with
a window size of 10. The result was the creation of two
combined vectors, one for the set of kin words and one
for the set of negative emotion words.
Normally a semantic space is constructed for all words
in the corpus. However, due to the large amount of
data, only vectors for terms of interest were computed
rather than an entire semantic space (a vocabulary of
1,054,820 terms was identified; 53% were unknown to
Unix spell leaving 498,746 known terms; even this is a
very large semantic space compared to [5] or [17]). In
keeping with previous research [24, 25, 28], pre- and
post- associations were combined leaving a single
vector for each word. Vectors were normalised to unit
length. Each of the two vectors representing sets of
words were created by summing the individual word’s
vectors, averaging and then normalising the result.
wife
like
am
up
sister
they
our
do
said
father
was
all
now
just
would
time
little
son
is
brother
nude
out
incest
6.
Hyperspace analogue to language ([5]) was performed
(window size=10) but only vectors for the sense-of-self
were created. Also, since the sense-of-self is personal,
the sense-of-self vector (calculated as in the above
bullet point) was created separately for each blog feed
over
the
days
in
question.
Over 56,000 sense-of-self vectors were created, with
over 13,300 appearing on more than one day during the
analysis period. Those occurring on only one day were
eliminated.
7. The sense-of-self vectors were projected onto the two
axes of interest using simple vector algebra: each senseof-self vector for a blog feed in a day (vector of 1 x n)
was projected onto the combined kin vector (1 x n) and
the combined negative emotion vector (1 x n) yielding
an x,y position against these two axes.
The length of the sense-of-self vectors varied widely –
the more contexts in which the terms comprising the
sense-of-self were used, the larger the vector. A larger
vector may be a better exemplar of the person’s senseof-self, although it is likely to be a non-linear
relationship – one occurrence of “I” may not be
strongly evidential, but three occurrences may be as
good as 30.
Table 2 presents the largest associations in the combined kin and
negative emotion vectors to assist understanding of the method.
For ease of reading, the normalised values (0-1) have been
multiplied by 100,000. Bold indicates terms comprising the
combined vector; italics indicates “interesting” associations.
my
i
his
a
her
me
[possessive]
you
not
he
[personalpossessive]
have
your
she
had
mother
but
one
we
so
has
sex
him
[ellipsis]
their
57423
38471
24290
18530
18476
16860
15774
14362
13014
12527
12478
11745
10595
9359
8187
7009
6450
5657
5330
5271
5251
5028
5003
4845
4729
5992
5918
5781
5711
5589
5434
5262
5126
5115
5065
5034
4899
4857
4733
4611
4525
4250
4230
4105
4065
3958
3892
3803
The result of the experiment is a list bloggers and the amount of
change of sense-of-self between particular days in the first part of
the corpus (Dec) and in the last part (Feb). Table 3 shows an
example ranked list of feed numbers. The score indicates the level
of difference (Euclidean distance) between the 2D sense-of-self in
one day compared to another day; it is the Euclidean distance
between the 2D point (kin axis, negative emotion axis) on one day
compared with another day. A high kin or negative emotion value
indicates strength of or high usage on that axis. The vector size is
also shown, along with the number of elements of the sense-ofself vector compared with the combined kin and negative emotion
vectors. The higher the score, the move the sense-of-self has
changed over the period.
While the table begins with the highest scoring blogs, the
vector size of many of the top entries for one of the two days is
very low, such as 12 for the first entry. These small vectors
indicate very little first-person usage and little chance of detecting
an anomaly. Therefore, while the highest of these scores are
shown, the first interesting example of larger vector size occurs at
a score of 0.68.
Negative emotion
(n=69,442, x̄ =20.04)
i
not
my
me
you
but
have
back
so
a
[ellipsis]
hate
pain
they
do
them
all
like
am
is
he
love
just
much
it
your
up
more
his
we
one
people
her
she
get
really
because
did
there
him
feel
their
out
management
[possessive]
know
being
anger
4. RESULTS
Table 2: Largest associations for combined vectors
Kin
(n=123,589, x̄ =10.23)
4722
4626
4604
4568
4492
4408
4366
4337
4294
4170
4162
4070
4066
3805
3788
3730
3690
3618
3582
3489
3478
3403
3386
68227
21778
19564
17689
15234
14084
13189
12397
11483
11427
11382
10481
10370
10322
10219
8125
7986
7756
7739
7586
7490
7222
6709
6340
6128
Table 3: Example output (n=9230 pairs, avg. score=0.10)
Score
1.07
Day
…
050248 0.04
0.72
072330 0.05
0.67
015221 0.06
0.65
…
…
0.05
0.87
0.07
0.79
0.10
0.80
…
20060205
20051211
20060201
20051214
20051207
20051214
…
0.68
032140 0.23
0.29
20051209
0.95
0.92
51
Kin Negativ
Blog
e
Feed ID value
emotion
value
(prefix:
BLOG0
6-feed-)
Vect Elements in
or
common
size with with
kin neg.
emoti
on
12
12
12
1649 1498 1442
10
10
10
537 516 513
28
28
28
329 317 318
…
…
…
221
213
205
…
…
0.69
…
0.79
…
20060203
…
725
…
678
...
homeless CRACK junkie. It's like having a tenant who shits the
floor and smells of whiskey-soaked rotten eggs, but won't pay
the rent. Sort of like a teenager. Eventually we will buy a new set
of hardware for the hatch door to keep him out... or keep him IN,
depending on whether or not I've taken my Maca that day. No
idea what we're going to do about the GIANT rodent in the back
crawlspace. So basically, Koala the kitten has sought residence
with his Daddy, whom I've only seen once, but looks like a
bigger version of Booga. Booga's going to be a big boy. He's
going to whip Mojo's ass one day. And now I return you the
regularly scheduled Days of Thanks...
681
…
5. DISCUSSION
It is important to commence the discussion with the proviso that
no clinical diagnosis is being offered. The list of candidate blogs
(e.g. Table 3) could be used as a decision support tool by
clinicians who would likely examine the entries in question, and
follow-up interesting leads with other, perhaps face-to-face,
interaction. However, with increasingly scarce resources, both at
an individual’s cognitive level and the medical establishment’s
staffing level, creating a candidate list of prospective respondents
would be extremely useful: after all, at the moment, there is no
analysis at all. Even should 1 in 100 or more leads prove worth
further analysis, the human benefits are likely to be worthwhile.
Example blog entries from the second day’s (20060203) feed are
(10 entries total):
Cutting Day Date: February 24, 2006 5:30AM Event:
Laparoscopy/Hysteroscopy/Septum Eviction 2006
Prep:
Occasional pacing; random anxiety dream; shallow breathing;
temporary attention deficit; excitement and hope that this might
fix the plumbing problems. Three weeks to go…
Ahem... yeah. Apparently, my MRI strongly suggests an
arcuate uterus. Arcuate, as in irregularly shaped. Arcuate, as in
almost heart-shaped (and not in a cheeky Valentines kind of
way). Arcuate, as in what one doctor noted when I was
pregnant, only to be contradicted by Dr. Asshole later. Arcuate,
as in lots more surgery. Arcuate, as in “might even be septate-but we won't be sure until the surgery because there's a gigantic
fibroid blocking our view.” I am having a difficult time finding
enough information about it. Most say that it doesn't increase
the risk of miscarriage, which I have to say is royal bullshit
according to my interview of my seven deceased fetuses. […]
They like it when you sweat...
Nothing yet. All day, I
waited patiently by the phone. I took my cell phone with me to
staff meetings, expecting to leave the meeting to talk to Dr.
Awesome. ...and I waited....and waited...and waited...At 5:55pm,
I got up and headed out the door to go home. Somewhere en
route from my desk to the car, they called. But of course, the
windtunnel I walk through known as “parking deck” was noisy
enough that OF COURSE, I didn't hear the damn phone ringing.
Ah, but they left me a message. “Mrs. Drab, we have the results
back for your MRI. If you will call us tomorrow, we will go over
those results with you. Also, you need bloodwork drawn. Call
and we will give the details.” Thanks, that was a fat load of help.
My imagination is running wild. Now, I'm fairly certain THIS is
The Shadow. This is why they are being coy. They don't want to
tell me about the evil creature that has attached itself to my
uterine wall. MMMMhmmmm... Mark my words. E-V-I-L C-R-EA-T-U-R-E. Closely related to the Ripapod. Dammit, they found
a mutant Ripapod in there. Son of a bitch.
Why I will fail at Anger Management
My doctor (the
one who treated my TMJ) advised me to work on my anger
management issues. I will fail at this task. It's not that I'm a
defeatist or like to give up easily (anyone who's followed my
ridiculous Chronicles of Conceiving knows that). The truth is that
I've had this anger as long as I can remember. It's like a close
friend. The one who leaves cigarette burns on your couch and
never replaces the toilet paper, but at the end of the day you are
saying “OHHHH Anger. You're simply ADORABLE.” I'd miss the
little curmudgeonette if she were to leave me entirely. And let's
face it, with an asshole around every corner, it's a virtual
impossibility. For example, Gwyneth. She took a lovely little
band like Coldplay and twisted it with her evil witchy magic.
Here's the transition: BEFORE (c.2000) I awake to see that
no one is free We're all fugitives Look at the way we live Down
here, I cannot sleep from fear, no I said, which way do I turn Oh,
I forget everything I learn...
AFTER (c.2005) You cut me
down a tree and brought it back to me And that’s what made me
see where I was going wrong You put me on a shelf and kept
me for yourself I can only blame myself, you can only blame
me... Do you SEE what she did? […] Look at that Glad trash
The assessment as to which blog entries most contribute to
the context is almost certainly qualitative and task dependent, in
line with the socio-cognitive nature of the problem. The
determination of whether a particular blog entry is the basis for
intervention cannot be qualitatively, or simply, judged; anecdotal
evidence indicates that different clinicians often disagree with
colleague’s conclusions. The good news is that often the first
level identification of interesting candidates is quite viable since
as humans we do such comparisons every moment and are,
relatively, good at it.
For example, in the example shown in Table 3, let us
examine the particular blog entries that form the basis of one of
the results. The list is ordered by a basic similarity score,
however, a factor not encompassed by that score is the size of the
sense-of-self vector. A small vector means few uses of firstperson language and hence may not be reliable. The entry
BLOG06-feed-032140 has the highest score with the largest
number of associations in both days’ sense-of-self vectors, hence
it will serve as a practical case in point. Note that this isn’t to
downplay or ignore the strength of the other results, merely that
this datapoint’s text may be clearer to non-specialists. In a clinical
setting a practical decision would be made on how many blog
entries to examine versus a cut-off based on vector size and score.
The first day for this blog feed is 20051209, where a senseof-self is (0.23, 0.29), while the second is 20060203 (0.69, 0.79).
Since analysis and interpretation of the meaning of these
inferences is qualitative, and opinion will differ, the most
appropriate demonstration is to show instances of blog entries
from the first day’s feed and contrast these to instances from the
second day. In bold is the title followed by the unadorned entry –
very unstructured and noisy textual data. Due to space constraints,
blog entries that are similar to, or do not add to those shown, are
not presented (there were 10 entries in total).
The 12 Days of Thanks: Day 10 Things I am thankful for
#10: Bruce Campbell This... is my BOOMSTICK!!!
One of
the best cheesy actors of our time, the star of the Evil Dead
series. I simply can't resist a movie with Bruce.
Intermission I am taking a brief break from giving thanks to
express a few thoughts. […] Koala, the fifth kitten in the litter
I've mentioned a half a million times disappeared for a while. I
couldn't figure out why. Until last night. We have a fucking
POSSUM living in our crawlspace. No wait, let me correct that.
We have a fucking POSSUM living in our BACK crawl space.
You see, we have two crawlspaces. One in the front and one in
the back. The one in the front is occassionally squatted by a
52
bag she has around her neck! How can I be mad when Gwyn's
prancing about in a garbage bag? It's hilarity at its finest. Chris
Martin has been bewitched by a bag lady! But there are plenty
of other assholes where Gwynnie drops off. It will always be a
challenge, and I'm not so sure I'm ready to give it up. Thank
GOD for Vicodin, huh?
number of blog entries to uncover changes in authors’ sense-ofself. Prior work on less noisy mailing list messages [24] broke the
ground on which this study built. Using a similar definition,
sense-of-self was measured through geometric projection of
author’s first-person usage onto key indicators of kin and negative
emotion words. Two candidates were presented, in the manner of
an early qualitative evaluation, to demonstrate the potential of the
technique: that deep personal context can be uncovered from
utterances through the appropriate analysis and inference.
Clearly, there is a difference between how and what is being
expressed in the two sets of blog entries, and in particular what is
“behind” that expression – the first day’s entries portray someone
who is outward looking, while the second is much more inwardly
focused and personal. Although the use of first-person in the two
sets of entries has not substantially changed, the implicit nature of
the messages is indicative of a possible change in sense-of-self
(using the same understanding of the particular meaning of this
concept expressed in [24]).
Further investigations are planned with clinicians to gain a
better understanding how the currently subjective and qualitative
evaluation may be improved. Also, one of the lessons learnt
during examination of blog text is that the terminology indicating
kin or negative emotion may need to be added to in order to
increase the ability to use smaller vectors – someone in trouble
may not be loquacious. One approach would be to find these
synonyms automatically through clustering in the semantic space
[17], again using the inherent lack of structure and noisiness of
the data itself to assist. Lastly, this research has focused on a
change in a person’s sense-of-self over two months. Detecting
where the sense-of-self is low and stays low may assist depression
and other longer-term mental health issues.
Another example is from the third line of Table 3 – from
BLOG06-feed-015221. While the vector size is very small (28),
there is still an interesting quality and meaning difference
between 20051214:
…I had this dream just before I woke up this morning. I was
back in high school; I'd gone back in time somehow. […] "Look
at us" I said to the woman standing next to me. "We're
so...young! Look at our faces. Nothing has happened to us yet."
[…] "It went by so fast" I whispered. "Didn't it go by fast? Like
the blink of an eye." "All of life is like that" she answered. "I
didn't do enough" I went on, realizing cues I missed, qualities I
failed to appreciate in my friends when I still had them there to
appreciate […] I'm going to die, I realized. I didn't think I said it
out loud, but she was just nodding at me calmly, unafraid and I
understood her as clearly as she'd spoken too. We're all going
to die . I woke up with my jaw aching, like I'd ground my teeth all
night in my sleep. Even though I couldn't argue with anything
she said in the dream, I sat on the edge of the bed shivering and
rubbing my jaw thoughtfully; wanting, more than anything, to go
on living.
The methods and results are an important addition to the
toolbox of analysis for noisy and unstructured textual utterance
data. While promising, more comprehensive qualitative
evaluation is required. For utterance data, such as blogs, emails
and mailing lists, where the task requires human-like mimicking
of the semantics and inference, socio-cognitively motivated
semantic systems are a principled choice. In the task of
identifying deep personal context, of which a sense-of-self is but
one example, they are as yet unrivaled.
7. REFERENCES
and 20051207:
1.
Yes, our artificial tree really did break or lose some very
important piece of hardware that maintains the vertical stance,
and we did fix it with duct tape. We don't live in West Virginia
for nothing. Duct tape is one of the six greatest inventions of the
modern day, and don't you forget it. If you can't fix it with duct
tape, forget it. It's broke.[…]
2.
3.
Both examples are encouraging indications that detecting a
change in this form of deeper user context is feasible.
Importantly, and in contrast to the usual TREC modus operandi,
inter-“assessor” (clinician) differences should perhaps be
embraced rather than ameliorated, with the annotated
explanations forming a vital part of assessment and evaluation.
4.
6. CONCLUSION
5.
User context is an important missing element in modern IR
systems. Some forms such as sentiment and style-based context
are under investigation. Deep context is an important facet of a
person. Sense-of-self is one example of deep personal context. To
collect it, though, requires analysis of people’s direct utterances.
Those in text form are fundamentally different to other
documents. They are more immediate and less thought through,
full of noise and unstructured. Analysis of utterances requires
methods that mimic human information processing: sociocognitively motivated semantic systems.
6.
7.
8.
This paper presented research using HAL, a method
validated by cognitive science, to analyse the text from a large
53
Adar, E., Glance, N. and Hurst, M. (eds.) 2006. 3rd Annual
Workshop on the Weblogging Ecosystem. WWW'06.
Ruthven, I., Borlund, P., Ingwersen, P., Belkin, N., Tombros,
A. and Vakkari, P. (eds.) 2006. Information Interaction in
Context. first IIiX Symposium on Information Interaction in
Context, (Copenhagen), ACM Press.
Balog, K. and Rijke, M.d. 2006. Decomposing Bloggers'
Moods: Towards a Time Series Analysis of Moods in the
Blogosphere. in Third annual workshop on the {Weblogging}
ecosystem, WWW2006, (Edinburgh).
Burger, J.D. and Henderson, J.C. 2006. An Exploration of
Observable Features Related to Blogger Age. in AAAI Spring
Symposia 2006 on Computational Approaches to Analysing
Weblogs, (Stanford, California).
Burgess, C., Livesay, K. and Lund, K. 1998. Explorations in
context space: words, sentences, discourse. Discourse
Processes, 25 (2&3). 211-257.
Burgess, C. and Lund, K. 1997. Representing Abstract
Words and Emotional Connotation in a High-Dimensional
Memory Space. Cognitive Science.
Callan, J.P., Croft, W.B. and Harding, S.M. 1992. The
(INQUERY) Retrieval System. in DEXA-92, 3rd
International Conference on Database and Expert Systems
Applications.
Ducheneaut, N.B. and Bellotti, V. 2002. Ceci n’est pas un
objet? Talking about things in email. Journal of HumanComputer Interaction.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25. McArthur, R. and Bruza, P.D. 2003. Discovery of implicit
and explicit connections between people using email
utterance. in Proceedings of the 8th European Conference on
Computer-supported Cooperative Work (ECSCW), Kluwer
Academic Publishers, 21-40.
26. McArthur, R. and Bruza, P.D. 2003. Discovery of tacit
knowledge and topical ebbs and flows within the utterances
of online community. in Ohsawa, Y. and McBurney, P. eds.
Chance Discovery - Foundations and Applications, SpringerVerlag.
27. McArthur, R. and Bruza, P.D. 2003. Finding Tacit
Knowledge in Online Communities. in Proceedings of the
Eighth Australasian Document Computing Symposium
(ADCS'03), NWO-MES Press, 61--66.
28. McArthur, R., Bruza, P.D. and Song, D. 2005. Policy
Conformance in the Corporate Blog Space. in Workshop on
Policy Management on the Web, WWW'05.
29. Mihalcea, R. and Liu, H. 2006. A Corpus-based approach to
finding happiness. in AAAI Spring Symposium on
Computational Approaches to Weblogs.
30. Mishne, G. and Glance, N. 2006. Predicting Movie Sales
from Blogger Sentiment. in AAAI Spring Symposia 2006 on
Computational Approaches to Analysing Weblogs, (Stanford,
California).
31. Mishne, G. and Rijke, M.d. 2006. Capturing Global Mood
Levels using Blog Posts. in AAAI Spring Symposia 2006 on
Computational Approaches to Analysing Weblogs, (Stanford,
California).
32. Moffat, A., Zobel, J. and Hawking, D. 2005. Recommended
Reading for IR Research Students. SIGIR Forum, 39 (2).
33. Mullen, T. and Malouf, R. 2006. A preliminary investigation
into sentiment analysis of informal political discourse. in
AAAI Spring Symposia 2006 on Computational Approaches
to Analysing Weblogs, (Stanford, California).
34. Nicolov, N., Salvetti, F., Liberman, M. and Martin, J.H.
2006. Computational Approaches to Analyzing Weblogs
Papers from the 2006 Spring Symposium, AAAI.
35. Nowson, S. and Oberlander, J. 2006. The Identity of
Bloggers: Openness and gender in personal weblogs. in
AAAI Spring Symposia 2006 on Computational Approaches
to Analysing Weblogs, (Stanford, California).
36. Ounis, I., Rijke, M.d., Craig, M., Mishne, G. and Soboroff, I.
2006. Overview of the TREC-2006 Blog Track. in Text
Retrieval Conference (TREC), NIST, Washington.
37. Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. 2006.
Effects of Age and Gender on Blogging. in AAAI Spring
Symposia 2006 on Computational Approaches to Analysing
Weblogs, (Stanford, California).
38. Thelwall, M. 2006. Blogs During the London Attacks: Top
Information Sources and Topics. in Third annual workshop
on the {Weblogging} ecosystem, WWW2006, (Edinburgh).
39. White, R.W., Kules, B., Drucker, S.M. and Schraefel, M.C.
2006. Supporting exploratory search: introduction to special
issue. Commun. ACM, 49 (4). 36-39.
40. White, R.W., Muresan, G. and Marchionini, G. 2006. Report
on ACM SIGIR 2006 workshop on evaluating exploratory
search systems. SIGIR Forum, 40 (2). 52-60.
41. Yasuda, N., Hirao, T., Suzuki, J. and Isozaki, H. 2006.
Identifying Bloggers' Residential Area. in AAAI Spring
Symposia 2006 on Computational Approaches to Analysing
Weblogs, (Stanford, California).
Firth, J. 1957. A Synopsis of Linguistic Theory 1930-1955.
in reprinted in Palmer, F.S.P.o.J.R.F.L.H. ed., Philological
Society, Oxford.
Freyd, J. 1983. Shareability: the social psychology of
epistemology. Cognitive Science, 7. 191-210.
Gärdenfors, P. 2000. Conceptual Spaces: The Geometry of
Thought. MIT Press.
Gu, L., Lento, T., Smith, M. and Johns, P. 2006. How do
blog gardens grow? Language community correlates with
network diffusion and adoption of blogging systems. in AAAI
Spring Symposia 2006 on Computational Approaches to
Analysing Weblogs, (Stanford, California).
Herring, S., Kouper, I., Paolillo, J.C., Scheidt, L.A.,
Tyworth, M., Welsch, P., Wright, E. and Yu, N. 2005.
Conversations in the Blogosphere: An analysis "from the
bottom up". in Thirty-Eighth Hawaii International
Concerence on System Sciences (HICSS-38), Los Alamitos,
IEEE Press.
Hsu, W.H., Weninger, T., Pydimarri, T. and Paradesi,
M.S.R. 2006. Collaborative and Structural Recommendation
of Friends using Weblog-based Social Network Analysis. in
AAAI Spring Symposia 2006 on Computational Approaches
to Analysing Weblogs, (Stanford, California).
Ingwersen, P. and Järvelin, K. 2005. Information retrieval in
context: IRiX. SIGIR Forum, 39 (2). 31-39.
Ku, L.-W., Liang, Y.-T. and Chen, H.-H. 2006. Opinion
Extraction, Summarization and Tracking in News and Blog
Corpora. in AAAI Spring Symposia 2006 on Computational
Approaches to Analysing Weblogs, (Stanford, California).
Landauer, T.K. and Dumais, S.T. 1997. A solution to Plato's
problem: The latent semantic analysis theory of acquisition,
induction and representation of knowledge. Psychological
Review, 104. 211-240.
Landauer, T.K., Foltz, P.W. and Laham, D. 1998. An
introduction to latent semantic analysis. Discourse
Processes, 25 (2\&3). 259-284.
Lento, T., Welser, H.T., Gu, L. and Smith, M. 2006. The
Ties that Blog: Examining the Relationship Between Social
Ties and Continued Participation in the Wallop Weblogging
System. in Third annual workshop on the {Weblogging}
ecosystem, WWW2006, (Edinburgh).
Li, X., Liu, B. and Yu, P.S. 2006. Mining Community
Structure of Named Entities from Web Pages and Blogs. in
AAAI Spring Symposia 2006 on Computational Approaches
to Analysing Weblogs, (Stanford, California).
Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. and Tseng,
B. 2006. Discovery of Blog Communities Based on Mutual
Awareness. in Third annual workshop on the {Weblogging}
ecosystem, WWW2006, (Edinburgh).
Lund, K. and Burgess, C. 1996. Producing high-dimensional
semantic spaces from lexical co-occurrence. Behavior
Research Methods, Instruments & Computers, 28 (2). 203208.
Lund, K., Burgess, C. and Atchley, R.A. 1995. Semantic and
associative priming in high-dimensional semantic space. in
Cognitive Science, Erlbaum Publishers, Hillsdale, N.J.
McArthur, R., Bruza, P., Kralik, D. and Warren, J. 2006.
Projecting computational sense of self: A study of transition
in a chronic illness online community. in Proceedings of the
39th Hawii International Conference on System Sciences
(HICSS-39).
54
Download