Uncovering deep user context from blogs Robert McArthur CSIRO ICT Centre GPO Box 664 Canberra ACT 2601 Australia +61-2-6216 7000 robert.mcarthur@csiro.au information retrieval to an increased involvement of context, especially user context. Context in this sense can be defined as information that allows retrieval and presentation in ways that are not necessarily representative of the average person, but are more personalized to an individual or identifiable group. The success of workshops at SIGIR [15] and the subsequent IIiX Symposium and workshops [2], the SIGIR exploratory search workshop [40], and the outcome of the 2004 SWIRL workshop [32] are indicators that elements of user context are both sort and becoming available: “The provision of tools…can yield great rewards for users, especially when contextual factors such as user emotion, task constraints, and dynamism of information needs are considered” [39]. ABSTRACT People’s utterances are fundamentally different to other documents because they are more immediate and less thought through. While this makes them more natural – noisy and unstructured – it provides an unrivalled opportunity to see “inside” the author, to collect some context. The data requires analysis methods that have a relationship to human information processing: socio-cognitively motivated semantic systems. Using HAL, a method validated by cognitive science, the text from a large number of blog entries was analysed to uncover changes in entries author’s sense-of-self. Sense-of-self was measured through geometric projection of author’s first-person usage onto key indicators of kin and negative emotion words. An example of non-clinical qualitative evaluation affirmed the utility and promise of the technique: that deep personal context can be uncovered from utterances through the appropriate analysis and inference. User context is context about the user. While trite, this is an important distinction as it attenuates the human dimension which has frequently been acknowledged but ignored: we need to know more about the author and questioner/reader as people if we are to identify and use their context for better retrieval. Retrieval using unstructured, noisy data that are people’s utterances requires a different approach compared to documents created by multiple authors, and/or over a longer period of time. The most readily available contemporary manifestation of people’s utterances on which experimentation is possible in this search for context are blog entries. Categories and Subject Descriptors H.3.3 [Information Systems]: Information Search and Retrieval– information filtering, clustering, search process. J.4 [Social and Behavioral Sciences]: Psychology General Terms This paper continues a line of research in which the utility of socio-cognitively motivated algorithms to uncover facts, relationships and context from automatic analysis of noisy, unstructured textual utterances is demonstrated. In the research described, a “deep” style of user context is uncovered by automatically analyzing blog entries. The change in a person’s “sense-of-self” is one example of “deep” personal context. It is more implicit information about a person beyond the absolute explicit (“she plays netball”) or relative explicit (“he types fast”). Decision support for computer-assisted clinical intervention is the domain in which this particular context has shown promise. Algorithms,Measurement,Experimentation,Human Factors,Theory Keywords context, socio-cognitive, semantics, HAL, sense-of-self 1. INTRODUCTION Information retrieval has a history of dealing with a variety of unstructured and noisy data (UAND). This has culminated in the phenomenon of the general web search engine. These can lay claim to dealing with an enormous amount of data, much of which is changing, noisy and relatively unstructured, in functionally very successful ways. In a highly competitive operational field, and with new directions in theory, there is a rising trend in Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. An experiment is presented in which the HAL algorithm from cognitive science [22], as a manifestation of sociocognitively motivated analysis, is applied to UAND from the TREC Blog track. The results demonstrate the promising utility of the techniques applied to blog data by identifying individuals – candidates of interest – who may be exhibiting a low sense-ofself. The intention is that such a list would assist clinical decision making about who to support with the scarce resources available. Since such decisions over all blog data is beyond the scope of this study, an example is presented which demonstrates the insights possible by means of qualitative evaluation techniques. AND '08, July 24, 2008 Singapore Copyright ©2008 ACM 978-1-60558-196-5... $5.00 This research is an important contribution demonstrating what appropriate methods of analysis of people’s utterances may 47 about the user. An example of such information is a computational manifestation of a person’s sense-of-self. This has been investigated previously using other UAND data – mailing list records – in an online health setting [24]. The research presented here utilizes that work as an exemplar of deeper user context, refining the methods and applying the ideas and techniques to a new type of data - blog posts. Thus this work uses very similar methods but investigates the utility in the very different data of blog posts, across a much larger collection of authors and posts, and therefore by necessity uses a more quantative method of deciding candidates of interest. be able to achieve. It is precisely the artefacts of human communication –expressing oneself in a human manner: lack of structure and “noise” – that present us as humans with the context we need, and present computer systems with electronic headaches. The next section introduces background to the research, followed by a description of the experiment including methods and results. The paper concludes with some further ideas and directions on unstructured, noisy data. 2. BACKGROUND Recent weblog workshops [1, 34] reflect the difficulty of dissecting UAND for a task, but also a strong interest in using such data for the discovery of context about the user in the form of the blogger’s age [4, 37], gender [35, 37], opinions, sentiments and opinions expressed [16, 30, 33], mood levels [3, 31], happiness [29], residential area [41] and social network(s) [12, 14, 19-21, 38]. The research areas have concentrated on particular ways of identifying or extracting information, providing fertile ground for TREC-style comparisons of the approaches. All of the tasks and methods show how interesting but difficult the problem of using UAND is. It has been the more simple problems in which reasonable success has occurred, almost always using methods successful in less noisy, non-utterance data (e.g. splog detection). 2.1 Socio-cognitive systems Freyd [10] provides a socio-cognitive frame for a viable communal (i.e., shared or conveyed) knowledge representation: what seems common to most of the main approaches to semantics is an assumption that values of semantics components, or features, are critical to word meaning. What is relevant to shareability theory is that a smaller number of features seem to be used than number of words. (pp195-6) I am arguing that a dimensional structure for representing knowledge is efficient for communicating meaning between individuals. That is, a small dimensional structure with a small number of values on each dimension is argued to be especially shareable, which might explain why such structures are observed. (pp198-9) Considerable interest has been shown in the analysis of sentiment in weblogs [3, 29-31, 33]. This broad umbrella encompasses notions of mood, opinion, emotion and happiness, both explicit and implicit, short-term and over longer periods. Sentiment is only one aspect of user context though. Indeed, many of the published notions of sentiment derive from reflection of authorship. That is, determination of the mood, opinion or emotion comes through analysis of the artefacts of communication, the blog entry, and is deemed a reflection of the sentiment of the author at the time. The assumption is that a person’s writing reflects their inner state of being. Although it is easy to ignore the author and concentrate on the blog entry itself, the focus should remain on determining more about the user and how to gain this using UAND. Context about the author Implicit, meta, aggregated Sense-of-self others Freyd’s suppositions on the dimensional nature of shared knowledge are compatible with a recent, three-level sociocognitive model of cognition by Gärdenfors [11], who argues that meanings of words come from conceptual structures in people’s heads. Of his three levels of representation, symbolic, conceptual and associationist, it is the middle, conceptual, level that is of relevance for this paper. Gärdenfors’ socio-cognitive position is that meanings emerge from the conceptual structures harbored by individual cognition together with the linguistic power structure within the community: “The dimensions are the basic building blocks of representations on the conceptual level” (p43). Of significant note is his adoption of Freyd’s supposition: social interactions will constrain these conceptual structures. This constriction of the dimensional structure by the individual for social interaction is important. We tend to economise our utterances, losing structure and increasing noise, leveraging a shared background. This may be especially true in the persistent conversations of online discussion groups, which can become “grounded in sufficient mutual understanding to allow very brief, sketchy and implicit references to succeed without posing significant problems in interpretation.” [8] Context about the “document” Sentiment Style mood happiness opinion length grammar 2.2 Semantic spaces Discovering and representing knowledge from utterances requires cognitive processing ability, or the ability to mimic it. While Gärdenfors presents an attractive socio-cognitive basis, high dimensional semantic spaces are the preferred computational models to operationalise the theory. Figure 1: Different contexts drawn from textual data To this end, more information about the user is required apart from the emphasis on the (usually) explicit manifestations of their selves. Figure 1 presents this difference by separating what we know about the document (e.g. style), what we know about the user from the document at a particular point in time (e.g. sentiment), and the more implicit and meta-information which is derived from the sentiment and which captures deeper context A human encountering a new concept derives the meaning via an accumulation of experience of the contexts in which the concept appears [9]. This opens the door to “learn” the meaning of a concept through how it appears within the context of other 48 concepts. HAL is one of an ensemble of models of semantic memory from psychology that automatically constructs a dimensional semantic space from a corpus of text [5, 6, 22, 23]. HAL has accounted for a wide variety of semantic effects in the cognitive and neuropsychological literature and has an enviable track record (http://locutus.ucr.edu/ has many publications). Latent Semantic Analysis (LSA) [18] is another leading model. 2.3 Defining sense-of-self As defined in [24], an important element of our personal context is our sense-of-self. It has been posited, and demonstrated to some degree, that people manifest this in their utterances by being in the first person singular form, using personal pronouns I, me, my or mine. First-person form is one that is inherent to people’s utterances. The dimensional representation of all of the firstperson terms is a beginnings of computational representation of the author’s sense-of-self. This representation derives automatically from the corpus from which the semantic space was computed or, in other words, the semantics of a person’s computational self derive from their utterances involving selfreference. See [24] for more justification and details. Burgess and Lund [6] examined whether HAL could represent abstract concepts, such as love, hate and joy. They found that, in a comparison with human raters in predicting abstract variables for a set of words, “global co-occurrence information carried in the word vectors can be used to predict a tangible proportion of the human likert scale ratings.” Using a geometric point in the semantic space as the representation, movement in sense-of-self can be analysed against axes of reference. The same termsets as [24] were used here to form axis centroids, the average of the terms in the sets in Table 1, creating two axes of kin and negative emotion terms. The projection of the sense-of-self against these axes provide a temporal point for an author’s sense-of-self, while change in sense-of-self is determined by the distance between temporal points for the same author over different periods (e.g. days). Previous work [25-27] has shown that HAL can be used successfully in extracting knowledge directly from email utterances, both explicit and tacit. The studies have also shown the validity of using a (modified) global association model like HAL, originally developed on a 300 million word corpus from Usenet, on datasets of a few sentences [26] to a few thousand emails or documents [25]. For all of these reasons, HAL was chosen in [24] and here as the semantic space model by which a representation of “sense-of-self” could be captured from the discourse in blog entries. Table 1: Terms of interest Semantic spaces also have the property that they can be created on the basis of different variables – e.g. particular datasets derived from various times or authors – as well as through known global association models such as HAL. These local associations provide a framework for understanding, in authorship or longitudinally, the utterances of individuals as well as the community as a whole. This is used to capture the state of a person at a particular time. The minimum time period, such as a day or month, depends on the amount of textual utterance available: the more coherent and more utterances for the time the better. Negative emotion words: "hate", "pain", "anger", "painful", "fatigue", "fatigued", "angry" Kin words: "mother", "father", "fathers", "grandfather", "mothers", "grandmother", "mother-in-law", "grandson", "sons", "son-in-law", "daughters", "granddaughter", "daughter-in-law", "sibling", "siblings", "husbands", "sister", "brother", "daughter", "son", "wife", "husband" Words comprising the sense-of-self: "I", "me", "my", "mine" In short, the HAL-produced semantic space is the knowledge representation framework; blog entries provide utterances of the right form for analysis; and sense-of-self is the geometric position of the author’s self-referents as against terms of kin and negative emotion in the semantic space. Movement in sense-of-self is determined using temporally determined sense-of-self points for individual authors. In practical terms, certain terms, chosen to exemplify the concepts of negative emotion and kin, are combined to form a 2D space onto which the monthly sense-ofself vectors of the authors are projected. The sense-of-self vectors arise from the identification of first-person language within the blog posts. A large movement within the 2D space of the monthly sense-of-self points indicate a potentially interesting change to the author’s sense-of-self. Terms used to create the axes are drawn from the original work, as well as more recent blog-related sources (like [29]). Therefore it is hypothesized that a socio-cognitively motivated paradigm and practical semantic spaces provides the best chance of representing and inferring user context. This research is the first to apply semantic spaces to blog data. Blog entries are quite different to most other text data because, unlike many types of documents, are written by the individual as a personal communication but often not directed to a particular reader. Sometimes they are written in the knowledge that many thousands of people will read, digest and comment on the contents. Other blogs are more personal and it doesn’t seem to matter to the author whether anyone else reads them. Whichever is the case on this continuum, the blogs have a clear authorship and are often written in first person. Apart from email on which it is notoriously difficult to conduct research because of privacy concerns, blogs are the principle other media which has these features, and it is these features which provide the strongest source of information – context – about the author. It is important to note that while some blog entries provide personal utterances which provide evidence towards a sense-ofself, there are many that do not. These include splogs (spamblogs), quotations, technology blogs, documents not in the first person, etc. The practical effect of limiting the analysis to entries written in the first person, and substantially so when vector size is taken into account (see sections 4 and 5), removes any effect from these spurious or uninteresting blog entries. In brief, the advantage of blogs as against all other forms of data for identifying and analysing user context is that 1. 2. 3. they are written by a single, identifiable author; they are revealing – often explicit personal information is presented; they are freely available. The next sections describe the particular experimental procedure and results to empirically investigate these notions. 49 The amount of utterance data for each blogger on each day is wildly different. Also, the largest difference in time between examining an author’s text is more likely to uncover deeper, inherent, context and show pronounced change. Therefore 9 consecutive days in the first part of the data, December (20051207, 20051208, 20051209, 20051210, 20051211, 20051212, 20051213, 20051214, 20051215), and 7 days in the latter period of February (20060201, 20060202, 20060203, 20060204, 20060205, 20060206, 20060207) were chosen for analysis. Multiple days in each period were chosen to attempt to capture one of the qualities of personal blogging – that repeated daily or near-daily entries were the type that are more likely to evince and respond to the deep personal analysis of the type this research desires to uncover. 3. METHODS 3.1 HAL A HAL space comprises high dimensional vector representations for each term in a vocabulary. Given an n-word vocabulary, the space is an n x n matrix constructed by moving a window of length l (typically 10) over the corpus by one word increments ignoring punctuation, sentence and paragraph boundaries. All words within the window are considered as co-occurring with each other with strengths inversely proportional to the distance between them. After traversing the corpus, an accumulated cooccurrence matrix for all the words in a target vocabulary is produced. Note that word pairs in HAL are direction sensitive – the co-occurrence information for words preceding every word and co-occurrence information for words following it are recorded separately by its row and column vectors. In summary, the following stages were implemented, with comments in italics: The HAL matrix for an example text “the pain of seeing dad that way was too strong to bear” is depicted in Table 1 using a 3 word moving window (l=3). An example of reading Table 1 would be that the word dad occurs before bear (is related to) with strength 1 (2-1 intervening words in the window). Stop words (non-italics) were removed. 1. Table 2: HAL matrix for "the pain of seeing dad that way was too strong to bear" 2. pain seeing dad strong bear pain 0 2 1 0 0 seeing 2 0 2 1 0 dad 1 2 0 2 1 strong 0 1 2 0 2 bear 0 0 1 2 0 Prior studies [25] revealed that for the purposes of this research, it was not useful to preserve word order information, so the HAL vector of a word was represented by the addition of its row and column vectors. The quality of HAL vectors is influenced by the window size; the longer the window, the higher the chance of representing spurious associations between terms. Burgess et al. [5] used a size of ten in their studies. To limit the influence of stop words (frequently occurring words that do not help to differentiate documents or utterances) in dimensional reduction and reduce the frequency bias, the INQUERY [7] list provided stopwords that were deleted from the input to HAL. Thus a window of 10 non-stop words was used, rather than 10 words. 3. 4. 3.2 Data The TREC Blog track [36] provided a large scale test collection of blog posts from 2006. The data, being a snapshot in time over a range of blog sites, appears to include all genres of blogs, from one-off to “A-list” [13], as well as spam for “a realistic research setting” making it a particularly realistic example of UAND data.. There were 100,649 unique blogs with 3,215,171 entries from 6/12/2005 to 21/2/2006. 5. The weblogs were analysed in a manner similar to [24]. Unlike in that study, some “pre-semantic” information such as part-of-speech tagging was not required because of the larger amount of data; for similar reasons, SVD (singular value decomposition) was not applied, thus only explicit associations or relationships were uncovered. 50 The text of the Blog entries were extracted from the XML-like syntax; This was more difficult than necessary due to the XMLlike format used, the fact that the messages were often in their original HTML format – each of which used different methods and standards including Javascript – and the variety of character sets & languages involved. Entries were tokenised, and non-English language blogs eliminated; This again was difficult as the texts were very ‘dirty’ – many strange characters, odd positioning of spaces and inter-word separators. It is unlikely more detailed linguistic analysis, such as part-of-speech and chunking, would succeed well because of both the format of the data as well as the language used. NonEnglish blogs were relatively easy to identify and remove using simple regexps. A small number of stopwords were removed; Some words usually considered stopwords, like “I” or “my”, are words that are vital to understanding the deep personal information in the blog. Removal of too many words that are important indicators in terms of personal context risks loss of the ability, wholly or in part, to find the desired entries; but the cluttering of the vectors with ‘useless’ words harms the quality of the subsequent analysis. The choice of which and how many stopwords to remove is likely to depend upon the particular personal context under examination. General cleaning of the text was performed (translation of forms like “I’ve” to “I have”, removal of URI’s, converting to lower-case) Hyperspace analogue to language (HAL, [5]) was performed over all the blog entries identified above with a window size of 10. The result was the creation of two combined vectors, one for the set of kin words and one for the set of negative emotion words. Normally a semantic space is constructed for all words in the corpus. However, due to the large amount of data, only vectors for terms of interest were computed rather than an entire semantic space (a vocabulary of 1,054,820 terms was identified; 53% were unknown to Unix spell leaving 498,746 known terms; even this is a very large semantic space compared to [5] or [17]). In keeping with previous research [24, 25, 28], pre- and post- associations were combined leaving a single vector for each word. Vectors were normalised to unit length. Each of the two vectors representing sets of words were created by summing the individual word’s vectors, averaging and then normalising the result. wife like am up sister they our do said father was all now just would time little son is brother nude out incest 6. Hyperspace analogue to language ([5]) was performed (window size=10) but only vectors for the sense-of-self were created. Also, since the sense-of-self is personal, the sense-of-self vector (calculated as in the above bullet point) was created separately for each blog feed over the days in question. Over 56,000 sense-of-self vectors were created, with over 13,300 appearing on more than one day during the analysis period. Those occurring on only one day were eliminated. 7. The sense-of-self vectors were projected onto the two axes of interest using simple vector algebra: each senseof-self vector for a blog feed in a day (vector of 1 x n) was projected onto the combined kin vector (1 x n) and the combined negative emotion vector (1 x n) yielding an x,y position against these two axes. The length of the sense-of-self vectors varied widely – the more contexts in which the terms comprising the sense-of-self were used, the larger the vector. A larger vector may be a better exemplar of the person’s senseof-self, although it is likely to be a non-linear relationship – one occurrence of “I” may not be strongly evidential, but three occurrences may be as good as 30. Table 2 presents the largest associations in the combined kin and negative emotion vectors to assist understanding of the method. For ease of reading, the normalised values (0-1) have been multiplied by 100,000. Bold indicates terms comprising the combined vector; italics indicates “interesting” associations. my i his a her me [possessive] you not he [personalpossessive] have your she had mother but one we so has sex him [ellipsis] their 57423 38471 24290 18530 18476 16860 15774 14362 13014 12527 12478 11745 10595 9359 8187 7009 6450 5657 5330 5271 5251 5028 5003 4845 4729 5992 5918 5781 5711 5589 5434 5262 5126 5115 5065 5034 4899 4857 4733 4611 4525 4250 4230 4105 4065 3958 3892 3803 The result of the experiment is a list bloggers and the amount of change of sense-of-self between particular days in the first part of the corpus (Dec) and in the last part (Feb). Table 3 shows an example ranked list of feed numbers. The score indicates the level of difference (Euclidean distance) between the 2D sense-of-self in one day compared to another day; it is the Euclidean distance between the 2D point (kin axis, negative emotion axis) on one day compared with another day. A high kin or negative emotion value indicates strength of or high usage on that axis. The vector size is also shown, along with the number of elements of the sense-ofself vector compared with the combined kin and negative emotion vectors. The higher the score, the move the sense-of-self has changed over the period. While the table begins with the highest scoring blogs, the vector size of many of the top entries for one of the two days is very low, such as 12 for the first entry. These small vectors indicate very little first-person usage and little chance of detecting an anomaly. Therefore, while the highest of these scores are shown, the first interesting example of larger vector size occurs at a score of 0.68. Negative emotion (n=69,442, x̄ =20.04) i not my me you but have back so a [ellipsis] hate pain they do them all like am is he love just much it your up more his we one people her she get really because did there him feel their out management [possessive] know being anger 4. RESULTS Table 2: Largest associations for combined vectors Kin (n=123,589, x̄ =10.23) 4722 4626 4604 4568 4492 4408 4366 4337 4294 4170 4162 4070 4066 3805 3788 3730 3690 3618 3582 3489 3478 3403 3386 68227 21778 19564 17689 15234 14084 13189 12397 11483 11427 11382 10481 10370 10322 10219 8125 7986 7756 7739 7586 7490 7222 6709 6340 6128 Table 3: Example output (n=9230 pairs, avg. score=0.10) Score 1.07 Day … 050248 0.04 0.72 072330 0.05 0.67 015221 0.06 0.65 … … 0.05 0.87 0.07 0.79 0.10 0.80 … 20060205 20051211 20060201 20051214 20051207 20051214 … 0.68 032140 0.23 0.29 20051209 0.95 0.92 51 Kin Negativ Blog e Feed ID value emotion value (prefix: BLOG0 6-feed-) Vect Elements in or common size with with kin neg. emoti on 12 12 12 1649 1498 1442 10 10 10 537 516 513 28 28 28 329 317 318 … … … 221 213 205 … … 0.69 … 0.79 … 20060203 … 725 … 678 ... homeless CRACK junkie. It's like having a tenant who shits the floor and smells of whiskey-soaked rotten eggs, but won't pay the rent. Sort of like a teenager. Eventually we will buy a new set of hardware for the hatch door to keep him out... or keep him IN, depending on whether or not I've taken my Maca that day. No idea what we're going to do about the GIANT rodent in the back crawlspace. So basically, Koala the kitten has sought residence with his Daddy, whom I've only seen once, but looks like a bigger version of Booga. Booga's going to be a big boy. He's going to whip Mojo's ass one day. And now I return you the regularly scheduled Days of Thanks... 681 … 5. DISCUSSION It is important to commence the discussion with the proviso that no clinical diagnosis is being offered. The list of candidate blogs (e.g. Table 3) could be used as a decision support tool by clinicians who would likely examine the entries in question, and follow-up interesting leads with other, perhaps face-to-face, interaction. However, with increasingly scarce resources, both at an individual’s cognitive level and the medical establishment’s staffing level, creating a candidate list of prospective respondents would be extremely useful: after all, at the moment, there is no analysis at all. Even should 1 in 100 or more leads prove worth further analysis, the human benefits are likely to be worthwhile. Example blog entries from the second day’s (20060203) feed are (10 entries total): Cutting Day Date: February 24, 2006 5:30AM Event: Laparoscopy/Hysteroscopy/Septum Eviction 2006 Prep: Occasional pacing; random anxiety dream; shallow breathing; temporary attention deficit; excitement and hope that this might fix the plumbing problems. Three weeks to go… Ahem... yeah. Apparently, my MRI strongly suggests an arcuate uterus. Arcuate, as in irregularly shaped. Arcuate, as in almost heart-shaped (and not in a cheeky Valentines kind of way). Arcuate, as in what one doctor noted when I was pregnant, only to be contradicted by Dr. Asshole later. Arcuate, as in lots more surgery. Arcuate, as in “might even be septate-but we won't be sure until the surgery because there's a gigantic fibroid blocking our view.” I am having a difficult time finding enough information about it. Most say that it doesn't increase the risk of miscarriage, which I have to say is royal bullshit according to my interview of my seven deceased fetuses. […] They like it when you sweat... Nothing yet. All day, I waited patiently by the phone. I took my cell phone with me to staff meetings, expecting to leave the meeting to talk to Dr. Awesome. ...and I waited....and waited...and waited...At 5:55pm, I got up and headed out the door to go home. Somewhere en route from my desk to the car, they called. But of course, the windtunnel I walk through known as “parking deck” was noisy enough that OF COURSE, I didn't hear the damn phone ringing. Ah, but they left me a message. “Mrs. Drab, we have the results back for your MRI. If you will call us tomorrow, we will go over those results with you. Also, you need bloodwork drawn. Call and we will give the details.” Thanks, that was a fat load of help. My imagination is running wild. Now, I'm fairly certain THIS is The Shadow. This is why they are being coy. They don't want to tell me about the evil creature that has attached itself to my uterine wall. MMMMhmmmm... Mark my words. E-V-I-L C-R-EA-T-U-R-E. Closely related to the Ripapod. Dammit, they found a mutant Ripapod in there. Son of a bitch. Why I will fail at Anger Management My doctor (the one who treated my TMJ) advised me to work on my anger management issues. I will fail at this task. It's not that I'm a defeatist or like to give up easily (anyone who's followed my ridiculous Chronicles of Conceiving knows that). The truth is that I've had this anger as long as I can remember. It's like a close friend. The one who leaves cigarette burns on your couch and never replaces the toilet paper, but at the end of the day you are saying “OHHHH Anger. You're simply ADORABLE.” I'd miss the little curmudgeonette if she were to leave me entirely. And let's face it, with an asshole around every corner, it's a virtual impossibility. For example, Gwyneth. She took a lovely little band like Coldplay and twisted it with her evil witchy magic. Here's the transition: BEFORE (c.2000) I awake to see that no one is free We're all fugitives Look at the way we live Down here, I cannot sleep from fear, no I said, which way do I turn Oh, I forget everything I learn... AFTER (c.2005) You cut me down a tree and brought it back to me And that’s what made me see where I was going wrong You put me on a shelf and kept me for yourself I can only blame myself, you can only blame me... Do you SEE what she did? […] Look at that Glad trash The assessment as to which blog entries most contribute to the context is almost certainly qualitative and task dependent, in line with the socio-cognitive nature of the problem. The determination of whether a particular blog entry is the basis for intervention cannot be qualitatively, or simply, judged; anecdotal evidence indicates that different clinicians often disagree with colleague’s conclusions. The good news is that often the first level identification of interesting candidates is quite viable since as humans we do such comparisons every moment and are, relatively, good at it. For example, in the example shown in Table 3, let us examine the particular blog entries that form the basis of one of the results. The list is ordered by a basic similarity score, however, a factor not encompassed by that score is the size of the sense-of-self vector. A small vector means few uses of firstperson language and hence may not be reliable. The entry BLOG06-feed-032140 has the highest score with the largest number of associations in both days’ sense-of-self vectors, hence it will serve as a practical case in point. Note that this isn’t to downplay or ignore the strength of the other results, merely that this datapoint’s text may be clearer to non-specialists. In a clinical setting a practical decision would be made on how many blog entries to examine versus a cut-off based on vector size and score. The first day for this blog feed is 20051209, where a senseof-self is (0.23, 0.29), while the second is 20060203 (0.69, 0.79). Since analysis and interpretation of the meaning of these inferences is qualitative, and opinion will differ, the most appropriate demonstration is to show instances of blog entries from the first day’s feed and contrast these to instances from the second day. In bold is the title followed by the unadorned entry – very unstructured and noisy textual data. Due to space constraints, blog entries that are similar to, or do not add to those shown, are not presented (there were 10 entries in total). The 12 Days of Thanks: Day 10 Things I am thankful for #10: Bruce Campbell This... is my BOOMSTICK!!! One of the best cheesy actors of our time, the star of the Evil Dead series. I simply can't resist a movie with Bruce. Intermission I am taking a brief break from giving thanks to express a few thoughts. […] Koala, the fifth kitten in the litter I've mentioned a half a million times disappeared for a while. I couldn't figure out why. Until last night. We have a fucking POSSUM living in our crawlspace. No wait, let me correct that. We have a fucking POSSUM living in our BACK crawl space. You see, we have two crawlspaces. One in the front and one in the back. The one in the front is occassionally squatted by a 52 bag she has around her neck! How can I be mad when Gwyn's prancing about in a garbage bag? It's hilarity at its finest. Chris Martin has been bewitched by a bag lady! But there are plenty of other assholes where Gwynnie drops off. It will always be a challenge, and I'm not so sure I'm ready to give it up. Thank GOD for Vicodin, huh? number of blog entries to uncover changes in authors’ sense-ofself. Prior work on less noisy mailing list messages [24] broke the ground on which this study built. Using a similar definition, sense-of-self was measured through geometric projection of author’s first-person usage onto key indicators of kin and negative emotion words. Two candidates were presented, in the manner of an early qualitative evaluation, to demonstrate the potential of the technique: that deep personal context can be uncovered from utterances through the appropriate analysis and inference. Clearly, there is a difference between how and what is being expressed in the two sets of blog entries, and in particular what is “behind” that expression – the first day’s entries portray someone who is outward looking, while the second is much more inwardly focused and personal. Although the use of first-person in the two sets of entries has not substantially changed, the implicit nature of the messages is indicative of a possible change in sense-of-self (using the same understanding of the particular meaning of this concept expressed in [24]). Further investigations are planned with clinicians to gain a better understanding how the currently subjective and qualitative evaluation may be improved. Also, one of the lessons learnt during examination of blog text is that the terminology indicating kin or negative emotion may need to be added to in order to increase the ability to use smaller vectors – someone in trouble may not be loquacious. One approach would be to find these synonyms automatically through clustering in the semantic space [17], again using the inherent lack of structure and noisiness of the data itself to assist. Lastly, this research has focused on a change in a person’s sense-of-self over two months. Detecting where the sense-of-self is low and stays low may assist depression and other longer-term mental health issues. Another example is from the third line of Table 3 – from BLOG06-feed-015221. While the vector size is very small (28), there is still an interesting quality and meaning difference between 20051214: …I had this dream just before I woke up this morning. I was back in high school; I'd gone back in time somehow. […] "Look at us" I said to the woman standing next to me. "We're so...young! Look at our faces. Nothing has happened to us yet." […] "It went by so fast" I whispered. "Didn't it go by fast? Like the blink of an eye." "All of life is like that" she answered. "I didn't do enough" I went on, realizing cues I missed, qualities I failed to appreciate in my friends when I still had them there to appreciate […] I'm going to die, I realized. I didn't think I said it out loud, but she was just nodding at me calmly, unafraid and I understood her as clearly as she'd spoken too. We're all going to die . I woke up with my jaw aching, like I'd ground my teeth all night in my sleep. Even though I couldn't argue with anything she said in the dream, I sat on the edge of the bed shivering and rubbing my jaw thoughtfully; wanting, more than anything, to go on living. The methods and results are an important addition to the toolbox of analysis for noisy and unstructured textual utterance data. While promising, more comprehensive qualitative evaluation is required. For utterance data, such as blogs, emails and mailing lists, where the task requires human-like mimicking of the semantics and inference, socio-cognitively motivated semantic systems are a principled choice. In the task of identifying deep personal context, of which a sense-of-self is but one example, they are as yet unrivaled. 7. REFERENCES and 20051207: 1. Yes, our artificial tree really did break or lose some very important piece of hardware that maintains the vertical stance, and we did fix it with duct tape. We don't live in West Virginia for nothing. Duct tape is one of the six greatest inventions of the modern day, and don't you forget it. If you can't fix it with duct tape, forget it. It's broke.[…] 2. 3. Both examples are encouraging indications that detecting a change in this form of deeper user context is feasible. Importantly, and in contrast to the usual TREC modus operandi, inter-“assessor” (clinician) differences should perhaps be embraced rather than ameliorated, with the annotated explanations forming a vital part of assessment and evaluation. 4. 6. CONCLUSION 5. User context is an important missing element in modern IR systems. Some forms such as sentiment and style-based context are under investigation. Deep context is an important facet of a person. Sense-of-self is one example of deep personal context. To collect it, though, requires analysis of people’s direct utterances. Those in text form are fundamentally different to other documents. They are more immediate and less thought through, full of noise and unstructured. Analysis of utterances requires methods that mimic human information processing: sociocognitively motivated semantic systems. 6. 7. 8. This paper presented research using HAL, a method validated by cognitive science, to analyse the text from a large 53 Adar, E., Glance, N. and Hurst, M. (eds.) 2006. 3rd Annual Workshop on the Weblogging Ecosystem. WWW'06. Ruthven, I., Borlund, P., Ingwersen, P., Belkin, N., Tombros, A. and Vakkari, P. (eds.) 2006. Information Interaction in Context. first IIiX Symposium on Information Interaction in Context, (Copenhagen), ACM Press. Balog, K. and Rijke, M.d. 2006. Decomposing Bloggers' Moods: Towards a Time Series Analysis of Moods in the Blogosphere. in Third annual workshop on the {Weblogging} ecosystem, WWW2006, (Edinburgh). Burger, J.D. and Henderson, J.C. 2006. An Exploration of Observable Features Related to Blogger Age. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Burgess, C., Livesay, K. and Lund, K. 1998. Explorations in context space: words, sentences, discourse. Discourse Processes, 25 (2&3). 211-257. Burgess, C. and Lund, K. 1997. Representing Abstract Words and Emotional Connotation in a High-Dimensional Memory Space. Cognitive Science. Callan, J.P., Croft, W.B. and Harding, S.M. 1992. The (INQUERY) Retrieval System. in DEXA-92, 3rd International Conference on Database and Expert Systems Applications. Ducheneaut, N.B. and Bellotti, V. 2002. Ceci n’est pas un objet? Talking about things in email. Journal of HumanComputer Interaction. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. McArthur, R. and Bruza, P.D. 2003. Discovery of implicit and explicit connections between people using email utterance. in Proceedings of the 8th European Conference on Computer-supported Cooperative Work (ECSCW), Kluwer Academic Publishers, 21-40. 26. McArthur, R. and Bruza, P.D. 2003. Discovery of tacit knowledge and topical ebbs and flows within the utterances of online community. in Ohsawa, Y. and McBurney, P. eds. Chance Discovery - Foundations and Applications, SpringerVerlag. 27. McArthur, R. and Bruza, P.D. 2003. Finding Tacit Knowledge in Online Communities. in Proceedings of the Eighth Australasian Document Computing Symposium (ADCS'03), NWO-MES Press, 61--66. 28. McArthur, R., Bruza, P.D. and Song, D. 2005. Policy Conformance in the Corporate Blog Space. in Workshop on Policy Management on the Web, WWW'05. 29. Mihalcea, R. and Liu, H. 2006. A Corpus-based approach to finding happiness. in AAAI Spring Symposium on Computational Approaches to Weblogs. 30. Mishne, G. and Glance, N. 2006. Predicting Movie Sales from Blogger Sentiment. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). 31. Mishne, G. and Rijke, M.d. 2006. Capturing Global Mood Levels using Blog Posts. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). 32. Moffat, A., Zobel, J. and Hawking, D. 2005. Recommended Reading for IR Research Students. SIGIR Forum, 39 (2). 33. Mullen, T. and Malouf, R. 2006. A preliminary investigation into sentiment analysis of informal political discourse. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). 34. Nicolov, N., Salvetti, F., Liberman, M. and Martin, J.H. 2006. Computational Approaches to Analyzing Weblogs Papers from the 2006 Spring Symposium, AAAI. 35. Nowson, S. and Oberlander, J. 2006. The Identity of Bloggers: Openness and gender in personal weblogs. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). 36. Ounis, I., Rijke, M.d., Craig, M., Mishne, G. and Soboroff, I. 2006. Overview of the TREC-2006 Blog Track. in Text Retrieval Conference (TREC), NIST, Washington. 37. Schler, J., Koppel, M., Argamon, S. and Pennebaker, J. 2006. Effects of Age and Gender on Blogging. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). 38. Thelwall, M. 2006. Blogs During the London Attacks: Top Information Sources and Topics. in Third annual workshop on the {Weblogging} ecosystem, WWW2006, (Edinburgh). 39. White, R.W., Kules, B., Drucker, S.M. and Schraefel, M.C. 2006. Supporting exploratory search: introduction to special issue. Commun. ACM, 49 (4). 36-39. 40. White, R.W., Muresan, G. and Marchionini, G. 2006. Report on ACM SIGIR 2006 workshop on evaluating exploratory search systems. SIGIR Forum, 40 (2). 52-60. 41. Yasuda, N., Hirao, T., Suzuki, J. and Isozaki, H. 2006. Identifying Bloggers' Residential Area. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Firth, J. 1957. A Synopsis of Linguistic Theory 1930-1955. in reprinted in Palmer, F.S.P.o.J.R.F.L.H. ed., Philological Society, Oxford. Freyd, J. 1983. Shareability: the social psychology of epistemology. Cognitive Science, 7. 191-210. Gärdenfors, P. 2000. Conceptual Spaces: The Geometry of Thought. MIT Press. Gu, L., Lento, T., Smith, M. and Johns, P. 2006. How do blog gardens grow? Language community correlates with network diffusion and adoption of blogging systems. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Herring, S., Kouper, I., Paolillo, J.C., Scheidt, L.A., Tyworth, M., Welsch, P., Wright, E. and Yu, N. 2005. Conversations in the Blogosphere: An analysis "from the bottom up". in Thirty-Eighth Hawaii International Concerence on System Sciences (HICSS-38), Los Alamitos, IEEE Press. Hsu, W.H., Weninger, T., Pydimarri, T. and Paradesi, M.S.R. 2006. Collaborative and Structural Recommendation of Friends using Weblog-based Social Network Analysis. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Ingwersen, P. and Järvelin, K. 2005. Information retrieval in context: IRiX. SIGIR Forum, 39 (2). 31-39. Ku, L.-W., Liang, Y.-T. and Chen, H.-H. 2006. Opinion Extraction, Summarization and Tracking in News and Blog Corpora. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Landauer, T.K. and Dumais, S.T. 1997. A solution to Plato's problem: The latent semantic analysis theory of acquisition, induction and representation of knowledge. Psychological Review, 104. 211-240. Landauer, T.K., Foltz, P.W. and Laham, D. 1998. An introduction to latent semantic analysis. Discourse Processes, 25 (2\&3). 259-284. Lento, T., Welser, H.T., Gu, L. and Smith, M. 2006. The Ties that Blog: Examining the Relationship Between Social Ties and Continued Participation in the Wallop Weblogging System. in Third annual workshop on the {Weblogging} ecosystem, WWW2006, (Edinburgh). Li, X., Liu, B. and Yu, P.S. 2006. Mining Community Structure of Named Entities from Web Pages and Blogs. in AAAI Spring Symposia 2006 on Computational Approaches to Analysing Weblogs, (Stanford, California). Lin, Y.-R., Sundaram, H., Chi, Y., Tatemura, J. and Tseng, B. 2006. Discovery of Blog Communities Based on Mutual Awareness. in Third annual workshop on the {Weblogging} ecosystem, WWW2006, (Edinburgh). Lund, K. and Burgess, C. 1996. Producing high-dimensional semantic spaces from lexical co-occurrence. Behavior Research Methods, Instruments & Computers, 28 (2). 203208. Lund, K., Burgess, C. and Atchley, R.A. 1995. Semantic and associative priming in high-dimensional semantic space. in Cognitive Science, Erlbaum Publishers, Hillsdale, N.J. McArthur, R., Bruza, P., Kralik, D. and Warren, J. 2006. Projecting computational sense of self: A study of transition in a chronic illness online community. in Proceedings of the 39th Hawii International Conference on System Sciences (HICSS-39). 54