Page 1 of 17 MySpace Comments1 Mike Thelwall School of Computing and Information Technology, University of Wolverhampton, Wulfruna Street, Wolverhampton WV1 1LY, UK. E-mail: m.thelwall@wlv.ac.uk Tel: +44 1902 321470 Fax: +44 1902 321478 Purpose - The public messages exchanged between friends in social network sites provide a record of informal communication on an unprecedented scale and, in some countries, for a wide cross-section of the population. This study investigates the characteristics of social network comments to give a broad overview to serve as a baseline for future research. Design/methodology/approach - English comments from a representative sample of public MySpace profiles are examined with a collection of exploratory analyses, using automatic data processing, quantitative techniques and content analyses. Findings - Comments are normally for general friendship maintenance and are typically short, with 95% having 57 or fewer words. They contain a combination of standard spelling, apparently accidental mistakes, slang, sentence fragments, “typographic slang” and interjections. Several new creative spelling variants derived from previous forms of computer-mediated communication have become extremely common, including u, ur, :), haha, and lol. The vast majority of comments, 97%, contain at least one non-standard language feature, suggesting that members almost universally recognise the informal nature of this kind of messaging. Research limitations/implications - The investigation only covers MySpace and only analyses English comments. Practical implications - MySpace comments should not be written in, or judged by, standard linguistic norms and may cause special problems for information retrieval. Originality/value - This is the first large-scale study of language in social network comments. Introduction The public messages exchanged by social network site members, sometimes called comments or wall postings, are a new type of text-based communication. These messages are unusual in that they are public - either world-visible or visible to all of a members’ friends - and can be permanently associated with the identity of the poster – more directly and publicly so than listserv postings. The widespread use of social network sites in many countries (boyd, & Ellison, 2007) makes them an important object of study and also gives an opportunity to investigate informal interpersonal communication on a larger scale than previously possible. Earlier forms of computer-mediated communication for interpersonal or informal communication have previously been investigated – typically with a case study approach or a potentially unrepresentative sample due to the limitations of the technology. These studies have shown the emergence of many forms of non-standard English and distinctive stylistic features (reviewed below). In addition to the intrinsic linguistic interest of these phenomena, online information retrieval can be impacted because if social network sites have casual language and spelling errors, then this could make them difficult to search effectively (see also Baron, 2003) and difficult to automatically translate (Climent, Moré, Oliver, et al., 2007). Moreover, if social network profiles are typically not rich in useful information, then search engines might wish to allocate low search rankings to them, and a convenient automatic mechanism for this would be to penalise slang or incorrect spelling. This article focuses on English language comments in one social network site, MySpace, using an exploratory set of predominantly quantitative analyses. The choice of MySpace is due to its popularity, apparently being the most visited site for U.S. web users at the start of 2007 (Prescott, 2007), and because of its amenability to quantitative analysis (Escher, 2007). The analysis includes text length, common words, spelling, grammar and rare 1 Thelwall, M. (2009, to appear). MySpace comments. Online Information Review, 31. Page 2 of 17 words. It is an initial exploratory study to highlight issues and patterns for future in-depth investigations. Language in Computer-Mediated Communication Language and CMC types Perhaps the key issue for early online language researchers was the degree to which Internet language is similar to spoken rather than written language (Baron, 2003; Crystal, 2006). Previous findings are ambiguous: its linguistic features can fit between the two (Ko, 1996 – an educational chatroom) or can be different from both (e.g., modals in Yates, 1996 – a student-oriented discussion forum). Similarly, a study showed that language in an international Bulletin Board System (BBS) covering a mixture of recreational and serious topics tended to be more informal than most written forms and was quite similar to the language of formal interviews, but with a higher degree of abstract information (Collot & Belmore, 1996). The problem with attempts to generalise from such studies is that “internet language” is too broad a category and a more nuanced approach is needed (Herring, 2002). There are many different computer mediated communication (CMC) modes (Baron, 2003), including email (one-to-one, asynchronous), instant messaging (one-to-one, synchronous), blogs (one-to-many, asynchronous), live streaming broadcasts (one-to-many, synchronous), chat applications (many-to-many, synchronous), and listservs or wikis (manyto-many, asynchronous). Online CMC also varies in the extent to which it is product-oriented or process-oriented (Baron, 2003). Those with more durable outputs (e.g., blogs) probably tend to use more carefully chosen language whereas those with less durable outputs (e.g., chat, instant messaging) are more oriented towards the process the users are engaged in and the use of casual language may be more appropriate. Since CMC services vary in their capabilities and usages, it is useful to have dimensions through which to compare and analyse their language. In particular it is important to recognise that Internet language is not homogeneous, but is socially constructed by users appropriating available technologies (Androutsopoulos, 2006). For example, although similar kinds of messages are possible with instant messaging and mobile phone text messages, they are integrated in different ways into people’s lives because of their differing conveniences (Grinter, Palen, & Eldridge, 2006). Herring’s (2007) faceted classification scheme summarises a wide range of factors that may influence the language used within a particular CMC context, distinguishing between medium and situational types. Partially quoting from Herring (2007), the medium factors include: synchronicity (asynchronous/synchronous); persistence of transcript (how long the record of the communication is likely to survive); maximum permitted message length; and whether the messages are private or anonymous. The situational factors are more complex and often have to be evaluated qualitatively. Again partially quoting from Herring (2007), situational factors include: participation structure (e.g., one-to-one or one-to-many; group size; the number of active participants); participant demographic, attitudinal, skills-based and other characteristics; purpose of communication environment; and the topic or theme of group or messages. As a consequence of the research summarised in Herring’s classification scheme, it is important to note that there are many different factors that influence the kind of language used in online communication, or even in any given type of online communication. CMC language variations and innovations It seems that when new communication technologies arrive, there is a burst of creativity as users develop new styles and patterns of use (e.g., Danet, Ruedenberg, & Rosenbaum-Tamari, 1997; North, 2006). CMC has seen the introduction or expansion in the written use of slang, emoticons, and abbreviations. Some abbreviations, such as irl (in real life) seem to be specific to the Internet, and others seem to have spawned from the functions of a CMC device. An important motivation for use of abbreviations by people in devices for which they are not convenient may be to show group membership or conformity (Crystal, 1997, in Baron, 2003). Page 3 of 17 Mobile phone (cellphone) text messages and instant messaging (IM) seem to promote independence in teenagers, and innovative styles of use are likely to emerge amongst adolescents since these are known leaders of linguistic change (Eckert, 2003). For the general population, text messages seem to be used for a variety of purposes, but asking questions and transmitting personal information seem to be two of the most common uses (Faulkner & Culwin, 2005). The types of abbreviations commonly used in text messages include: dropping one or more letters from words, phonetic spelling, and using symbols or numbers for sounds – letter/number homophones (e.g., @, h8) (Grinter & Eldridge, 2003). Many of the shortenings seem to be ad-hoc, created just to speed the writing of the message. Other shortenings include contractions that remove all or most vowels, clipping the final ‘g’ of words, and strings of initial letters of the words in standard phrases (e.g., bfn = bye for now) (Thurlow, 2003). In addition to shortenings there are creative non-standard spellings, such as those designed to portray an accent (e.g., wiv for with) or as a humorous alternative (e.g., lata for later) (Thurlow, 2003). Humorous spellings have also been noticed in dating chat rooms (del-TesoCraviotto, 2006) and misspellings have a long tradition of pre-CMC use in comic literature (e.g., Kline, 1907). Thurlow’s (2003) analysis also found that a few (student) text messages were obscurely encoded to the extent that they were incomprehensible to his researchers. Language innovations other than spelling also occur. For example, within multi-user environments, like chatrooms, a convention has emerged to preface a comment with the name of the intended recipient (Werry, 1996). Similarly, short and fragmented sentences can also be common in chatrooms (Radić-Bojanić, 2006). Herring (2001) emphasises that CMC typing variations should not be seen as mistakes but as natural adaptations to the affordance of particular devices, services or contexts. For example the avoidance of capital letters can speed typing in a rapid synchronous workplace exchange (Murray, 1990, cited in Herring, 2001). Variations can also be optimised for expressiveness rather than speed. Examples include written descriptions of sounds, such as laughing or crying, and the use of repeated letters (MacKinnon, 1995, cited in Herring, 2001). One of the most complete lists of standard variations in CMC text is that of Anis (2007) for French mobile phone text messages. In addition to the most of the examples mentioned above having French equivalents, there are many more variants. Some are language-specific, such as the omission of accents for letters and the substitution of k for qu, and others are more general such as merging consecutive words. Anis emphasises that although some of the spelling variants are positively cryptic, they are expressive and playful, making them apparently effective communication in context. Social network language Social networks sites are online environments that typically let registered members set up a personal home page, add their own content, and invoke ‘friend’ connections with other users. These friend connections are normally two-way with each friend having a picture of the other on their profile page or friend list pages. Social network sites like MySpace and Facebook seem to be particularly popular amongst younger users, and to have become near-ubiquitous amongst some groups, such as U.S. students (Golder, Wilkinson, & Huberman, 2007). Social network sites are not homogeneous: they have different online environments and user groups. For instance, Facebook originates from within education and seems to have more educated users then MySpace (boyd, 2007). Some sites with social network features target a particular activity, such as news discovery for digg.com (Lerman, 2006). Young members of social network sites like MySpace seem to use them primarily to communicate within existing friendship groups rather than to make new friends (boyd, 2008) or to flirt (Pew Research Center for the People & the Press, 2007), although there are many varied uses of social network sites (e.g., Fono & Raynes-Goldie, 2006). Many users have hundreds of ‘friends’ – which may be predominantly acquaintances or strangers (Thelwall, 2008b) – but the majority of interactions probably occur amongst offline friends (Golder et al., 2007). Page 4 of 17 Social network sites appear to be integrated into the daily lives of their users (Kim & Yun, 2007) rather than having a separate partitioned existence. In fact, social network sites can be important arenas in which to express personal identities (boyd & Heer, 2006). Although multiple communication modes are typically supported, such as blogs, pictures, email, instant messaging, and video, one distinctive way of communicating is to write comments on a friend’s profile page (i.e., their default main page in the social network site). Comments are an interesting communication phenomenon because they are public – either world-visible or visible to all of the recipient’s friends. Comments are described differently by site, for example testimonials and wall postings are alternative names, and some sites also allow comments about blog postings, pictures and videos. Nevertheless, public conversations between friends by writing on each other’s profile page seem to be very common. The public nature of these comments makes them amenable to researchers who can access and analyse those that are not restricted to the owner’s friends. The relatively permanent nature of social network comments makes them a potential to threat to orthodox standards of language use because members have, in theory, the space and time to take care with their comments and so the creators have little defence against accusations of linguistic “sloppiness”. Although no previous research seems to have analysed grammar and spelling in social network sites, some have discussed language use. One study analysed language overlaps in LiveJournal (Herring et al., 2007), showing the existence of multiple language communities, although sometimes bridged by multilingual individuals or journals with extensive non-text content. Other studies have analysed swearing in MySpace, showing that it is very common - occurring in around a third to half of teen MySpaces (Hinduja & Patchin, 2008; Thelwall, 2008a). The prevalence of swearing indicates that social network language can be highly informal. Research Questions This study investigates the comments found in the “Friends Comments” section of MySpace profile pages, which typically contains a set of text messages written by friends (although some comments contain images and some seem to be written by spam bots that have accessed a friend’s login information). This is an exploratory analysis using the information-centred research philosophy (Thelwall, Wouters, & Fry, 2008) of rapid (often shallow) exploratory analyses of new information sources to highlight potential applications and to develop appropriate methods to extract useful data (type ICR4 in terms of: http://cybermetrics.wlv.ac.uk/icr.html). In particular, a key objective is to give a broad overview to serve as a baseline for future research. The following research questions are addressed in the analysis, focusing on linguistic aspects. 1. What is the topic or purpose of typical comments? 2. What is the median length for comments? 3. Are rare words or spellings more frequent than rare words in standard written English? 4. Are any non-standard spellings common? 5. Are there any common types of non-standard word spellings and words? 6. What proportion of MySpace comments avoids all instances of non-standard English? Research Design The overall research design was to download the profile pages of a large random sample of MySpace users, to extract a random sample of English comments from these pages, and then to address the research questions with this data. Data A sample of MySpace comments was created for analysis via the member ID feature. Each MySpace member has an ID which uniquely identifies their profile page URL and can be used to deduce their joining date. A random sample of 30,000 URLs was chosen and automatically downloaded on July 17, 2007 using SocSciBot 4 (socscibot.wlv.ac.uk), Page 5 of 17 representing profiles that were created on July 3, 2006. From this collection the following were rejected: Members with 0 or 1 friends (unlikely to be real users) Members with private profiles (comments not available) Members registered as musicians, film-makers or comedians (not typical users) All remaining profiles were processed to extract all profile page comments, i.e. the most recent up to 50 comments, a total of 173,730. All comments that were either only pictures or were from a small set of standard commercial spam message types were automatically removed. A random sample of 8,000 out of remaining 149,913 comments was then manually checked to filter out any remaining spam, as well as to remove any nonEnglish comments and any viral messages (e.g., with an instruction to forward the comment). It is not fully possible to separate English from non-English comments since code-switching is a recognised phenomenon in online communication (e.g., Axelsson, Abelin, & Schroeder, 2007; Lee, 2007; Siebenhaar, 2006). The final set of comments for analysis consisted of 6,859, containing a total of about 95,000 words. Methods The topic or purpose of MySpace comments was investigated through an informal content analysis by the author of a random sample of 200 comments, excluding spam, viral and nonEnglish comments. The analysis is subjective because the comments are often short, part of longer exchanges, and may be decoded by the recipients in ways that the author does not understand. Thurlow’s (2003) SMS text messages categories are used as a baseline because SMS messages are also short messages between friends. Anonymised and sometimes truncated examples are given to illustrate the findings. To measure comment lengths, all HTML tags were removed from each comment and the number of characters in the remainder was measured. The comments were then split into separate words by dividing each comment at whitespace markers (single or multiple: spaces, tabs, and/or line ends) or punctuation (except hyphens or apostrophes within words). The number of resulting “words” in each comment was then counted. For the third research question, a word frequency distribution for MySpace comments was calculated by tallying the frequency of all “words” found in the comments, as described in the first paragraph of this section, after converting all capital letters to lower-case. For the fourth research question, a table of the most frequently occurring words was produced for comparison with similar tables for British and U.S. English. For the fifth research question, a set of 400 words occurring only once in the collection of 6,859 comments was investigated to get a sample of rare words. This is an artificial sample and the proportions of different types of words are not meaningful. A larger comment sample would probably have included a lower proportion of correctly spelled words. This is based upon the assumption that incorrectly spelt words are less likely to be repeated than correctly spelt words. For example if all incorrect spellings were unique, then the proportion of words that were incorrectly spelt would increase linearly with the size of a corpus, whereas the number of unique words in a corpus normally increases logarithmically with its size (i.e., at a lower rate), following Zipf (1949). The purpose of the sampling process is hence only to generate a sample of relatively rare spellings. The words in the sample were classified by the author using an inductive content analysis: initially grouping the words into similar sets and then formalising the category definitions and re-categorising the words. The categories chosen by this process overlap and the results are subjective but serve the purpose of highlighting a variety of types of rare words. Finally, to assess the spelling of MySpace comment words and to identify slang, the 6,859 comments were copied into Microsoft Word and its U.S. English spell-checker used as the primary dictionary. Two coders (the author and a final year linguistics student) classified each comment for the presence of any or all of: slang or typographic slang (defined as informal methods of spelling words), punctuation errors, spelling errors, interjections, pictograms and non-standard uses of capital letters. The frequency of occurrence of these Page 6 of 17 features in each comment was not recorded: only its presence or absence. In addition, the classifiers judged each comment for following an accepted standard grammatical format, using their own knowledge of the rules of grammar. Here “grammar” is interpreted as encompassing all language rules apart from those listed above. Inter-coder agreement was calculated and in all cases of disagreement the author made the final classification decision (see Appendix for more details of the scheme). A set of simple automatic analyses were also conducted using a purpose-built program (available from the author) that read each comment and produced summary statistics. For instance, one part of the program checked each comment to see whether it was entirely in lower-case and counted those that were. Results Themes In terms of Thurlow’s (2003) SMS categories, the vast majority of comments (78%) appeared to be for general friendship maintenance (e.g., Have A Great 4th!; haha keep drinking your jack, you sick son of a bitch!; hey happy belated; same here i am soo board; TOD LIKES ICKI PORN ew; Hey, whats up?), rather than for any more practical purpose. The remainder exchanged some kind of non-trivial information (e.g., I got dat prom video on my page), arranged external meetings (e.g., You out tonight my dear?) or were (possibly) romantic (e.g., I MISS U TRAC!; I LOVE YOU!!!!!!!!!). In contrast to Thurlow’s (2003) text messages, here were no explicit comments about sex. Almost half of the comments (43.5%) did not have a clear topic of discussion (e.g., hey baby how's it going; You guys are rockers!!; hah thats so me!!) but the main clearly identifiable topics were: MySpace (11.5%, e.g., ur second top friend is tim happy or what lol?; Thanks 4 the +; u totally should check out my space lol), birthdays (7.5%, happy birthday!), and music (7.5%, I love the picture and the song!). One noticeable feature was creative humour (e.g., Whats Up PROVIDER??! hows things goin..how was the songfest,, blah blah blah..; Make it back safe and don't be actin crazy nigga!!!). Although there was only one joke in the set, 25% of the comments appeared to be humorous in some way (e.g., with lol or :) following a comment, or otherwise judged as attempting humour; unusual spellings were not counted as humour). Another common element was an expression of interest in the target of the comment, or someone known to them. A total of 30.5% of comments contained such requests (e.g., how are you?; what’s up; are you doing …?; how is …?), although two thirds of these were stock polite greetings like what’s up, which may primarily serve as salutations or phatic communion (Malinowski, 1923). Finally, love was another common theme. Fully 22% of the comments contained an expression of love, either through the word love, hugs, hearts or kisses or a variant of miss you. These seemed to be predominantly expressing friendship (i.e., friendship love) rather than romantic love (e.g., ha ha miss you cuz; LIL SIS love you!!; jus wanna holla @ you an show ur page sum luv). The extent of correct formal written English in comments The manual checks of 400 random MySpace comments gave the results in Table 1. There was a high degree of agreement between the classifiers for the categories within the table, as supported by the Cohen's kappa values (Neuendorf, 2002), which was probably due to the prescriptive classification scheme. The differences occurred mostly in many cases where instances of non-standard English could be classified in multiple ways, or where there were so many non-standard features that judging grammar was difficult. For example the comment: “wat it do castro,watz with u now these dayz homie” was not coded for “Other non-standard English grammar” because “wat it do” and “watz with u” were judged to be slang phrases. Another common type of issue is represented by the comment: “o sorry i dont know the password no my parents got an email not me”, which could have been classified as having non-standard grammar due to sentences run together without conjunctions. Instead it was Page 7 of 17 classified as having non-standard punctuation, assuming that the primary issue was the absence of all punctuation rather than a non-standard sentence construction. Table 1. Types of non-standard English found in MySpace comments Aspect of non-standard English* Typographic slang or abbreviations (e.g., omg, lol, hugz, @) Slang, including dialect, swearing, and idiomatic slang sayings Non-standard spelling other than the above Non-standard punctuation Pictograms Interjections (e.g., haha, muahh, huh, but not oh). Non-standard capitalisation Other non-standard English grammar Not standard formal written English (i.e., Any of the above) Comments containing 41% 51% 33% 81% 16% 13% 75% 56% 97% Inter-coder Agreement (Kappa) 94.3% (.882) 88.5% (.771) 91.0% (.789) 95.3% (.829) 99.5% (.981) 98.0% (.913) 99.0% (.973) 91.5% (.824) 99.2% (.866) *See Appendix for more details of classes. From Table 1 it is clear that comments entirely in standard formal written English are extremely rare. Examples of comments judged completely correct include: “Happy Thanksgiving!”, “I like visitors. Xk” and the possibly facetious “Dear Friend, It says your birthday is June 20. That must be incorrect! I do believe your birthday is November 30.” The most common causes of “other non-standard English grammar” were incomplete sentences and sentences merged together without punctuation or conjunctions. The incomplete sentences often missed a pronoun (e.g., “Just sayin sup.”) or a (main or auxiliary) verb (e.g., “how you been”). Additional punctuation and capitalisation statistics This section reports automatic analyses of the 6,859 comment lines judged to be valid nonviral, non-spam and English. After excluding escaped characters and all leading and trailing white-space characters, all except 2 of the comments were non-null and were automatically processed for the patterns below to see whether some non-standard language features were common. All upper case: 7.5% (515) All lower case: 37.9% (2,600) Ending in a valid sentence terminator (full stop, quotes, ! or ?): 49.4% (3,389) Starting with a letter of the alphabet: 98.6% (6,760) Starting with an upper case letter of the alphabet, if starting with a letter of the alphabet 50.9% (3,439) Comment lengths This section analyses the distribution of comment lengths, as measured in characters and words (after excluding escaped characters). The median number of words per comment is 14 and the median number of characters per comment is 68. Figure 1 shows the ‘hooked power law’ shape (see similar graphs (Pennock, Flake, Lawrence, Glover, & Giles, 2002)). There is probably a basic power law (e.g., Barabási & Albert, 1999) in comment lengths, with shorter comments being much more common than longer comments. The hook shape at the top left of the graph shows that very short comments are much rarer than would be expected for a pure power law. This probably reflects the need to write long enough comments to convey a nontrivial message. Almost all (95%) MySpace comments have 57 or fewer words. Hence, although comments are sometimes very long, the typical comment is about the length of a short sentence, and the overwhelming majority are not longer than a few sentences. Page 8 of 17 Figure 1. Distribution of 6,859 comment lengths (words). Note the log-log scale. Word frequency distribution Figure 2 reports the distribution of word frequencies, showing a visually almost perfect power law. Classic text should illustrate a perfect power law, with a few words being very common (i.e., having a high word frequency – a point on the right of the graph below) and many words being rare (i.e., having a low word frequency – a point on the left of the graph below). The linear fit on the left of the graph is not quite perfect, and the straight line pattern evident for frequencies 2 to 10 is not matched by word frequency 1, which is higher than the line would predict. Although the difference is small in size, it is large due to the logarithmic scale. This is in contrast to similar graphs for British English and academic web sites, for example, in which the lines are straight and the point for word frequency 1 does not deviate (Thelwall, 2005). This confirms that there are more unique words in comments than in “normal” text. This cannot be the result of the typical short length of MySpace comments resulting in a high proportion of unique words in each comment (a high “type/token ratio” in the terminology of Chafe & Danielewicz, 1987), because, in general, text lengths vary the slope of the line in Figure 3 but not its overall shape. This suggests that there is an additional process at work, which could be a force for creative variety in spelling or word choice, or simply extra carelessness in spelling. Figure 2. Distribution of word frequencies in 6,859 comments. Note the log-log scale. Page 9 of 17 Common words Table 2 reports the most common MySpace comment words, after converting all capital letters to lower-case. The table highlights words that are not found in the top 100 for general British English, as calculated from the British National Corpus (Leech et al., 2001), and the top 100 from general written American English, as extracted from the Brown corpus (http://www.giwersworld.org/computers/linux/common-words-freq.phtml). Note that the methods used for the British National Corpus statistics are not quite the same as those here, and both corpora cover language from at least twenty years ago. In particular, the Brown corpus word list has apostrophes removed (e.g., don’t -> dont) and the British list splits compound abbreviated words at apostrophes (e.g., I’m counts as two components, I and ’m). Hence the comparison is approximate and serves only to draw attention to potentially significant words. Several abbreviations are included in Table 2: u and ya for you, ur for your, im normally for I’m, whats for what’s, dont for don’t. A few non-words are also present, such as lol (laugh out loud), :) and haha. The digit 2 is often used as a homophone for to or too. The rank order of the word frequencies seems more similar to spoken than written English. For example, I is the most frequent word in conversational British English (Kilgarriff, 1997), as in the British National Corpus (Burnard, 1995) and the second most common in general spoken British English (Leech et al., 2001), but is only seventeenth most common in general written British English (Leech et al., 2001) (see descriptions and data at http://www.kilgarriff.co.uk/bnc-readme.html and http://www.comp.lancs.ac.uk/ ucrel/ bncfreq/flists.html). Nevertheless, there are also clear deviations from spoken English, not only in terms of spellings and the lack of pause-fillers like er (Leech et al., 2001) but also in words like love (ranked 555 in spoken British English (Kilgarriff, 1997)), happy (ranked 503 in spoken British English), and miss (ranked 1,052 in spoken British English). These three word frequencies are probably closer to those of a written genre: letter-writing (Leech, Rayson, & Wilson, 2001). Also noticeable in Table 2 are words related to movement (come, go, going, back) and time (day, weekend) that seem to fit an orientation on small-talk. Table 2. The most common words in the comments sections. Bold words are not in the top 100 for general British English, and italic words are not in the top 100 for general American English. Rank Word 1-10 i, you, to, the, and, a, u, me, hey, my 11-20 it, for, in, love, is, that, so, up, your, on 21-30 have, of, are, just, lol, but, we, how, be, ya 31-40 at, was, well, what, get, like, good, im, know, out 41-50 been, this, with, see, hope, all, do, not, if, happy 51-60 miss, going, go, time, i'm, ur, back, some, got, there 61-70 when, can, will, thanks, its, or, by, from, now, whats 71-80 say, day, new, hi, much, one, no, about, haha, call 81-90 come, :), soon, too, need, birthday, 2, am, had, here 91-100 dont, doing, as, think, man, page, great, did, weekend, work Table 3 reports the same frequencies as Table 1, but includes the original case of the words. The high frequency of the capitalized initial letter words seem consistent with a letter-writing style, but it is interesting that You, U and YOU appear – perhaps it is logical to capitalise U since I is a capital letter. Page 10 of 17 Table 3. The most common words in the comments sections, retaining letter case. Words including an upper-case letter are in bold. Rank Word 1-10 you, to, I, i, the, and, a, u, me, it 11-20 my, for, in, is, that, on, up, your, so, have 21-30 of, are, hey, love, but, lol, be, just, was, at 31-40 we, ya, out, get, Hey, know, like, how, well, been 41-50 with, good, see, this, what, all, im, do, not, going 51-60 time, go, hope, back, if, miss, there, will, can, ur 61-70 got, some, when, or, its, from, say, U, by, :) 71-80 now, about, 2, much, haha, You, Love, one, soon, call 81-90 come, need, new, too, am, I'm, whats, had, doing, day 91-100 no, YOU, think, as, here, dont, man, work, A, really Rare words Table 4 reports the classification of a random sample of 400 words occurring only once in the collection. Whilst Table 4 includes many valid words and numbers, there is evidence of a systematic pattern of new word creation and deliberately made-up spellings. Several new common practices are evident: substituting numbers for similar-looking letters; truncating words; lengthening words by repeating letters; phonetic spellings; and substituting z for s. Two of these patterns, repeated letters and interjections, seem to be devices to emphasise the importance of words or to convey emotion (as with emoticons). Giving emphasis is also an important function of swearing (Jay, 2000). Page 11 of 17 Table 4. Classification of 400 words occurring once. Type Correct spelling other than proper nouns Name or other proper noun Definition Non-slang word found in dictionary (including standard grammatical variations) Identified as such through personal knowledge or web searches Number or code Non-English Apparent typo Identified as a nonnoun dictionary word in a non-English language Judged a small spelling variation of a recognised word. Number 150 75 kontiwa, jens, ap, chc, andreas, bridi 14 651, 8850, 3y, 2am, 8888888888888880, 7772, 89, r34, 808 prego, vhiida (vida), bleu, musica, pelo, interesante 6 49 Two words with merged spelling Two words normally written separately Slang or madeup word Non-dictionary word or described as slang in dictionary; not used as a proper noun. 23 Deliberately made-up spelling Judged a large spelling variation of a recognised word or part of a systematic spelling variation pattern. Spelling variant of existing word with at least one extra repeated letter 28 Judged to be describing a vocal sound 13 Word with repeated letter Interjection Unknown Total Examples 9 31 2 400 copyed, sumit, riends, doign, frend, experance, tomarow, materal, miester, internt, crys, andress, manillow, chrismtas, privlaged, punkin (pumpkin), valintines, arund, dewin (doing), roomate, cousinn, aout (about), visiiting, cann, encuragement, mixs, bearly, sappose, freinds, centery, dosent, appreaceate, apreciation, rteeth (teeth), locuacious, dosnt, cheak, skys, layed, sigle, goos (good), whent, destraction, biusines, earler, gpoing, wak babymamma, thankya, yawhy, soundfreak, carcrash, lotsa, whatchu, coinslot, yeadat dokie, scaggy, dangit, hotty, aght (alright), alreet, oik, mangina, yute, cracka, badboii, gawd, yids, numnuts, croc, favs, wuddup, roxes (rocks), mcwalmartenheimer, picy, goina, evrythang pt (put), finishin, styllz, n0t, deyz, nathin (nothing), l3t (let), getin, slidin, startz, altho, wurd, seri0us, c0de, choon (tune), reazun, a0l, hoez, mackin (making), 5tyll, sayz, bos (boss), niger myspacee, byeee, chiick, weirdddd, duhh, soooooooo, sleeep, misss, loveee, nobbbin, joyceeeeeeeeeeeee, sweeeet, loove , gwaannnin, hiiigh, beee, crazyyy, helllooooooooooooooooooooo, herrrrrre, goooodd, annnnt, okayyyy, souuulful, weeeeeel, meliiiiiissssssaaaaaaaa, homelesssssss, livee, congratss, happpyy, moneyyy, geeetair (guitar) boohoo, wuhu, heheh, muhahahahahahahahahahaha, awwwwwwwwwww, mmmuuuaaaahhhzzz, teeheee, muahzz, whew, aahahahahahahahahahaaha, awwwwwww, bya, hahahhah blac, tc Page 12 of 17 Discussion and limitations The results give some answers to four research questions exploring MySpace comment language. First, the “normal” length for MySpace comments seems to be about 14 words (for comments that contain at least one word) and almost all comments are not longer than a few sentences. Hence, comments are typically brief communications. The distribution of word frequencies differs from standard English in the sense that rare words or spellings are more frequent than would be expected for a pure power law. This confirms the casual observation that typos, slang and innovative spellings are a significant part of MySpace commenting. A few non-standard spellings were common enough to be in the top 100 for MySpace comments. These include abbreviated spellings, abbreviations, pictograms, and interjections. It will be interesting to see if any of these become accepted as recognised alternative dictionary spellings because of their widespread use. There were some patterns in the types of rare non-standard word spellings and words used in MySpace. The main patterns were the use of repeated letters, probably for emphasis or playfulness, and the typing of interjections, typically expressing emotion. Although MySpace comment text seems intuitively closer to spoken than written English, and this is backed up by the prevalence of the personal pronouns you and I, there are many features, including those discussed above, that make MySpace comments a clearly distinctive variety. There is also a technical problem for comparing MySpace text to spoken language, which is that spoken language has to be transcribed, and this transcription necessarily uses correct spellings, phonetic spelling, or a system based upon pronunciation for all recognisable words used. Hence, it is difficult to quantitatively compare the two, especially due to the prevalence of contractions using apostrophes (often omitted in MySpace comments) such as ’s and n’t in spoken English. In contrast, the non-standard spelling and grammar probably make MySpace comments clearly distinct from similar written forms, such as the personal letter. The vast majority of MySpace comments do not exclusively use formal written English (grammar and spelling; a lack of slang), and various types of non-standard spelling are prevalent. Comments written entirely in formal English are hence likely to stand out as inappropriate, which may detract from their message or stigmatise the commenter as abnormal or inexperienced. In contrast, a range of “incorrect” styles, such as lower-case messages, are common. This research has limitations in the extent to which it can be generalised to other social network sites and languages. It would not be reasonable to hypothesise that the patterns of spelling and language use in MySpace would also be found in all other social network sites. Facebook comments (i.e., wall posts) may have fewer spelling errors and creative spellings because of Facebook’s tendency to have more educated members (boyd, 2007) using it in an educational context (Golder et al., 2007). Similarly, it seems likely that different spelling patterns may emerge amongst different language user groups. In Chinese, for example, repeated characters (logograms) may not always make linguistic sense - although ASCII characters are sometimes used anyway for speed (Lee, 2007). Nevertheless repeated characters representing useful adjectives have been identified in online Chinese, such as 漂漂 (beautiful-beautiful, Yang, 2007). In contrast, in Japanese it seems that similar emphasis may be gained instead by the use of additional or alternate symbols designed for expressiveness (Nishimura, 2007). Conclusions MySpace comments are typically informal and often creative and fun. Their language seems to diverge from formal written English because of the need to convey meanings that are difficult to communicate quickly in standard written forms, for example expressing emotion or emphasis. MySpace comments written entirely in formal English are rare. They contain a combination of standard spelling, apparently accidental mistakes, slang, sentence fragments, typographic slang and interjections. Several new spellings have become commonplace, including u, ur, :), haha, and lol. Although some codes of practice are developing, it seems Page 13 of 17 unlikely that a new formal MySpace grammar will emerge because of the playfulness with language evident in many of the deliberately incorrectly spelt words (e.g., one commenter doubled every letter i in each word). The variety in grammar and spelling poses new challenges for future research in natural language processing and information retrieval because the former, and the latter to some extent, relies upon regular predictable patterns in text in order to effectively process it. For example one common application of natural language processing is in automatic translation. Whilst this would be useful for MySpace comments – perhaps to support bilingual friendships – specialist research is needed because techniques developed for standard English are unlikely to work well with MySpace comments. In information retrieval research, techniques such as latent semantic analysis (Deerwester, Dumais, Furnas, et al., 1990) could be used to help identify new synonyms for existing words, but all such techniques rely upon spellings being used frequently enough to form a statistically identifiable pattern. To help both types of research, it would be useful to extend the findings in the current paper to different languages, and to different social network sites. It would also be useful to analyse MySpace comments on a broader scale as grammatical units and also as dialogs between the profile owners. This should give a wider perspective on the linguistic phenomenon of social network commenting. In terms of the wider applications of the findings, social network comments should not be assessed by employers in terms of formal written English standards because this would signal members as deviant rather than linguistically skilled, at least in MySpace. The frequency of non-standard spelling and slang also has implications for web information retrieval. MySpace is indexed by Google (34.9 million pages reported indexed on June 16, 2008) and other search engines despite it seeming unlikely that many pages would ever be valuable in search results, with the main exceptions being musicians’ spaces. In particular, given the predominance of very personal communications in comments, it seems logical that search engines should take steps to keep them out of search results, for example by allocating them a low rank. This research suggests that this could be achieved in a generic manner by penalising the ranking of pages with many non-standard spellings, especially if these are close to the search keywords. References Androutsopoulos, J. (2006). Introduction: Sociolinguistics and computer-mediated communication. Journal of Sociolinguistics, 10(4), 419–438. Anis, J. (2007). Netography: Unconventional spelling in French SMS text messages. In B. Danet & S. C. Herring (Eds.), The multilingual internet: Language, culture, and communication online (pp. 87-115). Oxford: Oxford University Press. Axelsson, A.-S., Abelin, A., & Schroeder, R. (2007). Anyone speak Swedish? Tolerance for language shifting in graphical multiuser virtual environments. In B. Danet & S. C. Herring (Eds.), The multilingual internet: Language, culture, and communication online (pp. 362-381). Oxford: Oxford University Press. Barabási, A. L., & Albert, R. (1999). Emergence of scaling in random networks. Science, 286(5439), 509-512. Baron, N. S. (2003). Language of the Internet. In A. Farghali (Ed.), The Stanford Handbook Biber, D. (2003). Variation among University spoken and written registers: A new multi-dimensional analysis. In P. Leistyna & C. F. Meyer (Eds.), Corpus Analysis: Language Structure and Language Use (pp. 47-70). Amsterdam: Rodopi. boyd, d. (2007). Viewing American class divisions through Facebook and MySpace. Apophenia Blog Essay (June 24), Retrieved July 12, 2007 from: http://www.danah.org/papers/essays/ClassDivisions.html. boyd, d. (2008). Why youth (heart) social network sites: The role of networked publics in teenage social life. In D. Buckingham (Ed.), Youth, identity, and digital media (pp. 119-142). Cambridge, MA: MIT Press. Page 14 of 17 boyd, d., & Ellison, N. (2007). Social network sites: Definition, history, and scholarship. Journal of Computer-Mediated Communication, 13(1), Retrieved December 10, 2007 from: http://jcmc.indiana.edu/vol2013/issue2001/boyd.ellison.html. boyd, d., & Heer, J. (2006). Profiles as conversation: Networked identity performance on Friendster. Proceedings of the Hawai'i International Conference on System Sciences (HICSS-39, January 4-7), Retrieved July 3, 2007 from: http://www.danah.org/papers/HICSS2006.pdf. Burnard, L. (1995). Users' reference guide to the British National Corpus. Oxford: Oxford University Computing Services. Chafe, W., & Danielewicz, J. (1987). Properties of spoken and written language. In R. Horowitz & S. J. Samuels (Eds.), Comprehending oral and written language (pp. 83113). San Diego: Academic Press, Inc. Climent, S., Moré, J., Oliver, A., Sánchez, I., Taulé, M. & Salvatierra, M. (2007). Enhancing the Status of Catalan versus Spanish in Online Academic Forums: Obstacles to Machine Translation. In B. Danet & S. C. Herring (Eds.), The multilingual internet: Language, culture, and communication online (pp. 209- 230). Oxford: Oxford University Press. Collot, M., & Belmore, N. (1996). Electronic language: A new variety. In S. C. Herring (Ed.), Computer-mediated communication - linguistic, social and cross-cultural perspectives (pp. 13-28). Amsterdam: John Benjamins. Crystal, D. (2006). Language and the Internet (2nd ed.). Cambridge, UK: Cambridge University Press. Danet, B., Ruedenberg, L., & Rosenbaum-Tamari, Y. (1997). 'Hmmm.Where's That Smoke Coming From?' Writing, Play and Performance on Internet Relay Chat. Journal of Computer-mediated Communication, 2(4), Retrieved March 3, 2008 from: http://jcmc.indiana.edu/vol2002/issue2004/danet.html. del-Teso-Craviotto, M. (2006). Language and sexuality in Spanish and English dating chats. Journal of Sociolinguistics, 10(4), 460-480. Eckert, P. (2003). Language and gender in adolescence. In J. Holmes & M. Meyerhoff (Eds.), The Handbook of Language and Gender (pp. 381-400). Oxford: Backwell. Escher, T. (2007). The geography of (online) social networks. Web 2.0, York University, Retrieved September 18, 2007 from: http://people.oii.ox.ac.uk/escher/wpcontent/uploads/2007/2009/Escher_York_presentation.pdf. Faulkner, X., & Culwin, F. (2005). When fingers do the talking: a study of text messaging. Interacting with Computers, 17(2), 167-185. Fono, D., & Raynes-Goldie, K. (2006). Hyperfriendship and beyond: Friendship and social norms on Livejournal, Association of Internet Researchers (AOIR-6), Chicago. In M. Consalvo & C. Haythornthwaite (Eds.), Internet research annual volume 4: Selected papers from the Association of Internet Researchers conference. New York: Peter Lang. Golder, S. A., Wilkinson, D., & Huberman, B. A. (2007). Rhythms of social interaction: Messaging within a massive online network, 3rd International Conference on Communities and Technologies (CT2007), East Lansing, MI. Grinter, R. E., & Eldridge, M. (2003). Wan2tlk? Everyday text messaging. CHI 2003, 441448. Grinter, R. E., Palen, L., & Eldridge, M. (2006). Chatting with teenagers: Considering the place of chat technologies in teen life. ACM Transactions on Computer-Human Interaction, 13(4), 423-447. Herring, S. C., Paolillo, J., Ramos-Vielba, I., Kouper, I., Wright, E., Stoerger, S., et al. (2007). Language Networks on LiveJournal. Proceedings of the Fortieth Hawaii International Conference on System Sciences (HICSS-40), Retrieved November 21, 2007 from: http://www.blogninja.com/hicss07.pdf. Herring, S. C. (2001). Computer-mediated discourse. In D. Schiffrin, D. Tannen & H. E. Hamilton (Eds.), Discourse Analysis. Maldon, MA: Blackwell. Page 15 of 17 Herring, S. C. (2002). Computer-mediated communication on the Internet. Annual Review of Information Science and Technology, 36, 109-168. Herring, S. C. (2007). A faceted classification scheme for computer-mediated discourse. Language@Internet, 4, Retrieved June 12, 2008 from: http://www.languageatinternet.de/articles/2007/2761. Hinduja, S., & Patchin, J. W. (2008). Personal information of adolescents on the Internet: A quantitative content analysis of MySpace. Journal of Adolescence, 31(1), 125-146. Jay, T. (2000). Why we curse. New York: John Benjamins. Kilgarriff, A. (1997). Putting frequencies in the dictionary. International Journal of Lexicography, 10(2), 135-155. Kim, K.-H., & Yun, H. (2007). Cying for me, Cying for us: Relational dialectics in a Korean social network site. Journal of Computer-Mediated Communication, 13(1), Retrieved December19 from: http://jcmc.indiana.edu/vol13/issue11/kim.yun.html. Kline, L. W. (1907). The psychology of humor. The American Journal of Psychology, 18(4), 421-441. Ko, K.-K. (1996). Structural characteristics of computer-mediated language: A comparative analysis of InterChange discourse. Electronic Journal of Communication, 6(3), Retrieved February 27, 2008, from: http://www.cios.org/www/ejc/v2006n2396.htm. Deerwester, S., Dumais, S., Furnas, G. W., Landauer, T. K., Harshman, R. (1990). Indexing by Latent Semantic Analysis. Journal of the American Society for Information Science 41(6), 391–407 Lee, C. K. M. (2007). Text-making practices beyond the classroom context: Private instant messaging in Hong Kong. Computers and Composition, 24(3), 285-301. Leech, G., Rayson, P., & Wilson, A. (2001). Word frequencies in written and spoken English: Based on the British National Corpus. London: Longman. Lerman, K. (2006). Social networks and social information filtering on Digg. ArXiv.org, Retrieved April 23, 2007 from: http://arxiv.org/abs/cs.HC/0612046. Malinowski, B. (1923). The problem of meaning in primitive languages. In C. K. Ogden & I. A. Richards (Eds.), The Meaning of Meaning: Routlledge & Kegan Paul (pp. 296346). Neuendorf, K. (2002). The content analysis guidebook. London: Sage. Nishimura, Y. (2007). Linguistic innovations and international features in Japanese BBS communication. In B. Danet & S. C. Herring (Eds.), The multilingual internet: Language, culture, and communication online (pp. 163-183). Oxford: Oxford University Press. North, S. (2006). Making connections with new technologies. In J. Maybin & J. Swann (Eds.), The art of English: Everyday creativity (pp. 209-230). Basingstoke, Hampshire: Palgrave Macmillan. Pennock, D., Flake, G. W., Lawrence, S., Glover, E. J., & Giles, C. L. (2002). Winners don't take all: Characterizing the competition for links on the web. Proceedings of the National Academy of Sciences, 99(8), 5207-5211. Pew Research Center for the People & the Press. (2007). Social networking websites and teens: An overview. Retrieved June 4, 2007, from http://www.pewinternet.org/PPF/r/198/report_display.asp Prescott, L. (2007). Hitwise US consumer generated media report. Retrieved March 19, 2007 from: http://www.hitwise.com/. Radić-Bojanić, B. (2006). Fragmentation/integration and involvement/detachment in chatroom discourse. Skase Journal of Theoretical Linguistics, 3(1), Retrieved March 3, 2008 from: http://www.skase.sk/Volumes/JTL2005/2004.pdf. Siebenhaar, B. (2006). Code choice and code-switching in Swiss-German Internet Relay Chat rooms. Journal of Sociolinguistics, 10(4), 481-506. Thelwall, M., Wouters, P., & Fry, J. (2008). Information-Centred Research for large-scale analysis of new information sources. Journal of the American Society for Information Science and Technology, 59(9), 1523-1527. Page 16 of 17 Thelwall, M. (2005). Text characteristics of English language university web sites. Journal of the American Society for Information Science and Technology, 56(6), 609-619. Thelwall, M. (2008a). Fk yea I swear: Cursing and gender in a corpus of MySpace pages. Corpora, 3(1), 83-107. Thelwall, M. (2008b). Social networks, gender and friending: An analysis of MySpace member profiles. Journal of the American Society for Information Science and Technology, 59(8), 1321-1330. Thurlow, C. (2003). Generation Txt? The sociolinguistics of young people's text-messaging. Discourse Analysis Online, 1(1), Retrieved January 3, 2008 from: http://extra.shu.ac.uk/daol/articles/v2001/n2001/a2003/thurlow2002003-paper.html. Werry, C. (1996). Linguistic and interactional features of Internet Relay Chat. In S. C. Herring (Ed.), Computer-mediated communication: Linguistic, social and crosscultural perspectives (pp. 47-61). Philadelphia: John Benjamins. Yang, C. (2007). Chinese Internet language: A sociolinguistic analysis of adaptations of the Chinese writing system. language@internet, 4, Retrieved February 27, 2008 from: http://www.languageatinternet.de/articles/2007/1142. Yates, S. J. (1996). Oral and written aspects of computer conferencing. In S. C. Herring (Ed.), Computer-Mediated Communication: Linguistic, social and cross-cultural perspectives (pp. 29-46). Amsterdam: John Benjamins. Zipf, G. K. (1949). Human behavior and the principle of least effort: An introduction to human ecology. Cambridge, MA: Addison-Wesley. Appendix: Table 1 classification instructions. Regard comments as correct if they fit Either U.S. or British English. Typographic slang includes acronyms like: omg, lol, xx, ur, u, r, y, 2, z for s (e.g., hugz, boyz), @, shortenings (e.g., b/c for because), x for ks (e.g., thanx); luv for love; wat for what, numerical shortenings like: l8r, m8. Slang includes: da, yeah, yep, ya for you, dude, man, chink, yo, like (as interjection), witcha, witchu; shortenings (e.g., cuz, bro, sup, wit, in(ing)), swearing (fuck, ass, god), sayings like pour it up, whatever, what's good; this includes all written attempts to echo dialect or nonstandard pronunciation. Spelling: Do not count slang or typographic slang as wrong spelling, but do count multipleletter slang (e.g., hellllooo) as wrong spelling. Assume that all proper nouns are correct but otherwise assume non-slang terms not in a dictionary are spelling mistakes. Punctuation: Commas, apostrophes (in possessives, but also in words like don't), full-stops, colons, semi-colons used where the text seems to need them for standard English. (but see the grammar vs. punctuation section). Count the use of multiple consecutive punctuation marks as an error (e.g., !!!, ?! unless ellipsis (exactly three full stops)). Do not count a missing fullstop at the end of a comment as a punctuation or grammar error if it follows a closing statement, e.g., "see ya" or "Later, kate xx" or "ttfn". A space should follow punctuation except for apostrophes, quotes, and sentence endings. Pictograms: Any text pictures, e.g., :-) ^..^ also include ♥ as a picture. Interjections: e.g., huh, haha, mwuahh, but not "oh". Capitals correct: The use of capital letters for proper nouns, I, and sentence beginnings is correct - or title case is used if the comment seems to be a title or caption - or the capitals are appropriate for a letter format (e.g., Dear Jane, How are you?). Capitals are optional following a colon. Grammar: Ignore all of the above errors for this section, and check if the sentence deviates from standard formal English in any other way. For example, check for subject-verb agreement, incomplete sentences, missing verbs pronouns or nouns, sentences that don't make sense, sentences or phrases run together without joining words (e.g., and, but). Grammar vs. punctuation errors: In cases where the grammar is wrong because punctuation is missing, record a punctuation error if the author is deliberately avoiding all punctuation, but record a grammar error if the author is using some punctuation but has Page 17 of 17 missed out some. E.g., the comment "how are you what are you doing today" would count as a punctuation error (no punctuation at all) but the comment "how are you what are you doing today?" would count as a grammar error rather than a punctuation error because the author should have used something like "?", ".", or "and" between the two phrases".