Twitter Ontology Travis Allen Reason for Existence Social media has increasingly become a method of intelligence gathering and cultural study for the military, political organizations, and swaths of academia. As more and more individuals take to various forms of social media, the mediums have come to serve as transmission methods for not only greater volumes of data, but more important and societally relevant data as well. One of these forms of social media is Twitter. Whereas Twitter was mostly a social phenomena prior to 2010, it was the Arab Spring that catapulted Twitter into the spotlight of the world community, highlighting just how impactful the technology truly is1. Twitter and Facebook played important roles in helping the rebellion coordinate protests, distribute information, combat propaganda, and relay the reality of the situation on the ground to the world at large. Following the Arab Spring, Twitter has become a major component of social media in the US and abroad. Being able to extract useful knowledge out of Twitter will be vitally important to a wide range of organizations. In March 2013, Twitter reported there were over 200 million registered user accounts sending over 400 million Tweets per day2. This tremendous usage will provide valuable information to any who are able to make sense of the mammoth amount of data. The goal of the Twitter ontology is to provide those working on the platform a framework to make sense of the sprawling, multitudinous messages being transmitted at every second of the day. Building an ontology to understand the structure of Twitter will be an important step in making sense of the data it generates. By mapping the lexicon and architecture of the messages sent across the platform (“Tweets”), those studying the content will have better tools with which to gather information. The scope of the Twitter ontology is to accurately represent the mechanical functions of the Twitter platform. The Challenge Twitter has a fundamentally unique conceptual model that is best understood by contrasting it with e-mail. As the clear mainstay in communication protocols from the earliest days of the Internet, e-mail has a very approachable model. As the capacity of the Internet to depart from real-world analogues was not yet fully understood early on, e-mail was modeled on standard paper mail. An e-mail is created by a user, and then sent to another user or users. This idea of generating a message and sending it specifically to some set of recipients – even having recipients – is rooted in traditional mail systems. Each message is viewed only by those it is addressed to, and once sent, exists in possession of the recipient. Users do not go out into cyberspace looking at public e-mails; rather, they check their inbox to see what has been sent directly to them. If a user wishes to receive e-mail from another person, they will they have to obtain that individual’s e-mail address in some way. Similarly, a user cannot browse all the e-mails an individual has sent, nor is there typically a way to find another user’s e-mail address unless they want it to be found. Even the structure of an e-mail is itself reminiscent of real-world paper letters. E-mails contain fields meant to mirror paper mail, such as From, To, Subject, and even Carbon Copy and Blind Carbon Copy lines. 2 Figure 1 Twitter, however, functions significantly differently, in a way that does not easily map to real-world parallels. Each account has a username, which is by default viewable by anyone who wishes to find it. Nearly every tweet published by an account is public, even correspondence between accounts. When composing a tweet, one does not necessarily send it to anyone in particular, but rather broadcasts to nobody in particular, with anyone who cares to view it able to do so. Tweets also never really leave the possession of the composing account. Even years after its creation, an account can return to their own tweet and delete it, leaving all responses – even if there are thousands of them – pointing to something that no longer exists. Tweets also lack the more regimented structure of old world communications or e-mail. A Tweet is constrained to 140 characters, making that just about the only limitation. There are no subject lines or carbon copies. The platform is designed specifically for messages to be brief and freeform, akin to jotting down an idea on a post-it note and sticking it to a public bulletin board. All of these factors make building a conceptual model of Twitter a significant challenge. 3 The Resources There are no other Twitter ontologies in existence, and very little in terms of formal documentation of the system. As there was no apparent prior work on this front, data collection is organic. Beginning development of the Twitter ontology, the first place to find information is the glossary supplied by Twitter3. This glossary is the only real definitive list of terms relevant to the service, and even then defined as “vocabulary frequently used to talk about features on our site and aspects of our service.” As Twitter strives to maintain a community-managed lexicon and use pattern, some uses and social functions are not formalized by the company in any way. This results in a distinction between the actual mechanical structure of Twitter and the social activity that takes place within it. This ontology seeks to capture the former. The latter would likely be a culmination of a variety of language and anthropological studies. Figure 2 4 The Process The first step in creating the ontology is finding every term and its associated definition in the Twitter glossary that seemed to be a part of the architecture of the software. Once a list of domain vocabulary is established, relationships can be understood. After a list of formal terms was pulled from the glossary, actual tweets themselves are studied for any function or feature that was not accurately represented in the glossary. Any feature not explicitly mentioned in the glossary is recorded, with a rough definition created. That hope is that combining a defined glossary with real usage of the service would form an expansive base of terms from which to begin building the ontology. Figure 3 Once the list of terms is developed, the next step is to begin sketching out relationships between proposed classes. Using a spreadsheet, each class is listed, along with other classes it 5 could be related to. Possible object properties are included between the two classes. These object properties are meant to capture the relationship only; they are not intended to be final expressions. This was in respect to the inability to express information about edges. The culmination of this step is a listing of each proposed class and all possible edges to other classes. Figure 4 Following this step, all unique object properties are listed to search for redundancies. By extracting object properties from their triples, similarities across several properties can be identified so that like terms can be merged into a singular functioning term. Reducing the amount of new object properties in an ontology is an important step towards making it compatible with other existing ontologies. Once the list of entities and rough object properties is finished, a conceptual model was created on whiteboard. The goal here was to better understand how the various entities would 6 interact with each other. It was at this point in the process that relationships such as “is_a” are scrutinized to ensure no term overloading or confusion was occurring. Every connection is included in the diagram, with key terms becoming apparent as more connections were drawn. Classes are labeled in boxes, with color coding to denote differences in continuants and occurrents, and each edge between entities was labeled. There is constant referencing of both the initial list of classes and relationships, as well as of Twitter itself, to make sure all classes are included and relationships accurately defined. With the structure mapped on whiteboard, it can be transferred over to Protege. Introduction of the classes and object properties to Protege begins with the Basic Formal Ontology (BFO)4. By beginning with BFO, the Twitter ontology will be more compatible with other ontologies, especially those that also reference BFO. The majority of classes end up as generically dependent continuants, although the processes included fell under the occurrent class. After establishing which parts of BFO are required, the Information Artifact Ontology (IAO) is utilized to identify more specific classes that the Twitter components would be a part of. The merging of the IAO with the Twitter ontology is a simple and straightforward procedure, as both have been built with BFO as their upper-level ontology. The Twitter ontology acts as an extension of the IAO. As the IAO undergoes future revisions, the Twitter ontology will have to adjust accordingly. Information artifacts are a rich domain with a great deal of unique features, which will take time to fully appreciate. The .OWL File 28 unique classes are created for the ontology. This does not include terms imported from BFO or the IAO. There are twenty object properties in the file. 7 Being that Twitter is exclusively a digital entity, the IAO is the most appropriate mid-level ontology for this project. Within the IAO, the most important class to the Twitter ontology is the Information Content Entity class. This class is defined by the IAO as “an entity that is generically dependent on some artifact and stands in relation of aboutness to some entity.” This is an excellent definition of a considerable amount of Twitter ontology terms, if perhaps a bit high level. Quite a few classes relating to Twitter - a Hashtag, a Mention, a Tweet - are generically dependent, and they are “about” something, whether what they are about is another username, a specific topic, or a user’s opinion about the current weather. Several terms in the IAO are considered for inclusion, specifically the classes “Document” and “Document Part.” Initially these seem compatible with the Twitter lexicon, but upon further review the document class seems inadequate for expressing the idea of a tweet. The examples provided by the IAO of a journal article, a patent application, and laboratory notebook do not meaningfully reflect what a tweet is in reality. A more appropriate term may be “Message,” which will be covered in the discussion of possible improvements. 8 The service of Twitter revolves completely and entirely around the messages sent across the client, which are called Tweets. Unlike other social media sites such as Facebook, Twitter is a focused client with all functionality related to the sending and reading of Tweets. Understandably, “Tweet” is the single most used class in the ontology, with twenty-two uses of the term, which the official Twitter glossary defines as “a message posted via Twitter containing 140 characters or fewer.” It is referenced by its various components such as “Hashtag” or “Mention,” and also as an output of the process of creation. In some way, the class Tweet is linked to nearly every other class in the ontology, or is only a single hop away. “Account” is the second most connected class in the ontology, with eighteen uses. While twitter eschews the term “account” in their glossary and official literature, discussions with project oversight led to the addition of the Account class. This is because a Username is not truly an account, but rather a string of characters. In order to keep the terms Username and 9 Account closely linked, it is asserted that each Account has exactly one Username, and each Username is part of exactly one Account. Accounts have passwords, short biographies (bios), profile pictures, and other things associated with the common usage of accounts. Twitter has no official definition of account, so one was created. That definition is “A membership to a service, website, or related entity. An account with a digital entity typically includes a login and/or username, password, and possibly additional information about the account owner.” Further integration with the IAO would likely result in a more refined definition. Twitter’s official definition of a username is as follows: “Also known as a Twitter handle. Must be unique and contain fewer than fifteen characters. Is used to identify you on Twitter for replies and mentions.” The domain-specific information is rather flat, with only two levels at most past the Information Content Entity class. Because the subject matter is so narrow, there is not the same depth that is apparent in domains, such as medicine or manufacturing. Most relationships are not “is_a,” but rather horizontal in nature. Object properties such as “part_of” or “has” are considerably more relevant to mapping the domain. As the IAO grows and the Twitter ontology is better integrated with forthcoming classes, the relationship hierarchy will grow slightly more complicated. Evaluation Evaluation testing needs to be performed on all ontologies to ensure their veridicality. For the Twitter ontology, the most straightforward method is to take real Tweets and consider whether each component is represented in the ontology. A Tweet was chosen, and this evaluation performed. Fig. 5 shows a real tweet that has been diagrammed to represent how the various components are modeled in the ontology. Each visible component is an instance of a class in 10 the ontology. In the top left corner of the image, the picture is the username’s “Profile Picture.” Adjacent to this is the name “Julian Goldstein,” which is the “Display Name” class. It is possible this is the original creator’s real name as well, although that cannot be verified. Underneath the display name is “@JulianGoldstein,” the “Username” of the “Account” that the tweet originated from. In this instance it is the same as the display name, but that is not necessarily the case. To the right of those are two buttons. The first button allows the current user to send a “Direct Message” to the username of the Tweet creator. The Follow button, represented by the “Follow” process, causes the “Follower” role to inhere in the current username. The orange and green marks to the upper right corner are an “Annotation” representing a favorited Tweet, and indicating that the Tweet has undergone the “Retweeting” process and is now a “Retweet.” Figure 5 11 The body of the Tweet contains the “Message Content,” which is a “iao.Textual Entity.” This particular example lacks instances of “Hashtag” and “Mention.” Below the message content are three buttons; Reply, Retweet, and Favorite. All three are included in the ontology as processes. Below that the Tweet is marked as having been retweeted and favorited a certain number of times, which are outputs of the aforementioned processes. Finally, at the bottom, are a “Time” and “Date,” both of which are designated by a “Timestamp.” Overall, the evaluation of this Tweet is encouraging. Every single item on the page, as well as the processes that create them, is represented in the ontology. A single test is hardly proof that the ontology is exhaustive and functional, but it is a promising start. The true path rule can help identify misplaced classes. By following is_a relationships upwards, logical inconsistencies can be identified when a subclass is not entirely a type of its parent class. Figure 6 12 List is_a Column Profile is_a Column Timeline is_a Column Biography is_a TextualEntity Message_Content is_a TextualEntity Retweet is_a Tweet User is_a SpecificallyDependentContinuant Follower is_a User Replying is_a Creation Retweeting is_a Process Figure 6 is a mapping of all classes with a parent class that is not from the IAO or BFO, as well as three others. Reading across, the table shows that the natural language reading of “is a” works on each line. Some domain knowledge is required to understand all relationships, such as knowing what a profile consists of1, or the roles of follower and user.2 Improvement As is, the current .owl file has plenty of room to realize improvement. The most immediate way to affect this would be increased integration with the Information Artifact Ontology. Working with the next release of the IAO would hopefully identify more specific classes that the various Twitter classes would reside under. Increased specificity of classes in this way would lead to a more accurate understanding of the service as a whole, along with a more detailed asserted hierarchy. It would also open the door to merging with other similar ontologies, such as ontologies describing other social media networks, e-mail, instant messaging, communication clients such as Skype, Internet message boards, and generally various types of Internet-based communication. A class such as “Message_Content” would likely be shared across multiple domains, as most types of messaging include some form of content to relay their intentions. Digger deeper into the transmission protocol of the Twitter service would be another venue for improving the ontology. Increased explication of technical specifications may lead to significant overlap with other protocol functions, revealing increased similarity of functionality between Twitter and other services. A stronger representation of the technical aspects may 1 A profile is a combination of several classes such as profile picture and biography, along with a list of tweets from whichever username’s profile is being viewed. 2 In order to have the role Follower, a person must already have the role User. A person can not be a Follower without being a User. 13 provide increased insight matters of security and efficiency, especially when other platforms that share protocols develop improved methods. Expanding beyond mechanical representation would be integration with language ontologies, a rich field that has much to complement the Twitter ontology. Currently, this ontology only seeks to map the relation of the components of the software. All of the language actually in use by people accessing the service is rolled up into the Message_Content class that exists as a textual entity with no inspection of the contents. Incorporating language ontologies would open up a significant amount of understanding of the way the platform is used culturally to interested parties. This type of interconnectedness promises to be one of the most fruitful ways to explore Twitter and make use of the Twitter ontology. 14 Web Resources 1. Basic Formal Ontology. http://www.ifomis.org/bfo/ 2. Information Artifact Ontology. https://code.google.com/p/information-artifact-ontology/ Citations 1. Beaumont, Peter. “The truth about Twitter, Facebook, and the uprisings in the Arab world.” The Guardian. 24 February 2011. Web. 11 November 2013. http://www.theguardian.com/world/2011/feb/25/twitter-facebook-uprisings-arab-libya 2. Tsukayama, Hayley. “Twitter turns 7: Users send over 400 million tweets per day.” Washington Post. 21 March 2013. Web. 11 November 2013. http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jackdorsey-twitter 3. “The Twitter Glossary.” Twitter.com. Web. https://support.twitter.com/articles/166337the-twitter-glossary 15