Twitter Ontology

advertisement
Twitter Ontology
Travis Allen
Reason for Existence
Social media has increasingly become a method of intelligence gathering and cultural
study for the military, political organizations, and swaths of academia. As more and more
individuals take to various forms of social media, the mediums have come to serve as
transmission methods for not only greater volumes of data, but more important and societally
relevant data as well. One of these forms of social media is Twitter.
Whereas Twitter was mostly a social phenomena prior to 2010, it was the Arab Spring
that catapulted Twitter into the spotlight of the world community, highlighting just how impactful
the technology truly is1. Twitter and Facebook played important roles in helping the rebellion
coordinate protests, distribute information, combat propaganda, and relay the reality of the
situation on the ground to the world at large. Following the Arab Spring, Twitter has become a
major component of social media in the US and abroad.
Being able to extract useful knowledge out of Twitter will be vitally important to a wide
range of organizations. In March 2013, Twitter reported there were over 200 million registered
user accounts sending over 400 million Tweets per day2. This tremendous usage will provide
valuable information to any who are able to make sense of the mammoth amount of data.
The goal of the Twitter ontology is to provide those working on the platform a framework
to make sense of the sprawling, multitudinous messages being transmitted at every second of
the day. Building an ontology to understand the structure of Twitter will be an important step in
making sense of the data it generates. By mapping the lexicon and architecture of the
messages sent across the platform (“Tweets”), those studying the content will have better tools
with which to gather information.
The scope of the Twitter ontology is to accurately represent the mechanical functions of
the Twitter platform.
The Challenge
Twitter has a fundamentally unique conceptual model that is best understood by
contrasting it with e-mail. As the clear mainstay in communication protocols from the earliest
days of the Internet, e-mail has a very approachable model. As the capacity of the Internet to
depart from real-world analogues was not yet fully understood early on, e-mail was modeled on
standard paper mail. An e-mail is created by a user, and then sent to another user or users.
This idea of generating a message and sending it specifically to some set of recipients – even
having recipients – is rooted in traditional mail systems. Each message is viewed only by those
it is addressed to, and once sent, exists in possession of the recipient.
Users do not go out into cyberspace looking at public e-mails; rather, they check their
inbox to see what has been sent directly to them. If a user wishes to receive e-mail from another
person, they will they have to obtain that individual’s e-mail address in some way. Similarly, a
user cannot browse all the e-mails an individual has sent, nor is there typically a way to find
another user’s e-mail address unless they want it to be found. Even the structure of an e-mail is
itself reminiscent of real-world paper letters. E-mails contain fields meant to mirror paper mail,
such as From, To, Subject, and even Carbon Copy and Blind Carbon Copy lines.
2
Figure 1
Twitter, however, functions significantly differently, in a way that does not easily map to
real-world parallels. Each account has a username, which is by default viewable by anyone who
wishes to find it. Nearly every tweet published by an account is public, even correspondence
between accounts. When composing a tweet, one does not necessarily send it to anyone in
particular, but rather broadcasts to nobody in particular, with anyone who cares to view it able to
do so. Tweets also never really leave the possession of the composing account. Even years
after its creation, an account can return to their own tweet and delete it, leaving all responses –
even if there are thousands of them – pointing to something that no longer exists.
Tweets also lack the more regimented structure of old world communications or e-mail.
A Tweet is constrained to 140 characters, making that just about the only limitation. There are
no subject lines or carbon copies. The platform is designed specifically for messages to be brief
and freeform, akin to jotting down an idea on a post-it note and sticking it to a public bulletin
board. All of these factors make building a conceptual model of Twitter a significant challenge.
3
The Resources
There are no other Twitter ontologies in existence, and very little in terms of formal
documentation of the system. As there was no apparent prior work on this front, data collection
is organic. Beginning development of the Twitter ontology, the first place to find information is
the glossary supplied by Twitter3. This glossary is the only real definitive list of terms relevant to
the service, and even then defined as “vocabulary frequently used to talk about features on our
site and aspects of our service.” As Twitter strives to maintain a community-managed lexicon
and use pattern, some uses and social functions are not formalized by the company in any way.
This results in a distinction between the actual mechanical structure of Twitter and the social
activity that takes place within it. This ontology seeks to capture the former. The latter would
likely be a culmination of a variety of language and anthropological studies.
Figure 2
4
The Process
The first step in creating the ontology is finding every term and its associated definition in
the Twitter glossary that seemed to be a part of the architecture of the software. Once a list of
domain vocabulary is established, relationships can be understood. After a list of formal terms
was pulled from the glossary, actual tweets themselves are studied for any function or feature
that was not accurately represented in the glossary. Any feature not explicitly mentioned in the
glossary is recorded, with a rough definition created. That hope is that combining a defined
glossary with real usage of the service would form an expansive base of terms from which to
begin building the ontology.
Figure 3
Once the list of terms is developed, the next step is to begin sketching out relationships
between proposed classes. Using a spreadsheet, each class is listed, along with other classes it
5
could be related to. Possible object properties are included between the two classes. These
object properties are meant to capture the relationship only; they are not intended to be final
expressions. This was in respect to the inability to express information about edges. The
culmination of this step is a listing of each proposed class and all possible edges to other
classes.
Figure 4
Following this step, all unique object properties are listed to search for redundancies. By
extracting object properties from their triples, similarities across several properties can be
identified so that like terms can be merged into a singular functioning term. Reducing the
amount of new object properties in an ontology is an important step towards making it
compatible with other existing ontologies.
Once the list of entities and rough object properties is finished, a conceptual model was
created on whiteboard. The goal here was to better understand how the various entities would
6
interact with each other. It was at this point in the process that relationships such as “is_a” are
scrutinized to ensure no term overloading or confusion was occurring. Every connection is
included in the diagram, with key terms becoming apparent as more connections were drawn.
Classes are labeled in boxes, with color coding to denote differences in continuants and
occurrents, and each edge between entities was labeled. There is constant referencing of both
the initial list of classes and relationships, as well as of Twitter itself, to make sure all classes
are included and relationships accurately defined. With the structure mapped on whiteboard, it
can be transferred over to Protege.
Introduction of the classes and object properties to Protege begins with the Basic Formal
Ontology (BFO)4. By beginning with BFO, the Twitter ontology will be more compatible with
other ontologies, especially those that also reference BFO. The majority of classes end up as
generically dependent continuants, although the processes included fell under the occurrent
class.
After establishing which parts of BFO are required, the Information Artifact Ontology
(IAO) is utilized to identify more specific classes that the Twitter components would be a part of.
The merging of the IAO with the Twitter ontology is a simple and straightforward procedure, as
both have been built with BFO as their upper-level ontology. The Twitter ontology acts as an
extension of the IAO. As the IAO undergoes future revisions, the Twitter ontology will have to
adjust accordingly. Information artifacts are a rich domain with a great deal of unique features,
which will take time to fully appreciate.
The .OWL File
28 unique classes are created for the ontology. This does not include terms imported
from BFO or the IAO. There are twenty object properties in the file.
7
Being that Twitter is exclusively a digital entity, the IAO is the most appropriate mid-level
ontology for this project. Within the IAO, the most important class to the Twitter ontology is the
Information Content Entity class. This class is defined by the IAO as “an entity that is generically
dependent on some artifact and stands in relation of aboutness to some entity.” This is an
excellent definition of a considerable amount of Twitter ontology terms, if perhaps a bit high
level. Quite a few classes relating to Twitter - a Hashtag, a Mention, a Tweet - are generically
dependent, and they are “about” something, whether what they are about is another username,
a specific topic, or a user’s opinion about the current weather.
Several terms in the IAO are considered for inclusion, specifically the classes “Document”
and “Document Part.” Initially these seem compatible with the Twitter lexicon, but upon further
review the document class seems inadequate for expressing the idea of a tweet. The examples
provided by the IAO of a journal article, a patent application, and laboratory notebook do not
meaningfully reflect what a tweet is in reality. A more appropriate term may be “Message,”
which will be covered in the discussion of possible improvements.
8
The service of Twitter revolves completely and entirely around the messages sent
across the client, which are called
Tweets. Unlike other social media sites
such as Facebook, Twitter is a focused
client with all functionality related to the
sending and reading of Tweets.
Understandably, “Tweet” is the single
most used class in the ontology, with
twenty-two uses of the term, which the
official Twitter glossary defines as “a
message posted via Twitter containing
140 characters or fewer.” It is
referenced by its various components
such as “Hashtag” or “Mention,” and
also as an output of the process of
creation. In some way, the class Tweet
is linked to nearly every other class in
the ontology, or is only a single hop
away.
“Account” is the second most
connected class in the ontology, with
eighteen uses. While twitter eschews
the term “account” in their glossary and
official literature, discussions with
project oversight led to the addition of the Account class. This is because a Username is not
truly an account, but rather a string of characters. In order to keep the terms Username and
9
Account closely linked, it is asserted that each Account has exactly one Username, and each
Username is part of exactly one Account.
Accounts have passwords, short biographies (bios), profile pictures, and other things
associated with the common usage of accounts. Twitter has no official definition of account, so
one was created. That definition is “A membership to a service, website, or related entity. An
account with a digital entity typically includes a login and/or username, password, and possibly
additional information about the account owner.” Further integration with the IAO would likely
result in a more refined definition. Twitter’s official definition of a username is as follows: “Also
known as a Twitter handle. Must be unique and contain fewer than fifteen characters. Is used to
identify you on Twitter for replies and mentions.”
The domain-specific information is rather flat, with only two levels at most past the
Information Content Entity class. Because the subject matter is so narrow, there is not the same
depth that is apparent in domains, such as medicine or manufacturing. Most relationships are
not “is_a,” but rather horizontal in nature. Object properties such as “part_of” or “has” are
considerably more relevant to mapping the domain. As the IAO grows and the Twitter ontology
is better integrated with forthcoming classes, the relationship hierarchy will grow slightly more
complicated.
Evaluation
Evaluation testing needs to be performed on all ontologies to ensure their veridicality.
For the Twitter ontology, the most straightforward method is to take real Tweets and consider
whether each component is represented in the ontology. A Tweet was chosen, and this
evaluation performed.
Fig. 5 shows a real tweet that has been diagrammed to represent how the various
components are modeled in the ontology. Each visible component is an instance of a class in
10
the ontology. In the top left corner of the image, the picture is the username’s “Profile Picture.”
Adjacent to this is the name “Julian Goldstein,” which is the “Display Name” class. It is possible
this is the original creator’s real name as well, although that cannot be verified. Underneath the
display name is “@JulianGoldstein,” the “Username” of the “Account” that the tweet originated
from. In this instance it is the same as the display name, but that is not necessarily the case.
To the right of those are two buttons. The first button allows the current user to send a
“Direct Message” to the username of the Tweet creator. The Follow button, represented by the
“Follow” process, causes the “Follower” role to inhere in the current username. The orange and
green marks to the upper right corner are an “Annotation” representing a favorited Tweet, and
indicating that the Tweet has undergone the “Retweeting” process and is now a “Retweet.”
Figure 5
11
The body of the Tweet contains the “Message Content,” which is a “iao.Textual Entity.”
This particular example lacks instances of “Hashtag” and “Mention.” Below the message content
are three buttons; Reply, Retweet, and Favorite. All three are included in the ontology as
processes.
Below that the Tweet is marked as having been retweeted and favorited a certain
number of times, which are outputs of the aforementioned processes. Finally, at the bottom, are
a “Time” and “Date,” both of which are designated by a “Timestamp.”
Overall, the evaluation of this Tweet is encouraging. Every single item on the page, as
well as the processes that create them, is represented in the ontology. A single test is hardly
proof that the ontology is exhaustive and functional, but it is a promising start.
The true path rule can help identify misplaced classes. By following is_a relationships
upwards, logical inconsistencies can be identified when a subclass is not entirely a type of its
parent class.
Figure 6
12
List
is_a
Column
Profile
is_a
Column
Timeline
is_a
Column
Biography
is_a
TextualEntity
Message_Content
is_a
TextualEntity
Retweet
is_a
Tweet
User
is_a
SpecificallyDependentContinuant
Follower
is_a
User
Replying
is_a
Creation
Retweeting
is_a
Process
Figure 6 is a mapping of all classes with a parent class that is not from the IAO or BFO,
as well as three others. Reading across, the table shows that the natural language reading of “is
a” works on each line. Some domain knowledge is required to understand all relationships, such
as knowing what a profile consists of1, or the roles of follower and user.2
Improvement
As is, the current .owl file has plenty of room to realize improvement. The most
immediate way to affect this would be increased integration with the Information Artifact
Ontology. Working with the next release of the IAO would hopefully identify more specific
classes that the various Twitter classes would reside under. Increased specificity of classes in
this way would lead to a more accurate understanding of the service as a whole, along with a
more detailed asserted hierarchy. It would also open the door to merging with other similar
ontologies, such as ontologies describing other social media networks, e-mail, instant
messaging, communication clients such as Skype, Internet message boards, and generally
various types of Internet-based communication. A class such as “Message_Content” would
likely be shared across multiple domains, as most types of messaging include some form of
content to relay their intentions.
Digger deeper into the transmission protocol of the Twitter service would be another
venue for improving the ontology. Increased explication of technical specifications may lead to
significant overlap with other protocol functions, revealing increased similarity of functionality
between Twitter and other services. A stronger representation of the technical aspects may
1
A profile is a combination of several classes such as profile picture and biography, along with a list of
tweets from whichever username’s profile is being viewed.
2
In order to have the role Follower, a person must already have the role User. A person can not be a
Follower without being a User.
13
provide increased insight matters of security and efficiency, especially when other platforms that
share protocols develop improved methods.
Expanding beyond mechanical representation would be integration with language
ontologies, a rich field that has much to complement the Twitter ontology. Currently, this
ontology only seeks to map the relation of the components of the software. All of the language
actually in use by people accessing the service is rolled up into the Message_Content class that
exists as a textual entity with no inspection of the contents. Incorporating language ontologies
would open up a significant amount of understanding of the way the platform is used culturally
to interested parties. This type of interconnectedness promises to be one of the most fruitful
ways to explore Twitter and make use of the Twitter ontology.
14
Web Resources
1. Basic Formal Ontology. http://www.ifomis.org/bfo/
2. Information Artifact Ontology. https://code.google.com/p/information-artifact-ontology/
Citations
1. Beaumont, Peter. “The truth about Twitter, Facebook, and the uprisings in the Arab
world.” The Guardian. 24 February 2011. Web. 11 November 2013.
http://www.theguardian.com/world/2011/feb/25/twitter-facebook-uprisings-arab-libya
2. Tsukayama, Hayley. “Twitter turns 7: Users send over 400 million tweets per day.”
Washington Post. 21 March 2013. Web. 11 November 2013.
http://articles.washingtonpost.com/2013-03-21/business/37889387_1_tweets-jackdorsey-twitter
3. “The Twitter Glossary.” Twitter.com. Web. https://support.twitter.com/articles/166337the-twitter-glossary
15
Download