SOCIAL NETWORK BENCHMARK DATA GENERATOR This document describes the features of social network the data generator (SNDG). It includes an overview of the data generation process, the methods for data generation (i.e., value domains, constraints, statistical distributions and data correlations), and the data dictionaries used by the generator. DATA SCHEMA The data schema of the social network benchmark is presented in the following picture. Last update: 2013/09/04 OVERVIEW OF THE DATA GENERATION PROCESS The implementation of the SNDG is based on the S3G2 [2] generator. The S3G2 java source code has been adapted to the social network data schema but preserving some methods to generate data correlations and statistical distributions. One of the main features of the SNDG is the use of MapReduce tasks running on Apache Hadoop. It allows scale to terabytes of data generated in parallel. The generator requires the definition of the following input parameters: numtotalUser: <long> startYear: <date> numYears: <date> serializerType: <ttl|csv> The process of data generation includes the following steps: 1. Initialize parameters and dictionaries. 2. Create persons including relations with universities and companies (correlated by location). 3. Generate interest/tags of persons (correlated by location). 4. Generate friendship relationships (university-country correlated friendships, interest correlated friendships, and random friendships). 5. Generate forums, posts and comments (correlated with the interests of the person). 6. Serialize the generated data (including static data about Places and TagClasses). METHODS FOR DATA GENERATION ENTITY: PERSON Person.id It is a unique sequential number. Person.creationDate It is a uniform random date between the generation parameters startYear and endYear. Person.firstName The name of a person is correlated with location/gender. For each location (country) we have a set of names obtained form the “givennameByCountryBirthPlace” dictionary. The names are sorted by the name's popularity: N1, N2, N3, N4, N5, N6 We decide that N1, N3, N5 are for Male. N2, N4 and N6 are for female. Last update: 2013/09/04 From 1980 - 1985, the list of Male name is: N1, N3, N5 From 1985 - 1990, the list of Male name is: N3, N1, N5 (Randomly change the order of few top names comparing the period 1980 - 1985) From 1980 - 1985, the list of Male name is: N2, N4, N6 From 1980 - 1985, the list of Male name is: N4, N2, N6 (This method must be reviewed) Person.lastName The lastName of the person is randomly selected from the values available in the dictionary "surnameByCountryBirthPlace.txt". The procedure considers a correlation with the country of the person. Person.gender It is generated randomly. There is a 50% to chance to select male or female. Person.birthday Uniform Random date between 1-1-1980 and 1-1-1990. Person.email The email domains are obtained from the dictionary “email.txt”. There are some popular emails such as Gmail. Top-5 emails have the popularity scores. Others will be randomly distributed. Person.speaks Randomly generated but correlated with the location of the person. There is a chance to have English as a foreign language. Person.browserUsed Randomly generated by selecting a browser from the dictionary “browsersDic”. Person.locationIP The IP address of a person is correlated with the country defined by the relation Person.isLocatedIn. It uses the data in the “ipaddrByCountries” directory. Person-isLocatedIn-Place(City) Set the location Id and also the Z-order of the location by using the location dictionary. The distribution follows the population of each country. Person-studyAt-Place(University) The universities selected are correlated with the location of the person (i.e., the city given by relation isLocatedIn). Each location has collection of universities (see the dictionary). The top-10 universities will have much higher probability to be selected than the others (by default, it is 90%). Last update: 2013/09/04 The value of the attribute classYear is given by the birth year of the person plus a random value between 20 and 25. Person-workAt-Organisation(Company) The companies selected are correlated with the location of the person (i.e., the city given by relation isLocatedIn). The value of the attribute workFrom is given by the birth year of the person plus a random value between 20 and 25. Person-hasInterest-Tag Generation method: 1. Selection of a “main” tag -> uses dictionary dicCelebritiesByCountry.txt: a. Get the Location.Id of the person. b. By using the dictionaty dicCelebritiesByCountry, get a random celebrity related to his/her country with a 50% probability. If the location (country) of the user doesn't have celebrities then it selects a random one. 2. Determine number of tags by selecting a random value between parameters maxNumTagsPerUser (right now 1) and maxNumTagsPerUser (right now 10). 3. Selection of additional tags. a. Get topics related to the main topic by using the matrixId dictionary. b. If the main topic has no related topics choose another random row id. Person-likes-Post Feature to review. Person-knows-Person The relation has been designed to follow a power-law distribution by using the SSJ library [1]. The number of friends for a person is determined by the function PowerDist of the SSJ library. The assigned friendships are divided in the following fashion: 45% will be created from the university-location data, 45% from the tags, and the last 10% will be random. Person-follows-Person Feature to implement. ENTITY: FORUM Each forum corresponds to the Wall of a person. Forum.id The id of a user wall is generated by multiplying the id of the person with 2. The other ones are generated sequentially after the last user wall id. Last update: 2013/09/04 Forum.title There are three possible title patterns. 1. For the user wall the title are: Wall of [firstname] [lastname] 2. For the albums are: Album [number] of [firstname] [lastname] 3. For the groups are: Group for [celebrity] in [place] Forum.creationDate A random date between the person creationDate and endYear. Forum.hasModerator The person creator of the forum. For each person a wall is created, for each month between the person creationDate and endYear a random number album forums and of group forums are created, those random numbers are defined in the private parameter file. Forum-hasTag-Tag A random subset of the creators tag. Forum-hasMember-Person A random subset of the creator knows set. Forum-containerOf-Post A random number of posts are generated for the forum defined in the private parameter file. ENTITY: POST Post.Id It is a unique sequential number. Post.creationDate A random date between the person creationDate and endYear. Post.browserUsed The creator browser. Post.locationIP Usually it is an IP correlated to the creator country but there are certain chances to have a randomIP in the summer season and another in normal season. Post.content The content of a post is a fraction from the abstracts of the tags available in the “tagText.txt” dictionary (they were obtained from Wikipedia). Post.language Last update: 2013/09/04 Only available to text posts. One random language from the creators speaks relation. Post.imageFile Only available to photo posts. A text with the pattern: photo[number].jpg. Post-hasCreator-Person For each user and for each month a certain amount of posts are generated as defined in the private parameter file. Post-isLocatedIn-Place(country) The country correlated to the IP. Post-hasTag-tag The tags for a post are selected from the list of user’s tags. Each tag has a 1/5 chance of being selected except one selected to force a non empty set. The number of tags for a post is in the range [1, number of user's tags] ENTITY: COMMENT Comment.id It is a unique sequential number. Comment.creationDate A random date between the last comment of the base post creationDate and that creationDate+day. Comment.browserUsed The creator browser. Comment.locationIP Usually it is an IP correlated to the creator country but there are certain chances to have a randomIP in the summer season and another in normal season. Comment.content The content of a post is a fraction from the abstracts of the tags available in the “tagText.txt” dictionary (they were obtained from Wikipedia). Comment-hasCreator-Person A random friend or member of the group is selected as the creator of the comment. Comment-isLocatedIn-Place(Country) The country correlated to the IP. Last update: 2013/09/04 Comment-replyOf-Post There is a chance of 1/(Num created comments + 1) to be a reply of the original post. Comment-replyOf-Comment There is a chance of (Num created comments)/(Num created comments + 1) to be a reply of the original post. ENTITY: ORGANISATION Organisation.id It is a unique sequential number. Organisation.type There are two types of organization: university and company. Organisation.name The names for universities and companies are obtained from the dictionaries institutesCityByCountry and CompaniesByCountry. ENTITY: TAG Tag.id It is a unique sequential number. Tag.name The tags are obtained from the dicTopic.txt dictionary. Tag-hasType-TagClass It classifies the tag in a TagClass by using the “dicTopic.txt” dictionary. ENTITY: TAGCLASS TagClass.id It is a unique sequential number. TagClass.name The values are obtained from the “tagClasses.txt” dictionary. TagClass-isSubClassOf-TagClass This relationship defines a taxonomy of TagClasses. The relationships are obtained from the “tagHierarchy.txt” dictionary. Last update: 2013/09/04 ENTITY: PLACE Place.id It is a unique sequential number. Place.type There are three types of places: city, country and continent. Place.name The names of continents and countries are obtained from the “dicLocation.txt” dictionary. The names of cities are obtained from the “popularPlacesByCountry” dictionary. Place-isPartOf-Place The relationship isPartOf between two places is obtained from the “dicLocation.txt” and “popularPlacesByCountry” dictionaries. It defines a hierarchy of places, where “city ispartOf Country” and “Country isPartOf Continent”. DICTIONARIES browsersDic.txt Contains the name of the browser and its probability. Used to assign browsers to the persons based on the popularity probability. Used by BrowserDictionary.java Sample: Chrome 0.279 Internet Explorer 0.232 Firefox 0.422 ... companiesByCountry.txt Contains the country and the name of a company for that country. Used to give a workplace to the persons corresponding to its homeland if available. Used by CompanyDictionary. Sample: Afghanistan KamAir Last update: 2013/09/04 Afghanistan BalkhAirlines Afghanistan KhyberAfghanAirlines countryAbbrMapping.txt Contains the abbreviation and the name of the country that its refer for. This is used to link countries to ips See: Ipzones. Used by IPAddressDictionary.java. Sample: ac United Kingdom academic institutions ad Andorra ae United Arab Emirates dicCelebritiesByCountry.txt Contains the countryId, the celebrityId and its cumulated probability of popularity within that country. Used to assign a celebrity of the same country of the person if available. Used by TagDictionary.java Sample: 0 0 0.27328605200945627 0 1 0.4884160756501182 0 2 0.649645390070922 dicLocation.txt Contains the continent name, the country name, latd, longt, population and cumulated probability of population. Used to create the region-country hierarchy and to distribute the user nationality according to the population data. Used by LocationDictionary.java Sample: Asia Afghanistan 35 69 15500000 0.0028010447 Africa Algeria 37 3 29100867 0.008059937 Africa Angola -9 13 5646177 0.0090802721 Last update: 2013/09/04 dicTopic.txt Contains the tagId, the tagClassId, the tag name and the tag foaf:name. Used in the serialization part of the software to assign names to the tags and write the tag basic class. Used by TagDictionary.java Sample: 0 349 1 211 2 98 email.txt Contains the email domain name and its probability for the most popular ones and only the name for the rest. Used to assign email domains to the user. Used by EmailDictionary.java. Sample: gmail.com 0.45 gmx.com 0.20 yahoo.com 0.18 givennameByCountryBirthPlace.txt.freq.full Contains the CountryName, firstName, gender, birthdate period and an unused number. Used to assign a first name to the user according to the gender and age. Used by NamesDictionary.java Sample: Abkhazia Diana 0 0 1 Abkhazia Maya 0 0 1 Abkhazia Diana Gurtskaya 1 0 1 Abkhazia Diana 0 1 1 Last update: 2013/09/04 institutesCityByCountry.txt Contains the country name, the university name and the city of that university. Used to create the country->city hierarchy and to assign to the user a university from the same country. Used by OrganizationsDictionary,java (all data) and LocationDictionary.java (the country->city data) Sample: Åland_Islands Åland University of Applied Sciences Mariehamn Abkhazia Abkhazian State University Sukhumi Afghanistan Paktia University Gardē z Afghanistan Baghlan University Puli_Khumri languagesByCountry.txt Contains the name of the country and a list of language data: the ISO 639-1 code, * if it is an official language and the speaker percentage (0 if unknown). Used to assign languages of its country to the user. Used by LanguageDictionary.java Sample: Aruba es 12.6 en 7.7 nl * 5.8 Antigua and Barbuda en * 0 United_Arab_Emirates ar * 0 fa 0 en 0 hi 0 ur 0 popularPlacesByCountry.txt Contains the country name, the location name, the location name with spaces, latitude and longitude. Used by PopularPlacesDictionary.java. Sample: Afghanistan Ab-Kol Ab-Kol 36.22000122070312 68.5 Afghanistan Ab_Bazan Ab Bazan 36.93333435058594 69.94999694824219 Afghanistan Ab_Daw Ab Daw 36.25 71.16666412353516 Afghanistan Ab_Gaj Ab Gaj 36.98333358764648 72.69999694824219 Last update: 2013/09/04 smartPhonesProviders.txt (unused) Contains the name of smartphone providers. Used by UserAgentDictionary.java Sample: IPhone IPad HTC Samsung LG surnameByCountryBirthPlace.txt.freq.sort Contains the number of appearances of the last name, the country name and the last name. Used to assign a surname to the user. Used by NamesDictionary.java Sample: 2,Abkhazia,Gurtskaya 1,Abkhazia,Kopitseva 1,Adjara,Vashalomidze 7,Afghanistan,Zaland tagClasses.txt Contains the tagClassId, the name and the RDF label. Used to serialize the name and label of the tagClasses. Used by TagDictionary.java Sample: 0 Thing thing 1 BasketballLeague basketball league 2 LunarCrater lunar crater 3 MilitaryPerson military person Last update: 2013/09/04 tagHierarchy.txt Contains the base tagClassId and the parent tagClassId. Used to build the tag hierarchy in the serialize process. Used by TagDictionary.java Sample: 19 179 136 338 173 211 230 149 tagText.txt Contains the tagId and a text. Used to assign a text to the post and comments related to its tags. Used by TagDictionary.java Sample: 0 Hamid Karzai, GCMG (Pashto: ک رزی حامد, Hā mid Karzay; born 24 December 1957) is the 12th and ... 1 Jalā l ad-Dīn Muḥ ammad Balkhī, also known as Jalā l ad-Dīn Muḥ ammad Rū mī and .. 2 Mahmud of Ghazni, actually Yamīn ad-Dawlah Abdul-Qā ṣ im ... topicMatrixId.txt Contains the topic id 1, the topic id 2, the cumulative % for topic1 and the number of references the topic1 and topic2 appears in the same text. Used to select a list of correlated tags of the main interest of the user. Used by TagDictionary.java Sample: 2909 4870 0.0 8.0 2909 4871 2.392072671167751E-4 8.0 2909 4872 4.784145342335503E-4 2.0 Last update: 2013/09/04 topuniversities.txt (unused) Contains the name of the university, the country and a cumulative percentage. Sample: University of Cambridge United_Kingdom 100 Harvard University United_States 99.18 Yale University United_States 98.68 idzones Contained in the folder resources/ipaddrByCountries, there is a list of files named XX.zone where XX is a valid country abbreviation contained in the countryAbbrMapping.txt dictionary. Each file contains a list of IP from the country. Sample of ad.zone (Andorra): 85.94.160.0/19 91.187.64.0/19 109.111.96.0/19 194.158.64.0/19 REFERENCES [1] SSJ: Stochastic Simulation in Java. University of Montreal http://www.iro.umontreal.ca/~simardr/ssj/indexe.html [2] S3G2: A Scalable Structure-Correlated Social Graph Generator. Minh-Duc Pham, Peter A. Boncz and Orri Erling. TPCTC. 2010 Last update: 2013/09/04