schema-features

advertisement
SOCIAL NETWORK BENCHMARK
DATA GENERATOR
This document describes the features of social network the data generator (SNDG).
It includes an overview of the data generation process, the methods for data
generation (i.e., value domains, constraints, statistical distributions and data
correlations), and the data dictionaries used by the generator.
DATA SCHEMA
The data schema of the social network benchmark is presented in the following
picture.
Last update: 2013/09/04
OVERVIEW OF THE DATA GENERATION PROCESS
The implementation of the SNDG is based on the S3G2 [2] generator. The S3G2 java
source code has been adapted to the social network data schema but preserving
some methods to generate data correlations and statistical distributions.
One of the main features of the SNDG is the use of MapReduce tasks running on
Apache Hadoop. It allows scale to terabytes of data generated in parallel.
The generator requires the definition of the following input parameters:
 numtotalUser: <long>
 startYear: <date>
 numYears: <date>
 serializerType: <ttl|csv>
The process of data generation includes the following steps:
1. Initialize parameters and dictionaries.
2. Create persons including relations with universities and companies
(correlated by location).
3. Generate interest/tags of persons (correlated by location).
4. Generate friendship relationships (university-country correlated friendships,
interest correlated friendships, and random friendships).
5. Generate forums, posts and comments (correlated with the interests of the
person).
6. Serialize the generated data (including static data about Places and
TagClasses).
METHODS FOR DATA GENERATION
ENTITY: PERSON
Person.id
It is a unique sequential number.
Person.creationDate
It is a uniform random date between the generation parameters startYear and
endYear.
Person.firstName
The name of a person is correlated with location/gender.
For each location (country) we have a set of names obtained form the
“givennameByCountryBirthPlace” dictionary.
The names are sorted by the name's popularity: N1, N2, N3, N4, N5, N6
We decide that N1, N3, N5 are for Male. N2, N4 and N6 are for female.
Last update: 2013/09/04
From 1980 - 1985, the list of Male name is: N1, N3, N5
From 1985 - 1990, the list of Male name is: N3, N1, N5 (Randomly change the
order of few top names comparing the period 1980 - 1985)
From 1980 - 1985, the list of Male name is: N2, N4, N6
From 1980 - 1985, the list of Male name is: N4, N2, N6
(This method must be reviewed)
Person.lastName
The lastName of the person is randomly selected from the values available in the
dictionary "surnameByCountryBirthPlace.txt". The procedure considers a
correlation with the country of the person.
Person.gender
It is generated randomly. There is a 50% to chance to select male or female.
Person.birthday
Uniform Random date between 1-1-1980 and 1-1-1990.
Person.email
The email domains are obtained from the dictionary “email.txt”.
There are some popular emails such as Gmail. Top-5 emails have the popularity
scores. Others will be randomly distributed.
Person.speaks
Randomly generated but correlated with the location of the person. There is a
chance to have English as a foreign language.
Person.browserUsed
Randomly generated by selecting a browser from the dictionary “browsersDic”.
Person.locationIP
The IP address of a person is correlated with the country defined by the relation
Person.isLocatedIn. It uses the data in the “ipaddrByCountries” directory.
Person-isLocatedIn-Place(City)
Set the location Id and also the Z-order of the location by using the location
dictionary.
The distribution follows the population of each country.
Person-studyAt-Place(University)
The universities selected are correlated with the location of the person (i.e., the city
given by relation isLocatedIn).
Each location has collection of universities (see the dictionary). The top-10
universities will have much higher probability to be selected than the others (by
default, it is 90%).
Last update: 2013/09/04
The value of the attribute classYear is given by the birth year of the person plus a
random value between 20 and 25.
Person-workAt-Organisation(Company)
The companies selected are correlated with the location of the person (i.e., the city
given by relation isLocatedIn).
The value of the attribute workFrom is given by the birth year of the person plus a
random value between 20 and 25.
Person-hasInterest-Tag
Generation method:
1. Selection of a “main” tag -> uses dictionary dicCelebritiesByCountry.txt:
a. Get the Location.Id of the person.
b. By using the dictionaty dicCelebritiesByCountry, get a random
celebrity related to his/her country with a 50% probability. If the
location (country) of the user doesn't have celebrities then it selects a
random one.
2. Determine number of tags by selecting a random value between parameters
maxNumTagsPerUser (right now 1) and maxNumTagsPerUser (right now
10).
3. Selection of additional tags.
a. Get topics related to the main topic by using the matrixId dictionary.
b. If the main topic has no related topics choose another random row id.
Person-likes-Post
Feature to review.
Person-knows-Person
The relation has been designed to follow a power-law distribution by using the SSJ
library [1]. The number of friends for a person is determined by the function
PowerDist of the SSJ library.
The assigned friendships are divided in the following fashion: 45% will be created
from the university-location data, 45% from the tags, and the last 10% will be
random.
Person-follows-Person
Feature to implement.
ENTITY: FORUM
Each forum corresponds to the Wall of a person.
Forum.id
The id of a user wall is generated by multiplying the id of the person with 2. The
other ones are generated sequentially after the last user wall id.
Last update: 2013/09/04
Forum.title
There are three possible title patterns.
1. For the user wall the title are: Wall of [firstname] [lastname]
2. For the albums are: Album [number] of [firstname] [lastname]
3. For the groups are: Group for [celebrity] in [place]
Forum.creationDate
A random date between the person creationDate and endYear.
Forum.hasModerator
The person creator of the forum. For each person a wall is created, for each month
between the person creationDate and endYear a random number album forums and
of group forums are created, those random numbers are defined in the private
parameter file.
Forum-hasTag-Tag
A random subset of the creators tag.
Forum-hasMember-Person
A random subset of the creator knows set.
Forum-containerOf-Post
A random number of posts are generated for the forum defined in the private
parameter file.
ENTITY: POST
Post.Id
It is a unique sequential number.
Post.creationDate
A random date between the person creationDate and endYear.
Post.browserUsed
The creator browser.
Post.locationIP
Usually it is an IP correlated to the creator country but there are certain chances to
have a randomIP in the summer season and another in normal season.
Post.content
The content of a post is a fraction from the abstracts of the tags available in the
“tagText.txt” dictionary (they were obtained from Wikipedia).
Post.language
Last update: 2013/09/04
Only available to text posts. One random language from the creators speaks relation.
Post.imageFile
Only available to photo posts. A text with the pattern: photo[number].jpg.
Post-hasCreator-Person
For each user and for each month a certain amount of posts are generated as
defined in the private parameter file.
Post-isLocatedIn-Place(country)
The country correlated to the IP.
Post-hasTag-tag
The tags for a post are selected from the list of user’s tags. Each tag has a 1/5 chance
of being selected except one selected to force a non empty set.
The number of tags for a post is in the range [1, number of user's tags]
ENTITY: COMMENT
Comment.id
It is a unique sequential number.
Comment.creationDate
A random date between the last comment of the base post creationDate and that
creationDate+day.
Comment.browserUsed
The creator browser.
Comment.locationIP
Usually it is an IP correlated to the creator country but there are certain chances to
have a randomIP in the summer season and another in normal season.
Comment.content
The content of a post is a fraction from the abstracts of the tags available in the
“tagText.txt” dictionary (they were obtained from Wikipedia).
Comment-hasCreator-Person
A random friend or member of the group is selected as the creator of the comment.
Comment-isLocatedIn-Place(Country)
The country correlated to the IP.
Last update: 2013/09/04
Comment-replyOf-Post
There is a chance of 1/(Num created comments + 1) to be a reply of the original
post.
Comment-replyOf-Comment
There is a chance of (Num created comments)/(Num created comments + 1) to be a
reply of the original post.
ENTITY: ORGANISATION
Organisation.id
It is a unique sequential number.
Organisation.type
There are two types of organization: university and company.
Organisation.name
The names for universities and companies are obtained from the dictionaries
institutesCityByCountry and CompaniesByCountry.
ENTITY: TAG
Tag.id
It is a unique sequential number.
Tag.name
The tags are obtained from the dicTopic.txt dictionary.
Tag-hasType-TagClass
It classifies the tag in a TagClass by using the “dicTopic.txt” dictionary.
ENTITY: TAGCLASS
TagClass.id
It is a unique sequential number.
TagClass.name
The values are obtained from the “tagClasses.txt” dictionary.
TagClass-isSubClassOf-TagClass
This relationship defines a taxonomy of TagClasses. The relationships are obtained
from the “tagHierarchy.txt” dictionary.
Last update: 2013/09/04
ENTITY: PLACE
Place.id
It is a unique sequential number.
Place.type
There are three types of places: city, country and continent.
Place.name
The names of continents and countries are obtained from the “dicLocation.txt”
dictionary. The names of cities are obtained from the “popularPlacesByCountry”
dictionary.
Place-isPartOf-Place
The relationship isPartOf between two places is obtained from the “dicLocation.txt”
and “popularPlacesByCountry” dictionaries.
It defines a hierarchy of places, where “city ispartOf Country” and “Country isPartOf
Continent”.
DICTIONARIES
browsersDic.txt
Contains the name of the browser and its probability. Used to assign browsers to the
persons based on the popularity probability.
Used by BrowserDictionary.java
Sample:
Chrome 0.279
Internet Explorer
0.232 Firefox 0.422
...
companiesByCountry.txt
Contains the country and the name of a company for that country. Used to give a
workplace to the persons corresponding to its homeland if available.
Used by
CompanyDictionary.
Sample:
Afghanistan KamAir
Last update: 2013/09/04
Afghanistan BalkhAirlines
Afghanistan KhyberAfghanAirlines
countryAbbrMapping.txt
Contains the abbreviation and the name of the country that its refer for. This is used
to link countries to ips See: Ipzones.
Used by IPAddressDictionary.java.
Sample:
ac United Kingdom academic institutions
ad Andorra
ae United Arab Emirates
dicCelebritiesByCountry.txt
Contains the countryId, the celebrityId and its cumulated probability of popularity
within that country. Used to assign a celebrity of the same country of the person if
available.
Used by TagDictionary.java
Sample:
0 0 0.27328605200945627
0 1 0.4884160756501182
0 2 0.649645390070922
dicLocation.txt
Contains the continent name, the country name, latd, longt, population and
cumulated probability of population. Used to create the region-country hierarchy
and to distribute the user nationality according to the population data.
Used by
LocationDictionary.java
Sample:
Asia Afghanistan 35 69 15500000 0.0028010447
Africa Algeria 37 3 29100867 0.008059937
Africa Angola -9 13 5646177 0.0090802721
Last update: 2013/09/04
dicTopic.txt
Contains the tagId, the tagClassId, the tag name and the tag foaf:name. Used in the
serialization part of the software to assign names to the tags and write the tag basic
class.
Used by TagDictionary.java
Sample:
0 349
1 211
2 98
email.txt
Contains the email domain name and its probability for the most popular ones and
only the name for the rest. Used to assign email domains to the user.
Used by
EmailDictionary.java.
Sample:
gmail.com 0.45
gmx.com 0.20
yahoo.com 0.18
givennameByCountryBirthPlace.txt.freq.full
Contains the CountryName, firstName, gender, birthdate period and an unused
number. Used to assign a first name to the user according to the gender and
age.
Used by NamesDictionary.java
Sample:
Abkhazia Diana 0 0 1
Abkhazia Maya 0 0 1
Abkhazia Diana Gurtskaya 1 0 1
Abkhazia Diana 0 1 1
Last update: 2013/09/04
institutesCityByCountry.txt
Contains the country name, the university name and the city of that
university.
Used to create the country->city hierarchy and to assign to the user a
university from the same country. Used by OrganizationsDictionary,java (all data)
and LocationDictionary.java (the country->city data)
Sample:
Åland_Islands Åland University of Applied Sciences Mariehamn
Abkhazia Abkhazian State University Sukhumi
Afghanistan Paktia University Gardē z
Afghanistan Baghlan University Puli_Khumri
languagesByCountry.txt
Contains the name of the country and a list of language data: the ISO 639-1 code, * if
it is an official language and the speaker percentage (0 if unknown). Used to assign
languages of its country to the user.
Used by LanguageDictionary.java
Sample:
Aruba es 12.6 en 7.7 nl * 5.8
Antigua and Barbuda en * 0
United_Arab_Emirates ar * 0 fa 0 en 0 hi 0 ur 0
popularPlacesByCountry.txt
Contains the country name, the location name, the location name with spaces,
latitude and longitude. Used by PopularPlacesDictionary.java.
Sample:
Afghanistan Ab-Kol Ab-Kol 36.22000122070312 68.5
Afghanistan Ab_Bazan Ab Bazan 36.93333435058594 69.94999694824219
Afghanistan Ab_Daw Ab Daw 36.25 71.16666412353516
Afghanistan Ab_Gaj Ab Gaj 36.98333358764648 72.69999694824219
Last update: 2013/09/04
smartPhonesProviders.txt (unused)
Contains the name of smartphone providers. Used by UserAgentDictionary.java
Sample:
IPhone
IPad
HTC
Samsung LG
surnameByCountryBirthPlace.txt.freq.sort
Contains the number of appearances of the last name, the country name and the last
name. Used to assign a surname to the user.
Used by NamesDictionary.java
Sample:
2,Abkhazia,Gurtskaya
1,Abkhazia,Kopitseva
1,Adjara,Vashalomidze
7,Afghanistan,Zaland
tagClasses.txt
Contains the tagClassId, the name and the RDF label. Used to serialize the name and
label of the tagClasses.
Used by TagDictionary.java
Sample:
0 Thing thing
1 BasketballLeague basketball league
2 LunarCrater lunar crater
3 MilitaryPerson military person
Last update: 2013/09/04
tagHierarchy.txt
Contains the base tagClassId and the parent tagClassId. Used to build the tag
hierarchy in the serialize process.
Used by TagDictionary.java
Sample:
19 179
136 338
173 211
230 149
tagText.txt
Contains the tagId and a text. Used to assign a text to the post and comments related
to its tags. Used by TagDictionary.java
Sample:
0 Hamid Karzai, GCMG (Pashto: ‫ک رزی حامد‬, Hā mid Karzay; born 24 December
1957) is the 12th and ...
1 Jalā l ad-Dīn Muḥ ammad Balkhī, also known as Jalā l ad-Dīn Muḥ ammad Rū mī and ..
2 Mahmud of Ghazni, actually Yamīn ad-Dawlah Abdul-Qā ṣ im ...
topicMatrixId.txt
Contains the topic id 1, the topic id 2, the cumulative % for topic1 and the number of
references the topic1 and topic2 appears in the same text. Used to select a list of
correlated tags of the main interest of the user.
Used by TagDictionary.java
Sample:
2909 4870 0.0 8.0
2909 4871 2.392072671167751E-4 8.0
2909 4872 4.784145342335503E-4 2.0
Last update: 2013/09/04
topuniversities.txt (unused)
Contains the name of the university, the country and a cumulative percentage.
Sample:
University of Cambridge United_Kingdom 100
Harvard University United_States 99.18
Yale University United_States 98.68
idzones
Contained in the folder resources/ipaddrByCountries, there is a list of files named
XX.zone where XX is a valid country abbreviation contained in the
countryAbbrMapping.txt dictionary. Each file contains a list of IP from the country.
Sample of ad.zone (Andorra):
85.94.160.0/19
91.187.64.0/19
109.111.96.0/19
194.158.64.0/19
REFERENCES
[1] SSJ: Stochastic Simulation in Java. University of Montreal
http://www.iro.umontreal.ca/~simardr/ssj/indexe.html
[2] S3G2: A Scalable Structure-Correlated Social Graph Generator. Minh-Duc Pham,
Peter A. Boncz and Orri Erling. TPCTC. 2010
Last update: 2013/09/04
Download