Resources Team Write up

advertisement
RESOURCES TEAM WRITE UP
THE RESOURCES TEAM MEMBERS
Aristea M. Zafeiropoulou, Hao Chen (Jerry), Jinchuan Wang (Leo), Laura German, Lisa Sugiura
and Long Cheng (Jackie/Lenson).
INTRODUCTION
The aim of this collaborative project between the Tsinghua and visiting University of
Southampton students is to extract, analyse and visualise online data that have been created by
young people in China and the UK. This data must be focused on young people’s perceptions on
other countries in the world. The following countries have been selected, UK views about China,
USA, France, Germany, Japan, Australia, Canada and Singapore; and Chinese views about the UK,
USA, France, Germany, Japan, Australia, Canada and Singapore.
The Chinese data will be extracted from www.tianya.com and www.baidu.com bulletin boards,
and the UK data from www.thestudentroom.co.uk and www.twitter.com. All of the information
collected will be located either in the UK or China, and will have been created in the past week.
The role of the resources team is to scope the project by deciding:
i.
What resources and how many should be used? How can we ensure that comparative data is
obtained from both Chinese and UK websites;
ii.
What tools of data extraction should be used? Need to be aware that websites are structured
differently and may require other tools. Are there available open software tools online? Does
the team need to create their own tool?
iii.
What data format should be used to save the extracted data? Need to make sure the data
format is in the most comprehensive, usable and consistent format for Group 2 (Analysis
Team);
iv.
The size of the dataset and the timeliness of the data. Should data be taken from the past
week, month or year?
v.
Define what is meant by ‘young people’;
vi.
What countries should be used in the project? All of the countries in the world maybe too
difficult to achieve in the project time allocated.
SELECTED RESOURCES
CHINESE RESOURCES:
www.tianya.com – Used primarily by young people from all over China.
www.baidu.com – Used by young people to search bulletin boards with the exact search term you
use.
1 Screen shot taken of Tianya to show the categories of discussion located on the left side of the
screen.
Justifications:
In China bulletin boards are very popular amongst young people wishing to express their
perceptions about many topics. The bulletin boards are useful as they are primarily used by young
people and data can be selected from specific categories, such as the UK. Also, as the bulletin
boards are only used in China, the location of users is very easy to establish and there is no issue
about extracting specific-location based data. This is a problem for UK data extraction, where the
most popular websites are used by a global community. Also, age data can be extracted from each
post; although it must be kept in mind that there is no way to verify whether all the users have
given their true ages.
Issues:
Extracting data from the most popular social networking sites in China, Weibo (comparable to
Twitter) and RenRen (comparable to Facebook), is discounted on the grounds that the data could
not be accessed within the project time constraints. For example, the Weibo developers provide
some API, but it has access restrictions (only allowed access a hundred and fifty times in an hour).
Also, permission to access this data is required from the website owners. Instead the two Chinese
bulletin boards were used, as they are still very popular amongst young people and data can be
easily extracted free from restriction.
Limitations:
It is unclear whether a user has many different accounts on a bulletin board, for example if they
are trolling.
UK RESOURCES:
www.thestudentroom.com.uk
www.twitter.com
Justifications:
In the UK, general bulletin boards are not as popular as within China, they are also not specifically
used by young people. ‘The Student Room’ is the main bulletin board used by students to discuss
general topics in the UK; therefore, this was selected to act as the most comparable online forum
to the Chinese bulletin boards. Other bulletin boards found to be used by students had a very small
following (up to 150 users); therefore, they were discounted on their small user base. They cannot
be compared to the huge user base of the Chinese bulletin boards and give a wide enough
reflection of UK perceptions of other countries.
Will Fyson (WF) and Chris Phethean (CP) located in the University of Southampton must be
acknowledged for their huge input in extracting data from Twitter. Twitter is used as the second
resource, as it is decided that there is no other comparable bulletin board available for data
extraction. Twitter is very popular in the UK for expressing general opinion, so it was chosen as
the second comparative forum to the Chinese bulletin boards.
Issues:
The resources team is making the assumption that the majority of ‘The Student Room’ users are
‘young people’, as this is a forum used by students. However, not all students are young, for
example mature students. There is no indication on the website about the age of individual users,
so this is a limitation of the research that must be kept in mind. The majority of the users are
reasonably assumed to be young people, as they are students, but this cannot be ascertained for
certain. Bearing this limitation in mind, ‘The Student Room’ is used as it is the only forum, which
is popular and used by a group of people who will reasonably fall under the category of ‘young
people’ as they are students.
Establishing ‘young people’ as users of Twitter is also problematic, as not all users make their age
known. Even if they do, it is still not certain whether this information can be trusted. Therefore,
WF and CP used two ways to extract data from users who would reasonably be assumed to be
‘young people’.
WF collected Tweets based on a 100 mile radius around a few UK cities (London, Manchester and
Edinburgh to hopefully cover a large section of the UK) and then filtered those tweets by their
authors' descriptions of themselves to see if they contain key words such as ‘young’, ‘student’,
‘undergraduate’ etc. The problem with doing it this way is that it is highly likely that large
numbers of young people may be missed out, because at first Tweets are gathered based on
location and then filtering and as such only about an hour's worth of Tweets can be collected (due
to the Twitter API's restrictions of use).
CP has gone for a different approach and found all of the followers of the Twitter accounts of the
Liberal Youth, Young Labour UK and Cons Future, which together have 10,062 followers. The
assumption is that they will mostly be followed by young people. Tweets were looked at and the
ones about the various countries were pulled out. This of course has the slight bias of only getting
opinions from politically active young people, but does look farther back in time and produce
some more relevant Tweets.
INFORMATION EXTRACTION
‘Web of documents’ – limitations of data extraction
CHINESE TOOL
WGET - GNU-software freely available on the Web that automatically extracts HTML data from
static websites. The Chinese data is saved as a .txt file.
Justifications:
Chinese bulletin boards (www.tianya.com and www.baidu.com) are static websites. They both
have categories of discussion, such as ‘UK’ or ‘Australia’, where users talk about their perceptions
of a particular country in one place. Therefore, WGET is an appropriate tool to use for data
extraction.\
UK TOOLS
Manual extraction – The resources team will extract data from www.thestudentroom.co.uk
manually. The UK data is saved on NotePad as a UTF-8 file. The manual extraction process:
i.
Use the general search and type in a specific country, for example China;
ii.
All of the threads mentioning the search term ‘China’ will appear;
iii.
Click on each thread to be located to the post that mentions the key search term;
iv.
Check there is a location that is based in the UK. If the location is unknown ignore that post;
v.
If there is a UK location copy the post to NotePad and put four dashes (----) after the post to
separate that post from the others that will be copied;
vi.
Be careful of duplicate posts;
vii.
Repeat this process for all of the posts written in the last week for the key term that has been
searched, i.e. China;
viii.
Save to NotePad in UTF-8 format;
ix.
Repeat this entire process for all of the countries selected;
x.
Ensure that each file is UK perceptions about one particular country, for example, ‘how the
UK views China’, ‘how the UK views USA’ …
Outsourcing extraction to fellow Web Science colleagues at the University of Southampton – Used
Twitter API to extract data.
Justifications:
Unfortunately, there are technical issues as WGET does not work with www.thestudentroom.co.uk.
This is because of a number of factors, including: there are no categories where all of the
discussion is about a particular country, so the key word ‘China’ is used in a search – the data
cannot be extracted from one place, unlike the Chinese bulletin boards; and all of the data is not
saved on the server as it is a dynamic website. The resources team cannot find an online tool that
can automatically extract this data. Also, due to the time constraints of this project there is not
time to write a completed program from scratch. Therefore, the resources team used a process of
manual extraction of data.
Started to Create a Tool:
The data from ‘The Student Room’ was collected using multiple python 2.7 scripts. These were
the posts from the past week and two filterings took place. The first filtering was based on a
search for the country, such as ‘China’ and the second filtering took place within each thread and
found the user who created the post to avoid duplication. This tool needs future work to include
the location specification and the relevance of each post about a particular country.
RESEARCH LIMITATIONS
‘YOUNG PEOPLE’
The research aims to create an illustrative case study about the views of young people from the
UK and China, and their views on other countries from around the world. However, it is very
difficult to precisely define what is meant by ‘young people’; could this be teenagers or people in
their early twenties? Therefore, the resources team have decided that ‘young people’ will be
defined widely, encompassing 18-35 year olds.
ESTABLISHING USERS’ AGES
One of the greatest issues faced by the resources team is how a user’s age is going to be
determined. There is little age-specific data attached to the users’ posts and in the UK it is quite
difficult to a forum that is primarily used by young people to discuss perceptions of other
countries. Instead, the resources, apart from Twitter, have been selected on the assumption that the
majority of users will fall under the category of ‘young people’ (18-35 year olds.)
NATIONALITY
The Chinese bulletin boards are national and it can be immediately inferred that the majority of
the data is being created by Chinese nationals; however, this is much more difficult to establish in
the UK.
Some of the UK data on is created by individuals located in the UK, but they may not necessarily
be UK nationals. Even though it has a UK domain name, it also seems popular with other students
outside the UK. Also, a substantial portion of the users offer no location information, so these
posts have to be discounted from the research project. Therefore, the project is limited on the basis
that data collected in the UK cannot be confirmed to have been produced by UK nationals, and not
all of the data is collected from UK nationals as there is no method to ensure data is from the UK
if there is no location given.
POPULARITY OF THE WEBSITE
Weibo, RenRen and Facebook cannot be used for data extraction. Therefore, the highly popular
Chinese bulletin boards and Twitter were selected as being comparable. Although,
www.studentroom.co.uk has a substantial following it is much less than that of the other resources.
Therefore, there is a smaller amount of people discussing perceptions and this should be taken into
account. China also has a much greater population than the UK, so there might be more people
discussing different countries and thus be more obtainable data to extract.
ISOLATING SPECFIC GROUPS OF YOUNG PEOPLE
In China www.tianya.com and www.baidu.com are used by all groups of young people. However
the www.studentroom.co.uk is specifically targeted at one group of young people – students. It
must be taken into account that this makes the views about different countries limited. Perhaps,
this group of people is more interested in talking about gap year travel and specific topics from a
student point of view. It perhaps would be useful to expand this research to look at other groups of
people such as:
i.
ii.
iii.
iv.
v.
-Chinese ethnic origin;
MEASURING POST RELEVANCE
Posts may not be country specific, for example an individual located in the UK may create a post
that contains a comparison between two countries – “China has a greater population than Spain.”
This highlights the importance of both quantitative and qualitative methods in this project.
Quantitative methods are very important in extracting posts that mention specific countries;
however, sometimes qualitative methods will be vital to distinguish the context of a specific post.
One qualitative method is to extract the frequency of the search term, i.e. ‘China’, per thread. The
higher the frequency of the search term per thread means that specific thread is more likely to be
relevant about perceptions of a particular country. Require further research to examine how to flag
up posts that need further qualitative input.
DIFFERENT TERMS
Countries can be referred to be many different key search terms, such as ‘America’, ‘USA’,
‘Americans’ and ‘the United States’. Locations may not even be country specific, but relate to
cities and areas within that country, for example within the UK – Brum (Birmingham), Bristol,
Glasgow, Belfast, Cardiff, London and the West Country. This was a problem encountered on ‘The
Student Room’. How this location data can be extracted automatically is an area for future
research.
OPINIONS MAY CHANGE OVER TIME
The program created to extract data should be run every week to extract new posts. If this became
a long-term project data could be collected about how perceptions about other countries develop
over time. Retrospective perceptions could also be obtained.
LACK OF ONLINE RESOURCES
There is a lack of freely available online Web scrapers and there was not enough time to create a
complete program. Future work will be based on the python tool created.
CHINESE BAR CHART
Baidu and Tianya Data
70000
60000
50000
40000
30000
20000
totalnumber
10000
0
2The total number of Chinese posts collected during this project.
UK BAR CHART
'The Student Room'
900
800
700
600
500
400
300
200
100
0
3 The total number of posts collected from 'The Student Room' during this project.
FUTURE WORK
Chinese Data
1. Obtain data from websites which have larger amounts of users, for example http://weibo.com/
and http://renren.com/. These are similar to Twitter and Facebook. The team could use an
open API to get the data but one IP address can only get limited information for example 150
items per hour. This may be a problem but it could be resolved by using many different
computers together.
2. Find some automatic tools to replace the manual collection of data to free up time and enable
faster results.
3. Previously the team assumed that all the views were from young people, but did not have the
evidence to prove this. The team has to do some work to know the age of the individuals
contributing the data.
4. Previously the data was given to the analysis team via a text file; however this may not be the
most effective method as the data was not separated properly. The resources team could look
at using a database instead, which is managed by the analysis team. This may be easier to
analyse the data later on. The data could be filled everyday using some computer servers to
automatically get the data. They can continue to run programs and new data can be inputted
into it. If the database is changed the map which is produced by the visualization team, can
also be changed easily. The resources team will not store data as they will give the data to the
analysis team to store the data in their data base. Instead the resources team could perhaps
keep a copy for back up purposes.
5. Another solution is to collect and store the data in RDF format; this would make it easier to
collaborate with Group 2 (Analysis Team) and general compatibility. This seems to be a better
solution than using domain-specific databases.
6. Natural language processing (NLP) – need more country-specific, context-specific NLP.
UK Data
1. Continue to collect data from the most popular social networking websites such as Twitter and
Facebook.
2. Successfully implement tools to enable the electronic collection of data to avoid manual
collection. Decide on the best method (follow examples set by Southampton students during
their collection of Twitter data).
3. Try to use methods that will ensure the data is obtained from young people.
4. Previously the data was given to the analysis team via a text file; however this may not be the
most effective method as the data was not separated properly. The resources team could look
at using a database instead, which is managed by the analysis team. This may be easier to
analyse the data later on. The data could be filled everyday using some computer servers to
automatically get the data. They can continue to run programs and new data can be inputted
into it. If the database is changed the map which is produced by the visualization team, can
also be changed easily. The resources team will not store data as they will give the data to the
analysis team to store the data in their data base. Instead the resources team could perhaps
keep a copy for back up purposes.
5. Another solution is to collect and store the data in RDF format; this would make it easier to
collaborate with Group 2 (Analysis Team) and general compatibility. This seems to be a better
solution than using domain-specific databases.
6. Natural language processing (NLP) – need more country-specific, context-specific NLP.
Download