the LAK Dataset - DSpace Open Universiteit

advertisement
Fostering Analytics on Learning Analytics Research:
the LAK Dataset
Davide Taibi
Stefan Dietze
Institute for Educational Technologies
National Research Council of Italy
Via Ugo La Malfa 153, Palermo
Italy
L3S Research Center
Appelstr. 9a
30167 Hannover
Germany
davide.taibi@itd.cnr.it
dietze@l3s.de
ABSTRACT
This paper describes the Learning Analytics and Knowledge
(LAK) Dataset, an unprecedented collection of structured data
created from a set of key research publications in the emerging
field of learning analytics. The unstructured publications have
been processed and exposed in a variety of formats, most notably
according to Linked Data principles, in order to provide
simplified access for researchers and practitioners. The aim of this
dataset is to provide the opportunity to conduct investigations, for
instance, about the evolution of the research field over time,
correlations with other disciplines or to provide compelling
applications which take advantage of the dataset in an innovative
manner. In this paper, we describe the dataset, the design choices
and rationale and provide an outlook on future investigations.
Categories and Subject Descriptors
H.3.4 [Semantic Web], I.2.4 [Ontologies]
General Terms
Algorithms,
Documentation,
Standardization.
Design,
Experimentation,
Linked
The LAK dataset provides access to documents already available
online in unstructured form but also to research works not
publicly accessible before at all. The collection includes the
proceedings of the International Educational Data Mining
Society4, as well as the Journal of Educational Technology and
Society a special issue on Learning Analytics5. In both cases full
text has been freely available already but in unstructured format.
Content which has previously been accessible to subscribers only
includes the “Proceedings of the ACM International Conference
on Learning Analytics and Knowledge” edited by ACM. In the
framework of our initiative, ACM is providing freely its ACM
Digital Library in the Learning Analytics field solely for research
purposes.
The following table describes in details the set of papers that have
been collected, processed and exposed in a structured form in the
Learning Analytics and Knowledge (LAK) dataset:
Keywords
Learning Analytics, Data,
Educational Data Mining
allow the computational analyses of research publications in the
emerging Learning Analytics field, we have published in a
machine-readable format a comprehensive set of scientific papers
in the field of learning analytics and educational data mining. This
can serve as input for studies into scientometrics, investigations
into the evolution of the overall discipline or correlations with
other fields.
Data,
Semantic
Table 1 : Papers included in the LAK dataset
Web,
Publication
1. INTRODUCTION
As part of an international team of research practitioners
consisting of the Society for Learning Analytics Research
(SoLAR)1, ACM2, the LinkedUp project3, the Educational
Technology Institute of the National Research Council of Italy
(CNR-ITD), we have released an unprecedented resource for the
Learning Analytics and Educational Data Mining. In order to
Permission to make digital or hard copies of all or part of this work for
personal or classroom use is granted without fee provided that copies are
not made or distributed for profit or commercial advantage and that
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
1
www.solaresearch.org/
2
http://acm.org/
3
http://linkedup-project.eu/
# of papers
Proceedings of the ACM International
Conference on Learning Analytics and
Knowledge (LAK) (2011-12)
66
The open access journal Educational
Technology & Society special issue on
“Learning
and
Knowledge
Analytics”:
Educational Technology & Society (Special
Issue on Learning & Knowledge Analytics,
edited by George Siemens & Dragan Gašević),
2012, 15, (3), pp. 1-163.
10
Proceedings of the International Conference on
Educational Data Mining (2008-12)
239
Journal of Educational Data Mining (2008-12)
16
4
http://www.educationaldatamining.org/proceedings
5
http://www.ifets.info/issues.php?id=56
Figure 1: Knowledge extraction process
For each paper the following information are collected: title,
authors, keywords, abstract, full-text and its relationship with the
type of publication or event, (journal or conference proceedings).
2. FROM SCHOLARLY PAPER TO
STRUCTURED DATA
plain
textual
It is important to note that beside the common metadata for the
learning analytics papers such as: title, abstract, authors and
affiliation also the full text of the papers is stored in the dataset.
At this stage the full text has been stored without considering its
separation in paragraphs and sections, however the elaboration
performed at step number 2 has also identified the titles and
paragraphs of sections and subsections, thus providing the basis
for analyzing full text with further granularity in next versions of
the LAK dataset. The referenced papers are also extracted but are
not made available in the LAK dataset in this version of the
dataset .
the
textual
2.2
2.1 The extraction process
In order to process and analyze the set of unstructured journals
and conferences papers data was transformed into structured data.
While each conference proceeding is available on the Web in PDF
format, but each collection has its own structure. Even if in some
cases the most used format is the ACM template6, papers not
always comply with it entirely, calling for some specifically
adapted extraction mechanisms. The overall knowledge
extraction process is composed of three main steps:
1.
Transforming PDF
representation.
documents
2.
Cleaning up
information.
consolidation
3.
Extracting structured data from text.
and
to
of
In the first step the PDF file containing the proceeding of a
conferences, or the papers of a journal is split up in order to have
one document for each paper. Then each PDF file has been
elaborated with pdf2text tool in order to have a textual
representation for each paper.
In the second step the text files are elaborated in order to
transform them in a partially structured format that can be
elaborated automatically. In particular at this step tables and
figures are removed from the paper, maintaining their captions,
that can be useful for text mining processing, footnotes have been
also removed from the text, while bulleted or numbered list have
been organized using an homogeneous format.
As part of the third step, text files are being processed in order to
extract from them the most important sections of the document.
Regarding the authors, their name, affiliation, country are
represented using the FOAF ontology.
The schema
The schema used to describe the papers in the dataset is based on
two established schemas: the Semantic Web Conference (SWC)
ontology7 (already used to describe metadata about publications
from the Semantic Web conferences and related events8) and the
Linked Education schema9. The Linked Education schema has
been developed to represent and catalog both educational and
educational related datasets, which are datasets not specifically
created for education but that can be used in an educational
context. The schema has been used to annotate datasets and
resources as part of an integrated dataset10 which contains
educationally relevant resources such as : LinkedUniversities[1]
and the mEducator Educational Resources [4] with their Open
Educational Resources and materials explicitly related
to
education, as well as implicitly educationally relevant datasets
such as BBC Programmes[3], ACM Library Metadata11 and
Europeana [2] datasets. The main entities collected in the LAK
dataset are paper authors, institutions and papers, related to the
7
http://data.semanticweb.org/ns/swc/ontology
8
http://data.semanticweb.org/
9
http://data.linkededucation.org/ns/linked-education.rdf
10
http://linkedup.l3s.uni-hannover.de:8880/openrdfsesame/repositories/linked-learning-selection?query
6
http://www.acm.org/sigs/publications/proceedings-templates
11
http://acm.rkbexplorer.com/
learning analytics area. Authors and institutions have been
represented using respectively the classes Person and
Organization of the FOAF ontology, while to represent papers, the
class InProceedings of the SWRC ontology has been used. The
LAK Dataset, at the time of writing, includes 779 authors,
connected to 295 institution, and 315 posters, abstract, short and
full papers.
2.3 Access Methods
Conference, Leuven, Belgium (April 2013)18. The challenge is
revolving around the overall question on what insights can be
gained from analytics on the LAK corpus about the general
discipline of Learning Analytics and its connection to other fields.
How can we make sense of this emerging field’s historical roots,
current state, and future trends, based on how its members report
and debate their research? Challenge submissions should exploit
the LAK Dataset by covering one or more of the following, nonexclusive list of topics:
In order to support different access method for the data, the
resources of the LAK datasets have been published in different
formats:

Analysis & assessment of the emerging LAK community
in terms of topics, people, citations or connections with
other fields

A dump file in zipped RDF/XML file format can be directly
downloaded from the SoLAR research web page.

Innovative applications to explore, navigate and visualise
the dataset (and/or its correlation with other datasets)

A version of the dataset in a format that can be elaborated
through the R statistic software have been provided12

Usage of the dataset as part of recommender systems

A Linked Data endpoint with a public SPARQL endpoint
has been developed in order to provide access to structured
RDF metadata according to LOD principles.13
The following SPARQL query14 on the LAK dataset can be used
to extract the full text of all 2011 papers (LAK 2011, and EDM
2011 conferences) in .srx format (XML file which can be opened
in any text editor):
PREFIX led:<http://data.linkededucation.org/ns/linked-education.rdf#>
PREFIX swrc:<http://swrc.ontoware.org/ontology#>
SELECT ?paper ?fulltext WHERE { ?paper led:body ?fulltext . ?paper
swrc:year ?year . FILTER (?year = “2011”) }
On the SoLAR Website some useful examples for querying the
SPARQL endpoint15 of the LAK dataset have been reported16.
The example queries allow users, for instance, to retrieve: the
papers co-authored by two selected authors; all papers published
in both EDM and LAK conferences by the authors affiliated to an
institution.
4. CONCLUSIONS
The LAK Dataset has been a first starting point to allow the
analysis of leaning analytics as an emerging research discipline
and its definition and evolution. Along with the growth of the
research works in learning analytics and related fields, we intend
to expand the dataset by adding new research publications. In
addition, while the dataset currently contains plain metadata and
full text of the research publications, it is envisaged to extract and
add additional data about contained entities and topics, to provide
simple means for assessing, exploring and navigating the data.
5. ACKNOWLEDGMENTS
This work is partly funded by the European Union under FP7
Grant Agreement No 317620 (LinkedUp).
6. REFERENCES
3. LAK DATA CHALLENGE
[1] Fernandez, M., d'Aquin, M., and Motta, E. 2011. Linking
Data Across Universities: An Integrated Video Lectures
Dataset. In Proceeding of the 10th International Semantic
Web Conference (ISWC 2011), 23 - 27 Oct 2011, Bonn,
Germany.
Beyond merely publishing the data, we are actively encouraging
its innovative use and exploitation as part of a public LAK Data
Challenge17 sponsored by the European Project LinkedUp. An
initial competition is co-located with the ACM LAK13
[2] Haslhofer, B., Isaac. A. 2011. data.europeana.eu - The
Europeana Linked Open Data Pilot. In Proceeding of the
International Conference on Dublin Core and Metadata
Applications (DC 2011).
12
http://www.r-project.org/
13
http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?query=[your sparql query]
14
http://data.linkededucation.org/openrdf-sesame/repositories/lakconference?queryLn=SPARQL&query=PREFIX%20led%3A%3Chttp
%3A%2F%2Fdata.linkededucation.org%2Fns%2Flinkededucation.rdf%23%3E%0APREFIX%20swrc%3A%3Chttp%3A%2F%
2Fswrc.ontoware.org%2Fontology%23%3E%0A%0Aselect%20%3Fpa
per%20%3Ffulltext%20where%20%7B%3Fpaper%20led%3Abody%2
0%3Ffulltext%20.%20%3Fpaper%20swrc%3Ayear%20%3Fyear%20.
%20FILTER%20%28%3Fyear%20%3D%20%222011%22%29%20%7
D&infer=true
15
http://data.linkededucation.org/openrdf-sesame/repositories/lakconference
16
http://www.solaresearch.org/resources/lak-dataset/sparql-queries/
17
http://www.solaresearch.org/events/lak/lak-data-challenge/
[3] Kobilarov, G., Scott T., Raimond Y., Oliver S., Sizemore C.,
Smethurst M., Bizer C., Lee R. 2009. Media Meets Semantic
Web - How the BBC Uses DBpedia and Linked Data to
Make Conections. In Proceedings of the 6th European
Semantic Web Conference (ESWC2009).
[4] Mitsopoulou, E., Taibi, D., Giordano, D., Dietze, S., Yu, H.
Q., Bamidis, P., Bratsas, C. and Woodham, L. 2011.
Connecting Medical Educational Resources to the Linked
Data Cloud: the mEducator RDF Schema, Store and API, in
Linked Learning 2011, Proceedings of the 1st International
Workshop on eLearning Approaches for the Linked Data
Age, CEUR-WS, Vol. 717, 2011
18
http://lakconference2013.wordpress.com/
A Dynamic Topic Model of Learning Analytics Research
Michael Derntl
Nikou Günnemann
Ralf Klamma
RWTH Aachen University
Advanced Community
Information Systems (ACIS)
Aachen, Germany
RWTH Aachen University
Advanced Community
Information Systems (ACIS)
Aachen, Germany
RWTH Aachen University
Advanced Community
Information Systems (ACIS)
Aachen, Germany
derntl@dbis.rwthaachen.de
nikou@dbis.rwthaachen.de
klamma@dbis.rwthaachen.de
ABSTRACT
Research on learning analytics and educational data mining has been published since the first conference on Educational Data Mining (EDM) in 2008 and gained momentum
through the establishment of the Learning Analytics and
Knowledge (LAK) conference in 2011. This paper addresses
the LAK Data Challenge from the perspective of visual analytics of topic dynamics in the LAK Dataset between 2008
and 2012. The data set was processed using probabilistic,
dynamic topic mining algorithms. To enable exploration
and visual analysis of the resulting topic model by LAK
researchers and stakeholders we developed and deployed DVITA, a web-based browsing tool for dynamic topic models.
In this paper we explore answers to the questions about past,
present, and future of LAK posed in the Data Challenge
based on a topic model of all papers in the LAK Dataset.
We also briefly describe how users can explore the LAK topic
model on their own using D-VITA.
1.
OBJECTIVES
The LAK Data Challenge called for contributions to make
sense of the field of learning analytics including its “roots,
current state, and future trends, based on how its members
report and debate their research”1 . This paper tackles the
challenge by presenting facts obtained from statistical analyses of the paper full texts included in the provided LAK
Dataset [7]. The main contributions are as follows:
1. A dynamic topic model was computed using the approach presented in [3]. Using this dynamic topic model
we explore in Section 4 three questions about the evolution of topics in the LAK Dataset to distill knowledge
about past, present and future of LAK research.
2. In Section 5 we describe the visual analytics application D-VITA2 , which puts the toolkit to answer the
1
2
http://solaresearch.org/events/lak/lak-data-challenge/
http://monet.informatik.rwth-aachen.de/DVita/?id=16
Copyright 2013 by the authors
Figure 1: Yearly distribution of papers over venues
questions posed in the LAK Data Challenge into the
user’s hands. D-VITA is a web-based tool that offers
topic-based views on the LAK Dataset using a pointand-click metaphor and simple visualizations.
2.
DATASET AND PREPROCESSING
The LAK Dataset underlying the analyses presented in this
paper includes the EDM conference proceedings 2008–2012
(239 papers), the LAK conference proceedings 2011–2012
(66 papers), and the papers of the 2012 Special Issue on
Learning Analytics in the Educational Technology and Society journal (10 papers; herafter referred to as ETS). The
RDF representation of the LAK Dataset was processed by a
script that extracted for each paper the identifier, venue ∈
{LAK, EDM, ETS}, year of publication, title, authors, abstract, full text, and hyperlink to the full RDF description
on data.linkededucation.org. The distribution of the 315
papers over time and venues is given in Figure 1.
In the next preprocessing step the paper records were cleaned
by removing stopwords and by applying stemming methods
on the included word sets. For word stemming we used the
Porter Stemming technique [6], which is well established for
this purpuse. As a result, close to 5000 distinct word stems
were identified as being used in the 315 papers.
3.
DYNAMIC TOPIC MINING
From a text mining perspective the LAK Dataset represents
a text corpus in which a set of words is used in a set of papers. To identify what is relevant to LAK research, we used
the dynamic topic modeling approach described in [3] to obtain the distribution of words over a pre-defined number of
topics. This is a probabilistic, unsupervised machine learning approach that has been gaining increasing prominence
recently [2]. In these probabilistic topic models a topic is a
distribution of words, so each topic is typically represented
by its most frequently occurring words. Topic mining also
obtains the distribution of these topics over the papers in
the data set. Dynamic topic mining applies these analysis steps using several consecutive time slices in the data
set. For the LAK Dataset, we chose the five calendar years
∈ {2008 . . . 2012} as time slices. The results will thus reveal
the evolution of topics over documents during those discrete
time slices, and the evolution of words used in the papers
for each topic over time.
with pairwise divergence values is displayed in Figure 2. The
maximum Jensen-Shannon divergence value is ln(2) ≈ .69.
The darker the cell color, the lower the divergence, thus the
higher the similarity. The matrix is generally “light-colored”,
indicating that the topics’ word distributions diverge to a
high degree. Topic pair (A, S) has the lowest dissimilarity
value, and Figure 3 reveals why: both topics are about student modeling. Topic S generally appears to have several
loosely related topics.
Dynamic topic mining requires the analyst to pre-set the
number of topics. Based on previous experiments with varying numbers of topics in paper collections in well-defined
subject areas, we decided to run the analysis of the LAK
Dataset with a set of 20 topics. This number, while somewhat arbitrary, shall provide for sufficient discriminatory
power for both the distribution of topics over papers and the
distribution of words over topics. With fewer topics, terms
like ‘learning’, for instance, are more likely to be present
with relatively high relevance in many topics, while a larger
preset would increase the number of topics exposed in each
paper. Both situations would impede reasonable interpretation and visualization of the results.
4.
A word of explanation regarding the labels used to refer to
topics in this paper: mathematically each topic is a distribution over words. In a dynamic topic model this distribution changes over time, i.e. a specific word may rise or
fall in relevance for a topic. In the rest of the paper we
will therefore label each topic with an ordered tuple representing those words with the highest mean relevance for this
topic over time. In topic modeling literature we found that
four words is a good number to form a topic label. For instance, for topic “students model parameters skill” the
most relevant word on average is student followed by model,
parameters, and skill. For illustration, based on the word
distribution for this topic in 2008 only, the label would be
“model student skill learning”. Often, such word tuples
are rephrased as more expressive labels; for instance “student
modeling” could be appropriate in our example.
The obtained topic model including 20 topics was analyzed
to see whether the topics have sufficient discriminatory power.
To this end, we used the ten most important words for
each topic and the corresponding probability distributions to
compute a dissimilarity measure of the distributions by using the Jensen-Shannon divergence measure [5]. The matrix
ANALYSIS OF LAK TOPIC DYNAMICS
In this section we explore three questions about the LAK
Dataset, intending to shed some light on the past and present
topics of learning analytics research, along with a cautious
glimpse into the future.
Question 1: What have been the most relevant topics
overall in the LAK data set?
This question addresses the LAK Data Challenge aspects
of roots and current state of learning analytics. Figure 3
shows an overview chart of the 20 topics identified in the
LAK Dataset. The horizontal axis reflects the rank of mean
relevance of each topic and the vertical axis reflects the rank
of stability3 over the five time slices in the dataset. The size
of each bubble reflects the relevance of the topic in 2012, the
most recent period. We make several observations:
• The most relevant topics most prominently feature the
terms students/learners, model, and data. This aligns
well with SoLAR’s definition of learning analytics as
“the measurement, collection, analysis and reporting
of data about learners and their contexts, for purposes
of understanding and optimizing learning and the environments in which it occurs,”[1] considering that understanding and optimization is necessarily based on
models of learners and data.
• The topic with the highest mean relevance is “student
model parameters skill” (A); this topic also has the
highest variance in relevance.
• In the top-right quadrant we find topic “model data
features prediction” (B) which has a strong relevance in 2012, high mean relevance rank over all years
and a high stability. As such, it can be considered
as one of the core topics in the LAK Dataset. In
2012 the distribution of words in this topic would advocate the label “prediction model data students”,
i.e. prediction is currently most relevant for this topic.
• Topic “network community discussion analysis” (R)
is also worth looking at. While it is relatively irrelevant and volatile, it is among the relevant topics in
2012 (cf. the bubble size). The topic evolution chart
in Figure 4 reveals that this topic accumulated most
of its relevance in 2011, the year of the first LAK conference. Also, 8 of the 10 papers with the strongest
focus on this topic in 2011 were published in the LAK
conference (see bottom portion of Figure 4) although
EDM published 2.5 times the number of papers in that
Figure 2: Overview of topic divergence
3
Stability was computed by inverting the variance of the
topic’s relevance over time
Figure 3: Topic stability plotted against average topic relevance over time.
the topic, i.e. the more documents expose this topic. Since
each document exposes different topics to varying degrees
the relevance of topic P
k at time t is formally defined as
relevance(k, t) := |D1t | d∈Dt θd [k], where Dt is the set of
documents belonging to time t, and θd is the topic distribution for document d. Observing the ThemeRiver in Figure 5
it is evident that there were some shifts in topic focus during
the years 2008 and 2010, where we have only the EDM papers in the dataset. Between 2010 and 2011 we identify the
strongest turbulence, presumably based on substantial shifts
in topic foci introduced by the 2011 LAK conference. Interestingly the topic distribution remains rather stable during
the last time slice, in which LAK 2012, EDM 2012 and the
ETS special issue are included. This might suggest that
these three publication venues propelled the convergence of
LAK research as represented in the LAK Dataset.
Figure 4: Evolution (top) and most representative
papers in 2011 (bottom) of topic “network community discussion analysis”
year. This topic, in 2012 represented by the word order “network community social user”, therefore appears to be a genuine LAK topic which was previously
rather irrelevant for the EDM conference.
Question 2: What changes in topic dynamics did the
first LAK conference in 2011 bring about?
This question aims to reveal whether and how the LAK community relates to the EDM community in terms of topics
covered by their papers. To explore this we look (a) at at
the overall distribution of topics over time and (b) at the
relative change of topic relevance between 2010 and 2011.
The evolution of the overall distribution of topics is illustrated in the ThemeRiver in Figure 5. In a ThemeRiver [4]
the horizontal axis represents the points in time to which the
documents in a dataset belong (in the LAK Dataset that is
the publication date), and the vertical axis represents the
relevance of the topic. Each current in the ThemeRiver
therefore presents the dynamic development of a selected
topic over time. The wider the current, the more relevant is
To see which topics rose in relevance between 2010 and
2011 we filter for topics and zoom into the transition between 2010 and 2011 as illustrated in Figure 6. Those three
topics that have their absolute highest relevance in 2011
are marked with an up-pointing triangle with a solid-black
outline. These are “model students data probability”,
“network community discussion analysis”, and “problem
students model types”, indicating an increased focus on
student modeling as well as community and network analysis through the first LAK conference in 2011.
Question 3: What topics rose the most in 2012, the
most recent time slice in the data set?
This question looks into what the dynamic topic model of
the LAK Dataset suggests as rising topics over the next
year(s). We try to answer this by identifying those five topics that had the highest rise in relevance between 2011 and
2012. The topic labels represent the word distribution in
2012, and the number in parentheses indicates the absolute
gain in relevance:
Figure 5: Overall distribution of topic relevance between 2008 and 2012
The Document and Word Evolution Panel shows for the
selected topic an ordered list of the most relevant papers
in the “Relevant Documents” tab. The icons next to each
document allow showing the topic pie for the document
and its content, respectively. The “Similar Docs” icon will
bring up the Document Browser with a list of similar documents. Under the “Word Evolution” tab the user will find a
ThemeRiver illustrating the evolution of the distribution of
words in the selected topic over time.
Figure 6: Topics with rising relevance in 2011
D-VITA also offers a Document Browser to perform keywordbased search, explore the topic distribution of documents,
and navigate documents based on similarity.
6.
1.
2.
3.
4.
5.
students data courses system (+.054)
students interaction participants analysis (+.036)
learning analytics social learners (+.035)
students actions learning state (+.025)
data user learning dataset (+.013)
In sum these five topics have accumulated a share of 42% of
the topic distribution by 2012, starting from 11% in 2008 (cf.
Figure 7). These developments indicate a strong increase in
focus on the students’ activities and actions in courses as
well as social and interaction analytics.
CONCLUSION
In a nutshell, we discovered the following: Regarding the
past, we found that LAK and EDM do have a substantial
shared topic foundation including themes like student modeling, data classification, and clustering. We also found that
the EDM conference series had some turbulence in topical
focus between 2008 and 2010, the time window when only
EDM papers are present in the dataset.
Regarding the present we found that the LAK Dataset exposes a strong emphasis on learner modeling, data modeling, analysis and prediction. The first LAK conference in
2011 also brought some considerable shifts in topic focus;
e.g. LAK 2011 has visibly strengthened network and social
analysis aspects on top of EDM topics.
Regarding the near future we found that the shifts in the
topics’ proportions in 2012 appear rather moderate, thus
indicating a phase of convergence of LAK research topics.
Projecting recent topic shifts into the future, we can expect
increased emphasis on social and interaction aspects and a
sustained, strong role of students as research subjects.
Figure 7: Cumulative relevance of the top-five rising
topics 2012 over all years
5.
D-VITA TOPIC ANALYTICS TOOKIT
Except for Figures 1 and 3 all figures were produced using
D-VITA, a web-based visual analytics tool we developed and
deployed for visual analytics of dynamic topic models. The
tool allows users to visually interact with the output of the
dynamic topic mining algorithms on the LAK Dataset. The
application window shown in Figure 8 has three panels:
The Topics Panel shows the list of topics obtained by the
dynamic topic modeling algorithm; topics can be sorted by
rising, falling and mean relevance, as well as variance of
relevance. The topics can be filtered using keywords; in the
screen shot the keyword “visual” is used as a filter. The topic
list thus only includes topics whose set of relevant words
includes this word stem. Topics checked by the user will be
visualized in the ThemeRiver in the Topic Evolution Panel.
The Topic Evolution Panel shows a ThemeRiver of evolution of relevance of the topics selected in the Topics Panel.
Data points for each topic and time slice, respectively, can
be clicked, which will trigger the display of detailed information on the clicked topic at the selected time slice in the
Document and Word Evolution Panel.
7.
ACKNOWLEDGMENTS
This work was supported by the European Commission through
the the support action TEL-Map (FP7-257822) and the integrated project Layers (FP7-318209).
8.
REFERENCES
[1] About SoLAR, 2012.
http://www.solaresearch.org/mission/about/.
[2] D. M. Blei. Probabilistic topic models. Commun. ACM,
55(4):77–84, 2012.
[3] D. M. Blei and J. D. Lafferty. Dynamic topic models.
In ICML, pages 113–120, 2006.
[4] S. Havre, E. G. Hetzler, P. Whitney, and L. T. Nowell.
Themeriver: Visualizing thematic changes in large
document collections. IEEE Trans. Vis. Comput.
Graph., 8(1):9–20, 2002.
[5] J. Lin. Divergence Measures Based on the Shannon
Entropy. IEEE Transactions on Information Theory,
37(1), 1991.
[6] M. F. Porter. An algorithm for suffix stripping.
Program, 14(3):130–137, 1980.
[7] D. Taibi and S. Dietze. Fostering analytics on learning
analytics research: the LAK dataset, Technical Report,
03/2013, 2013. http://resources.linkededucation.
org/2013/03/lak-dataset-taibi.pdf.
Figure 8: Application window showing ThemeRiver and document list (rotated image)
Socio-semantic Networks of Research Publications
in the Learning Analytics Community
Soude Fazeli, Hendrik Drachsler, Peter Sloep
Open University of the Netherlands (OUNL)
Centre for Learning Sciences and Technologies (CELSTEC)
6401 DL Heerlen, The Netherlands
0031-(0)45-576-2218
{soude.fazeli,hendrik.drachsler,peter.sloep}@ou.nl
ABSTRACT
2. Motivation
In this paper, we present network visualizations and an analysis of
publications data from the LAK (Learning Analytics and
Knowledge) in 2011 and 2012, and the special edition on
Learning and Knowledge Analytics in Journal of Educational
Technology and Society (JETS) in 2012.
It is often difficult for conference attendees to decide which
workshops or sessions are suitable and relevant for them.
Therefore, a list of recommended authors and papers based on
shared interests could be supportive to plan the conference
participation more efficiently and effectively. There already exist
several papers published regarding awareness support for
researchers (Reinhardt et al., 2012; Fisichella et al., 2010; Ochoa
et al., 2009; Henry et al., 2009) and scientific recommender
systems (Huang et al., 2002; Wang & Blei, 2010) but none of
them has analyzed the Learning Analytics datasets for this
purpose yet.
Categories and Subject Descriptors
H.3.3 [Information Search and Retrieval]: Information filtering;
K.3.m [computers and education]: Miscellaneous
General Terms
Algorithm, visualizations
Keywords
Network, recommender, visualization, dataset, learning analytics,
degree
1. Introduction
1
The Society for Learning Analytics Research (SOLAR) provided
a dataset to solicit contributions to the LAK data challenge2
sponsored by the FP7 European Project LinkedUp3. The dataset
contains research publications in learning analytics and
educational data mining for the years 2010, 2011, and 2012 (Taibi
& Dietze, 2013). An overview of the dataset is shown in Figure 1.
The dataset contains in total, 173 authors and 76 papers from the
LAK (Learning Analytics and Knowledge) conference series in
2011 and 2012, and the special edition on learning and knowledge
analytics in the Journal of Educational Technology and Society
(JETS) in 2012. We found 24 authors who contributed to all three
scientific proceedings.
Having access to a dataset always offers new opportunities,
particularly in the educational domain, that lacks public datasets
for running experimental studies (Verbert, Drachsler, Manouselis,
Wolpers, Vuorikari, & Duval, 2011). Therefore, we used this
dataset to present visualization of the authors and papers network,
and to carry out a deeper analysis of the generated networks. Our
overall aim is to use such a graph of authors and papers to
recommend similar items to a target user. In the following
sections, we evaluate the suitability of the LAK dataset for this
purpose.
1
http://www.solaresearch.org/
2
http://www.solaresearch.org/events/lak/lak-data-challenge/
3
http://linkedup-project.eu/
Our overall vision is to support the LAK attendees with a list of
LAK authors and papers that are relevant for their own research
interests. Such a recommendation could be created based on one
or more of their own research papers but also on a short essay or
even a tag cloud summarizing the research interest and objectives.
Such a priority list can support the awareness of the attendees and
empower the network of like-minded authors in the attendees’
particular research focus.
Figure 3. The used datasets
In this paper, then, we aim to explore and identify like-minded
authors within the LAK dataset. Supposing that we have a
network of all the LAK authors and papers, the main research
questions are:
RQ1. How are the authors connected and which authors share
more connections and are more central in terms of sharing
commonalities with the others?
RQ2. How are the papers connected to each other in terms of
similarity?
To answer these questions, we went through two main steps in our
analysis: 1. Finding patterns of similarity between authors and
papers, 2. Visualizing networks of the LAK authors and papers.
We will now describe each step in detail.
4.1. The LAK authors network
Figure 2 presents a network of the LAK authors in which red
nodes represent the authors and the edges show the similarity
between the publications of two authors. The result shows how
the LAK authors are connected in terms of their publications'
commonalities. Moreover, the network shows the users who share
more commonalities than do other authors. We call them ‘central
authors’. In the next section, we show how they are connected
with the other authors in the network.
4.2. The LAK authors’ degree centrality
(The Appendix shows a larger version)
3. Data processing
To find relationships between authors, we first computed the
4
similarity of the papers with the TF-IDF algorithm. TF-IDF can
create a weighted list of the most commonly used terms in
research articles. To generate the TF-IDF matrix for the LAK
dataset, we first converted the LAK data from RDF to text files,
5
which is an accepted format for the Mahout system. Then, we ran
the default TF-IDF algorithm provided by Mahout on the text
files. We removed the stop words by setting the configuration
variables within Mahout to 90%. Thus, if a word appears in 90%
of the document, it is considered as a stop word (e.g. and, or, the,
etc.) and is removed from the similarity matrix. As a final
outcome we had:
•
•
A so-called dictionary of all the terms in the LAK
dataset
A binary sequence file that includes the TF-IDF
weighted vectors
For computing similarity between the LAK authors, we used the
T-index algorithm (Fazeli, Zarghami, Dokoohaki, & Matskin,
2010) as a collaborative filtering recommender algorithm that
generates a graph of users. In it the nodes are users and the edges
show the relationship between users that originates from similarity
of user profiles. The T-index algorithm originally makes
recommendations based on the ratings data of users. We extended
the T-index algorithm to be able to process tags and keywords
6
extracted from the linked data e.g. RDF files. We used Jena APIs
to process RDF files and to handle Ontology Web Language
(OWL) files that describe the generated graph of authors and
papers. Jena helps to develop semantic Web application and tools.
4. Data visualization
We visualized the generated graphs of authors and papers with the
Welkin7 tool. Welkin takes an OWL file as input and provides
visualization of the data as output. We present visualizations of
the LAK authors and the LAK papers generated by Welkin in the
following sub sections.
4
http://en.wikipedia.org/wiki/Tf–idf
5
http://mahout.apache.org/
6
http://jena.apache.org/
7
http://simile.mit.edu/welkin/
140
n=10
121
n=5
120
100
indegree
Figure 2. The LAK authors’ network
For some node in the network, the degree centrality shows the
total number of incoming and outgoing edges. It is a metric
commonly used for Social Network Analysis (SNA) (De Liddo,
Buckingham Shum, Quinto, Bachler, & Cannavacciuolo, 2011;
Gu´eret, Groth, Stadler, & Lehmann, 2012; Opsahl, Agneessens,
& Skvoretz, 2010). In other words, the degree of a node describes
how many other nodes are connected to the target node. In fact, it
helps to measure how many hubs are in the network. We describe
hubs as the nodes that have the most connections to the others in
the network. The degree centrality metric may be used to
strengthen a network by providing its nodes with more
connections. In this data study, degree centrality is used to
measure the relevance of an author’s papers to the other authors in
the network.
96
92
85
80
71
57
60
64
55
55
46
45
36
40
50
49
45
44
17
16
u9
u10
35
22
20
0
u1
u2
u3
u4
u5
u6
u7
u8
Then first t en central authors
Figure 3. The degree centrality of the top ten central authors
Figure 3 shows the degree centrality for the first ten authors with
the highest similarity degree with respect to the LAK publications.
The horizontal axis (x) shows the top ten central users, e.g. u1 is
the author whose paper(s) has the highest degree. The vertical axis
(y) shows the degree values that describe the number of
relationships of a each user shown in the x-axis. Figure 3 also
shows degree centrality for two different sizes of nearest
neighborhoods (n). Such neighborhoods are commonly used in
collaborative filtering recommender algorithms. By increasing the
neighborhood size n, the degree of the authors increases
accordingly. As a result, we will have a larger number of central
authors when n is higher (e.g. n=10). As can be seen in Figure 3,
degree for the first central author (u1) is equal to 121 if n=10 and
97 if n=5. These high scores show the high relevancy of u1’s
publications to the authors. As a consequence, u1 will appear in
the top-n authors recommendations more often than the other
authors.
5. Discussion and conclusions
The results presented here, allow us to answer our research
questions in the following way:
RQ1. How are the authors connected? Which authors share more
connections and are more central in terms of sharing
commonalities with the others?
Figure 4. The LAK papers network
(The Appendix shows a larger version)
4.3. The LAK papers network
Figure 4 shows a network of the LAK papers. The red nodes are
papers and the edges between them represent the similarity of the
papers. By finding similar papers, we can recommend the most
similar papers to specific authors. This increases the awareness of
the authors about papers which are relevant to them and published
in their communities.
Figure 4 shows that, some of the papers share more similarity with
the others and own a higher degree number. As with the central
authors, these papers will appear more often in the top
recommendation list than the other papers of the dataset. One
may interpret their degree as their popularity. Therefore, the
papers with higher degree values are more popular and,
presumably, they are more of interests to users. For the
publication data, interests of users derives from the words and
terms they have used more frequently in their papers.
70
60
degree
Table 1. The first ten central authors
Author
Degree
Hendrik Drachsler
116
Kon Shing Kenneth Chung
87
Wolfgang Greller
80
Javier Melenchon
66
Brandon White
59
Vania Dimitrova
50
Erik Duval
45
Rebecca Ferguson
44
Anna Lea Dyckhoff
40
Simon Buckingham Shum
39
58
51
50
40
We presented a visualization of the authors’ network to provide an
overview of how they are connected to each other. To justify the
authors’ connections and relationships, we evaluated the degree
centrality for the first ten, most central authors. Table 1 presents
the first ten central authors and their degree to show the authors
with the highest relevancy of their publications with others in the
network. Table 1 shows the degree of the authors for sizes of
neighborhoods equal to 10.
34
47
46
42
34
30
30
n=5
24
23
21
20
33
19
20
31
18
n=10
27
26
17
16
10
0
p1
p2
p3
p4
p5
p6
p7
p8
p9
p10
Top t en papers
Figure 5. The degree centrality of Top ten papers
4.4. The LAK papers’ degree centrality
Figure 5 shows the degree centrality for the first ten papers that
are most similar to the other papers. We selected the first ten top
papers with the highest degrees. The horizontal axis (x) shows the
top ten papers e.g. p1 is the paper with the highest similarity and
thus, the highest degree value among the others shown by the
vertical axis (y). Figure 5 shows degree centrality for two
different sizes of nearest neighborhoods (n), 5 and 10. By
increasing the n, the degree of the papers increases accordingly.
As a result, we will have a larger number of top papers if n is
higher (here, when n=10). In Figure 5, the degree for the first top
paper (p1) is equal to 53 (n=10) and 29 (n=5). This shows how
much p1 shares similarity with other papers. As a consequence, p1
can be considered as the most popular paper and it has the highest
chance to appear in the top paper recommendations.
RQ2. How are the papers connected to each other in terms of
similarity?
We presented degree centrality of the LAK papers to give insight
in their relationships in the papers’ visualized network. We
selected the top ten papers that have the highest similarity with the
other papers. To show which papers are placed in the top ten
papers’ list, we present the title and authors for each paper.
The top ten papers are not necessarily by the authors who are
identified as the central authors. Although most of the central
authors also appear in top ten papers’ list (see Table 2), the order
is not the same. As we investigated the LAK data, we found out
that some of the central authors have more than one paper. For
instance, Hendrik Drachsler has contributed to four papers. In this
study, similarity is calculated based on all papers of an author. So,
it is quite probable that not each and every one of the authors’
papers individually has the highest similarity to the other papers.
Although some of the central authors are common to the two
tables, only one of the papers authored by those central authors
appears in the top ten papers list shown by Table 2.
Table 2. The Top ten papers
Paper
Authors
Learning Dispositions and
Transferable Competencies:
Pedagogy, Modelling and Learning
Analytics
Simon Buckingham-Shum,
Ruth Deakin Crick
The Pulse of Learning Analytics
Understandings and Expectations
from the Stakeholders
Social Learning Analytics: Five
Approaches
Hendrik Drachsler,
Wolfgang Greller
Multi-mediated Community
Structure in a Socio-Technical
Network
Modelling Learning &
Performance: A Social Networks
Perspective
Teaching Analytics: A Clustering
and Triangulation Study of Digital
Library User Data
Monitoring Student Progress
Through Their Written "Point of
Originality"
Learning Designs and Learning
Analytics
Dan Suthers, Kar Hai Chu
A Multidimensional Analysis Tool
for Visualizing Online Interactions
Eunchul Lee,
M'hammed Abdous
Using computational methods to
discover student science
conceptions in interview data
Bruce Sherin
Rebecca Ferguson,
Simon Buckingham-Shum
Walter Christian Paredes,
Kon Shing Kenneth Chung
Beijie Xu,
Mimi M Recker
Johann Ari Larusson,
Brandon White
Lori Lockyer,
Shane Dawson
Overall, we found that the LAK dataset can help conference
attendees to become more aware of their research network, which,
in its turn, is useful for sharing knowledge and experiences.
However, the current dataset contains no user feedback or
evaluations to evaluate either an author or a paper recommender
system in terms of common metrics such as prediction accuracy
and coverage of the generated recommendations. For future
analysis it would be helpful if the LAK dataset also contains
references to the papers. The references could be used to identify
the top cited authors and papers within the LAK dataset and
beyond. As a further step, we are planning to try additional social
network analysis measures besides degree, such as betweenness or
closeness.
6.
References
De Liddo, A., Buckingham Shum, S., Quinto, I., Bachler, M., &
Cannavacciuolo, L. (2011). Discourse-centric learning
analytics Conference Item. LAK 2011: 1st International
Conference on Learning Analytics & Knowledge. Banff,
Alberta.
Fazeli, S., Zarghami, A., Dokoohaki, N., & Matskin, M. (2010).
Elevating Prediction Accuracy in Trust-aware
Collaborative Filtering Recommenders through T-index
Metric and TopTrustee lists. JOURNAL OF EMERGING
TECHNOLOGIES IN WEB INTELLIGENCE, 2(4), 300–
309. doi:doi:10.4304/jetwi.2.4.300-309
Gu´eret, C., Groth, P., Stadler, C., & Lehmann, J. (2012).
Assessing Linked Data Mappings using Network Measures.
Proceedings of the 9th international conference on The
Semantic Web: research and applications (pp. 87–102).
Springer-Verlag Berlin, Heidelberg. doi:10.1007/978-3642-30284-8_13
Opsahl, T., Agneessens, F., & Skvoretz, J. (2010). Node centrality
in weighted networks: Generalizing degree and shortest
paths. Social Networks, 32(3), 245–251.
doi:10.1016/j.socnet.2010.03.006
Taibi, D., & Dietze, S. (2013). Fostering analytics on learning
analytics research: the LAK dataset.
Verbert, K., Drachsler, H., Manouselis, N., Wolpers, M.,
Vuorikari, R., & Duval, E. (2011). Dataset-driven research
for improving recommender systems for learning.
Proceedings of the 1st International Conference on
Learning Analytics and Knowledge (pp. 44–53). ACM,
New York, NY, USA.
7. Appendix
7.1. The LAK authors’ network
7.2. The LAK papers’ network
Paperista: Visual Exploration of Semantically Annotated
Research Papers
Nikola Milikic
Uros Krcadinac
Jelena Jovanovic
Faculty of Organizational Sciences,
University of Belgrade
Jove Ilića 154
Belgrade 11000, Serbia
+381-11-3950853
Faculty of Organizational Sciences,
University of Belgrade
Jove Ilića 154
Belgrade 11000, Serbia
+381-11-3950853
Faculty of Organizational Sciences,
University of Belgrade
Jove Ilića 154
Belgrade 11000, Serbia
+381-11-3950853
nikola.milikic@gmail.com
Bojan Brankov
uros@krcadinac.com
Srdjan Keca
jeljov@gmail.com
UZROK Labs
107 Nehruova
Belgrade 10070, Serbia
+381-63-581879
UZROK Labs
107 Nehruova
Belgrade 10070, Serbia
+381-61-3115661
bb@uzrok.com
ABSTRACT
sk@uzrok.com
We consider the problem of visualizing and exploring a dataset
about research publications from the fields of Learning Analytics
(LA) and Educational Data Mining (EDM). Our approach is based
on semantic annotation that associates publications from the
dataset with Wikipedia topics. We present a visualization and
exploration tool, called Paperista (www.uzrok.com/paperista),
which presents these topics in the form of bubble and line charts.
The tool provides multiple views, thus allowing users to observe
and interact with topics, understand their evolution and
relationships over time, and compare data originating from
different research fields (i.e., LA and EDM). Moreover, user can
explore papers to which the presented topics are related to, and
make related Web searches to access the papers themselves.
Categories and Subject Descriptors
D.2.2 [Software Engineering]: Design Tools and Techniques user interfaces
General Terms
Algorithms, Design
Keywords
Learning Analytics, Visualization, Research Papers
1. MOTIVATION
The field of Learning Analytics is emerging in the past few years
and attracting more and more researchers from other areas of
Technology Enhanced Learning (TEL). It aims to address the
current needs in the broad area of education by making use of the
latest trends in information technologies where everything is
moving towards Big Data and real-time analytics.
Learning Analytics (LA) is defined as “the measurement,
collection, analysis and reporting of data about learners and their
contexts, for purposes of understanding and optimising learning
and the environments in which it occurs” [16]. It is often equated
with other similar fields in the TEL area, such as Academic
Analytics or Educational Data Mining (EDM) [14]. EDM is a
research field that focuses on using computational approaches,
namely data mining and machine learning, to analyze educational
data in order to facilitate and enhance educational process, and
contribute to the overall improvement of students’ learning
experience [17]. Even though both LA and EDM are selfcontained research fields, they are intertwined and overlap in
topics they cover. They share many similarities, but also have
some distinct differences as discussed by Siemens and Baker [18].
One of the similarities emphasized by these authors is that both
fields reflect the emergence of data-intensive approaches to
education, where both communities have the goal of analyzing
large-scale educational data in order to support research and
practice in education. They differ in the level of automation they
aim to achieve. In particular, EDM has a greater focus on
automating support for educational processes, such as adaptation
and personalization of learning environments and learning
processes. On the other hand, LA has a considerably greater focus
on leveraging human judgment, on informing and empowering
instructors and learners to reflect over and improve learning
processes.
The Society for Learning Analytics Research (SoLAR) has
published LAK dataset1 containing structured data about research
publications from Learning Analytics and Knowledge (LAK)
Conference, Educational Data Mining Conference, and Journal of
Educational Technology & Society (JETS) Special Issue on LAK.
The data are represented in the RDF form, which makes them
easy to integrate and process by applications.
In this paper, we propose an approach to visualizing and exploring
the LAK dataset. It is centered around the topics covered by the
papers from the dataset, and is intended to give an overall view of
the topics that LA and the EDM fields cover. As the focus of
researchers and the degree of relevance of particular topics have
been changing over years, our approach tries to show a trend of
those changes through the whole period the dataset covers,
namely from 2008 to 2012. It also allows for topic-based
exploration of research papers and easy navigation to them.
2. RELATED WORK
In [11], authors present an interesting work aimed at automating
the creation of relations between research areas by using
semantically annotated data about research papers in a particular
1
www.solaresearch.org/resources/lak-dataset
area. As a continuation of this work, the same authors have
created a tool, called Rexplore2, which, among other things,
visualizes authors migration patterns across research areas [15].
In terms of visual representation, we find interesting an approach
to visualization of tags (topics) and categories of tags over time.
For example, Dubinko et al. [4] consider the problem of
visualizing the evolution of Flickr tags. The authors present a new
slider-based approach based on a characterization of the most
interesting tags. A Flash-based animation in a web browser allows
the user to observe and interact with the tags. Zhang et al. [5]
present an approach to classification and visualization of temporal
and geographic tag distributions. The authors argue that their
approach can help humans recognize semantic relationships
between tags. Lemma [6] presents the Ebony system, an
application for browsing, navigation, and visualization of the
DBLP database. Wattenberg [7] introduces arc diagrams for
representing complex patterns of repetition in string data.
Watteberg application, the Shape of Song, visualizes music files,
creating a static representation of repetition throughout a time
series. However, to our knowledge, there has been no (published)
research work on the visualization of research topics and
publications in the areas of LA and EDM.
3. THE PAPERISTA SYSTEM
Our approach is illustrated through a Web application called
Paperista. The application visualizes topics associated with
research publications from the LAK dataset, allowing users to
browse through papers, compare LA and EDM research fields,
and make related Web searches. Visualizations are created for
each individual year in order to display relevant topics in the LA
and EDM fields for a specific year, but also for all years
combined in order to give an overall depiction of the topic
distribution in these research areas.
3.1 Data Preparation and Analysis
LAK dataset consists of data about conferences and journal papers
published in the LA and EDM research fields in the 2008-2012
period. For each paper, the following elements are available: title,
author(s), abstract, keyword(s) and full text. Also, basic
information about authors is available, such as name and
affiliation.
3.1.1 Topic Extraction
Since one of the main features of Paperista is visualization of
research topics relevant for the given corpus, the first step in the
data preparation process was to extract main topics of the papers
encompassed by the LAK Dataset. A straightforward approach
was to use keywords associated with the papers. This is because
the authors themselves have compiled those keywords, and it is
them who know the best which topics describe their work in the
most appropriate way. However, the downside of this approach is
that those keywords are given as free form text and are not
consistent with any existing formal vocabulary. This makes them
inconsistent throughout the corpus. Furthermore, the dataset is
incomplete in regard to keywords as for conferences EDM 2008,
2009 and 2010 no keywords are provided.
Thus, we decided to employ a service for semantic annotation in
order to detect paper topics. We took into consideration two
Wikipedia based semantic annotators: TagMe3 and DBpedia
2
3
http://technologies.kmi.open.ac.uk/rexplore
http://tagme.di.unipi.it
Spotlight4. The decision to use Wikipedia based annotator was
motivated by the fact that Wikipedia is the largest corpus of open
encyclopedic knowledge and is often used as a well established
large-scale taxonomy [8]. Both annotator services are designed to
look for and retrieve recognized Wikipedia concepts from the
given text. They can be configured to the specific needs of any
particular usage scenario (i.e., corpus). TagMe is designed to
identify Wikipedia concepts specifically in short texts. Its REST
API5 allows for configuration of two parameters: i) the rho
parameter which refers to the "goodness" of an annotation with
respect to the topics of the input text, and ii) the epsilon parameter
which is used for fine-tuning the disambiguation process and
indicates whether to favor the most-common topics or to take the
context more into account [9]. DBpedia Spotlight annotates a
given text with concepts from DBpedia, a structured
representation of Wikipedia [12]. DBpedia Spotlight REST API6
exposes two parameters: confidence of the annotation process that
takes into account factors such as the topical pertinence and the
contextual ambiguity; support parameter specifies the minimum
number of inlinks7 [10]. We used only paper title and abstract for
topic extraction, based on an assumption that these two elements
contain mentions of the most important and interesting topics a
paper is related to. In order to decide which service for semantic
annotation to use, the two services were tested with a random
sample comprising 5% of all papers and with different parameter
settings. The best results were achieved by the TagMe service
(rho=0.15; epsilon=0.5). For this reason, TagMe service was
employed to annotate all papers in the corpus.
3.1.2 Identifying Popular Topics
Once having all papers associated with topics, we calculated the
significance of each topic. Numerical statistic called TF-IDF
(Term Frequency – Inverse Document Frequency)8 was used as it
calculates how important a word is to a document in a corpus of
documents. This metric was adapted to our case and used to
calculate the importance of a topic in a paper. Instead of
calculating the frequency of a word, we calculate the frequency of
a topic.
Since Paperista allows for visualizing topics in a specific year and
overall (in all years, 2008-2012), the significance was calculated
for corpora containing papers from each of these different time
periods. Accordingly, we had six different corpora and calculated
the significance of a topic for each corpus. In order to present only
the most significant topics, we have filtered the topic set to only
those whose significance for a particular period was over 0.01.
This threshold was empirically chosen and presents the best
balance between the relevance of topics and their presentation in
the Paperista’s visualizations (i.e., assuring easy comprehension
by users).
3.1.3 Topic Cleaning
Even though the output of TagMe service consisted of topics that
are relevant to the papers’ content, some of them can hardly be
considered as relevant research topics in the LA and EDM fields
as they are too general. For instance, topics like Methodology,
4
http://spotlight.dbpedia.org
http://tagme.di.unipi.it/tagme_help.html
6
http://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Webservice
7
Inlinik, or inline link, are incoming links from other DBpedia
concepts to the observed DBpedia concept
8
http://en.wikipedia.org/wiki/Tf-idf
5
Research, and Experiment can be associated with almost every
paper in this corpus. Actually, these topics can be related to
research papers from almost any other research area. Similarly,
some of the retrieved topics were not relevant to research papers
from the LAK dataset. Such topics resulted from imperfection of
the TagMe tool (and semantic annotation tools, in general). Some
examples of these alien topics include The T.O. Show, Ade Easily,
Henry Snapp, etc. For instance, Henry Snapp topic was apparently
mistaken with the SNAPP tool9, a popular learning analytics and
visualization tool. Hence, it was important to detect and exclude
all these generic and alien topics from the final visualization in
order to reduce the noise.
Once having relatedness calculated for all the topics in our corpus,
we compiled two lists to help us detect removal candidates. In the
first list, each topic was associated with the number of other topics
that topic is related to. This gave us an insight into which topics
can be considered too general/specific (the higher the number of
related topics, the more generic the topic is, and vice versa). In the
second list, each topic was associated with a sum of its relatedness
with all the other topics. This list was meant to complement the
first one. The rationale here is that there might be a topic with fair
number of relations to other topics, but those relatedness values
are weak. This behavior also qualifies a topic to be considered as
too specific or alien.
We applied topic cleaning approach similar to [11]. The idea is to
identify topics that have little or no relationships with other topics
in the corpus. This can be an indicator that a topic is too specific
or alien to our set of identified topics and thus can be considered
as an exclusion candidate. On the other hand, if a topic has
relationships with too many other topics, this can be an indicator
that a topic is too generic and again should be considered as an
exclusion candidate. In order to detect these outlier topics, we
needed a measure of relatedness between topics. To that end, we
used the Wikipedia Miner10 service that calculates semantic
relatedness of two topics by finding the corresponding Wikipedia
articles, and calculating similarity of those articles by comparing
their incoming and outgoing links [13]. Wikipedia Miner has a
REST API11 that allows for retrieving this information
programmatically.
The initial idea with compiling these two lists was that topics to
be removed will be at the beginning and the end of the lists (top
and bottom 10%), and that they could be removed automatically.
However, by examining the lists, among the obvious exclusion
candidates, there were also several topics that should not have
been excluded. For instance, topics like Online tutoring, Process
mining, Educational data mining etc. were at the end of both lists
making them removal candidates, even though these topics are
obviously highly relevant for LA and EDM fields. The reason for
this lays in the nature of Wikipedia itself and the fact that not
many other articles in Wikipedia link to these topics. Thus, the
topic removal process could not be done completely automatically
and an expert in the area was consulted to mark the topics that
should not be excluded.
Figure 1 - Paperista Interface
9
http://www.snappvis.org
http://wikipedia-miner.cms.waikato.ac.nz
11
http://wikipedia-miner.cms.waikato.ac.nz/services
10
3.2 Data Visualization and Exploration
The topic visualization applied in Paperista is inspired by the New
York Times visualizations Four Ways to Slice Obama’s 2013
Budget Proposal [1] and At the National Conventions, the Words
They Used [2].
The Paperista visualization includes bubble and line charts,
allowing users to gain insights into topic trends within the LA and
EDM fields. Bubble charts show the importance of a certain topic
for the entire dataset, each year, and/or each field. By changing
different views, users can watch the changes within the dataset
and compare the two fields. Animated transitions between charts
help users understand these processes. In addition, since the
animation does not show precise changes in topic’s relevancy
(calculated using TF-IDF metric, see Sect. 3.1), users are also
presented with a relevancy line chart for each topic.
The user interface (Figure 1) consists of an animated bubble
cloud, two button sliders, a sidebar, and an optional timeline. The
first slider button (All Years / By Year) allows users to choose
between the “All Years” and “By Year” views. “All Years” view
presents relevant topics for the entire corpus of publications. “By
Year” view activates a timeline, showing relevant topics for each
year. By using the slider, users can follow the change in topic
relevancy through the years the data is available for (2008-2012).
The second slider button (All Topics / Group Topics) allows for
grouping and regrouping of topics. “All Topics” view shows one
circle-shaped bubble chart. “Group Topics” view divides the chart
into two groups of bubbles. The first group presents topics that
appear only in the EDM field. The third one shows topics related
only to the LA field (i.e., LAK and JETS publications). The group
in the middle shows “mixed” topics, i.e., those that appear at least
once in both EDM and LAK/JETS. Different views of the bubble
chart are presented on Figure 2.
The size of a bubble represents the topic’s relevancy (i.e., TF–IDF
value). Two research fields, EDM and LA, are color-coded. Each
bubble is divided into two slices the size of which corresponds to
the frequency of that topic within publications of each of the two
sources. For the years 2008-2010, the dataset contains data only
for the EDM research field, so the bubbles are one-colored.
The order of topic bubbles is intended to help users compare the
two fields. The leftmost bubbles represent mostly EDM-related
topics, while the rightmost bubbles mostly belong to the LA field.
Moreover, clicking on a bubble creates a line chart in a sidebar.
The line chart shows the growth and decline of a certain topic.
In addition to the visualization, the Paperista application allows
users to browse papers by topic. When a user clicks on a
particular bubble (topic), a list of papers related to that topic
appears in the right sidebar (represented by a title and a list of
authors). Clicking on the particular paper opens a link to Google
scholar with a name of the article as a search query. Thus, if a
paper is available online, a user could easily obtain the paper
using the Paperista system.
Figure 2 - Different views of the bubble chart: (1) All Year / All Topic (no highlights); (2) All Years / Group Topics
(with highlights); (3) By Year (2009) / All Topics (no highlights); and (4) By Year (2012) / Group Topics (with
highlights)
Furthermore, when a user hovers over the paper title, all topics
related to that paper become highlighted. By hovering over
papers, users can gain quick insight about topic connections
between publications. Users can also distinguish papers annotated
with highly relevant topics from those marked with insignificant
ones. This can show which papers are more related to the fields of
EDM and LA, and which can be viewed as “outliers”.
3.3 Paperista Architecture and Dataset API
The Paperista system consists of a Web application and a server
application that provides RESTful API for communicating with
the dataset. The Web-based visualization is written in D3, a
JavaScript library for manipulating documents based on data [3].
We have chosen D3 because of its good performance for
animation and interaction within the Web environment. The
visualization is available at the following address:
www.uzrok.com/paperista.
All data about conference topics and their significance (explained
in Section 3.1) is available as a part of Paperista Dataset API. This
API supports a REST model for accessing the data and it is
available at: http://147.91.128.71:9090/LAKChallenge2013. The
Paperista’s Web application calls these operations in order to
access data from the dataset (for example, a click on a topic
triggers a call to the API, which returns a list of papers).
4. DISCUSSION
When looking at the view displaying topic distribution in all years
(Figure 2.1), one can observe that EDM conference dominates in
almost all topics. This is due to the fact that EDM conference is
being organized longer than the LAK conference (3 years longer),
and thus the LAK dataset contains overall more papers coming
from the EDM conference.
Filtering topics by years allows for observing the popularity of
topics in a particular year and a particular field (LA or EDM).
This further enables one to observe the shift in interest for a
particular topic by researchers in the LA and EDM fields
throughout the years. For instance, one can observe that before
2011, the topic of Learning Analytics was not much popular in the
papers from the EDM field; thus this topic is not displayed at all
in visualizations for years 2008-2010. In 2011, it boomed in
popularity as indicated by the significant rise in the number of
papers covering it. In fact, this was the first year the LAK
conference was organized, and it immediately occupied the
attention of researchers interested in the topic of Learning
Analytics. Interestingly, this topic also gained some traction
among the researchers publishing in the EDM field. In 2012, the
topic’s popularity grew even bigger and the researchers covering
it directed their effort toward the LA field. This resulted in papers
published within the LA field to almost exclusively cover the
topic of Learning Analytics. Similarly, we can observe topics that
have kept high popularity in both areas over years. For instance,
this is the case with the Data topic, obviously as a consequence of
research in both areas concentrating on the analysis of large
amounts of data coming from various learning systems and other
sources.
The application also allows us to observe that topics such as
Intelligent Tutoring System, Prediction and Accuracy and
Precision mostly kept their popularity throughout the years and
stayed exclusively within the EDM field. On the other hand, one
can observe that the large majority of topics have been covered by
both fields. This suggests that the similarities between the two
fields are significant as they share many research topics.
5. CONCLUSION
In this paper we have presented our approach to visualizing topics
and their trends in the LA and EDM fields. Our application allows
for easy identification of the main topics researchers in these
fields have been focusing on, and also exploration of papers
related to those topics.
When compared to other similar tools that provide visualization of
research topics, our tool is the most similar to the previously
mentioned Rexplore tool. However, while Rexplore is more
focused on relations between authors and topics in research areas,
Paperista’s focus is on research topics and their trends over time.
Also, Paperista allows for exploring papers related to different
topics.
Future work for Paperista will be primarily directed towards
extending the system to support other datasets, related to other
research areas. Since the LAK dataset is RDF-based, Paperista
can easily be expanded to support other RDF-based datasets
expressed using the same or related vocabulary, such as the
Semantic Web Dog Food corpus12. Regarding the interface, we
plan to introduce keyword-based search functionality for
searching a topic by its name. This would allow for easy
navigation to a desired topic and filtering papers related to it. The
final goal for Paperista is to become a universal visualization tool
for research papers.
6. REFERENCES
[1] Carter, S. Four Ways to Slice Obama’s 2013 Budget
Proposal. New York Times, 2012. Available online:
http://www.nytimes.com/interactive/2012/02/13/us/politics/2
013-budget-proposal-graphic.html
[2] Bostok, M., Carter, S., and Ericson, M. At the National
Conventions, the Words They Used. New York Times, 2012.
Available online:
http://www.nytimes.com/interactive/2012/09/06/us/politics/c
onvention-word-counts.html
[3] Bostok, M., Ogievetsky, V., and Heer J. D3: Data-Driven
Documents. IEEE Trans. Visualization & Comp. Graphics
(Proc. InfoVis), 2011. Available online:
http://vis.stanford.edu/papers/d3
[4] Dubinko, M. et. al. Visualizing Tags over Time. WWW 2006,
Edinbourgh. Available online:
http://labs.rightnow.com/colloquium/papers/visualizing_tags.
pdf
[5] Zhang H., Korayem M., You E., and Crandall D. J. Beyond
Co-occurrence: Discovering and Visualizing Tag
Relationships from Geo-spatial and Temporal Similarities.
Available online:
http://www.cs.indiana.edu/~zhanhaip/wsdm2012clustering.pdf
[6] Lemma, R. Visualizing the DBLP Database. Bachelor
Thesis, 2010. Available online:
http://www.inf.usi.ch/faculty/lanza/Downloads/Lemm2010a.
pdf
12
http://data.semanticweb.org
[7] Wattenberg, M. Arc Diagrams: Visualizing Structure in
Strings. InfoVis 2002. Available online:
http://hint.fm/papers/arc-diagrams.pdf
[8] Ponzetto, S. P., & Strube, M. (2007, July). Deriving a large
scale taxonomy from Wikipedia. In Proceedings of the
national conference on artificial intelligence(Vol. 22, No. 2,
p. 1440). Menlo Park, CA; Cambridge, MA; London; AAAI
Press; MIT Press; 1999. Available online: http://www.hits.org/english/research/nlp/papers/ponzetto07b.pdf
[9] Ferragina, P., & Scaiella, U. (2010, October). TAGME: onthe-fly annotation of short text fragments (by wikipedia
entities). In Proceedings of the 19th ACM international
conference on Information and knowledge management (pp.
1625-1628). ACM.
[10] Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C.
(2011, September). Dbpedia spotlight: Shedding light on the
web of documents. In Proceedings of the 7th International
Conference on Semantic Systems (pp. 1-8). ACM.
[11] Osborne, F., & Motta, E. (2012). Mining semantic relations
between research areas. The Semantic Web–ISWC 2012, 410426.
[12] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,
R., & Ives, Z. (2007). Dbpedia: A nucleus for a web of open
data. The Semantic Web, 722-735.
[13] Milne, D., & Witten, I. H. (2008, October). Learning to link
with wikipedia. InProceedings of the 17th ACM conference
on Information and knowledge management (pp. 509-518).
ACM.
[14] Siemens, G., & Long, P. (2011). Penetrating the Fog:
Analytics in Learning and Education. Educause
Review, 46(5), 30-32.
[15] Osborne, F., & Motta, E. (2012). Making Sense of Research
with Rexplore. The Semantic Web–ISWC 2012
[16] 1st International Conference on Learning Analytics and
Knowledge, Banff, Alberta, February 27–March 1, 2011, link
https://tekri.athabascau.ca/analytics/
[17] Romero, C., & Ventura, S. (2010). Educational data mining:
a review of the state of the art. Systems, Man, and
Cybernetics, Part C: Applications and Reviews, IEEE
Transactions on, 40(6), 601-618.
[18] Siemens, G., & Baker, R. S. D. (2012, April). Learning
analytics and educational data mining: towards
communication and collaboration. In Proceedings of the 2nd
International Conference on Learning Analytics and
Knowledge (pp. 252-254). ACM.
Analysis of the Community of Learning Analytics
Sadia Nawaz
Farshid Marbouti
Purdue University
West Lafayette, IN, USA
Purdue University
West Lafayette, IN, USA
Johannes Strobel
Purdue University
West Lafayette, IN, USA
sadia@alumni.purdue.edu
fmarbout@purdue.edu
jstrobel@purdue.edu
The trends of the learning analytics community being presented in
this paper are in terms of authors, their affiliation and
geographical location. Thus the most influential authors,
institutes, and countries who have been actively contributing to
this field are brought out. In addition, this paper identifies
collaborations among authors, institutes, and countries. The paper
also tries to explore the research themes followed by the learning
analytics community.
Analytics, scholars from different disciplines such as education,
technology, and social sciences are contributing towards this field
[6]. Different authors with different backgrounds, expertise and
purpose publish and present their work in Learning Analytics
related journals and conferences. To draw a better understanding
of who are top collaborates in the field and which institutes and
countries are more active in creating and disseminating
knowledge, we analyzed the data described in the previous
section.
1. DATA AND TOOLS
3. AUTHORSHIP TRENDS
ABSTRACT
The data that is analyzed in this paper consists of the conference
on Learning Analytics and Knowledge (LAK) 2011–2012,
Educational Data Mining (EDM) conference 2008–2012 and the
Journal of Educational Technology and Society (JETS) special
edition on learning and knowledge analytics. This data was
provided on the Society for Learning Analytics Research
(SoLAR) website in xml format [1]. The xml data converted to
tabular data using an xml to csv convertor [2]. The converted csv
files were then processed and merged using macro programming
in MS Excel. Later, this data was using NodeXL tool – an open
source template for Microsoft® Excel® [3]. It allows the user to
work on different worksheets for different operations such as
„Edges‟ worksheet can be used to compute the inter/intra
collaboration. „Vertices‟ worksheet allows the display and
computation of individual node properties such as degree,
betweenness, centrality etc. Other tools that have been utilized in
this paper include NetDraw [4] and IBM‟s Many-eyes [5].
2. MOTIVATION
Complete summary of various (author related) statistics has been
provided in table 1 (detailed definition of these graph theory
related terms is available at [7]). Analysis of authors provides
information which not only helps in understanding the growth of
the field (in terms of publication counts and author counts etc.)
but also is used to predict the future of the field e.g.,: information
such as „connected components‟ and „maximum edges in a
connected component‟ is showing that the graphs are getting well
populated and connected – thus, employing more inclination
towards collaboration Overall it can be said that the field itself is
growing as apparent from node counts (2008-2012) and article
counts (the sum of single and multi-author article counts).
Similarly, self-loop count together with single vertex connected
component can show how many authors of the single authored
publication have / have not collaborated (within this data)? e.g.,
the last column indicates that overall there have been 26 singleauthored articles by 25 authors. It was found that 14 of these
authors have had no collaborative work in this data. And it was
also found that „Stephen E. Fancsali‟ is the only author with two
single authored publications.
With increase of attention to interdisciplinary field of Learning
Table 1: Combined statistics for EDM, LAK and JETS
Graph Metric (graph theory terminologies)
2008
2009
2010
2011
2012
Total
Total unique vertices / nodes (authors)
74
79
151
193
281
623
Unique edges (edge is loop for single author articles & straight line otherwise)
100
106
208
251
435
938
Edges with duplicates (i.e., edge weight is greater than 1)
(These edges show joint authorship in more than one publication )
17
18
50
42
48
337
Total edges
117
124
258
293
483
1275
Self-loop (single author articles)
4
1
3
10
8
26
Multi-author article count
27
31
61
75
96
27
Connected components (authors forming a cluster based on authorship)
20
22
38
53
79
140
Single-vertex connected components
(Count of the authors of single author articles who did not collaborate)
4
0
3
8
7
14
Maximum vertices in a connected component
Maximum edges in a connected component
15
33
7
16
15
36
29
72
22
76
113
370
Table 4: Top 10 authors with highest degree counts
4. COLLABORATION TRENDS
Collaboration as defined in Oxford dictionary [8] is the „action of
working with someone to produce something‟ and in current
context it represents co-authorship of an article by two or more
researchers. This term can be extended to institutes and even
countries and hence extended collaboration patterns will be
extracted between and within institutes and countries respectively.
Table 2 shows that there have been 938 pairs of authors who
collaborated just once (this number includes single author articles
- since in that case a self-loop serves as an edge to itself).
Alternatively, it can be stated that 73.57% of all articles have been
written by the authors who have collaborated just once. It could
either mean that new collaborations are forming or that the
authors published just once and then they started working in other
research areas, with other authors or they started targeting other
venues. Therefore, initiatives such as LAK Data challenge will
attract more researchers towards this field and hence may help in
further growth and development of authorship networks.
Table 2: Overall collaboration pattern
Author Pairs
Article Counts
1
10
2
6
2
5
10
4
15
3
110
2
938
1
1(10)+2(6)+2(5)+10(4)+15(3)+110(2)+938(1) =1275
Table 3 presents some of the top collaborators e.g., N.T.
Heffernan had been a co-author with J.E. Beck and Z.A. Pardos in
6 articles. Such analysis can help in finding active researchers and
collaborators in this field.
Table 3: Top collaborators based on article count
Author
S. Ventura
Neil T. Heffernan
Arnon Hershkovitz
Sujith M. Gowda
Author
C. Romero
Joseph E. Beck,
Zachary A. Pardos
Rafi Nachmias
Ryan S. J. d. Baker
Article Count
10
6, 6
5
5
5. DIVERSITY
Diversity in this context is the count of distinct researchers – a
given author may have worked with. Table 4 aims at identifying
the contributors who have worked with most diverse group of
authors e.g., K.R. Koedinger has worked with 34 distinct authors
and Ryan Baker has worked with 25 distinct authors. We also
extracted the graph of these top contributors (based on degree)
i.e., a graph which includes these top authors and all of their
collaborators; and it was found that this new graph consists of 128
authors (roughly 21% of the total authors). This percentage shows
the significance of the top authors towards EDM, LAK, JETS and
in general towards learning analytics.
Author
Kenneth R. Koedinger
Ryan S. J. d. Baker
C. Romero
Vincent Aleven
S. Ventura
Neil T. Heffernan
Sujith M. Gowda
Mykola Pechenizkiy
Arthur C. Graesser
Jack Mostow
Degree
34
25
19
18
17
16
15
15
14
13
Article Count
17
11
11
5
11
16
5
7
4
12
6. GEOGRAPHICAL LOCATION
Next, the geographical analysis of this dataset is presented which
aims to explore the countries that have been extending this field
especially through contributions to the venues: EDM, LAK and
JETS. There have been contributions from 41 different countries.
For extracting this information, all aliases of a country‟s name
were merged e.g., Netherland, Netherlands, The_Netherlands etc.
were all merged together. The top countries that have had
international collaborations are provided in table 5. Clearly, USA
and UK are on top of the list. To illustrate the collaboration
patterns between countries figure 1 is drawn using „NetDraw‟. In
this figure an edge between two countries depicts the coauthorship between the researchers from these countries. The edge
width (also represented by a number) shows the strength of such
collaboration. Also, different symbols have been used for different
nodes based on their „betweeness‟ values. „Betweenness
centrality‟ is the “number of times a node acts as a bridge along
the shortest path between two other nodes” [9]. Clearly, USA, UK
and Germany are on top of this list based on degree and centrality
measures. It is apparent that most of the nodes have „betweenness‟
value of zero as depicted with a „+‟ symbol. It indicates the
peripheral nature of these nodes and thus depicts the birth or
growth of this field – in that newer nodes are being added and the
graph is currently sparse. Figure 2 illustrates geographical
diversity of collaborators. The smaller circles show lesser
diversity in terms of collaboration (with researchers from other
countries). Similarly, larger circles are indicative of the countries
whose researchers have more diverse group of co-authors (from
across the world). In this figure a small table at the bottom depicts
the count of papers from each continent. Thus it brings out the
most active region for research in the area of learning analytics.
Clearly, North America and Europe are at the top of this list
(complete geographical mapping is available at [5]).
Table 5: Top international collaborators
Country
USA
UK
Australia, Germany
Netherland
Canada, Belgium, Greece, Spain
Degree
11
10
6
5
4
Figure 1: Collaboration in terms of geographical location
Figure 2: Geographical diversity of collaborators
7. AUTHOR AFFILIATION
Next, the institutional affiliation of authors was analyzed and it
was found that there have been contributions from 200 different
institutes world-wide. The ranking of the top few institutes in
terms of collaboration with other institutes is provided in table
6. The term degree represents count of unique institutes that a
given institute may have worked with. This term can be
influenced by both the „article counts‟ and the „coauthor
counts‟. Table 7 provides the institutes with highest count of
intra-institute collaboration and table 8 provides the „institute –
pairs‟ that have had highest collaboration. Such analysis is
beneficial to research institutes and organizations so that they
may collaborate and extend further studies in the field of
learning analytics. Figure 3 illustrates trends of collaboration
between institutes.
Table 6: Top institutes with highest counts of distinct
collaborators
Institute
Carnegie Mellon University
University of Cordoba
Stanford University
Fraunhofer Institute for Applied Information Technology
Dept. Computer wetenschappen, KU Leuven
Worcester Polytechnic Institute
Open University of the Netherlands
University of Pittsburgh
Degree
20
9
8
7
7
7
6
6
Table 7: Top institutes with highest count of intra-institute
collaboration
Institute
Worcester Polytechnic Institute
Carnegie Mellon University
Eindhoven University of Technology
University of Cordoba
University of Memphis
Universitat Oberta de Catalunya (UOC)
University of North Carolina at Charlotte
RWTH Aachen University
Self-loop count
116
107
36
33
31
31
20
16
8. RESEARCH THEMES
In order to track the research themes being followed by learning
analytics society and to see their emergence over time, the
authors conducted a keyword based analysis. The information
for this analysis has been extracted from the keyword (subject)
section of the data provided by Society for Learning Analytics
Research (SoLAR) website [1]. However, for initial two years
i.e., 2008-2009 this field is empty, similarly some of the articles
in later years had this field empty. Therefore, it was decided to
use the „title‟ field for the purpose of keyword extraction. The
selection of „title‟ field rather than the „abstract‟ field for the
purpose of keyword extraction relies on an earlier study by the
authors of this paper [10]. Later, Hermetic Word Frequency
Counter (HWFC) software [11] was used to parse out top 30
keywords for each year. Some of the common English keywords
are already ignored by this software, as available in its stop word
list. Other words which are apparent by the nature of the venues
EDM, LAK and JETS were then manually eliminated (since
they would not bring any insightful information for this
analysis) e.g., student, learn, knowledge, education etc. Further
refinement was made to merge varying instances of the same
word such as „visual, visualize, visualization‟ etc. Then, IBM‟s
Many-eyes software utility was used to obtain the Matrix Chart
as provided in figure 4. In this figure top 30 keywords for each
year have been presented. It should be noted that since the count
of articles and venues has also increased over years; therefore,
the relative rank or position of keywords will be discussed rather
than absolute frequency counts. From this figure, it was found
that the usage of some of the keywords such as „visualization,
intelligent, network*‟ is increasing over time. Some keywords
such as „model*, system*, tutor*‟ retain their ranks. The
keywords „online, collaborat*, performance‟ etc. show
fluctuating trends. Similarly, other trends can be interpreted.
The authors further extracted the context of these keywords: it
was found that „visualization co-occurs with data-mining‟,
„intelligent appears with tutoring system‟. The word „online‟ has
a broader class of co-occurring keywords which includes
„learning, education, university, assessment systems, tutoring,
courses, curriculum‟ etc. Interestingly, in 2012 the context
changed to „online communities, interactions and social
learning‟ etc. Due to space restriction further analysis cannot be
provided in this paper.
CONCLUSION
In this paper the data of past five years of publications related to
learning analytics are analyzed. The trends show increasing
number of authors and more collaboration between authors as
well as institutes. Geographical analysis of authors shows that
scholars from different countries have been collaborating and
contributing towards this field. Top authors, collaborators, and
institutes are identified in this paper. The authors also attempted
to bring out the research themes followed by the learning
analytics community based on the frequency of the usage of
keywords.
The authors plan to extend this study based on author‟s
disciplinary diversity and on the association between authors
and their explored research areas within learning analytics.
Table 8: Top pairs for inter-institute collaboration
Institute
Worcester Polytechnic Institute
Claremont Graduate University
University of Belgrade
Northern Illinois University
Hochschule fur Wirtschaft und Recht
Beuth Hochschule fur Technik Berlin
Universidade Federal de Alagoas
Fraunhofer Institute for Applied Information Technology
Institute
Carnegie Mellon University
University of Memphis
Simon Fraser University
University of Memphis
Hochschule fur Technik und Wirtschaft
Hochschule fur Technik und Wirtschaft
Carnegie Mellon University
Saarland University
Edge weight
37
18
9
9
8
8
8
8
Figure 3: Trends of collaboration in terms of author affiliation
REFERENCES
[1] Taibi, D., Dietze, S., Fostering analytics on learning
analytics research: the LAK dataset, Technical Report,
03/2013
[7] YWORKS, 2013. Y works developer‟s guide glossary.
http://docs.yworks.com/yfiles/doc/developersguide/glossary.html
[2] LUXON SOFTWARE, 2013. Luxon software converter.
http://www.luxonsoftware.com/converter/xmltocsv
[8] OXFORD DICTIONARIES, 2013. Oxford dictionary
collaboration.
http://oxforddictionaries.com/definition/english/collaborati
on
[3] NODEXL, 2013. NodeXL. http://nodexl.codeplex.com/
[4] Borgatti, S.P., 2002. NetDraw Software for Network
Visualization. Analytic Technologies: Lexington, KY
[5] IBM,
2013.
Many
eyes.
http://www958.ibm.com/software/analytics/manyeyes/visualizations/a
nalysis-of-the-community-of-learn
[6] Ferguson, R. 2012. The State Of Learning Analytics in
2012: A Review and Future Challenges. Technical Report
KMI 12-01, Knowledge Media Institute, The Open
University, UK.
http://kmi.open.ac.uk/publications/techreport/kmi-12-01
[9] WIKIPEDIA,
2013.
Wikipedia
centrality.
http://en.wikipedia.org/wiki/Betweenness#Betweenness_ce
ntrality
[10] Nawaz, S., Strobel, J., 2013. IEEE Transactions on
Education – authorship and content analysis, under
preparation
[11] HERMETIC, 2013. Hermetic Word Frequency Counter.
http://www.hermetic.ch/wfc/wfc.htm
Figure 4: Keyword analysis for research theme extraction
Cite4Me: Semantic Retrieval and Analysis of Scientific
Publications
Bernardo Pereira Nunes
Besnik Fetahu
Marco Antonio Casanova
PUC-Rio
Rio de Janeiro, Brazil
L3S Research Center
Appelstrasse 9a
Hannover, Germany
PUC-Rio
Rio de Janeiro, Brazil
bnunes@inf.puc-rio.br
casanova@inf.puc-rio.br
fetahu@l3s.de
ABSTRACT
This paper presents the Cite4Me Web application and its features
created for the LAK Challenge 2013. The Web application focuses
on two main directions: (i) interlinking of the LAK dataset with
related data sources from the Linked Open Data cloud; and (ii)
providing innovative search, visualization, retrieval and recommendation of scientific publications from the LAK dataset and related
interlinked resources. Our approach is based on semantic and cooccurrence relations to provide new browsing experiences to Web
users and an overview of scientific data available. Furthermore, we
present a detailed analysis of the LAK dataset along with applications which contributes to the development of the learning analytics
field.
1.
INTRODUCTION
The volume of information on the Web has been growing steadily
over the last decade and has doubled every two years. The vast
amount of data available on the Web along with new means of
communications have transformed our society, including the way
we work, live, relate to each other and learn.
In the midst of change, the Learning Analytics emerges to make
sense of the produced educational data reported by learners, professors, institutions and so on. Analyzing and understanding the
changes along the past years help us to understand the current state
and be aware of the forthcoming trends, enabling a new outlook of
the future of learning.
A recent challenge initiative of SOLAR1 and LinkedUp2 project
arises to leverage the creation of tools that enables the analysis,
visualization, browsing and recommendation of scientific and educational data.
Although the scientific field has fostered the creation of new
applications in several areas, such as medical, biology, physics,
amongst others, the information access is based mostly on free
text search and on hierarchical classification system of the pub1
Society for Learning Analytics Research - http://www.
solaresearch.org
2
http://linkedup-project.eu/
lications3 . However, current approaches by main digital library
providers, such as ACM Digital Library4 and Elsevier5 , do not
represent the current state of research on exploring resources using approaches from Information Retrieval, Information Extraction
and Semantic Web. Thus, get an overview of research topics, find
publications and discover new nomenclatures are an arduous and
laborious task that are not always successful.
In this paper, we introduce Cite4Me a novel application for exploratory search, retrieval and visualization of scientific publications. Cite4Me intends to provide to the end users a single point for
accessing papers and hence reducing efforts of searching in several
data sources. Our system takes advantage of reference datasets,
such as DBpedia6 , to explore semantic relationships between scientific papers and user queries. Additionally, an analysis of topic
coverage and shared concepts from related educational datasets, extracted from the Linked Open Data cloud, will be introduced.
The remaining of the paper is organized as follows. Section 2
presents the approach used for searching, retrieving and recommending papers. Section 3 describes the process of dataset discovery and interlinking and Section 4 shows a brief result analysis
of the data discovery. Finally, Section 5 presents related work and
Section 6 presents some concluding remarks.
2.
CITE4ME
As one of the main goals of the field of “Learning Analytics” is
to support students in their learning process, we developed a Web
application called Cite4Me7 that assists students in making decisions to find scientific publications and identify relevant research
topics.
Cite4Me implements semantic and co-occurrence methods to (a)
search and retrieve scientific publications; and (b) recommend scientific publications. Moreover, it provides a Web interface that facilitates the search for publications and may help users on discovering related terms to a given query.
In this section, we provide an overview of the major features
of the Web application and its Web interface that assist users to
explore scientific data on the Web.
2.1
Search and Retrieval
Cite4Me relies on search functionalities to meet the users needs.
Briefly, we implemented standard Information Retrieval (IR) and
Semantic Web (SW) approaches to retrieve and recommend scientific papers to the users. We divided this subsection into (i) free text
3
http://www.acm.org/about/class/
http://dl.acm.org
5
http://www.elsevier.com
6
http://dbpedia.org
7
http://www.cite4me.com/
4
c 2013 by the papers’ authors. Copying permitted only for priCopyright vate and academic purposes.
LAK-Data Challenge ’13 Leuven, Belgium
search; (ii) exploratory search; and (iii) semantic search.
2.1.1
Free Text Search
The purpose of the free text search functionality is to offer users
the abilities to search for mentions, titles and authors of academic
publications contained in the LAK dataset. Even though, this functionality is similar to existing digital libraries, we agree that this is a
basic functionality that must be provided by our application. Therefore, we use standard vector space models (tf-idf ) for indexing and
retrieving documents.
The tf-idf scores were computed for each term extracted from the
publication content after applying stemming [14]. Furthermore, the
searching functionality offers boolean queries with standard operators, such as ’OR’, ’AND’, and also a ranking of the matching
publications based on the sum of tf-idf scores from the individual
query terms.
In summary, our free text search provides to the users publications (P) that match query terms and non-matching publications
P0 , which are related to P according to a degree of similarity (see
Eq. 1), but does not contain the query terms.
The similarity between a matching publication P and other nonmatching publication P0 in the LAK dataset is measured by the
standard cosine similarity measure, which is built on top of the
computed tf-idf scores.
S im(P, P0 ) =
P · P0
|P||P0 |
(1)
where P and P0 represent the tf-idf scores for the terms in two distinct publications.
2.1.2
Semantic Search
Cite4Me provides also a semantic search engine that assists users
to find publications semantically related to the query terms. Analogously to explicit semantic analysis (ESA) technique [5], the relatedness score, is computed between the enriched concepts found in
the publications’ content.
Basically, the semantic search is an adaptation of the free text
search presented in the Section 2.1.1. Instead of computing the
tf-idf scores for the words in a text, it computes the tf-idf score
8
2.2
http://dbpedia.org/spotlight
Paper recommendation
Another key feature of our system is the paper recommendation
based on semantic relationships extracted from reference datasets.
The recommendation is based on a previous work [12, 11], where
we exploit the number of paths and the distance (length of a path)
between given entities to compute a relatedness score between extracted entities and associated documents. The first step to measure
the relatedness between documents is to compute the semantic connectivity score (S CS e ) of the entities found in each text (see Eq. 2).
S CS e (a, b) =
τ
X
βl · |paths<l>
(a,b) |
(2)
l=1
where |paths<l>
(a,b) | is the number of paths between a and b of length l
and 0 < β ≤ 1 is a positive damping factor. As in [12, 11], we used
β = 0.5 as our damping factor. Furthermore, we also constrained
the length of a path to τ = 4.
Based on the score for entities, we then define the semantic connectivity score (S CS w ) between two documents W1 and W2 as follows:







 X
1
|E
∩
E
|
1
2
 ∗
S CS e (e1 , e2 ) +
S CS w (W1 , W2 )= 
 |E1 | ∗ |E2 |

2


e1 ∈E1
e2 ∈E2
Exploratory Search
In this section, we provide detailed insights on the exploratory
search functionality of our application. As a preliminary step to
provide analytics and information about the actual content and topics coverage, all the scientific publications contained in the LAK
dataset are previously enriched. The enrichment process was performed using DBpedia Spotlight API8 , where entities, entity types
and their respective categories were extracted.
After the enrichment process, we cluster the publications according to entities and its categories found in each document. The publications are clustered in a tree-based structure over the enrichments.
Note that, each node of the tree represents a topic in which a publication under this node covers. Thus, the exploratory search is
performed through the topics covered by each publication.
The process of linking publications, categories and extra resources is mediated by DBpedia knowledge graph, where we use
the dcterms:subject property to match the resources.
Thus, as a result, the exploratory search provides a way to
explore resources through the connections between their topics,
which facilitates the search for topically related resources. Figure 1
shows the exploratory search.
2.1.3
for the entities contained in a publication. Finally, the ranking of
the results is based on the sum of the tf-idf scores of the matching
concepts.
Figure 2 illustrates the semantic search functionality. It also generates a tag cloud from matching publications, showing the most
prominent terms for a given query. Specifically, the tag cloug based
on the results helps the users to have an insight about the topics and
may assist in finding related terms previously unknown by them.
(3)
e1 ,e2
where Ei is the set of entities associated with Wi , for i = 1, 2. Note
that documents that contain the same entities receive an extra bonus
(the second term on the right-hand side of Eq. 3).
Thus, a list of documents pairs is generated and ranked according
the score and suggested to the user. Figure 3 illustrates the paper
recommendation process computed based on S CS w .
3.
DATASET DISCOVERY AND INTERLINKING
This section briefly describes the datasets used on automatic related data discovery from DataHub9 and future steps on dataset discovery and interlinking.
3.1
LAK Dataset
The LAK dataset contains the metadata of the papers published
in the proceedings of LAK conference 2011-12, a special issue of
Learning and Knowledge Analytics: Educational Technology &
Society, the proceedings of the International Conference on Educational Data Mining (2008-12) and the Journal of Educational Data
Mining (2008-12). In total, 315 descriptions of papers containing
detailed information about authors, institutions, conference venues
and the full content of the paper were available.
3.2
Data Analysis
The goal of the data analysis procedure is to align the various
publications in the LAK dataset based on mutual information, such
9
http://www.datahub.io
Figure 1: Preview of the exploratory search funcionality.
Figure 2: Preview of the semantic search funcionality.
Figure 3: An example of paper recommendation based on S CS w .
4.
EVALUATION OF DATA ANALYSIS
AND DATA DISCOVERY
This section presents an overview of the results obtained by analyzing the LAK dataset with respect to the constructed feature set
that describes topics covered by individual publications. Moreover,
based on the data analysis procedure and shared information, we
show that the establishment of links between the different publications within the LAK dataset and from other datasets in DataHub
is possible.
In the following subsections, we show the analysis of the LAK
dataset and the discovery of relevant datasets and publications.
4.1
Figure 4: Relevant Dataset Discovery Framework based on
the generated feature set used to query DataHub Linked Data
provider.
as the topics covered by them. This is achieved using well established datasets like DBpedia10 and Freebase11 , where a reference
point for the unstructured textual content of publications is created
through an enrichment process.
Again, the enrichment process is carried out using DBpedia
Spotlight12 [10] and addresses several issues of significant importance. For instance, it offers several advantages such as: (i) identification of (common) named entities, (ii) disambiguation; and (iii)
expansion of the limited dataset and resource descriptions with additional background knowledge.
3.3
Data Discovery
Our Web application uses as its starting point the instances in
the LAK dataset to automatically explore and recommend to users,
datasets that covers similar topics. In order to query, detect and
interlink related datasets, we chose the DataHub as a data provider.
DataHub serves as a collecting point of datasets from various fields
and currently it has over 5000 datasets. Note that, from the large
number of datasets, only 300 datasets are provided as Linked Open
Data. As the latter is the main focus of our work, the analysis and
interlinking process is focused for such datasets.
Briefly, the data discovery is performed using CKAN13 data
management framework from DataHub, where based on data analysis and user interests (such as topics covered by a publication/resource) related datasets are suggested.
Additionally, we provide to the user a set of resources, amongst
other data analytics, that enables the user to harvest and correlate
new information from the discovered resources, considering the
LAK dataset as a starting point of such discovery.
This approach presents several advantages such as the adoption
and the widespread use of Linked Data principles for publishing
scientific papers. Nowadays, many conferences make their proceedings and journals freely accessible, hence our approach would
take advantage of such open data and offer users topically relevant
papers for a particular resource in the LAK dataset.
10
http://dbpedia.org
11
http://www.freebase.com/
12
http://spotlight.dbpedia.org/
13
http://www.ckan.org
Data Analysis
The data analysis of the LAK dataset focuses mostly on assessing the individual publications for their topic coverage. In this manner, we build a connected data graph consisting of the individual
publications and items from the feature set. This step is necessary
to provide the exploratory search functionality, where based on the
established edges between publications and feature set items, we
can navigate through the publications or topics of interest. Therefore, the results obtained with respect to the constructed feature set
and LAK dataset graph are shown in what follows.
Table 1 shows the top ranked items for each of the feature sets,
along with the number of associations an item has with respect to
all publications (entity, category and type items). Figure 5 shows
the constructed data graph for the LAK dataset.
4.2
Data Discovery
After creating the feature set based on the information provided
from reference datasets, we are able to query for relevant datasets
in DataHub.
Thus, for the top ranked feature set items, the data discovery
for relevant resources is considered. Table 2 shows the discovered
resources and datasets for the top-10 entity items. Note that, we focus only on bibliographic datasets, since we aim at recommending
topically related scientific publications. Due to the lack of bibliographic datasets, we were not able to find related publications
for all entities considered. Table 2 summarizes the discovered resources. The dataset names are represented by their acronyms as
follows: b3kat - “Bayerische Staatsbibliothek", hebis - “Hessisches Bibliotheks Informations System" and npg - “Nature Publishing Group - ALL".
Additionally, from the set of 96 bibliographic datasets available,
only a few of them were offered as Linked Data, thus narrowing
our search space for relevant resources.
Entity
Data
Learning
Data mining
Algorithm
Education
Analysis
Student
Knowledge
Methodology
Statistics
b3kat
14
5
0
4
17
42
7
11
4
7
hebis
0
0
0
0
1
1
1
0
0
0
npg
12
1
0
0
2
6
1
0
1
1
Table 2: Number of discovered resources from the bibliographic
group for the top ranked items from the entity feature set, based
on the LAK dataset.
Entity
Data
Learning
Data_mining
Algorithm
Education
Analysis
Student
Knowledge
Methodology
Statistics
System
Scientific_modelling
Prediction
Data_set
Statistical_classification
Evaluation
Standard_deviation
Probability
Behavior
Interaction
Assoc.
90
80
67
50
49
48
46
46
42
41
37
37
36
36
30
29
29
28
26
24
Category
Educational_psychology
Data_analysis
Learning
Scientific_method
Neuropsychological_assessment
Greek_loanwords
Data
Evaluation_methods
Computer_data
Research_methods
Systems_science
Formal_sciences
Data_management
Cognitive_science
Statistical_terminology
Developmental_psychology
Intelligence
Data_mining
Critical_thinking
Thought
Assoc.
161
150
139
137
136
135
131
129
126
124
118
108
108
107
107
101
93
91
87
84
Type
DBpedia:TopicalConcept
Freebase:/book
Freebase:/book/book_subject
Freebase:/media_common
Freebase:/media_common/quotation_subject
Freebase:/computer
Freebase:/education
Freebase:/education/field_of_study
Freebase:/computer/software_genre
Freebase:/internet
Freebase:/internet/website_category
Freebase:/award
Freebase:/media_common/media_genre
Freebase:/organization
Freebase:/award/award_discipline
Freebase:/business
Freebase:/organization/organization_sector
Freebase:/people
Freebase:/film
Freebase:/book/periodical_subject
Table 1: Top ranked items from the feature set for the LAK Dataset, from the dataset analysis.
Industrial processes
Unit
operations
Laboratory
techniques
Distillation
Separation
processes
Paper
products
Books
http://data.linkeded...
Quantification
Computer
security mo...
Multivariable
calculus
Differential
operators
Religious
fundamenta...
Java
specification
Competition
JVM programming
lang... r...lan...
Java programming
Economic
Social
statusproblems
Narcissism
language...
JavaProgramming
platform
Musical notation
Predicate logic
Propositional calculus
Artisans
Host
cities
ofBay
theCompany...
C...
Hudson's
Populated
places
est...
Orders
ofEdmonton
magnitude
Kindness
Populated
places est...
Garment
industry
Hybrid
vehicles
Electric
vehicles
Drawing
Engines
Economic
development
Development
economics
Genetic
algorithms
Analog
circuits
British
brands
Clothing
retailers o...
Theories
of law
Fashion programming
design
Retail Genetic
companies
of
...algorit...
Reproduction
Evolutionary
Propositions
Companies
establishe...
Teachable
units
for ... ininference
Supermarkets
Companies
based
L... of Nort...
Immediate
Flynn's
taxonomy
1884
establishments
...
Zoology
Department
stores
Spirituality
Department
stores
of...
design
MarksElectronic
& Spencer
Military
Legal
terms
Multiple
births
Electronics
terms
Popular
Mediaculture
studies
Cluster
computing
Fault-tolerant
compu...
Local
area
networks
Supercomputers
Theology
Concurrent
programmi...
Oligarchy
Social events
School
examinations
Educational
qualific...
1951 introductions
Russell
Group
Lawyers
Case
Western
Reserve...
Authority
Lecturers
Ecclesiastical
titles institut...
Educational
Lists byofcountry
History
Cambridge
Oxbridge
Organisations
based
...
Coimbra
Group
University
of Cambri...
Visitor
Anglican
ecclesiasti...
Law
degrees
Culture
in attractions
Cambridge...
Counterculture
Anglican
priests
Universities
and col...
Legalestablishments
education
1209
...
Logic symbols
Marxist theory
Classes of computers
http://data.linkeded...
Systems of formal lo...
Environmental
design
Landscape
Architecturearchitecture
Visual
arts physical ...
Individual
Art of
media
Works
art
Secret
military
prog...
Military
projects
Manhattan
Project
Machines
Atomic bombings
of
H... of ...
Military
history
Codeofnames
Articles
withhistory
exampl...
Military
...
Criticism
Nuclear
weapons
prog...
Review
websites
Articles
with of
exampl...
Units
frequency
Musical
Nucleartechniques
history
ofRhythm
t...
Literary
concepts
Philosophy
Nuclear
weapons
of t...
Syntax
C programming
langua...
Class-based
programm...
Multilingual
websites
Philosophy
of biology
Programming
language...
Inverse
problems
Crystallography
Internet
activism
Lexis
(linguistics)
Dynamically
typed
pr...
Programming
language...
Approximation
algori...
Collaborative
projects
Developmental
biology
Reality
by type
GNUstep
Customary
units
of m...
Objective-C
NeXT
General
encyclopedias
Discrete
geometry
Continuous
mappings
Habitat
(ecology)
te...
Wikipedia
Search
algorithms
Wikimedia
projects
Open
content
projects
HistoryMathematical
Chemical
properties
Landscape
ecology
Mineralogy
Free
encyclopedias
constants
Commerce
Condensed
matter
phy...databases
Articles
withOnline
exampl...
Lexical
Mass
dictionaries
Online
encyclopedias
Internet
properties
...
Creative
Commons-lic...
ForcePhysiology
Functional
programming
Grammatical
cases
Power
laws
Searching
Imperial
Tails
ofunits
probability...
Statistical
mechanics
Quantity
Computational
physics
Computer
physics
eng...
String
(computer
sci...
Operators
(programmi...
Discovery
Communicat...
Video
game
development
Television
channels
... tel...
Discovery
Channel
English-language
Structural
engineering
Modes
Bridges
Catholic
music
Melody
types music
Ancient
Greek
Higher
education
in ...
Higher
education
in ...
Architectural
commun...
Personality
MBTI
types tests
Personality
typologies
Jungian
tradition
Symbols
Logical connectives
Notation
Semiotics
Military
units
ando...
f...
Academic
United
States
Navy
Ad-hoc
and
for...
Model theory
Peerliterature
review
Taskunits
forces
ExpertScience
witnesses
VirtuePsychologists
Tests studies
Acronyms
MentalMusical
health
profes...
Safety
Master's
degrees
AIDS
origin hypotheses
Pandemics
Scientists
composition
Association of Indep...
Health Syndromes
disasters
Proof theory
HIV/AIDS
Formal languages
Vocabulary
Christian
iconography
Combat
Set theory
Measuring
instruments
Trigonometry
Monetaryineconomics
Federalism
the Un...
Cross
symbols
Christian
symbols
Quality
management
Paper
Enterprise
architect...
Heraldic
ordinaries
Papermaking
Computer storage media
Stationery
Christian
terms
MilitaryTransducers
operations
...
Information
Punched
card
Foodarchitects
safety
Unit
record
equipment
Recursion
Electronics
Evidence law
Intention
IBM unit
record
equi...
Pragmatics
Religious
symbols
Self care
Programming language...
TraditionalWriting
Chinese
... materials
Cognitive
neuroscience
Causality
Employment
compensat...
Engineers
media
IBM
storage
devices
Government
of
the
Un...
Sensors
History
of
computing...
Packaging
Mental
content
Money
History of software
School qualifications
Theory
of computation
Units of angle
Industrial
automation
Causal inference
Articles to be merge...
Central processing
u...
Programming
processing
1789 establishments ...
ChinesePrinting
inventions
http://data.linkeded...
Self-referenceidiomsInstruction
Physical exercise
http://data.linkeded...
Automation
materials
Microprocessors
Paper art
Buildings
and struct...
Style (fiction)
Building
Middle
States Associ...
Electricity
Electric
current
Real estate
Ballistics
Theatre
Culture jamming
Traditions
Alchemical processes
Performance
art tech...
Combinatorics
Measure
theory
Television
technology
Argumentson words
Finite
rings
Integral
calculus
Digital
imaging
Ceremonies
Instruction set arch...
Graduation
Science occupations
Articles with exampl...Metalogic
Computer-aided
design
Domain-specific
prog...
Languages
Literature
Internet radio
Digital geometry
ArtComputer
movements
Psychiatry
controver...
Linguistic research
Trains
graphics da...
Physical layer proto...
InternetPeercasting
broadcasting
Engineering discipli...
Art materials
Firearm
terminology
Video
game genres
Corpus
Literary
Modular arithmetic
Writing criticism
occupations Fiction
media
syst...linguistics
Microsoft
Internet
television
RailWindows
transport
Criminology
Computer engineering Streaming
Action
(genre)
North Central
Associ...
Psychiatric
diagnosis
Digital
television
Materials
Action
video
games
Continuing
education
Electronic engineering
Labeling
theory
Cloud
storage
Philosophical
method...
Entertainment
http://data.linkeded...
Applications of dist...
Political engineering
Music
Fiction-writing
mode
Metaphysics
Academic administrat...
Reality
Video on demand serv...
Association of Ameri...
Analogy argume...
Behaviorism
Philosophical
Wikipedia articles w...
http://data.linkeded...
Subjects taught in m...
Drug delivery
devices
Homogeneous
chemical...
Independent agencies...
Solutions
http://data.linkeded...
1950 establishments
Foundations
based in... ...
Examinations
Dosage forms
Organizations
establ...
Nothing
Concepts in aesthetics
Space
National
Science Fou...
Colloidal
chemistry
Belief
Physical
chemistry
Simple living
ScienceFunding
and technolo...
Pittsburgh History &...
Veracity
bodies
Environments
Musical terminology
Psychiatric institut...
Educational institut...
Elementary
arithmetic
Approximations
http://data.linkeded...Laboratories
Macroeconomics
History of psychiatry
Software
bugs
Core
issues
in
ethics
Universities
and
col...
Documents
Anti-psychiatry
Medical terms
Carnegie
Mellon
Univ...
Political culture
Oak Ridge
Associated...
Social institutions
Integrated
circuits Reliability
Italian loanwords
Failureengineer...
Cohort studies
http://data.linkeded...
Inquiry
Discovery
and invent...
Grammar
frameworks
Theories
of aesthetics
Maintenance
Printing
Semiconductor devices
Arts
Healthcare quality Units of time
Syntactic
Dispositional beliefs
Rhetoric
Highertransforma...
category theory
Evidence-based
Bias
Health
informaticsmedic...
Geometric
shapes
Fractions
Communalism
HTML
Social
theories
Informal fallacies
Noam
Chomsky
http://data.linkeded...
Telecommunications
Books by type
Philosophical
theories
Morphology
Giving
Australian inventions
Meetings
Metanarratives
TextbooksLatin words and phra...
http://data.linkeded...
Category
theory
Interest (psychology)Grid computing
History
of sociology
Physical cosmology
Fluid
dynamics
Fluid
mechanics
Ratios Least squares
http://data.linkeded...
21st linguistics
century
Theoretical
physics
Single equation meth...
Chronology
New
Thought
terms
Generative
Binary operations
Paradoxes
Pipingmechanics
CheminformaticsString
Hierarchy
Continuum
Typingtheory
Employment
Economics of uncerta...
Aerodynamics
Millennia
Modern
history
School terminology
century
Natural20th
philosophy
Films by type ge...
Multi-dimensional
Idealism
Wealth
Adulthood
Public speaking
Architectural
design
Functionalism
Module
theory
Conformity
Philosophy of life
Empiricism
Development
Neuroscience
3D computer graphics
Real analysis
People in informatio...
Presentation
Skills
Academic
pressureIndustry
in...
Software requirements
Construction equipment
Schoolteachers
Computer programmers
Language
psychology
Education Positive
and traini...
Infrastructure
National Association...
Production economics
Universities and col...
Academic
degrees
http://data.linkeded...
Virtual communities
3Deffects
imaging
Educators
Standards-based Construction
educ...
Elementary and prima...
Outsourcing
Community organizing
http://data.linkeded... Visual
Calendars
Definition
Selection
Horology Performing
Order theory
Projects
Design
High schools and sec...
Memory
arts
Patent law
Education terminology
Population genetics
economics
Philosophical
logic
Engineering
occupati...
1801Welfare
establishments
...
Economic growth
New York Post
Spacetime
http://data.linkeded...
Teaching
Individualism
Alexander
Hamilton
Graphics
file formats
Publications
establi...
PatternsYouth
Syntactic entities
Newspapers
published...
Education
economics
Standardized tests
Accounting
terminology
SelfExploration
School-related terms
Contemporary
artart
Postmodern
Centimetre%E2%80%93g...
Electronic
documents
Adobe
Systems
Engineering
Project management
Elections
Economicconcepts
indicators
Orders of magnitude ...
Research
institutes
Humanities
Marriage
Legal
entities
Legal
research...
Vertical
transport
d...
News Corporation sub...
Occupations
1764 establishments
Physical quantities
Colonial
architectur...
Government
Happiness
SI base units
Campuses
http://data.linkeded... Energy
ISO
standards
Economics
ofCompanies
regulat...instituti...
Time
Ethical principles
Digital
press
Cultural
studies
Massachusetts
Instit...
EmbeddedPlanned
systemsscience and ...
Learning to read
Doctoral
degrees
Newbased
England
Scale model scales
Organizations
... Associat...
Corporatism Cohort study methods
Political
terms
Introductory physics
Agriculture
Educational
http://data.linkeded...
Institutions
founded...
Didactics
Motivation
Utility
Emotions
Elevators
Qualities of thought
Labour
law institut...
modeling
Grammar Fundamental physics ...
Software optimizationScale
Unitedmanagement
Nations
Gener...
Change
Private
universities
Personal development
Corporations
Forms
of atti...
government
Students
Occupations
in music
Positive
mental
Classical genetics
Ethics
Political science te...
Think
tanks
Conductors
(music)
Harvard
University
Computer performance
Analytic functions
Titles
Standards
organizati...
Model aircraft
Dynamical systems
Democracy
Scientific
documents
Meaning (philosophy ...
Rooms
Health
research
Music performance
Education reform
Higher education
Business
Rhodelaw
Island in...the ...
Finance
16th
arrondissement
Help
desk
Concurrent computing
Formal systems
Brown
University
Evidence-based
pract...
Remote
desktop
Uncertainty of numbers
Public relations
Organisation
Mechanical vibrations
Topology Numbers
Educational facilities
Risk SchoolsElectronic feedback
Corporations
law for Eco...
Public
universities
Theses
Mathematical axioms
Organizational
culture
Georgian
architectur...
Law
Customer
experience Socioeconomics
...
Exponentials
Filter frequency res...
Page
description
lan...
Psychology
Communication
science
Social
concepts
Mathematical analysis Encodings
Regulation
Mathematical proofs
Colonial
Colleges
Educational
institut...
History
of ideas
Concepts in physics
Axiology
Types of
business
en...
Social psychology
International
econom...
http://data.linkeded...
Wave mechanics
Non-profit
organizat...
Social
anthropology
Iterative methods
Mathematics education
Educational stages Television
Academia
Quality
assurance
Governance
Budgets
Fundraising
Polynomials
Culture
COBOL terminology
Professional
titles ...
Academic
institutions
Discourse analysis
Object-oriented
prog...
Sociology
of
culture
Philosophy
of
language
Human
behavior
Validity
(statistics)
University
of
Chicago
Vector
graphics
Statistics
education
Group
processes
Heat
transfer
Meta-analysis
Harvard
Medical
School
Role
status
Charles Sanders Peirce
Gothic
archi...
Filter theory Mathematical optimiz... Computational fluid ...
Mergers
andRevival
acquisit...
Social philosophy
Training
.NET programming lan...
Mathematical science...
Underwater diving sa...
Behavior
Polymorphism
Sociodynamics
Systematic
review
Biotechnology
Summary statistics
f...
Cambridge
Identity
Committee
on Institu...
Computer
libraries
Genetics
or genomics...
Compilers
Sun Microsystems
Greek
inventions
http://data.linkeded...
Sorting algorithms
Logic
Universities
and
col...
School types
Professorial
degrees
Numerical analysis
Abstract algebraMathematics
Organizations
establ...
Home economics
Syntactic relationsh... Community
Equations
Types
of
functions
Applied
linguistics
Technology
Infinity
http://data.linkeded...
Types of university
...
Engineering
Mind%E2%80%93body pr...
Film and video techn...
Control theory
Elementary algebra
Science education
State functions
Interpolation
Coding theoryLinear algebra
Alternative education
Theories of mind
Algebra
http://data.linkeded...
Probability distribu... Philosophy of psycho...
Sentences by type
Phenomena
Phenomenology
Mathematical
concepts
Behavioral concepts
Human resource manag...
Psychological
theories
Interdisciplinary fi...
DNA
Production and manuf...
Philosophical concepts
Special functions
Web 2.0 neologisms Local
http://data.linkeded...
Computer
programming...
Chemical engineering
government
Operations research
Education policy
Parallel computing
Potential
http://data.linkeded...
Citizen media
http://data.linkeded...
Lexicography
http://data.linkeded...
Branches
of philosophy
Differential equations
Attention-deficit
hy...
Human-based computat...
Phase transitions
Psychology articles ...
Attention
http://data.linkeded...
Ecological metrics
Peer-to-peer computing
Functions and mappings
Semantics
Ontology
http://data.linkeded...
Linguistics
Oral communication
Manufacturing
Optimal control
Arabic loanwords
File sharing networks
Systems engineering
Computer storage
Culturaldev...
economics
Education by subject
Neuroimaging
Curricula
Image processing
Cryptography
Population
BehaviouralPerception
sciences
Radiology
Evaluation
Recording
EvolutionaryUnits
biology
Sound production
tec...
terminology
Brain
methods theory
Classified information
ofEconomics
information...
Integral transforms
Accountability Learning Business
Simulation
Data collection
Unsolved
problems
in... sensitiv...
Information
Consciousness
Concepts
in metaphys...
Computer
storage
Optimization
Storage
media
Mental processes
Object-based
program...
Behavioral and socia...
Organs algorit...
English words and ph...
Mathematical structu...
Programming
language...
Academic publishing
Leadership
Demography
Physics
Wagering
BASIC interpreters
Geometric algorithms
Greek loanwords
Grammatical voices
Microsoft
Visual
Stu...
Film techniques
BASIC
compilers
Microsoft
BASIC
Calculus Physical sciences
http://data.linkeded...
Capital
Mathematical economics
Creativity
BusinessPositions of authority
Concepts
Procedural programmi...
Management occupations
Socialsociology
economy
Management
Algorithm descriptio...
Articles with exampl...
Statistical forecast...
Economic
http://data.linkeded...
Game theory
Limbic system
http://data.linkeded...
Environmental issues...
Biology theories
Theories
Emotion
Bayesian inference
Group theory
Concepts in ethics
Covariance and corre...
Epistemology
http://data.linkeded...
Concepts in epistemo...
http://data.linkeded...
Human communication
Error detection and ...
Time series
analysis
Cognition
Pedagogy
Policy
http://data.linkeded...
Systems psychology
DebatingExperimental psychol...
Progressive Era in t... MultimediaSocial epistemology
BiostatisticsMathematical relations
Video game design
Architects
Mathematical termino...
Philosophical school...
Philosophy of educat...
Science experiments
http://data.linkeded...
Politics by issue
Formal methods
Systems
Educational television
American philosophy
Western
art
Radioactivity
History
of museums
Social sciences
Translation
Education
Arabic words and http://data.linkeded...
phr...
Mathematical physics
Philosophical moveme...
Branches
of
psychology
Types
of
organization
Normal
distribution
Sports terminology
Historical
scientifi...
Metrology
Social sciences meth...
Education-related te...
Pragmatism
Geometry
History
of biology
Museum
collections
Terminology
Writing
Bibliometrics
Relationship counsel...Interpersonal relati...
Government
Demographics
Research
Political philosophy
Forteana
Educational assessme...
Methods in sociology
Westerncourt
classical
mu...
Collecting
http://data.linkeded...
History
of earth
sci... development
Standardization
http://data.linkeded...
Philosophy
ofEconomies
mind
Canadian
system
Knowledge sharing
Consumer behaviour
Human
M-estimators
http://data.linkeded...
Length
http://data.linkeded...
National
security
http://data.linkeded...
Professional
certifi...
Critical
thinking
Strategic
management
Former
courts
and
tr...
Economic systems
Demographic economics
Mind
Problem solving
Compiler constructionReference
Support
vector
machi...
United
States
federa... ratios
User interfaces
Teleconferencing
Statistical
Nature Wikis
Statistical terminol...
Institutes
Quality
control
Integrated developme...
http://data.linkeded...
Measurement
Film production
Transportation
plann...
Engineering statistics
Video
Hypothesis
testing
(computing)
Bioinformatics
1993Process
introductions
Supply chain
managem...
Tools
Philosophy of science
Associative arrays
Greek words and phra...
Population ecology
Videotelephony
Adolescence
Statements
Structure
http://data.linkeded...
http://data.linkeded...
Article Feedback 5 A...
Accessibility
Operating system tec...
Sociology index
Knowledge
Urban design
Philosophy of mathem...
Ergonomics
Usability
Probability assessment
Linear programming
Programming paradigms
Proprietary
database...
Web analytics
Data-centric
program...
Marketing
Personality theories
Postgraduate schools
Vector calculus
http://data.linkeded...
Russian inventions
Error
Design of experiments
ERP
software
Gambling terminology P-complete problems
Internet
marketing
http://data.linkeded...
Object-oriented prog...
Biology
software
Social issues
Library science
Mathematical finance
Mental structures
Cybernetics
Systems
theory
Concepts in logic
Educational technology Programming1992
language...
Data warehousing
Health
Psychometrics
Observation
Continuous distribut...
Privacy
Elementary mathematics
Convex optimization
Acoustics
Desktopdatabase
databases...
app...
Microsoft
Analytics Digital rights
Business planning
User interface techn...
Arrays
Qualitative research
Sociological termino... Literacy
Pharmacokinetics
Thought
http://data.linkeded...
History of mining
Information Age
Statistical theory
Financial data analy...
http://data.linkeded...
Human rights
Decision theory Periodic table
Earth sciences
Conditionals
Clubs and societies
Abstraction
Organizations
Nonverbal
communicat...
Statistical deviatio...
Business terms
Information
technology
Interrogative words ...
Psychological testing
Academic disciplines
Statistical models
Bayesian
statistics
Theory of probabilit...
Society
Articles with incons...
Mining
Knowledge
representa...
Regression
analysis
Neuropsychological
a...
Anthropological
cate...
Information
Occupational
safety science
...
Genetics
Programming
language...
http://data.linkeded...
Classification
systems
Diagrams
Social groups
Ecology
Computing
Signal processing
Articles with exampl...
Sampling (statistics)
Strategy
Collaboration
Process management
Cultural history
Mathematical logic http://data.linkeded...
Economics
Computer programming
American inventions
Innovation
Urban studiesSociolinguistics
and pl...
http://data.linkeded...
ComputerInnovation
occupations
economics
Evaluation methods
Scientific revolution Planning
http://data.linkeded...
Actuarial science
Resources
Latent variable models
Statistical inference
International relati...
http://data.linkeded...
Science and technolo...
Crowdsourcing
Comparison
of assess...
http://data.linkeded...
Information technolo...
Reading
Sources
of
knowledge
Heuristics
Parametric statistics
Analysis
Virtual
reality Anthropology
enforcement
titles
Educational
software
http://data.linkeded...
http://data.linkeded...
Learning
Probability
Means
Standards
LawLaw
enforcement
occu...
Types of marketing
Conceptual models
http://data.linkeded...
Part-time employment
Dynamic programming
Randomness
Medical statistics
Product management
Microsoft developmen...
Human–computer
inter...
Market research
Ecosystems
Publishing
terms
http://data.linkeded... Complex dynamics
Law
enforcement
Science
Business
software
Sociological terms
Cognitive science
Personal life
Educational psycholo...
Probability and stat... http://data.linkeded...
Law enforcement
occu...
Computing
terminology
Summary
statistics
Conservation
Distance education
http://data.linkeded...
http://data.linkeded...
Police
ranks
Basic concepts
se...
Futurology
Singapore
Police
Force
Logic in
and
statistics
Questionnaire constr...
History of Organizational
education behav...
Formal sciences
Source codeScientific
method
http://data.linkeded...
Tool-using species
Articles containing ...
Vectors
Functional analysis
Human%E2%80%93comput...
Learning management ...
Cultural landscapes
Communication
Research methods
http://data.linkeded...
http://data.linkeded...
Megafauna of Australia Hidden Markov models
Decision
Support Sys...
Technical factors of...
Virtual learning env...
Expert
systems
Developmental psycho...
http://data.linkeded...
Environmental science
Architectural termin...
http://data.linkeded...
Places
Probability
theory
Methodology
Physical
objects
http://data.linkeded...
Critical phenomena Ship construction
http://data.linkeded...
Interaction
Mathematical notation
of scie...
Logical consequence
http://data.linkeded...
Estimation theory
Forestry
ComputerEpistemology
data Interpretation
(phil... Prediction
Environment
Economic anthropology
Megafauna of Eurasia
Statistical data sets
Systems science
http://data.linkeded...
Personality
Generalized linear m...
Social research
Sociological theories
http://data.linkeded...
Technical
communicat...
Aptitude
Plants
http://data.linkeded...
Apes
Mathematical
modeling
Materials science http://data.linkeded...
http://data.linkeded...
Megafauna
of North A...
Infographics
http://data.linkeded...
Natural language pro...
Scientific modeling
Trees
Factorial
and binomi...
Online gaming
services
Artificial intellige...
http://data.linkeded...
Online
games
Internet
properties
...
Cosmopolitan
species
Human
geography
Information
retrieval
Folksonomy
http://data.linkeded...
http://data.linkeded...
Secure
communication
Hypertext
Sequences
Photo
sharing
Socialization
http://data.linkeded...
Medicinal chemistry
Plotting
Statistical
methodsand series
Econometrics
Megafauna of South
A... software
Reputation management
Risk management
Archival
science
Discrete distributions
Computer
art
Data management Auxiliary sciences o...
1999 introductions
X86-64
Linux distrib...
Real algebraic geome...
Content management s...
Plant morphology
Educational
video
ga...
Linux
numerical
anal...
name
system
Groupware
Sensitivity analysis
Student
culture
Data analysis
RiskFacebook
analysis
Systems
ecology
Numerical
programmin...
Complex Domain
systems
theory
1988management
introductions
Social
bookmarking
Real numbers
Software architecture
Self-organization
Revolutionary tactics
Intelligence
Document
...
Mathematical series
Semantic
Web
%3C!--Professions--%...
Debian-based distrib...
Data
analysis software
Network
performance
Megafauna of
Africa
Quality control tools
Organizational theory
IRIX software
Computer-mediated
co...
Internet
access
Free software cultur...
Permutations
Taxonomy
Health
promotion
Security
http://data.linkeded...
Probability distribu...
Mathematical and
qua...
Transdisciplinarity
http://data.linkeded... Educational psychology
DebianPostmodernism
http://data.linkeded...
Mass media
Zoomable
User Interf...
Computer
network
sec...
Statistical intervals
C softwarecross-pl...
Blog
hosting
services
Matter
Interoperability
XML-based
standards
Proprietary
Business Poisson
intelligence
http://data.linkeded...
Social
systems
English inventions
processesPharmacology
Workflow
technology
1993 software
Symbian
software
Media
technology
Social
constructionism
Psychological
attitude
Virtual
avatars
Internet
privacy
Psephology
Education
Uni...
Companies
establishe...
Logical fallacies
Array programming la...
Statistical
data types
Cognitive
psychology
Social
networks theory
Performance
management
Matrices
State
schoolsininthe
the...
Information systems
Education schools
Categorical data
Video
game culture
Translation
studies
Communication
Parsing
Data
Superorganisms
Network
theory
Data structures
Elementary special
f... models
Intelligence
(inform...
Biological
systems
Cascading
Style Sheets
Industrial design
2004
establishments
1989 introductions
Markov
Network
addressing
Mobile
computers
Deliberative
methods
Algorithms
on strings ...
Natural sciences
Teacher training
Windows
Phone
software
Conjugate prior dist...
http://data.linkeded... Exponential
Application software
software
Web design
Web services
Types of databases
Film
and
video termi...
Websites
which
mirro...
Types
of
library
family d...
Reasoning
Games
of Bada
mental
skill
Computer architecture
Software
development
Digital
humanities
Internet Protocol
Cross-platform
softw... trade
Human
rights
by
issue
Hawaiian
words and p...
http://data.linkeded...
Broadband
Windows word http://data.linkeded...
process...
Digital
libraries
Vector spaces
Community
buildingInternational
Stylesheet
languages
E-commerce
Statistics
Software
design
Graphical models
Distributions with Megafauna
c... Numerical linear alg...
Social information
Computational
statis...
Learning
psychology
1983
software
Articlesp...
in
need
of ...
MUD
terminology
Underlying
principle...
Information
Articles with exampl...
Community
websites
Blog
software
Symbiosis
Computer
graphicsengineering
Home
computer
software
Companies
establishe...
Technical
communicat...
Mac OS
X word
proces...
Industrial
Notetaking
Stable distributions
Dimension
Animals described
in...
Curves
Atari
ST software
Estimation
of densit...
Privately
held
compa...
Bayesian networks
RSS servers
Collective intellige...
Intellectual
propert... 1969 introductions
Open methodologies
Decision trees
Global
internet
Quality
Proxy
Quantitative
research
Mac OS word processors
Marketing
research
c... comm...
Mathematical sciences
Holism
Machine learning
Computer
security
so...
Scottish inventions
Internet
protocols
Lifelong
learning
Invasive mammalStatistics
spec... articles ...
Socialsc...
networking se...
Computer
security
Classification algor...
Deviance
and pr...
social ...
Computer law
2000s in computer
Laptops
Geography
http://data.linkeded...
Automatic
identifica...
Digital
media
Companies
based
in
R...
Data
mining
Social
media
Logarithms
Articles
with
exampl...
Communication design
Cloud applications
Representation theor...
Sociology
Buzzwords
Experimental physics
Environmental health
Relational database
...
BlackBerry
software
Crime
prevention
Rights
Data modeling
Charts
InternetWeb 2.0
http://data.linkeded...
http://data.linkeded...
World Wide Web
Representation theory
Internet
forum termi...
InternetNeologisms
ages
Theoretical computer...
Persistent
Worlds
Prospect http://data.linkeded...
theory
Web
syndication
form...
Digital technology
http://data.linkeded...
Internet
architecture
Learning in computer...
http://data.linkeded...
Windows software
Bilinear operators
Enterprise applicati... New media
http://data.linkeded...
Internet memes
Identifiers
Algorithms
Qualia
Sociocultural
global...
Brand management Documentary
film tec...
Articles with exampl...
Statistical
classifi...
Speech recognition
http://data.linkeded...
Sound
Microsoft Office
Sparse matrices
interpre...http://data.linkeded...
Service-oriented
(bu... platformsScience-related lists
photography
Control flowDigitalProbability
Metadata
Computing
Networks
http://data.linkeded...
Analytical chemistry
Waves Radiation health eff...
Information technolo...
Assistive technology
Euclidean solid geom...
Hearing
Statistical charts a...
Mutation
Elementary shapes
History of radio
All articlestheory
lacking...
Computability
http://data.linkeded...
Television genres
Utility software type Websites
http://data.linkeded...
Computational resour...Elementary geometry
SQL
One
Articles lacking sou...
Complex numbers
Computer science
Articles
including
r...
Metric
geometry
http://data.linkeded...
Database theory
Article Feedback 5
Web development
Declarative
programm...
Articles
with exampl...
Applied mathematics
Local
government
Local
government
dis...in ...
http://data.linkeded...
Web applications
http://data.linkeded...
Computational science
Philosophy
disambigu...
Disk
file systems
Places
in Berkshire
...
Economics
ofest...
transpo...
Relational model
Populated
Integrals
ofplaces
telecommu...
Modernism neuros...
Computational
SoftwareHistory
companies
b...memory
Computer
String
similarity
me... Area
Constraint programming
Economic
geography
Historical eras
Query languages
Towns
in Berkshire
Sociocultural
evolut...
Input/output
Reporting
Cloud
computing
States
of the United...
ComputersComputer
Northern
American co...
Companies
establishe...
Electronic circuit v...
DOS
on authorities
IBM
PC compat...
Linear operators in ...
languages
Unitary
...
Epidemiology
Local
authorities
ad...Household
Multivariate statist...
Radio formats
Member
states
ofincome
NATO
Theories of history
Globalization
Accesshttp://data.linkeded...
control
Database management ...
Markup languages
Subdivisions
of the......
Cloud
platforms
Countries
bordering
Logic in computer sc...
Online
education
bordering
...informati...
http://data.linkeded...
IBM
PC compatibles
English-speaking
cou...
Kennet
and AvonCountries
Canal
First-level
administ...
Cultural geography
Geographic
Income
in
the United...
Website management
Former
confederations
History of television United
Particle physics
States
Visualization (graph...
Superpowers
Identity
management
Country
subdivisions...
Earth
sciences
data
...
School districts... Internet search algo...
Combinatorial optimi...
States and
territori...
Data modeling langua...
1776
establishments
Justification
G8 nations
Countries
borderingdensity
...
http://data.linkeded... Logging
Search
engine optimi...
Population
Google
Income countr...
http://data.linkeded...
Bicontinental
Library cataloging a...
Thermodynamic entropy
Computational comple...
Algebraic structures
Analytic geometry
Link analysis
Programming language...
Internet companies o...
Pharmaceutical indus...
Data
types
Databases
Secondary
education
Internet properties
...
Philosophy of therma...
Real-time
web
Combinatorics
Statistical natural ...
http://data.linkeded...
Privately
Cartography
Geometric measurement
Twitterheld compa...
Automotive
transmiss...
Automobile
transmiss...
Mechanical
power con...
Engagement
http://data.linkeded...
Knowledge bases
Classical logic
19th-century
mathema...
1856
births
SaintRussian
Petersburg
Sta...
statisticians
Political theories
20th-century
Nationalism mathema...
People from
Ryazan
Sovereignty
1922 deaths
Sustainability
Full
Members
of war
the
...
Boolean
algebra
Aftermath
of
LogicalGrief
calculi
Three-digit
telephon...
Members
of
the ...
Red
Hat
Free Full
package
managem...
Linux
package
manage... Distance
Probability
theorists
Open University
Health
sciences
education i...
Russian
mathematicians
Health
care
Aid
Inductive
reasoning
1981
introductions
Educational
institut... ...
1969 establishments
Emergency
telephone
Former
Eastern
Ortho...
Primary
care ... Exempt
Medicine
charities
introductions
Distance
education i...
MIPS 1968
Technologies
of Commo...
Archive
formats Association
Public
services
Charities based in B...
Nursing
International
Charities based in S...
Support develo...
groups
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
Neuropsychology
Sexual arousal
Sexual emotions
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
Frequency domain ana...
Electrical circuits
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
http://data.linkeded...
2006
establishments
...
Companies
based
inFree
S...
educational
sof...
Online chat
Free
learning
suppor...
Internet culture
Free learning manage...
Free Free
content
managem...
software
progra... Internet forums
Computer networking
Youth rights
Telecommunications
e...
Types of communities
Open problems
Emerging technologies
English
Heritage
Listed
buildings
inof...the U...
Archaeology
Applied
psychology
British
architecture
Town
and
country pla...
Aesthetics
Consensus
reality
Educational
research
Systemic
Risk - Beha...
Collaborative
software
Subroutines
Recommender
Application
layer pr... systemshttp://data.linkeded...
Free software
Multiple choice
Programming
constructs
University
of Cambri...
Computational lingui... GraphArtificial
theory intellige...
http://data.linkeded...
Files
World Wide Web Conso...
Computer file systems
Clinical research
http://data.linkeded...
Inter-process commun... http://data.linkeded...
Vision
Photography
Non-parametric stati...
Computer security
http://data.linkeded...
Graphic
design
Maps
Survey
methodology
Types
of polling
French
loanwords
Graphical
user Android
inter...
IOS software
software
Neural networks
System
administration
Services
management ...
industry
VariableSoftware
(computer
p...
GUI game
widgets
Video
gameplay
Business
models
Statistical
outliers user inter...
Software
distribution
Graphical
Video game
terminology
Computer networks
Companies listed on ...
Technology in society
Computer file formats
Internet
search
engi...
History
of the
Inter...
Software licenses
Building
engineering
Advertising
publicat...
Computer jargon
Blogging
Words coined
in the ...
Literary
genres
Politics and
technol...
Internet terminology
Network HTTP
protocols
Server
hardware
Servers
(computing)
http://data.linkeded...
http://data.linkeded...
Set families
South
Asian countries
Article
Feedback
Bla...
India
Member states
of the...
States
and territori...
G15 nations
Member
states of the...
Countries
the Ind...
BRICS
Former British
colon...
International rankings
Systems
Modeling
Lan...
Matrix decompositionsOpen formats
Unified
Modeling
Lan...
http://data.linkeded...
Humans
Private equity portf...
Asymptotic
analysis
Analysis
of algorithms
Trees (data structur...
Data serialization f...
Fundamental
analysis
Stock market
Foreign
exchange
mar...
Derivatives
(finance)
Tasks
of
Natural
lan...
2002
software
.NET
framework
Commodities
market
Navigation
Modelstranslation
oftr...
computation
Computer-assisted
Machine
Concurrency
(compute...
Formal
specification...
Petri
nets applicatio...
Microsoft
GPS
Mac OS X software
Distributed computin...
Software
usingprogramm...
the M...
Freedesktop.org
Application
Operating
systems
Free
software
Freegraphics
windowing
systems
X Window
System
Thermodynamics
Non-standard
Binary
arithmetic analysis
Rotation
Model
selection
Oracle
acquisitions
Free software
compan...
Surfaces
Pi
Google Earth
ofLexical
mathematics
Software
that uses Qt
2005units
software
Angle
Units History
of information
Words
Linux
software variable
Field theory
Regression
...
Orientation
Geometric
centers
Temperature
Astrological
aspects
companies
b...
Binary
treesSoftware
Triangle
centers
Mathematics
of
infin...
Keyhole
Markup
Langu...
Remote
sensing
GIS
fileTravel
formats
Bilinear
Open
Alliance
Differential
geometr...
Units
of linguistic
...forms
BigTable
implementat...
Companies
establishe...
Nucleic
acids
Freeware
Primitive Circles
types
Affine
geometry
Virtual
globes
History
ofClassical
calculusmechanics
Orbits
Celestial
mechanics
Kinematics
Conic
sections
Euclidean
geometry
Companies
based
in S...
http://data.linkeded...
Perimeter security
Telecommunication th...
Spreadsheet
software
Mac
OS software
Inventory
Signs
of death
Enterprise
modelling
Composting
Business
processwaste
Biodegradable
Anaerobic
digestion
Articles
containing
... ...
Hydrology
Water
waveshazards
Sociological
theory
Water
streams
Weather
Historiography
Floodlandforms
Weather
Fluvial
Geomorphology
Rivers
Postmodern
theory
BasicWater
meteorological...
http://data.linkeded...
Nursing
Food and
Drugresearch
Admini...
DrugClinical
discovery
trials
1985 software
NP-complete
problems
Graph families
Perfect graphs
connectivity
1971Differential
in Graph
computer
sci...
topology
Complexity
classes
Parity
Graph
data structures
Differential
geometry
Matrix
normal
Singular
valueforms
decom...
Glossaries
of mathem...
Manifolds
Matrix
theory
Algebraic
graph
theory
Experiments
Stochastic processes
Statistical programm...
Statistical software
Web browsers
Software
development...
http://data.linkeded...
Software
Project development...
management s...
Markov processes
Chemical kinetics
Environmental moveme...
Coordinate systems
Ren%C3%A9
Descartes
Lie algebras
Lie groups
Geometric
topology
Robust statistics
Java platform software
http://data.linkeded...
Software engineering
Free
algebraicgroup
struc...
Geometric
theory
Combinatorial
...
Propertiesgroup
of groups
Blogs
Australian
televisio...
Ethernet
English-language
tel...
Kohlberg Kravis
Robe...
Seven
Network
Networking
hardware
Television
channels
...
Member states of the...
G20 nations
Federal countries
Liberal democracies
Spam filtering
http://data.linkeded...
Symmetry
Exploratory data ana...
ClusterGeostatistics
analysis
Geodesy
Text messaging
Discrete mathematics
Network architecture
Spreadsheet
file file
for...
Bibliography
fo...
PresentationXML
layer p...
Computer hardware co...
Cloud
computing
prov...o...
Computer
companies
1969
in computer
Directed
graphs sci...
Composite
data
types
IBM
Collier
Trophy
recip...
1975
establishments
...
National
Medal
of
Te...
Electronics
companie...
Semiconductor
compan...
1896
establishments
...
Companies
establishe...
UML
Partners
American
brands
Information
theory
Computer
storage
com...
Display
technology
c...
Social
classes
Companies
establishe...
Socialism
Abstract compan...
data
types
security
Companies
listed
on ... so...
Point
ofComputer
sale
compan...
Multinational
Publicly
traded
comp...
1911
establishments
Microsoft
Software
companies b...
Companies
based
in...W...
Social
divisions
Companies
establishe...
Companies
based
in R...
Transaction
processing
Dow
Jones
Industrial...
Companies
in the
Dow...
Windows
administration
Children
Olympic
sports
Safety codes
Stairways
Childhood
MacArchitectural
OS user interface
elements
Figure Ice
skating
Sports
entertainment
Garden
features
dancing
Blogospheres
Solid mechanics
Property se...
Hacking
(computer
Web security
exploits
Deformation
Injection
exploits
Vitaceae
Property
law
Computer Viticulture
security
ex...
Grape
varieties
Social
inequality
Software
testing
Security
compliance
WoodUniv...
Alumni of Woodworking
Keele
People
from
Stoke-on...
The
Pogues
members
1955
births
English
guitarists
English
banjoists
People Living
associated
wi...
people
Woodcarving
Figure 5: Topic coverage of LAK data graph for the individual resources.
Number theory
http://data.linkeded...
Integers
Cardinal numbers
Zonohedra
Prismatoid
polyhedra
Space-filling
polyhe...
Cubes
Platonic
solids
Volume
Assoc.
150
142
142
138
136
125
122
120
120
118
118
114
105
103
103
102
99
99
94
93
5.
RELATED WORK
Cobo et al.[3] presents an analysis of student participation in online discussion forums using an agglomerative hierarchical clustering algorithm, and explore the profiles to find relevant activity patterns and detect different student profiles. Barber et al. [1]
uses a predictive analytic model to prevent students from failing
in courses. They analyze several variables, such as grades, age,
attendance and others, that can impede the student learning.Kahn
et al. [7] present a long-term study using hierarchical cluster analysis, t-tests and Pearson correlation that identified seven behavior
patterns of learners in online discussion forums based on their access. García-Solórzano et al. [6] introduce a new educational monitoring tool that helps tutors to monitor the development of the
students. Unlike traditional monitoring systems, they propose a
faceted browser visualization tool to facilitate the analysis of the
student progress. Glass [8] provides a versatile visualization tool to
enable the creation of additional visualizations of data collections.
Essa et al. [4] utilize predictive models to identify learners academically at-risk. They present the problem with an interesting
analogy to the patient-doctor workflow, where first they identify the
problem, analyze the situation and then prescribe courses that are
indicated to help the student to succeed. Siadaty et al.[13] present
the Learn-B environment, a hub system that captures information
about the users usage in different softwares and learning activities
in their workplace and present to the user feedback to support future
decisions, planning and accompanies them in the learning process.
In the same way, McAuley et al. [9] propose a visual analytics to support organizational learning in online communities. They
present their analysis through an adjacency matrix and an adjustable timeline that show the communication-actions of the users
and is able to organize it into temporal patterns. Bramucci et al. [2]
presents Sherpa an academic recommendation system to support
students on making decisions. For instance, using the learner profiles they recommend courses or make interventions in case that
students are at-risk.
In the related work, we showed how different perspectives and
the necessity of new tools and methods to make data available and
help decision-makers.
6.
CONCLUSION
In this paper we presented the main features of the Cite4Me Web
application. Cite4Me makes use of several data sources to provide
information for users interested on scientific publications and its
applications.
Additionally, we provided a general framework on data discovery and correlated resources based on a constructed feature set,
consisting of items extracted from reference datasets. It made possible for users, to search and relate resources from a dataset with
other resources offered as Linked Data.
For more information about the Cite4Me Web application refer
to http://www.cite4me.com.
7.
REFERENCES
[1] R. Barber and M. Sharkey. Course correction: using analytics
to predict course success. In Proc. of the 2nd International
Conference on Learning Analytics and Knowledge, LAK
’12, pages 259–262, New York, NY, USA, 2012. ACM.
[2] R. Bramucci and J. Gaston. Sherpa: increasing student
success with a recommendation engine. In Proc. of the 2nd
International Conference on Learning Analytics and
Knowledge, LAK ’12, pages 82–83, New York, NY, USA,
2012. ACM.
[3] G. Cobo, D. García-Solórzano, J. A. Morán, E. Santamaría,
C. Monzo, and J. Melenchón. Using agglomerative
hierarchical clustering to model learner participation profiles
in online discussion forums. In Proc. of the 2nd International
Conference on Learning Analytics and Knowledge, LAK
’12, pages 248–251, New York, NY, USA, 2012. ACM.
[4] A. Essa and H. Ayad. Student success system: risk analytics
and data visualization using ensembles of predictive models.
In Proc. of the 2nd International Conference on Learning
Analytics and Knowledge, LAK ’12, pages 158–161, New
York, NY, USA, 2012. ACM.
[5] E. Gabrilovich and S. Markovitch. Computing semantic
relatedness using wikipedia-based explicit semantic analysis.
In Proc. of the 20th international joint conference on
Artifical intelligence, IJCAI’07, pages 1606–1611, San
Francisco, CA, USA, 2007. Morgan Kaufmann Pub. Inc.
[6] D. García-Solórzano, G. Cobo, E. Santamaría, J. A. Morán,
C. Monzo, and J. Melenchón. Educational monitoring tool
based on faceted browsing and data portraits. In Proc. of the
2nd International Conference on Learning Analytics and
Knowledge, LAK ’12, pages 170–178, New York, NY, USA,
2012. ACM.
[7] T. M. Khan, F. Clear, and S. S. Sajadi. The relationship
between educational performance and online access routines:
analysis of students’ access to an online discussion forum. In
Proc. of the 2nd International Conference on Learning
Analytics and Knowledge, LAK ’12, pages 226–229, New
York, NY, USA, 2012. ACM.
[8] D. Leony, A. Pardo, L. de la Fuente Valentín, D. S.
de Castro, and C. D. Kloos. Glass: a learning analytics
visualization tool. In Proc. of the 2nd International
Conference on Learning Analytics and Knowledge, LAK
’12, pages 162–163, New York, NY, USA, 2012. ACM.
[9] J. McAuley, A. O’Connor, and D. Lewis. Exploring
reflection in online communities. In Proc. of the 2nd
International Conference on Learning Analytics and
Knowledge, LAK ’12, pages 102–110, New York, NY, USA,
2012. ACM.
[10] P. N. Mendes, M. Jakob, A. García-Silva, and C. Bizer.
Dbpedia spotlight: shedding light on the web of documents.
In Proc. of the 7th International Conference on Semantic
Systems, I-Semantics ’11, pages 1–8, New York, NY, USA,
2011. ACM.
[11] B. Pereira Nunes, S. Dietze, M. A. Casanova, R. Kawase,
B. Fetahu, and W. Nejdl. Combining a co-occurrence-based
and a semantic measure for entity linking. In ESWC, 2013
(to appear).
[12] B. Pereira Nunes, R. Kawase, S. Dietze, D. Taibi, M. A.
Casanova, and W. Nejdl. Can entities be friends? In
G. Rizzo, P. Mendes, E. Charton, S. Hellmann, and
A. Kalyanpur, editors, Proc. of the Web of Linked Entities
Workshop in conjuction with the 11th International Semantic
Web Conference, volume 906 of CEUR-WS.org, pages
45–57, Nov. 2012.
[13] M. Siadaty, D. Gašević, J. Jovanović, N. Milikić, Z. Jeremić,
L. Ali, A. Giljanović, and M. Hatala. Learn-b: a social
analytics-enabled tool for self-regulated workplace learning.
In Proc. of the 2nd International Conference on Learning
Analytics and Knowledge, LAK ’12, pages 115–119, New
York, NY, USA, 2012. ACM.
[14] C. van Rijsbergen, S. Robertson, and M. Porter. New models
in probabilistic information retrieval. 1980.
Visualizing the LAK/EDM Literature Using Combined
Concept and Rhetorical Sentence Extraction
Davide Taibi1, Ágnes Sándor2, Duygu Simsek3,
Simon Buckingham Shum3, Anna DeLiddo3, Rebecca Ferguson3
1
Institute for Educational
Technologies
Italian National Research Council
Via Ugo La Malfa 153
90146 Palermo, Italy
davide.taibi@itd.cnr.it
2
Parsing & Semantics Group
Xerox Research Centre Europe
6 Chemin de Maupertuis
F-38240 Meylan, France
agnes.sandor@xrce.xerox.com
ABSTRACT
Scientific communication demands more than the mere listing of
empirical findings or assertion of beliefs. Arguments must be
constructed to motivate problems, expose weaknesses, justify
higher-order concepts, and support claims to be advancing the
field. Researchers learn to signal clearly in their writing when
they are making such moves, and the progress of natural
language processing technology has made it possible to combine
conventional concept extraction with rhetorical analysis that
detects these moves. To demonstrate the potential of this
technology, this short paper documents preliminary analyses of
the dataset published by the Society for Learning Analytics,
comprising the full texts from primary conferences and journals
in Learning Analytics and Knowledge (LAK) and Educational
Data Mining (EDM). We document the steps taken to analyse
the papers thematically using Edge Betweenness Clustering,
combined with sentence extraction using the Xerox Incremental
Parser's rhetorical analysis, which detects the linguistic forms
used by authors to signal argumentative discourse moves. Initial
results indicate that the refined subset derived from more
complex concept extraction and rhetorically significant
sentences, yields additional relevant clusters. Finally, we
illustrate how the results of this analysis can be rendered as a
visual analytics dashboard.
Categories and Subject Descriptors
2. THE LAK DATASET
We selected the LAK Dataset1 published by the Society for
Learning Analytics Research (SoLAR2), which provides
machine-readable plain-text versions of the Learning Analytics
and Knowledge (LAK) conference proceedings and a journal
special issue related to learning analytics, and of the Educational
Data Mining (EDM) conferences and journal.
The corpus was extracted using the SPARQL endpoint of the
LAK dataset. The corpus comprised the following:



General Terms





Design
Keywords
Learning Analytics, Corpus Analysis, Scientific Rhetoric,
Visualization, Network Analysis, Natural Language Processing
Our overall aims are to provide users automatically with
suggestions about similar papers, about connections between
papers, and to present these similarities and connections in ways
that are both meaningful and searchable.
In order to achieve this, we integrated three different approaches
to linking and analysing a specific dataset of scientific papers
(see section 2). These approaches were:
1.
2.
3.
network analysis
rhetorical analysis
visualization of the results
The Open University
Knowledge Media Institute &
Institute of Educational Technology
Milton Keynes, MK7 6AA, UK
firstname.lastname@open.ac.uk
Network analysis yields sets of related papers based on
statistical corpus processing (Section 3). In order to improve the
precision of information about the content of the connections
among the papers, we carried out semantic and rhetorical
analysis (Section 4). On the one hand, we extracted similar
concepts in order to provide topical similarity indicators
(Section 4.1) and, on the other hand, we extracted salient
sentences that indicate the main research topics of these papers
(Section 4.2). We repeated the statistical analysis of this reduced
list of concepts, and of the reduced list of salient sentences. At
the end of this paper, we present the design and implementation
of the first prototype of an analytics dashboard (Section 5),
which is designed to summarize results of the socio-semanticrhetorical analysis in a way that users will find both meaningful
and easy to explore.
K.3.1 [Computers and Education]: Computer Uses in
Education
1. INTRODUCTION AND MOTIVATION
3
24 papers presented at the LAK2011 conference
42 papers presented at the LAK2012 conference
10 papers from the journal of Educational Technology
and Society special issue on learning analytics
31 papers presented at the EDM2008 conference
32 papers presented at the EDM2009 conference
64 papers presented at the EDM2010 conference
61 papers presented at the EDM2011 conference
52 papers presented at the EDM2012 conference
For each resource, the title, description and keywords properties
were used to feed the data mining processes employed in our
analysis. At the end of this initial process, a relational database
was used to store 305 papers, 599 authors, 448 distinct
keywords. After this preliminary phase the entire LAK Dataset
1
LAK Dataset: http://www.solaresearch.org/resources/lak-dataset
Published by SoLAR and made available to the LAK Data challenge
of the 3rd International Conference on Learning Analytics and
Knowledge (http://lakconference.org)
2
http://www.solaresearch.org
was analyzed by using the Xerox Incremental Parser (XIP) [1]
for concept extraction and rhetorical analysis, a total of 305
papers, from which XIP extracted 7,847 sentences and 40,163
concepts.
3. STATISTICAL ANALYSIS
A preliminary analysis reported the most-used keywords, the
most frequently occurring authors and the most-referenced
papers. A second phase of analysis was then carried out using
the data-mining tool, RapidMiner [2].
3.1 Statistical Data from RapidMiner
A three-step process was developed in order to analyze the
corpus using the data-mining tool, RapidMiner:



Process documents from file: this module generates
word vectors from the text files.
Select attributes: This allows users to select the
attributes to be considered by the analysis. In our case,
a threshold was set in order to eliminate less important
elements in the word vectors.
Data to similarity: This module was used to calculate
a similarity index for the conference papers based on
Cosine similarity.
and their aggregations [4]. The yEd tool allows users to balance
quality and speed of the cluster algorithm by the use of a slider.
When the quality is set at the highest value, the Girvan and
Newman algorithm is used in its normal form. At the opposite
end, the lowest quality value produces the fastest running time.
In this case it executes a local betweenness calculation following
Gregory’s algorithm [5]. When a mid value is chosen for quality
and speed, the fast betweenness approximation of Brandes and
Pich [6] is applied. In this case, less accurate clustering is
balanced by a lower execution time.
The clusters created with yEd have the following properties:

each node (paper) is a member of exactly one cluster

each node shares many edges with other members of
its cluster, where edges represent the connection
between a pair of papers if their similarity values is
more than a threshold value (0.3 in our experiment).

each node shares few or no edges with nodes of other
clusters
Figure 1 shows a visualization of the primary clusters. Some of
the clusters did seem to have thematic coherence, while others
were harder to label:





The first block ‘Process Documents from file’ is made up of the
following steps:





Tokenize: This operator splits the text of a document
into a sequence of tokens.
Replace token: This operator is used to replace
tokens, for instance in cases where words are
misspelled.
Filter tokens (by length): This operator filters tokens
based on their length. In our case, all the words with
fewer than three characters were removed.
Filter stopwords (English): This operator filters
English stopwords from a document by removing
every token that is the same as a stopword from the
built-in stopword list.
Stem (Snowball): This operator stems words by
applying stemming using the Snowball tool.3
Cluster 1: collaborative, learning, social
Cluster 2: skills, model, slip, guess, parameters
Cluster 3: causality, variables, model, construct
Cluster 4: question, fit, grain, school, skill
Cluster 5: translating, sentences, grinder, corpus
At the end of the main process, the ‘Data to Similarity’ step
returns two results:
a)
b)
The list of the most relevant words (stemmed version)
used in the entire corpus
The measured similarity index between the papers that
make up the corpus.
We employed the similarity relationships between papers to
build a network of papers. In this network each node represents
a paper, and an edge between two paper is created if the
similarity value of a pair of papers overcome a threshold of 0.3.
3.2 Analysing the Network of Papers
The network of papers was then analysed with the yEd tool4 in
order to extract clusters of documents using the algorithm for
natural clusters “based on Edge Betweenness Clustering
proposed by Girvan and Newman” [3]. This algorithm has been
successfully used in Network Analysis to study communities
3
http://snowball.tartarus.org
4
http://www.yworks.com
Figure 1: Results of initial LAK paper clustering analysis
The complete list of the papers belonging to the clusters has
been reported in the web page5 associated to this work.
This analysis was word-driven and not concept-driven. The next
step was to try and refine this by distilling (1) a richer set of
concepts, and (2) a more salient subset of sentences.
4. SEMANTIC ANALYSIS
In order to go beyond full-text statistical analysis and find
connections between papers at the level of the claims they make,
we processed the corpus using the Xerox Incremental Parser
5
http://www.pa.itd.cnr.it/lak-data-challenge.html
(XIP) [1] for extracting concepts and rhetorically salient
sentences [7].
4.1 Concept Extraction
The basic module of XIP performs morphosyntactic analysis,
part-of-speech tagging, constituent analysis and dependency
extraction on free text. Since we define concepts as simple or
compound noun phrases, they can be identified using general
morphosyntactic analysis. Examples of extracted concepts are
analytics, learning analytics, social learning analytics and
social network analytics.
4.2 Rhetorical Analysis
Scientific research does not consist in providing a list of facts,
but in the construction of narrative and argumentation around
facts. In articles, researchers make hypotheses, support, refute,
reconsider, confirm, and build on previous ideas in order to
support their ideas and findings. The aim of rhetorical analysis is
to detect where authors signal that they are making such moves.
This analysis builds on the widely studied feature of research
articles that, besides their well-defined standard structure (title,
abstract, keywords, often IMRAD body structure) rhetorical
moves emphasize articles’ contribution to the state of the art,
and the research problems they address. In previous work [7] we
described a list of rhetorical moves that characterize such salient
messages, together with the extraction methodology. Figure 2
lists the detected rhetorical moves (in caps) together with
examples of expressions that mark them.
A basic observation concerns the distribution of the pairs of
similar papers yielded by the three methods. According to the
expectations, the most similarity pairs have been yielded by
taking into account the full text only in both the LAK and the
EDM collection. There are considerable overlaps among the
three methods, and there are cases when just one method yields
similarity pairs. In subsequent evaluations we aim at evaluating
these various cases. As a first step towards a more complete
evaluation, we have selected some pairs of papers and checked
their similarity according to some independent similarity
indicators. We have found that our statistical method is coherent
with independent similarity indicators in case of high similarity
scores and that in these cases, similarity is found with and
without XIP-extracted text. This indicates the validity of our
statistical method in these cases for finding related papers. In the
case where no independent similarity indicator could be found,
but we do have XIP-based similarity pairs, we looked for related
key claims or findings in the pairs of papers 7. In the cases where
the similarity score between the two papers was high we did find
such interesting related claims in the two papers. However, in
cases where the similarity measure is low, we did not find any
related claims. This indicates that we might want to define a
threshold score. The details of the preliminary tests are reported
in the web page.
5. XIP DASHBOARD
The XIP Dashboard was designed to provide visual analytics
from XIP output in order to help readers assess the current state
of the art in terms of trends, patterns, gaps and connections in
the LAK and EDM literature. The dashboard also draws
attention to candidate patterns of potential significance within
the dataset:



the occurrence of domain concepts in different
metadiscourse contexts (e.g. effective tutoring
dialogue in sentences classified as contrast).
trends over time (e.g. the development of an idea)
trends within and differences between research
communities, as reflected in their publications.
5.1 Implementation
Figure 2: Rhetorical moves (in capital red letters) followed
by some examples of expressions used to signify them in
papers
Once the XIP concept extraction and rhetorical analysis were
concluded we repeated the cluster analysis on the XIP-filtered
lists of concepts and salient sentences. Thus our statistical
analysis (described in Section 3.) of the LAK dataset has been
conducted in three different ways:



considering the full text of the articles
considering only the salient sentences extracted by
XIP
considering only the concepts extracted by XIP
The comparison of the sets of papers yielded by the three
approaches is still ongoing. At this stage we can only present
some preliminary observations concerning pairs of similar
papers yielded by the three kinds of input. The data obtained
through this preliminary evaluation is reported in the web page6
related to this work.
6
http://www.pa.itd.cnr.it/lak-data-challenge.html
All the papers in the LAK dataset were analyzed using XIP. The
output files of the XIP analysis, one per paper, were then
imported into a MySQL database, and the user interface was
implemented using PHP and JavaScript, making use of Google
Chart Tools for the interactive visualizations.8
5.2 User Interface
The dashboard consists of three sections, each showing different
analytical results in different types of chart.
Section one of the dashboard shows two line charts, representing
the LAK and the EDM conferences respectively. Each line chart
shows the distribution of the number of salient sentences over
time and by rhetorical marker type (see Figure 2 for a list of the
types of rhetorical markers). Each coloured line in these line
charts indicates how many sentences of a specific rhetorical type
were extracted, and how this number changed by year (Figure 3
shows the line chart for the EDM conference).
7
The related claims have been searched by reading the pairs of
sentences. Our long-term goal is to provide the related claims
automatically.
8
https://developers.google.com/chart
6. SUMMARY
This short paper has summarised an approach to conducting
‘analytics on Learning Analytics’. The LAK Dataset comprising
LAK and EDM literature has been analyzed in order to identify
clusters of papers dealing with similar topics (conceptual
clustering), and in order to identify key contributions of papers
in terms of the claims authors make, as signalled by rhetorical
patterns. Our preliminary tests are promising, but more thorough
testing is needed to validate the method. Finally, we showed
how the results of this analysis are beginning to be visualized
using an analytics dashboard. All the secondary datasets
produced have been published as open data, for further research.
Figure 3: Rhetorical sentences graphed by year, for EDM
The second section of the dashboard (Figure 4) allows users to
select a combination of the extracted concepts, in order to
visualize the occurrence of these concepts in papers within any
or all research communities represented in the corpus– that is to
say across the whole LAK dataset (EDM plus LAK conference).
Figure 6: Distribution of rhetorical types in XIP-classified
sentences within a selected concept bubble
Figure 4: Number of papers with rhetorically extracted
sentences containing user-selected concepts
The third dashboard section consists of a bubble chart that
displays the occurrence of papers within the entire dataset,
filtered by user-selected concepts (Figure 5). This visualization
can be restricted to display just the LAK or the EDM
conference. In Figure 5, each bubble represents a concept that
has been selected by the user. This is associated with a specific
number of papers and sentences in which that concept has been
detected. The colour saturation of each bubble (expressed by the
color spectrum shown at the top) represents the ‘density’ of the
chosen concept as defined by the number of XIP-extracted
sentences in which the concept occurs. The darker the colour,
the greater the density.
In the longer term, the aim of this research is to provide users
with automatic suggestions about similar papers and about
connections between papers, and to present these similarities
and connections in ways that are both meaningful and
searchable for the users. Future steps will validate the outputs
from these analyses with researchers, and test the usability of the
dashboard with different end-users (e.g. researchers, educators,
students).
7. REFERENCES
[1]
Salah Aït-Mokhtar, Jean-Pierre Chanod, and Claude
Roux. (2002). Robustness beyond shallowness:
incremental dependency parsing. Natural Language
Engineering, 8(2/3):121-144.
[2]
Jungermann, F. (2009). Information extraction with
RapidMiner. In Proceedings of the GSCL Symposium’
Sprachtechnologie und eHumanities’. W. Hoeppner, ed.
[3]
Girvan M. and Newman. M. E. J. 2002. Community
structure in social and biological networks. Proceedings of
the National Academy of Sciences. 99, 12, 7821-7826.
[4]
Newman MEJ: Detecting Community Structure in
Networks. Eur Phys J B 2004, 38:321-330.
[5]
Gregory, S.: Local Betweenness for Finding Communities
in Networks. Technical Report, University of Bristol
(2008).
[6]
Brandes, U., Pich, C., Centrality Estimation in Large
Networks. Intl. Journal of Bifurcation and Chaos in
Applied Sciences and Engineering 17(7) 2303–2318
[7]
Ágnes Sándor. (2007). Modeling metadiscourse
conveying the author's rhetorical strategy in biomedical
research abstracts. Revue Française de Linguistique
Appliquée 200(2): 97-109
Figure 5: Concept ‘density’ within XIP sentences, by year
and number of papers
When a concept bubble is selected (Figure 6), a pie chart pops
up representing the relative distribution of the rhetorical types
for that bubble (that is to say for that concept, and across the
papers and sentences in which the concept has been detected).
Ontology Learning to Analyze Research Trends in
Learning Analytics Publications
Srećko Joksimović
Amal Zouaq
Department of Mathematics and
Computer Science
Royal Military College of Canada
Kingston, ON, Canada
+1 613 541 6000, Ext. 6478
amal.zouaq@rmc.ca
sjoksimo@sfu.ca
ABSTRACT
In this paper, we show how ontology learning tools can be used to
reveal (i) the central research topics that are tackled in the published literature on learning analytics and educational data mining;
and (ii)relationships between these research topics and iii)
(dis)similarities between learning analytics and educational data
mining.
Categories and Subject Descriptors
I.2.7 [Artificial Intelligence]: Natural Language Processing;
G.2.2 [Discrete Mathematics]: Graph Theory
General Terms
Algorithms, Measurement, Experimentation
Keywords
Ontology learning, deep parsing, filtering, information retrieval,
ranking algorithms, graph theoretic statistics
1. INTRODUCTION
Learning analytics is a new research discipline. Although it attracted a considerable amount of attention in educational research
and practice, debate is still very active about the scope of the discipline. The definition of learning analytics offered by the Society
for Learning Analytics Research [7], which is commonly used in
the literature to date, gives a general framework for the main tasks
learning analytics are about. However, given the youth of the
discipline, there are generally two open questions:
-
Dragan Gašević
School of Interactive Arts and Tech- School of Computing and Information
nologies
Systems
Simon Fraser University
Athabasca University
Surrey, BC, Canada
Athabasca, AB, Canada
+1 778 782 7474
+1 604 569 8515
dgasevic@acm.org
papers presented at the LAK conference editions and another one
for the papers presented at the EDM conference editions – in
order to compare the two conferences based on concepts and relationships gauged as most important. We also performed analysis
based on (a) paper abstracts only and (b) main body of text of the
papers.
In this short report, we first describe the data analysis pipeline.
This is followed by a very brief discussion of a small fragment of
the results we obtained in our analysis. The complete results in the
CSV format are available at [8].
2. DATA ANALYSIS PIPELINE
The data analysis relies on our ontology learning tool, OntoCmaps[10]. Ontology learning from text is a multi-layer knowledge
extraction task that targets the following components:
Terms and concepts: The first step consists in identifying candidate expressions in texts. These expressions are then ranked using
some kind of measure (statistical metrics, graph-based metrics,
etc.) to extract those that are relevant for the domain. These filtered relevant expressions are then considered “concepts” in the
ontology learning community.
Taxonomy: This step identifies “is-a” links in texts, generally
using patterns indicating a taxonomical link in text such as
Hearst’s patterns[11], or using the inner structure of multiword
expressions. For example, a “carnivorous plant” can be considered
a “plant” just by looking at the syntactic structure “Adjective
noun” of the expression.
What are the central research topics that are tackled in the
published literature?
What are the relationships between the central research topics?
What are similarities and differences between learning analytics and educational data mining?
Conceptual relationships: This step uses various techniques (patterns, machine learning, etc.) to identify any kind of transversal
relations, with a domain and range.
To address the above questions, we aimed to analyze systematically textual content available in the LAK Challenge data set. In
particular, we used a state-of-the-art ontology learning tool, OntoCmaps, that enabled the automatic (i) parsing of textual content,
(ii) creation of conceptual maps based on the extracted concepts
and relationships, and (iii) filtering/ranking of the most important
concepts and relationships based on measures of information retrieval, graph theory, and voting theory. The concept extraction
and their filtering/ranking was done (i) for each edition of the two
conferences and the journal special issue (from the LAK 2013
Challenge dataset)individually (i.e., LAK 2011-2012, EDM 20082013, and LAK ET&S special issue) to see the emerging trends
through the years; and (ii) by creating two subsets – one for the
OntoCmaps requires a domain corpus as input. As such, LAK and
EDM proceedings (the LAK dataset [13]) were an appropriate set
of texts to test the ontology learning process. OntoCmaps relies on
three main phases to learn a domain ontology: 1) the extraction
phase that performs a deep semantic analysis based on dependency patterns; 2) the integration phase that builds concept maps,
which are composed of terms and labeled relationships, and uses
basic disambiguation techniques. These concept maps form a
graph; and finally 3) the filtering phase where various metrics
rank the items (terms and relationships) in concept maps.
-
Axioms: Finally, axioms here mean defined classes, or rules from
texts.
2.1 The Extraction Phase
In the extraction phase, OntoCmapsis based on a hierarchy of
syntactic patterns. Each pattern describes a set of syntactic rela-
tionships that permit the extraction of a “semantic representation”.
OntoCmaps does not rely on any predefined domain knowledge. It
uses two NLP tools to obtain the syntactic representations: the
Stanford Parser along with its dependency module [2] and the
Stanford parts-of-speech (POS) Tagger [6]. Given a sentence, the
Stanford parser generates syntactic dependency relations between
each pair of related words of a sentence. The POS Tagger identifies words’ parts-of-speech. Based on these two inputs, OntoCmaps creates a pattern syntactic format that enriches words in
each dependency relation with their parts-of-speech. This enriched
representation is then used as input to a pattern recognition task.
A recognized pattern fires a rule that applies various transformations on the syntactic representation to obtain a “semantic representation”, in the form of expressions, triples or sets of triples.
The patterns are divided into conceptual patterns and hierarchical
patterns. Hierarchical patterns concentrate on the extraction of
taxonomical links, following the work of [11], but based on the
dependency formalism. Conceptual patterns identify the main
structures of the language that can be transformed into triples
useful for the extraction of conceptual relations. They are organized into a hierarchy from most-detailed patterns (containing the
biggest number of dependency relationships) to least detailed. The
extraction phase targets deeper levels of the hierarchy first to
avoid extracting too abstract or incomplete representations. For
instance, if the pattern “nsubj-dobj-xcomp” exists in text, the extractor should fire it instead of firing one of its higher-level counterparts “nsubj-dobj” and “nsubj-xcomp”which contain only a
subset of the syntactic relationships of interest. If a pattern is instantiated, then all its parents in the hierarchy are disregarded.
2.2 The Integration Phase
In this integration phase, all the extracted relationships are gathered into concept maps. Some basic term disambiguation tasks
are performed at this level mainly: i) lemmatization which considers singular, plural and other forms of the same terms or relationships as referring to a single concept or relationship; ii) basic synonym detection based on abbreviation relations that are generated
by the Stanford parser and iii) a kind of co-reference resolution
phase that is built in some of the patterns, and that allows for the
creation of semantic links between terms in a sentence, even if not
direct dependency links existed in the original dependency representation. For example, in the sentence: carnivorous plants are
organisms which eat insects, the co-reference resolution creates a
relation “eat” between the term “carnivorous plants” and the term
“insects” while the grammatical representation links the term
“plants” to the term “insects”.
All these operations result in concept maps around various terms.
For example, if there were a number of statements around the
term “carnivorous plants” in texts, it is likely that a concept map
around “carnivorous plants” will be created. This process is repeated for all identified terms and relationships and results in an
aggregation of concept maps through links between various concept maps, thus constituting a graph, with terms representing
nodes, and relationships representing edges.
2.3 The Filtering Phase
The third and last phase for learning the domain ontology is the
filtering phase, which aims at ranking the items in concept maps
(domain terms, taxonomical links, and conceptual links).
2.3.1 Concept Filtering
A number of metrics from graph theory and from information
retrieval are used to identify relevant terms. Graph-based metrics
were computed using the JUNG framework [3]. These metrics
include:
•
The Degree centrality of a node which identifies the number
of edges from and to a given node.
•
The Betweenness centrality, which assigns each node a value
that is derived from the number of shortest paths that pass
through it;
•
The HITS algorithm which ranks nodes according to the
importance of hubs and authorities [5]. This resulted in two
measures Hits-Hubs and Hits-Authority;
•
The PageRank of a node [1];
•
We also computed standard information retrieval metrics,
mainly term frequency (TF) and TF-IDF.
Finally, using the graph-based metrics, we defined a number of
voting schemes with the aim of improving the precision of filtering. All the VS relied on three metrics that were identified as being among the best metrics in previous experiments [10][11]:
Degree, Betweenness and HITS-Hubs. The VS include:
•
•
•
The majority voting scheme, which recognizes a term as an
important one if it is chosen by at least k metrics out of n
with k>n/2.
Borda Count Voting Scheme: This method assigns a “rank”
to each candidate. A candidate who is ranked first receive n
points (n=size of the domain terms to be ranked), second n-1,
third n-2 and so on. The “score” of a term for all metrics is
equal to the sum of the points obtained by the term in each
metric.
Nauru Voting Scheme: The Nauru voting scheme is based on
the sum of the inverted rank of each term in each metric. It is
used to put more emphasis on higher ranks.
Table 1 shows the top ranked concepts based on the majority voting scheme. All the base metrics (Betweenness, PageRank, Degree, etc.) and voting schemes have been computed and can be
found at [8]. The Web site [8] also features a visualization of the
extracted data based on the obtained concept maps. The visualization is performed per venue (EDM/LAK/ETS-SI), per corpus
(only abstracts or main texts) and per year (2008-2012).
2.3.2 Relationship Filtering
Similarly, a number of metrics were used to identify important
relationships.
The first measure consists of all the relationships that occur between important terms (determined through the voting schemes)
as important relationships. This constitutes our voting schemes for
relationships, which were based on the results of the majority
voting scheme for concepts.
The second measure ranks relationships based on Edge Betweenness centrality, which is a measure of the importance of edges
based on the number of shortest paths which contain them.
The third measure is based on assigning frequencies of cooccurrence weights based on the Dice coefficient [9], a standard
measure for semantic relatedness.
Table 2 shows an excerpt of the top ranked relationships based on
the majority voting scheme. Contrary to standard named entity
extractors, an important aspect of using ontology learning is the
ability to extract relationships as well, thus, obtaining not only
topics but also relationships (taxonomical and conceptual) between these topics. A better approach would mix the two approaches and combine topic extraction using named entity extractors, linked data semantic annotators and ontology learning.
Table 1.Top ranked concepts based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset
LAK
(abstracts)
LAK
(paper body)
EDM
(abstracts)
EDM
(paper body)
student (0.50)
student (0.75)
student (0.75)
student (0.75)
datum (0.45)
datum (0.20)
model (0.38)
model (0.23)
informal_learn
(0.31)
learner (0.15)
datum (0.37)
datum (0.19)
learn (0.31)
course (0.15)
method (0.19)
skill (0.09)
teacher (0.29)
analysis (0.12)
paper (0.16)
problem (0.08)
model (0.27)
activity (0.11)
system (0.13)
result (0.06)
learning_analytics
(0.26)
user (0.10)
result (0.12)
method (0.06)
learner (0.25)
tool (0.10)
approach (0.11)
parameter (0.05)
social_factor (0.21)
learn (0.09)
skill (0.08)
question (0.05)
social_learn (0.19)
analytics (0.07)
analysis (0.07)
performance (0.05)
effective_learn
(0.19)
group (0.07)
intelligent_
tutoring_system(0.07)
system (0.05)
group_learn (0.17)
system (0.07)
behavior (0.07)
approach (0.04)
knowledge_
professional (0.17)
teacher (0.06)
tool (0.07)
example (0.04)
Lak (0.17)
instructor (0.06)
work (0.06)
feature (0.04)
knowledge (0.17)
network (0.06)
Researcher (0.06)
item (0.04)
Table 2.Top ranked relationships based on the majority voting scheme extracted the subsets of the LAK 2013 Challenge dataset.
Each cell in the table contains a concept-relationship-concept triplet
LAK
(abstracts)
learner–build–knowledge (1)
datum–obtained
(0.81)
from–learner
LAK
(paper body)
course–being recorded as
well as to–student (1)
datum–break ability to
educate
effectively–
student (0.60)
EDM
(abstracts)
EDM
(paper body)
datum–mining–method (1)
model–fit–student (1)
method–linguistics in–paper (0.95)
learning_analytics–important step
for–teachers_of_tomorrow (0.78)
system–addresses individually–student (0.45)
model–are trained over–datum (0.70)
teachers_of_tomorrow–is
teacher (0.77)
analysis–have since been
moved as–student (0.37)
system–provides–student (0.61)
tool–incorporate functionality to
access–datum (0.65)
network–impacting–
student (0.31)
student–are represented by–model
(0.56)
model–can be used to inform–
student (0.64)
process–finally
should
promote reflection on–
instructor (0.29)
model–can detect–student (0.50)
datum–obtained
(0.62)
tool–identify–student
(0.27)
datum–derived from–student (0.43)
learner–generating–datum (0.58)
datum–may be presented
to–learner (0.25)
goal–has been investigated
researcher (0.42)
student–accessing–
online_discussion_forum (0.56)
activity–conducted
user (0.25)
tutoring_system–is a–system (0.40)
model–can be used to inform–
teacher (0.51)
student–flock to–online_service
(0.48)
datum–are combined to calculate–
likelihood_of_student (0.45)
group–will
contain–
student (0.25)
environment–capture–
datum (0.24)
model–highly
accurate
on–student (0.22)
average–miss–student
(0.21)
role–are imposed on–
student(0.21)
information–useful for–
student (0.20)
a–
from–instructor
instructor–guide–student (0.39)
learn–integral
to–
success_of_community (0.37)
likelihood_of_student–is related
to–student (0.36)
by–
question–were based–
student (0.62)
by–
student–study
with–
intelligent_tutoring_system(0.39)
skill–studied
in–tutoring_system
(0.38)
intelligent_tutoring_system–are
informed by–datum (0.32)
analysis–reveals–unexpected_result
(0.30)
unexpected_result–is a–result (0.30)
collaborative–learning–
interactions_of_student (0.29)
datum–are collected
far
from–student
(0.96)
skill–will have been
covered
by–student
(0.67)
problem–assign for–
student (0.67)
example–
parameterization by–
student (0.63)
student–provides
useful evidence to–
model (0.60)
step–requires–student
(0.57)
performance–
dependent
upon–
student (0.56)
accuracy–varies
across–student (0.48)
student–is guessing–
result (0.48)
student–collect–datum
(0.45)
word–uttered
by–
student (0.44)
datum–were used to
build–model (0.44)
skill–are included in–
model (0.41)
We can also notice that we were not always successful in extracting meaningful relationships labels from this corpus. One possible
explanation is the type of texts (publications) and the amount of
noise in these texts. In fact, OntoCmaps is made to run on clean
plain sentences that describe a domain of interest and define it.
Parts of research papers such as figure captions, formulas, and
references represent noise for OntoCmaps. Additional cleaning of
the input texts would be necessary. However, even when the labels were not meaningful, the existence of a link between two
concepts (unlabeled relationship) was shedding some light on the
domain (see Section 3).
3. FINDINGS
In this section, we present only results of the 15-top ranked concepts and relationships according to the Majority Voting Scheme
(Betweenness, Degree, and Hits-Hub) as shown in Tables 1-2
(N.B. As can be noticed in the tables, the majority of the terms are
lemmatized, that is, we show only their lemma or root. For example,informal_learn for informal learning or datum for data. In few
cases, such as learning_analytics, the lemmatizer returned the
expression itself). First, we could not possible include all the results of all the metrics we calculated in our experiment (those
results are available at [8]). Second, we selected the metrics which
were proven to be most accurate in our previous research [10],
[11]. Finally, it should be noted that the purpose of our experiment here was not to evaluate the effectiveness of individual metrics, but rather to experiment if ontology learning technology can
shed some light on the questions posed in the introduction of relevance to the LAK 2013 Data Challenge.
Concepts reported in Table 1 reveal that papers of both the LAK
and EDM conferences have students, data and models as shared
concepts. However, it is clear that LAK papers also focus on
teachers/instructors, informal learning, and social, networked, and
group learning. On the other hand, EDM papers focus on (data
mining) methods and approaches, intelligent tutoring systems,
features (extraction), and various types of parameters.
Figure 1.Two conceptual maps extracted from the abstracts of the papers presented at the LAK conference
Relationships reported in Table 2 further corroborate the observation that the LAK papers are more focused on teachers in order to
empower them with learning analytics and to help them guide
students. Moreover, there is an emphasis on (promoting) reflection of both students and instructors. Various aspects of social
learning such as role playing and impact of communities appear to
be highly popular topics in the LAK papers. On the other hand,
EDM papers are much more focused on intelligent tutoring systems, accuracy of different types of (predictive) models, and revealing unexpected patterns. Certainly, focus on data is shared by
both the LAK and EDM communities, but LAK also seems to be
focused on data collected by and for instructors, not only for students. This probably indicates a trend that the LAK community
has so far acknowledged the role of instructors in the learning
process and aimed at supporting them as much as learners. The
EDM community has however focused more on measuring and
predicting specific types of skills. This is consistent with their
focus on intelligent tutoring systems in which automated assessment of learners’ skills is of paramount importance.
Finally, we were also able to visualize the extracted conceptual
graphs. In Figure 1, we show the relationships of concept learning
analytics as extracted from the abstracts of the papers presented at
the LAK conference. This figure further corroborates earlier oobservations by indicating that learning analytics is an integral part
of teaching profession, is an important step for teachers of tomo
tomorrow and learners, and offers a new approach. This figure reveals
also the nature of learning analytics to promote qualitative unde
understanding of context
ntext of information. Learning analytics is also
(strongly) related to discourse analytics, which seems to be conco
sistent
tent with the strong emphasis of learning analytics on social
learning and which is further confirmed by extracted relationships
of discoursee learning analytics with sense-making,
sense
argumentation
and social, all of which are types of skills recognized as imporimpo
tant for the modern society.
Figure 2.. Visualization of top 30 ranked concepts based on the majority voting scheme extracted from the abstracts of the LAK
2013 Challenge dataset.
In future work, we plan to analyze further the research trends over
the years for the LAK and EDM communities. Anot
Another of our
goals is to compare the extractions of an ontology learning system
such as OntoCmaps with Linked data Semantic Annotators such
as DBPedia Spotlight1 or Alchemy2.
4. CONCLUSION
Funnily, our text analysis tool inferred that EDM is an abbrevi
abbreviation of learning analytics.. This probably comes from the open
debate reflected in the analyzed papers about the relationships
between learning analytics and educational data mining. We hope
that this paper sheds some light on the (dis)similarities of the two
areas.
s. We also hope that our analysis of the LAK 2013 Data Cha
Challenge dataset with the ontology learning tools indicated a high
potential of this type of analytics to help the research community
of new research discipline define itself and relationships with
1
https://github.com/dbpedia-spotlight/dbpedia-spotlight/
spotlight/
2
http://www.alchemyapi.com/
closest
osest communities. More interesting results are available on our
website [8]. For example, those results allow for (i) comparing
results of different concept/relationship measures and (ii) chronochron
logical trends emerging throughout the years of individual edied
tions of both the conferences. An example
xample of one of the visualizavisualiz
tions available at [8] is presented in Figure 2.
Of course, ontology learning tools are not perfectly accurate, and
thus, few “strange” concepts and relationships are shown in our
tables. An opportunity is however in combining such ontology
learning tools as starting points of the concept map development
of the learning analytics domain, which can then be refined
through crowd sourcing (e.g., in a Wiki-like
Wiki
manner).
5. REFERENCES
[1] Brin, S. & Page, L. (1998). The anatomy of a large-scale
large
hyper-textual
textual web search engine, Stanford University.
[2] De Marneffe, M-C,
C, MacCartney, B. and Manning. C.D.
(2006). Generating Typed Dependency Parses from Phrase
Structure Parses. In Proc. of LREC, pp. 449-454,
449
ELRA.
[3] JUNG (2013). Last retrieved from http://jung.sourceforge.net/
[4] Klein, D. and Manning, C.D. (2003). Accurate Unlexicalized
Parsing. Proc. of the 41st Meeting of the Association for
Computational Linguistics, pp. 423-430.
[5] Kleinberg, J. (1999). Authoritative sources in a hyperlinked
environment, Journal of the ACM 46(5): 604-632, ACM.
[6] Toutanova, K., Klein, D., Manning, C.D. & Singer, Y.
(2003). Feature-rich part-of-speech tagging with a cyclic dependency network, In Proc. of HLT-NAACL, pp. 252-259.
[7] http://www.solaresearch.org/mission/about/
[8] http://lakchallenge.co.nf
[9] Van Rijsbergen, CornelisJoost (1979). Information Retrieval.
London: Butterworths. ISBN 3-642-12274-4.
[10] Zouaq, A., Gasevic, D. and Hatala, M. (2011). Towards
Open Ontology Learning and Filtering, Information Systems,
36(7): 1064–1081.
[11] Zouaq, A., Gasevic, D. and Hatala, M. (2012a). Voting
Theory for Concept Detection. The 9th Extended Semantic
Web Conference 2012 (ESWC 2012), pp. 315-329.
[12] Hearst, M. A. (1992). Automatic acquisition of hyponyms
from large text corpora. In Proc.14th Conference on Computational Linguistics – Vol. 2 (COLING '92), 539-545.
[13] Taibi, D., Dietze, S., Fostering analytics on learning analytics
research: the LAK dataset, Technical Report, 03/2013, URL:
http://resources.linkededucation.org/2013/03/lak-datasettaibi.pdf.
Download