Learning Analytics: Tool Matrix

David Dornan
Tool (URL)
Opportunities in Learning
Analytic Solutions
One of the biggest hurdles in developing learning analytic tools is developing data governance and privacy policy related to accessing
student data. The two initiatives in this section offer frameworks for opening access to student attention/learning data. The first
initiative provides a start to developing data collection standards and the second provides inspiration on how/why it is not only feasible to
deliver free open courses, it is also makes sense in terms of providing a community based research environment to explore, develop and
test learning theories and learning feedback mechanisms/tools.
PSLC (Pittsburgh
Science of Learning
Center) DataShop
The PSCL DataShop is a repository
containing course data from a
variety of math, science, and
language courses.
Data Standards
Open Learning
This is an exciting initiative taking
place at Carnegie Mellon University.
Students’ interaction with free online course material/activities
provides a virtual learning analytic
laboratory to experiment with
algorithms and feedback
From Solo Sport to Community
Based Research Activity
Initiatives like PSLC will help the
learning analytics community
develop standards for collecting,
anonomizing and sharing student
level course data.
Convincing individual institutions to
contribute to this type of data
repository may be difficult given that
many institutions do not have data
governance/sharing policies to share
this type of information internally.
Herbert Simon from Carnegie
Mellon University states that,
“Improvement in Post Secondary
Education will require converting
teaching from a ‘solo sport’ to a
community based research activity.”
There are often two concerns
related to conducting
experimentation using learning
1. Privacy concerns related to
accessing student related data.
2. Ethical concerns related to testing
different feedback\instructional
response mechanisms.
By offering free courses to student
with full disclosure of how their
interactions will be tracked and
analyzed, these the two issues are
no longer road blocks for conducting
learning analytics research. As
learning material/objects become
commodities, the development of
learning analytics tools that help
guide and direct students will
become what is valued and this
requires that institutions build
expertise in developing and
sustaining the communities required
to conduct community based
learning research.
Database Storage
The majority of current learning analytics initiative are handled adequately using relational databases. However, as learning analytics
programs begin to make use of the semantic web and social media tools, there will be a need to start exploring data storage technology
that can handle large unstructured data sets. This section provides a brief description to the data storage required for LA programs.
Relational Database
For years we have used relational
databases to structure the data
required for our analyses. Data is
stored in tables consisting of rows
and columns. The columns are welldefined attributes pertaining to an
object represented by a table. There
are good open source relational
database such as greenplum and
mysql. However, most universities
have standard supported RDMS
offerings. At the University of
Guelph we support both SQL Server
and Oracle's RDMS.
Oracle provides a secure repository
for structured data. The recent
release of 11g also provides
integration with the R engine
permitting it to access data stored in
the database.
Map Reduce
Hadoop is an Apache project
inspired by Google's Mapreduce and
the Google File System. It has
become a standard for distributing
large unstructured data sets. It
provides a framework that can
distribute large data set over a
number of servers and can provide
intermediate results as data flows
through the framework's pipeline.
As learning analytics programs begin
to make use of the semantic web
and social media tools there will be a
need to start exploring data storage
technology that can handle large
unstructured data.
Universities have good relational
database infrastructures including
expertise. As LA programs grow to
include analysis of unstructured
data, universities will need to
develop skill and capacity to offer
Hadoop data storage and retrieval
There are a number of companies
that lease access to processing via
virtual servers. Amazon’s EC2 is a
common cloud server option
available to host applications.
It is becoming common for
organization to look at moving
application to the cloud. For many
of the traditional services, like the
RDMS, there is resistance to cloud
based deployments. This resistance
is primarily due to privacy concerns
and resistance to change. As LA
programs require access to new
technologies such as Hadoop and
require infrequent massive
analytical cycles, there may be an
opportunity to introduce cloudbased offerings such as EC2.
The first assignment for this course
(the development of a LA tool)
provided me an opportunity to
deploy an application using EC2.
EC2 is a great way to explore new
technologies. If mistakes are made
one simply redeploys a new EC2
instance. There are many publically
available instances that save time in
deploying complete environments.
In developing my LA tool, I deployed
an Oralce XE instance (which
required virtually no effort) and
another RedHat instance where I
installed RevoDeployR. Since
RevoDeployR was a new tool for me,
I had to start over several times
before completing a successful
installation. It is possible to create
backup images in EC2. However, it
was not as intuitive as creating a
new instance.
Data Cleansing/Integration
Prior to conducting data analysis and presenting it through visualizations, data must be acquired (extracted), integrated, cleansed and
stored in an appropriate data structure. The tools that perform these tasks are commonly referred to as ETL tools. Given the need for
both structured and unstructured data (as described in the above section), the ideal ETL tools will be able to access and load data to and
from data sources including RRS feeds, API calls, RDMS and unstructured data stores such as Hadoop.
Needlebase is a web-based
webscraping tool that provides an
easy to use interface to acquire,
integrate and cleanse web-based
data. As a user navigates a website
tagging page elements of interest,
Needlebase detects the underlying
database structure and web
navigation and automates the
collection of the underlying data into
a table of data.
Needle base is a great tool for
accessing a websites underlying
data when direct access to the data
is not easily accessible. I have used
Needlebase to create a lookup table
for archived National Occupation
Codes and to create a lookup table
for our undergraduate course
There is no API access to the
Needlebase scripts that are created.
It seems best for one off extracts or
for applications where the entire
dataset is acquired using
Needlebase tools. It does not seem
all that useful for an integrated
solution. One other restriction that I
ran across using this tool was that it
did not support accessing websites
requiring authentication.
Pentaho Integration
Pentaho Data Integration (PDI) is a
powerful easy to learn open source
ETL tool that supports acquiring data
from a variety of data sources
including flat files, relational
databases, Hadoop databases, RSS
Feeds, and RESTful API calls. It can
also be used to cleanse and output
data to the same list of data sources.
PDI provides a versatile ETL tool that
can grow with the evolution of an
institutions learning analytics
program. For example, initially a LA
program may start with institutional
data that is easily accessible via
institutional relational databases. As
the program grows to include text
mining and recommendation
systems that require extracting
unstructured data outside the
institution, the skills developed with
PDI will accommodate the new
sources of data collection and
There are two concerns that I have
with PDI:
1. Pentaho does not have built in
integration with R statistics. Instead
Pentaho data mining integration
focuses on a WEKA module.
2. Pentaho is moving away from the
open source model. Originally PDI
was an open source ETL tool called
Kettle developed by Matt Casters.
Since Pentaho acquired Kettle (and
Matt Caster), it has become a central
piece to their subscription based BI
Suite and the support costs are
growing at a rapid pace. Twice, I
have budgeted for support on this
product only to find that the support
costs have more than doubled year
over year.
Talend is another open source ETL
tool that has many of the same
features as PDI. The main
differences between PDI and Talend
are presented in the following blog
Talend has the same strengths as
described above with the additional
benefit of having built in integration
with R.
The main difference that from my
perspective is that Talend is a code
generator whereas PDI is not. I have
also found PDI a much easier tool to
learn and use.
Yahoo Pipes
Yahoo provides this free web-based
GUI tool that allows users to extract
web-based data and create data
stream that will cleanse, filter or
enhance data prior to outputting the
data via an RSS feed.
Since PDI and Talend seem to be
able to provide the same ability as
Yahoo Pipes I did not spend a great
deal of time exploring Yahoo Pipes.
However, it seems to me that Yahoo
pipes could provide the webscraping
functionality that Needlebase
provides, yet offer a RRS feed output
that could be picked up by either
Talend or Pentaho in order to
The one concern that I have wrt
Yahoo pipes is that some of the
unstructured data that will require
analysis in a LA system will be posts
by student. If a free public service
like Yahoo Pipes is being used to
stream data through various analytic
API’s, we will potentially release
personal student data.
schedule nightly loads. It might be a
more efficient way to pass web
based data streams through various
API's prior to extractions using PDI>
Statistical Modeling
There are three major statistical software vendors: SAS, SPSS and R. All three of these tools are excellent for developing
analytic/predictive models that are useful in developing learning analytics models. This section focuses on R. The open source project R
has numerous packages and commercial add-ons available that position it well to grow with any LA program. Given that many researchers
are proficient in R, incorporating the R engine into a LA platform also offers an opportunity to engage faculty in the development of
reusable models/algorithms.
R is an active open source project
that has numerous packages
available to perform any type of
statistical modeling.
R statistics strength is the fact that it
is a widely used by the research
community. Code for analysis is
widely available and there are many
packages available to help with any
type of analysis and presentation
that might be of interest. Some of
these include:
1) Visualization:
a) ggplot provides good
charting functionality.
b) googlevis provides an
interface between R and the
Google Visualization API
2) Text Mining:
a) tm provides functions for
manipulating text including
stripping whitespace and
stop words and removing
suffixes (stemming).
b) openNLP identifies words as
nouns, verbs, adjectives or
c) wordnet provides access to
wordnet library. This is often
used to replace similar words
with a common word prior to
text analysis.
Although I really like R there are two
issues that may be of concern to
some universities:
1) Lack of Support - only Revolution
R provides support for the R
2) High Level of Expertise Required
to Develop and Maintain R. How
does a university retain people
that have the skill required to
develop and maintain
R/RevoDeployR. However, since
many faculty and students are
proficient with R, perhaps
building a platform similar to
Datameer (see below) would
allow R code to be community
sourced allowing the majority of
faculty and students to easily
access and build their own
learning dashboards.
Here are a few articles that show the
power of using a few of these text
mining packages:
1. Creating a wordle using tm and
ggplot - http://www.rbloggers.com/building-a-betterword-cloud/
2. Provides an overview of
conducting text analysis using R http://www.jstatsoft.org/v25/i05/pa
Oracle has also integrated R into it's
11g RDMS allowing R models direct
access to RDMS data.
Revolution R
Offerings Including:
 RevoDeployR
 RevoConnectR
 Integration with
IBM Netezza
Revolution R provides support for
the open source R engine and
provides add on to enhance the
integration and use of R within
databases and websites. The
RevoDeployR is a server-based
platform that provides access to the
R engine via a RESTful API. The
RevoConnectR allows use of
Hadoop stored data by the R engine.
Revolution R also provides
integration with IBM Netezza data
warehouse appliances providing a
scalable infrastructure for analyzing
very large datasets.
Revolution R is the only commercial
support offering for R. Revolution R
will be useful for institutions that
have procurement or risk
management policies that restrict
the use of open source products.
The support that I received using
RevoDeployR was very slow.
However, I am not a supported
Revolution R tools are free for
research purposes and their support
contract or licenses for institutional
purposes (i.e. learning analytics and
dashboards) are very reasonable. I
was quoted $4,ooo/core for
RevoDeployR product.
This is an open source apache
module named mod_R that embeds
the R statistical engine inside the
web server.
Zementis ADAPA
Zementis offers a PMML-based
scoring engine which can be
deployed on-site, within a
greenplum database, within an excel
spreadsheet or consumed as a web
service using Zementis amazon
cloud based service. By using the
PMML (Predictive Model Markup
Language) standard ADAPA can
easily leverage predictive models
developed in the major statistical
software including R, SAS and SPSS.
It can quickly provide scoring based
on any of the following modeling
- Support Vector Machines
- Naive Bayes Classifiers
- Ruleset Models
- Clustering Models
- Decision Trees
- Regression Models
- Scorecards
- Association Rules
- Neural Networks
ADAPA allows for easy consumption
of predictive scores into a student or
faculty web based learning
dashboard. The cloud based service
starting at only $0.99/hr only
requires a $2000/semester
investment. I tried using the API to
create a Purdue-like dashboard in
the LA tool, but I did not have time
to get it working properly.
Zementis has partnered with
RevoDeployR to create their web
base subscription service using
RevoDeployR. So if RevoDeployR is
part of your LA architecture, it could
provide the same functionality using
your in house service.
Network Analysis
Network Analysis focuses on the relationship between entities. Whether the entities are students, researchers, learning objects or ideas,
network analysis attempts to understand how the entities are connected rather than understand the attributes of the entities. Measure
include density, centrality, connectivity, betweenness and degrees. This is an important area to explore, as we take up Herbert Simon
(from Carnegie Mellon University) challenge and nudge learning and teaching ‘from a solo sport to a community based research activity’.
Network analysis can not only help us identify pattern that help identify dis-connected students or help predict success based network
metrics, these tools can help student develop networking skill that will be required for successful life long learning and research.
Social Networks Adapting
Pedagogical Practice (SNAPP) is a
network visualization tool that is
delivered as a 'bookmarklet' . Users
can easily create network
visualizations from LMS forums in
real time.
Self Assessment Tool for Students
SNAPP provide students with easy
access to network visualizations of
forum posting. These diagram can
help students understand their
contribution to class discussions.
Identify at Risk Students/ Monitor
Impact of Learning Activity
Network Analysis visualizations can
help faculty identify students that
may be isolated. They can also be
used to see if specific activities have
impacted the class network.
NodeXL is an excel add-on that
creates network visualizations from
a worksheet containing the lists of
edges. The tool provides the ability
to calculate common networking
measures such as density, centrality,
connectivity, betweenness and
degrees. Data can be exported in a
format that can be imported into
Sophisticated Network Analysis
Both NodeXL and Gelphi can be used
to explore network patterns. These
tools are useful for researchers. It
would be interesting to explore the
relationship these network metrics (
e.g. centrality and betweeness) and
gelphi for further analysis or refined
Gelphi offers a standalone product
for analyzing networks. It is the
most advanced of the three network
analysis tools described in this
Simon Buckingham and Anna De
Liddo have developed an enhanced
diigo-like tagging/bookmark tool
that has allows a user to link their
contributions to other ideas and
websites with descriptive adjectives.
student success.
Idea Creation
While this tool provides the creators
with data that is useful to conduct
their discourse analysis research it
also provides people/researchers
with a tool that may help connect
them to people that have related
interests and ideas and may help to
stimulate new ideas and
Other Tools for Analysis
Viral heat provides a full-featured
tool set and an API that helps
monitor web content for specific
mentions of people, products and
Monitor and Evaluate
Course/Program Satisfaction
This relatively cheap analytics
offering could help introduce the
use of analytics by helping evaluate
a recruitment drive/strategy or
fundraising campaign.
Wolfram API
Princeton University provides a
lexical database that links English
words (or sets of words) by their
common meaning. It is essentially a
database that helps identify
Identify Main Concepts found in a
Learning Objects / Forum Post
Leximancer provides sophisticated
text analysis and presentation of
concepts found in a learning object.
The API can return interactive
concept maps demonstrating how
different ideas connect.
The tool provides the ability to drill
from the concept map down to the
text that spawned the concept map.
Identify Main Concepts found in a
Learning Objects / Forum Post
The Wolfram Alpha API provides
a developers with the ability to
submit free text/questions from a
website to the Wolfram Alpha
engine and have the results
Dynamic Content Delivery
This lexical database is used in text
analysis to replace similar words
with one common descriptor.
Leximacer could be a used to help
consolidate the main ideas of a
lecture or discussion groups. It can
also provide students with easy
access to the detailed discussion and
material related to a concept via a
link from the concept map to the
discussion forum posting.
The wolfram API could be used to
provide supplemental material to
on-line discussion.
Linked Data
If Tim Berners-Lee vision of linked data (http://www.ted.com/talks/tim_berners_lee_on_the_next_web.html) is successful in
transforming the internet into a huge database, the value of delivering content via courses and programs will diminish and
universities will need to find new ways of adding value to learning. Developing tools that can facilitate access to relevant
content using linked data could be one way that universities remain relevant in the higher learning sector.
e.g. DBPedia
Ontologies are essentially an agreed
upon concept map for a particular
domain of knowledge.
Reuter’s offers this free API that
takes text input and returns tags
that will link the concepts in the text
to other linked data on the web.
Dynamically Deliver Relevant
Using OpenCalais along with welldefined ontologies provides a
mechanism for dynamically
delivering/suggesting related
The presentation of the data after it has been extracted, cleansed and analyzed is critical to successfully engage students in learning and
acting on the information that is presented.
Google Visualization
Google Visualization provides an API
to their chart library allowing for the
creation of charts and other
visualizations. They have recently
released an API to add interactive
controls to their charts.
Protovis and D3 are JavaScript
frameworks for creating web-based
visualizations. Protovis is no longer
an active open source project. It has
been replaced by D3.
Interactive Learning Dashboards
All of these tools are useful for
creating visualizations for learning
feedback systems such as
Learning how to use these
tools/libraries requires a fair amount
of effort. Developer retention is a
risk for system maintenance and
The Motion Chart (purchased from
gapminder) is one of my favourite
interactive charts that Google
provides access via their API.
All of these tools can present data as
a heat maps, network analysis
Fusion Charts provides a commercial
JavaScript framework for creating
dynamic visualizations.
diagrams and tree maps. Here's a
link to an example dashboard
created in D3, presenting university
admission data.
Reporting Suites
Many universities have reporting
tools available to create
visualizations. Tools include
Tableau, Cognos, Pentaho and
Jasper Reports.
All of these vendors provide good
tools to create reports and
dashboards. My favourite is
Tableau, however, JasperReports or
Pentaho are much more affordable.
Full Analytics Offerings
Using LOCO (Learning Object
Context Ontologies) student on-line
activities are mapped to specific
learning objectives. The tool set
provides faculty with feedback
related to how well material has
been understood, as well it
provides network visualizations
describing student interaction.
The tool provides a framework for
describing on-line learning
Faculty Feedback Related to
Learning Success
Datameer provides full set of tools
(http://www.datame allowing users to conduct advanced
analytics on Hadoop based data.
Engage Faculty in Learning
I like Datameer's wizard based
approach to user controlled
analytics. It provides some ideas on
how one could provide faculty with
the ability to contribute or reuse
predictive models, quickly test
historic data, deploy a learning
analytics algorithm and present the
results in a learning dashboard.
This approach may be too
complicated for delivery to the
masse, as I suspect that the majority
of faculty will want something that
requires less effort.