Data Mining Examples

Formal Foundations of Clustering
Clustering is a central unsupervised learning task with a wide variety of applications. However, in spite of
its popularity, it lacks a unified theoretical foundation. Recently, there has been work aimed at developing
such a theory. We discuss recent advances in clustering theory, starting with results on clustering axioms.
We will then discuss a new framework for addressing one of the most prominent problems in the field,
the selection of a clustering algorithm for a specific task. The framework rests on the identification of
central properties capturing the input-output behavior of clustering paradigms. We present several results
in this direction, including a characterization of linkage-based clustering methods.
Towards Deeper Semantics: Composition of Semantic Relations and Semantic Representation of
Semantic representation of text is key to text understanding, performing inferences and reasoning.
In this talk, I focus on extracting meaning ignored by existing semantic parsers. I present a
framework to compose semantic relations and introduce a novel representation for negated
statements. Composition of semantic relations automatically obtains inference axioms in an
unsupervised manner. In contrast to approaches that extract relations from text, axioms take as
their input relations previously extracted and output a new relation. The key to this model is an
extended definition for semantic relations including semantic primitives coupled with domain and
range restrictions. The novel representation for negated statements reveals implicit positive
meanings and is grounded on detecting the focus of negation.
Title: Discourse-Guided and Multi-faceted Event Recognition from Text
Events are one important type of information throughout the text.
Accurately extracting significant events from large volumes of text informs the government, companies
and the public regarding possible changing circumstances caused or implied by events.
Finding event information completely and accurately is challenging mainly due to the high complexity of
discourse phenomena. In this talk, I will present two discourse-guided event extraction architectures that
explore evidence and clues from wider discourse to seek out or validate pieces of event descriptions.
TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to
progressively "zoom in" on relevant event information. LINKER is a more principled discourse-guided
approach that models textual cohesion properties in a single structured sentence classifier.
To enable fast configurations of domain-specific event extraction systems, I will also present my work on
relieving the human annotation bottleneck that hinders current system training. Especially, I will focus on
the recent multi-faceted event recognition approach that uses event defining characteristics (facets), in
addition to event expressions, to effectively resolve the complexity of event descriptions. I will present a
novel bootstrapping algorithm that can automatically learn both event expressions and facets from
unannotated texts.
Title: Towards Large Scale Open Domain Natural Language Processing
Abstract: Machine Learning and Inference methods are becoming ubiquitous – a broad range of
scientific advances and technologies rely on machine learning techniques. In particular, the big
data revolution heavily depends on our ability to use statistical machine learning methods to
make sense of the large amounts of data we have. Research in Natural Language Processing
(NLP) has both benefited and contributed to the advancement of machine learning and inference
methods. However multiple problems still hinder the broad application of some of these
methods. Performance degradation of machine learning based systems in domains other than the
training domain is one of the key problems delaying widespread deployment of these systems. In
this talk, I will present techniques for "on the fly" domain adaptation that allows adaptation
without model retraining; with our methods, one can successfully use a model trained on a
source domain, on new domains. This is accomplished by transforming text from the test domain
to (statistically) look more like the training domain, where the trained model is accurate. This
process of text adaptation treats the model as black box, thus makes the adaptation of complex
pipelines of models, which are commonly used in NLP. The next key challenge for machine
learning is scalability to processing of vast amounts of data. Making decisions in NLP and other
disciplines often requires solving inference (e.g., Integer Linear Programs) problems many times
over the life time of the system; typically once per sentence. I will present a theory, and
significant experimental results, showing how to amortize the cost of inference over the lifetime
of any machine learning tool. I will present theorems for speeding up integer linear programs that
correspond to new, previously unseen problems, by reusing solutions of previously solved
integer programs. Our experiments indicate that we can cut up to 85% of the calls to an inference
Opinion Mining for the Internet Marketplace: Models, Algorithms and Predictive Analytics
The massive amount of user generated content in social media offers new forms of actionable
intelligence. Public sentiments in debates, blogs, and news comments are crucial to governmental
agencies for passing bills, gauging social unrest, predicting elections, and socio-economic indices. The
goal of my research is to build robust statistical models for opinion mining with applications to
marketing, social, and behavioral sciences. To achieve this goal, a number of research challenges need to
be addressed. The first challenge is to filter deceptive opinion spam/fraud. It is estimated that 15-20%
opinions on the Web are fake. Hence, detecting opinion spam is a precondition for reliable opinion
mining. The second challenge is fine-grained information extraction to capture diverse types of opinions
(e.g., agreement/disagreement; contention/controversy, etc.) and latent sentiments expressed in debates
and discussions. The state-of-the-art machinery (e.g., topic modeling) falls short for such a task. I develop
several novel knowledge induced sentiment topic models which respect notions of human semantics. The
third challenge is that social sentiments are inherently dynamic and change over time. To leverage the
sentiments over time for predictive analytics (e.g., predicting financial markets), I develop non-parametric
sentiment time-series and vector autoregression models. This talk is structured with a two-pronged focus:
(1) Statistical models for opinion mining and predictive analytics, (2) Models for filtering deceptive
opinion spam/fraud.
In the first part of the talk, I will present novel statistical models for sentiment analysis and talk about two
key frameworks: (1) Semi-supervised graphical models for mining fine-grained opinions in social
conversations, and (2) Bayesian nonparametrics, sentiment time-series, and vector autoregression models
for stock market prediction.
In the second part of the talk, I will discuss the problem of opinion spam and throw light on some
techniques for filtering opinion spam. The focus will be on modeling collusion and combating Groupspam in e-Commerce reviews. The talk will conclude with a discussion about my ongoing research and
future research vision in opinion contagions, forecasting socio-economic indices, and healthcare.