Formal Foundations of Clustering Clustering is a central unsupervised learning task with a wide variety of applications. However, in spite of its popularity, it lacks a unified theoretical foundation. Recently, there has been work aimed at developing such a theory. We discuss recent advances in clustering theory, starting with results on clustering axioms. We will then discuss a new framework for addressing one of the most prominent problems in the field, the selection of a clustering algorithm for a specific task. The framework rests on the identification of central properties capturing the input-output behavior of clustering paradigms. We present several results in this direction, including a characterization of linkage-based clustering methods. ========================================================================= Towards Deeper Semantics: Composition of Semantic Relations and Semantic Representation of Negation Semantic representation of text is key to text understanding, performing inferences and reasoning. In this talk, I focus on extracting meaning ignored by existing semantic parsers. I present a framework to compose semantic relations and introduce a novel representation for negated statements. Composition of semantic relations automatically obtains inference axioms in an unsupervised manner. In contrast to approaches that extract relations from text, axioms take as their input relations previously extracted and output a new relation. The key to this model is an extended definition for semantic relations including semantic primitives coupled with domain and range restrictions. The novel representation for negated statements reveals implicit positive meanings and is grounded on detecting the focus of negation. Title: Discourse-Guided and Multi-faceted Event Recognition from Text Events are one important type of information throughout the text. Accurately extracting significant events from large volumes of text informs the government, companies and the public regarding possible changing circumstances caused or implied by events. Finding event information completely and accurately is challenging mainly due to the high complexity of discourse phenomena. In this talk, I will present two discourse-guided event extraction architectures that explore evidence and clues from wider discourse to seek out or validate pieces of event descriptions. TIER is a multilayered event extraction architecture that performs text analysis at multiple granularities to progressively "zoom in" on relevant event information. LINKER is a more principled discourse-guided approach that models textual cohesion properties in a single structured sentence classifier. To enable fast configurations of domain-specific event extraction systems, I will also present my work on relieving the human annotation bottleneck that hinders current system training. Especially, I will focus on the recent multi-faceted event recognition approach that uses event defining characteristics (facets), in addition to event expressions, to effectively resolve the complexity of event descriptions. I will present a novel bootstrapping algorithm that can automatically learn both event expressions and facets from unannotated texts. Title: Towards Large Scale Open Domain Natural Language Processing Abstract: Machine Learning and Inference methods are becoming ubiquitous – a broad range of scientific advances and technologies rely on machine learning techniques. In particular, the big data revolution heavily depends on our ability to use statistical machine learning methods to make sense of the large amounts of data we have. Research in Natural Language Processing (NLP) has both benefited and contributed to the advancement of machine learning and inference methods. However multiple problems still hinder the broad application of some of these methods. Performance degradation of machine learning based systems in domains other than the training domain is one of the key problems delaying widespread deployment of these systems. In this talk, I will present techniques for "on the fly" domain adaptation that allows adaptation without model retraining; with our methods, one can successfully use a model trained on a source domain, on new domains. This is accomplished by transforming text from the test domain to (statistically) look more like the training domain, where the trained model is accurate. This process of text adaptation treats the model as black box, thus makes the adaptation of complex pipelines of models, which are commonly used in NLP. The next key challenge for machine learning is scalability to processing of vast amounts of data. Making decisions in NLP and other disciplines often requires solving inference (e.g., Integer Linear Programs) problems many times over the life time of the system; typically once per sentence. I will present a theory, and significant experimental results, showing how to amortize the cost of inference over the lifetime of any machine learning tool. I will present theorems for speeding up integer linear programs that correspond to new, previously unseen problems, by reusing solutions of previously solved integer programs. Our experiments indicate that we can cut up to 85% of the calls to an inference engine. ===================================================================== Opinion Mining for the Internet Marketplace: Models, Algorithms and Predictive Analytics The massive amount of user generated content in social media offers new forms of actionable intelligence. Public sentiments in debates, blogs, and news comments are crucial to governmental agencies for passing bills, gauging social unrest, predicting elections, and socio-economic indices. The goal of my research is to build robust statistical models for opinion mining with applications to marketing, social, and behavioral sciences. To achieve this goal, a number of research challenges need to be addressed. The first challenge is to filter deceptive opinion spam/fraud. It is estimated that 15-20% opinions on the Web are fake. Hence, detecting opinion spam is a precondition for reliable opinion mining. The second challenge is fine-grained information extraction to capture diverse types of opinions (e.g., agreement/disagreement; contention/controversy, etc.) and latent sentiments expressed in debates and discussions. The state-of-the-art machinery (e.g., topic modeling) falls short for such a task. I develop several novel knowledge induced sentiment topic models which respect notions of human semantics. The third challenge is that social sentiments are inherently dynamic and change over time. To leverage the sentiments over time for predictive analytics (e.g., predicting financial markets), I develop non-parametric sentiment time-series and vector autoregression models. This talk is structured with a two-pronged focus: (1) Statistical models for opinion mining and predictive analytics, (2) Models for filtering deceptive opinion spam/fraud. In the first part of the talk, I will present novel statistical models for sentiment analysis and talk about two key frameworks: (1) Semi-supervised graphical models for mining fine-grained opinions in social conversations, and (2) Bayesian nonparametrics, sentiment time-series, and vector autoregression models for stock market prediction. In the second part of the talk, I will discuss the problem of opinion spam and throw light on some techniques for filtering opinion spam. The focus will be on modeling collusion and combating Groupspam in e-Commerce reviews. The talk will conclude with a discussion about my ongoing research and future research vision in opinion contagions, forecasting socio-economic indices, and healthcare.