Proceedings of the Twenty-Fifth AAAI Conference on Artificial Intelligence Learning with Imprecise Classes, Rare Instances, and Complex Relationships Srinath Ravindran Department of Computer Science North Carolina State University Raleigh, North Carolina 27695 acids, alcohols and amines. Often the chemicals in one class exhibit properties similar to chemicals in another congeneric class while being different from chemicals in the same class. The former phenomenon is an example of within-class variation and the latter, inter-class similarity. Moreover, the occurrence of some chemicals may be rare or unseen. That is, the chemical may be the only representative of its class in the training data, or a chemical that is not be present training set could be presented at a later stage (test set) for a prediction task. Finally, the labels assigned to each of the chemicals are determined experimentally and this task is costly and, as often in other applications, error prone. The most important challenge when considering rare instances and cases with lack of examples is handling noisy data. The foremost task is to build a system that can tell the difference between a noisy instance and a rare instance. Moreover, many applications deal with large amounts of data, both in dimensionality and in the number of instances. This poses additional challenges in terms of memory and time complexity. The problems discussed above span all aspects of machine learning: classification, regression, and reinforcement learning. My thesis, focuses on supervised learning tasks. Abstract In applications including chemoinformatics, bioinformatics, information retrieval, text classification, computer vision and others, a variety of common issues have been identified involving frequency of occurrence, variation and similarities of instances, and lack of precise class labels. These issues continue to be important hurdles in machine intelligence and my doctoral thesis focuses on developing robust machine learning models that address the same. Problem Description There are a variety of machine learning approaches that work well for problems involving IID data, and cases with enough examples to represent the variations in data. However, in many practical applications, we may not have sufficient examples, or the data may not be drawn from identical distributions and there could be complex relations among individual data points. These problems that make the learning task difficult and often error prone typically fall under the following categories: 1. Within-class Variation: instances within one category or class have different properties Research Questions 2. Inter-class Similarity: instances from different categories or classes may have similar properties Many approaches have been demonstrated to successfully address each of the four problems discussed above. The more prominent approaches include multilevel models (Bach 2008; Gelman and Hill 2006), active learning (Cohn, Atlas, and Ladner 1994; Settles 2010), mixture or ensemble models (Bahler and Navarro 2000; Jordan and Jacobs 1994; Bishop 2007), multiple instance learning (Zhou 2004) and transfer learning techniques (Pan and Yang 2008). Multilevel modeling, also referred to as hierarchical modeling, is an increasingly popular approach to modeling data and is known to outperform classical regression in predictive accuracy (Gelman 2005). Multilevel models have been known to provide improved generalization in prediction tasks across various application domains. Such models have been studied under various guises including bayesian models (Gelman and Hill 2006), computer vision and visual cortex simulation (Sudderth et al. 2005; Bouvrie 2009), and linear models (Gelman and Hill 2006). 3. Rare Instances: instances may be unseen or may be the only representatives of their kind and could be present in either training or test data 4. Lack of labeled instances: instances that either belong to an unknown class or are assigned an imprecise class label Over the past few years, a variety of approaches have been developed to address each of the issues mentioned above. Existing approaches to prediction most often do not consider these problems together, instead treating these as separate problems. However, there are many applications having some combination of these problems. An example application is chemical toxicity prediction. Chemicals belong to various congeneric classes such as c 2011, Association for the Advancement of Artificial Copyright Intelligence (www.aaai.org). All rights reserved. 1859 Future Work However, multilevel models have found limited use in prediction problems in the presence of rare instances. The foremost question to address is: given the power of multilevel models, can they be applied to prediction in the presence of the problems discussed earlier? As we shall discuss in the next section, our current work has shown promising results in this direction. Much multilevel modeling work has focused on batch learning and supervised learning. However, some applications require online learning. Online learning often posses all four of the problems mentioned in the previous section. We must investigate the performance of multilevel models in such applications and if required, also identify an alternative approach for online learning in the presence of the four problems. While each of the first three problems have received attention for a relatively long time, the fourth problem (“lack of labeled instances”) has received interest only recently. While active and semi-supervised learning models have been proposed, open questions remain concerning the quality of the labels in the data. Labeling the data instances is not only costly but also error prone, for a variety of reasons, such as labeler inexperience and fatigue. An instance could be assigned either a wrong label, an imprecise label, or left unlabeled. Apart from noisy or imprecise labels, the presence of class imbalance, within-class variation, and inter-class similarity still remain major concerns in active learning. My thesis aims to address both the issues of imprecise labels and class imbalance for active learning. The major challenge here is establishing a trade-off between cost of labeling and classification error. At present, we are working on a solution to learning in the presence of imprecise class labels. Feature extraction and selection plays a vital role in various machine learning tasks. More importantly, the presence of some features may be helpful in predicting rare cases. It is important to study the effect of both feature selection and extraction towards improving prediction in the presence of the four problems we discussed earlier. We believe that sampling interesting instances could help improve the performance of active learning, especially in the presence of class imbalance. This will be a good extension of my work on interestingness. Finally, there is an increasing interest in multitask learning and learning in the presence of instances with multiple labels. All the problems addressed in my work exist in these two learning tasks as well. It will be an important direction for future research. References Bach, F. 2008. Exploring large feature spaces with hierarchical multiple kernel learning. CoRR abs/0809.1493. Bahler, D., and Navarro, L. 2000. Methods for combining heterogeneous sets of classifiers. In Proceedings of the Seventeenth National Conference on Artificial Intelligence, Workshop on New Research Problems for Machine Learning. AAAI Press/ The MIT Press. Bishop, C. 2007. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer, 1st ed. 2006. corr. 2nd printing edition. Bouvrie, J. V. 2009. Hierarchical Learning: Theory with Applications in Speech and Vision. Ph.D. Dissertation, Massachusetts Institute of Technology. Cohn, D.; Atlas, L.; and Ladner, R. 1994. Improving generalization with active learning. Mach. Learn. 15:201–221. Gelman, A., and Hill, J. 2006. Data Analysis Using Regression and Multilevel/Hierarchical Models. Analytical Methods for Social Research. Cambridge University Press. Gelman, A. 2005. Multilevel (Hierarchical) Modeling: What It Can and Can’t Do. Technometrics. Jordan, M. I., and Jacobs, R. A. 1994. Hierarchical mixtures of experts and the em algorithm. Neural Computation 6(2):181–214. Pan, S. J., and Yang, Q. 2008. A survey on transfer learning. Technical Report HKUST-CS08-08, Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China. Settles, B. 2010. Active learning literature survey. Technical report, University of Wisconsin-Madison. Sudderth, E. B.; Torralba, A.; Freeman, W. T.; and Willsky, A. S. 2005. Learning hierarchical models of scenes, objects, and parts. In ICCV, 1331–1338. IEEE Computer Society. Zhou, Z.-H. 2004. Multi-instance learning: A survey. Technical report, Department of Computer Science and Technology, Nanjing University, China. Current Progress We have developed a multilevel model for supervised prediction tasks that achieves better performance than existing models in the presence of within-class variation, interclass similarity and rare instances without compromising the overall error rate. Across multiple domains, the model can generalize at least as well as most existing models, while correctly predicting more rare instances. A paper describing this work is currently under review. In related work previous work, we addressed the issue of detecting interesting patterns in data, especially detecting if a less frequent or rare pattern is interesting or not. Existing methods suffer from a variety of shortcomings. To begin, their output depends on the choice of a threshold for the value of support. If the support is low, they tend to generate a large number of patterns, many of which are “uninteresting”. Even if the support is sufficiently large, some patterns generated may already be known to the user as ground truth. At the same time, some interesting but infrequent patterns may be mistakenly overlooked. Our approach, being subjective, uses the relationship between entities of a pattern as a factor in determining the interestingness of a pattern. The major limitation of research in this direction is a lack of datasets or standards that define “interest” since it is context dependent and subjective. 1860