Selecting examples for regression Supervisor: João Mendes Moreira (jmoreira@fe.up.pt) Co-supervisor: Carlos Soares (csoares@fep.up.pt) Data mining (DM) is becoming an increasingly important technology for businesses [Ghani & Soares, 2006, Moreira et al., 2005]. One important task that is addressed with DM techniques is prediction. The DM approach to prediction consists of using inductive learning methods, which analyze available data to generate a model. Then, this model is used for making predictions concerning new examples. For instance, the examples could represent bus trips from a public transportation company and the prediction is concerned with the duration of the trip [Moreira et al., 2005]. Different methods can be used for prediction tasks, including neural networks, support vector machines and decision trees. The data used to train the model is referred to as the training set. The success of a DM approach to prediction depends on how suitable the algorithm is for the training set and how representative the training set is of the new examples. Different pre-processing tasks can be used to address these issues [Reinartz, 2002; Blum & Langley, 2007], namely: example (or instance) selection [Liu & Motoda, 2001], feature selection [Guyon & Elisseeff, 2003], and domain values definition. The goal of example selection is to identify the data from the training set that are expected to yield the best possible model for a particular learning algorithm. There are some examples of successful approaches for example selection, such as the one for the linear kernel of support vector machines [Moreira et al., 2006]. A possible approach to example selection which has not been sufficiently explored is metalearning [Brazdil et al., 2009, Crammer et al., 2008]. This approach consists learning about learning algorithms (hence the prefix "meta"), i.e., using inductive learning methods on results of previous prediction problems to choose the experimental setup that is most suitable for a new prediction problem. In this particular case, the experimental setup to be selected consists of both the subset of data used for training and also the learning algorithm. The goal of this project is to combine and extend previous work on example selection [Moreira et al., 2006] and metalearning [Brazdil et al., 2009] for the problem of selecting the best training data for a prediction problem. The work will be empirically tested on several datasets, including an application of bus travel time prediction from the public transportation company of Porto (STCP). [Blum & Langley, 1997] Blum, A. L., P. Langley (1997). “Selection of relevant features and examples in machine learning.” Artificial Intelligence 97(1-2): 245-271. [Brazdil et al., 2009] Brazdil, P., C. Giraud-Carrier, C. Soares, R. Vilalta (2009). “Metalearning applications to Data Mining", Springer. [Crammer et al., 2008] Crammer, K., M. Kearns, J. Wortman (2008). “Learning from Multiple Sources.” Journal of Machine Learning Research 9:1757-1774. [Ghani & Soares, 2006] Ghani, R., C. Soares (2006). "Data mining for business applications: KDD-2006 workshop", ACM SIGKDD Explorations Newsletter. [Guyon & Elisseeff, 2003] Guyon, I., A. Elisseeff (2003). "An introduction to variable and feature selection." Journal of Machine Learning Research 3: 1157-1182. [Liu & Motoda, 2001] Liu, H., H. Motoda, Eds (2001). “Instance selection and construction for data mining,” Kluwer Academic Publishers. [Moreira et al., 2005] Moreira, J. M., A. M. Jorge, J. F. Sousa, C. Soares (2005), “A Data Mining approach for trip time prediction in mass transit companies.”, Workshop on Data Mining for Business at ECML/PKDD 2005, Porto - Portugal, 63-66. [Moreira et al., 2006] Moreira, J. M., A. M. Jorge, C. Soares, J. F. Sousa (2006). “Improving SVM-linear predictions using CART for example selection.” International Symposium on Methodologies for Intelligent Systems, Springer, LNAI 4203: 632-641. [Reinartz, 2002] Reinartz, T. (2002). “A unifying view on instance selection.” Data Mining and Knowledge Discovery 6(2): 191-210. Development of regression algorithms for censored-data Supervisor: João Mendes Moreira (jmoreira@fe.up.pt) Prediction methods are an important technology for businesses. Regression refers to the prediction of numeric variables while classification refers to the prediction of categorical variables. In certain areas of business that use regression methods, the numeric variable is bounded. An example is survival analysis, a branch of statistics which deals with death in biological organisms and failure in mechanical systems. Another example is the analysis of performance (performance is, in this case, a real value bounded between 0 and 1) using exogenous variables in DEA - Data Envelopment Analysis (a state of the art benchmark method). Many other problems exist where data is left-censored, right-censored or interval-censored. A common approach to solve this kind of problems is Tobit regression, a parametric statistical method. However, the assumptions of this model (homogeneous variance and independency of the errors) limit the range of problems where it can be applied. In the last few years new inductive learning algorithms have been developed for the regression problem. Support vector regression [Smola & Scholkopf, 2004], random forests [Breiman, 2001] and random decision tress [Fan et al., 2006] are examples of such methods. Inductive learning does not any assumption about the data. However, they typically assume non-censored data. For this reason, there is current research on inductive learning algorithms in order to adapt existing approaches for the resolution of problems with output censored data. This is the case of random survival forests [Ishwaran et al., 2008] for right-censored survival data. This proposal is on inductive learning approaches for interval-censored data. The work will be empirically tested on several datasets, including an application on the evaluation of bus line performance from the public transportation company of Porto (STCP). [Breiman, 2001] Breiman, L. (2001). “Random forests.” Machine Learning 45: 5-32. [Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006). “A general framework for accurate and fast regression by data summarization in random decision trees.” The 12th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. [Ishwaran et al., 2008] Ishwaran, H. (2008). “Random survival forests.” The Annals of Applied Statistics 2(3): 841-860. [Smola & Scholkopf, 2004] Smola, A. J. and B. Scholkopf (2004). “A tutorial on support vector regression.” Statistics and Computing 14: 199-222. Towards an off-the-shelf method for heterogeneous ensembles Supervisor: João Mendes Moreira (jmoreira@fe.up.pt) In the early nineties, the use of multiple models (also named ensembles) to accomplish the prediction task gained relevance due to works using homogeneous ensembles (i.e., ensembles using the same induction algorithm [Hansen & Salamon, 1990]) and heterogeneous ensembles (ensembles using diverse induction algorithms [Perrone & Cooper, 1993]). Despite the use of ensembles was not new at that time, it was since then that ensemble learning became a major research line for different research communities, such as the ones on neural networks, machine learning, artificial intelligence, pattern recognition, computational statistics, among others. Bagging [Breiman, 1996], boosting [Freund & Schapire, 1996], random forests [Breiman, 2001], random decision trees [Fan et al., 2006] are some of the ensemble methods for prediction that obtained good results. The first three are nowadays important benchmarks of prediction methods. All the methods previously referred use homogeneous ensembles. However, some studies show that heterogeneous ensembles can enhance results of homogeneous ones [Wichard et al., 2003; Moreira, 2008]. The main difficulty of this approach is that the best tuning and parameter set is problem dependent, reducing meaningfully its use as an off-the-shelf method, i.e., its use by non-experts. However, this is a promising area of research that is not enough explored, yet. The goal of this project is to develop off-the-shelf heterogeneous ensembles, enhancing its use by a broader community. The work will be empirically tested on several datasets, including an application of bus travel time prediction from the public transportation company of Porto (STCP). [Breiman, 1996] Breiman, L. (1996). “Bagging predictors.” Machine Learning 26: 123-140. [Breiman, 2001] Breiman, L. (2001). “Random forests.” Machine Learning 45: 5-32. [Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006). “A general framework for accurate and fast regression by data summarization in random decision trees.” The 12th ACM SIGKDD international conference on Knowledge Discovery and Data Mining. [Freund & Schapire, 1996] Freund, Y. and R. Schapire (1996). Experiments with a new boosting algorithm. International Conference on Machine Learning, 148-156. [Hansen & Salamon, 1990] Hansen, L. K. and P. Salamon (1990). “Neural networks ensembles.” IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10): 993-1001. [Moreira, 2008] Moreira, J. M. (2008). “Travel time prediction for the planning of mass transit companies: a machine learning approach.” Ph.D. thesis, Faculty of Engineering, University of Porto. [Perrone & Cooper, 1993] Perrone, M. P. and L. N. Cooper (1993). “When networks disagree: ensemble methods for hybrid neural networks.” Neural Networks for Speech and Image Processing. R. J. Mammone (Eds), Chapman-Hall. Wichard, J., C. Merkwirth, et al. (2003). Building ensembles with heterogeneous models. Course of the International School on Neural Nets, Salerno - Italy. A decision support system for timetable adjustments Supervisor: João Mendes Moreira (jmoreira@fe.up.pt) Co-supervisor: Jorge Freire de Sousa (jfsousa@fe.up.pt) In the last years, public transportation companies made important investments in order to have data about the service in real time. One of those investments was in the implementation of Automatic Vehicle Location (using GPS) in order to know where each bus is at each moment. With this kind of system, companies are able to store both actual and planning data. With the actual data is now possible to improve the operational planning. An important operational planning task at this kind of companies is the definition of timetables. Until recently, timetables were defined using only the average travel times. With the existence of actual data, it is possible to take into account the variability of travel times or the level of vehicles occupancy, for instance. This problem is usually seen as a single objective optimization problem [Carey, 1998; Zhao et al., 2006], namely, the minimization of the passengers' waiting time. In [Moreira, 2008] it is shown that two objectives must be considered: the maximization of passengers' satisfaction (not necessarily the same as the minimization of passengers' waiting time, at least for some lines) and the minimization of operational costs. In fact, for the schedulers, rather than a method that solves the partial problem in a deterministic way (as in [Zhao et al., 2006]), they need a tool to give them insights into the best solution, at least while there are no answers to questions such as "how does passengers' waiting time compare with the operational cost of an additional bus?", or "what is the impact of reducing slack times on operational costs?". The multi-objective nature of the problem justifies its study and the use of decision support systems. The expected areas of study are on: (1) Evaluation measures for the degree of achievement of the two objectives of the problem dependent of the bus line type [Strathman et al., 1998]; (2) Analytical solutions for the partial problem of minimization of passengers' waiting time for the different bus line types (the one presented in [Zhao et al., 2006] is a good startup for the case of lines with high frequency); (3) Detection of systematic delays using data mining approaches (the one presented in [Duarte, 2008] is a good startup); (4) Design and development of decision support systems for timetable adjustments. [Carey, 1998] Carey, M. (1998). “Optimizing scheduled times, allowing for behavioural response.” Transportation Research Part B, 32(5):329-342. [Duarte, 2008] Duarte, E. (2008), “Técnicas de Mineração de Dados para suporte à decisão no Planeamento de Horários em Empresas de Transportes Públicos.” M.Sc. thesis, University of Minho. [Moreira, 2008] Moreira, J. M. (2008). “Travel time prediction for the planning of mass transit companies: a machine learning approach.” Ph.D. thesis, University of Porto. [Strathman et al., 1998] Strathman, J. G., K. J. Dueker, T. Kimpel, R. Gerhart, K. Turner, P. Taylor, S. Callas, D. Griffin, J. Hopper (1998). “Automated bus dispatching, operations control and service reliability: analysis of tri-met baseline service date.” Technical report, University of Washington - U.S.A.. [Zhao et al., 2006] Zhao, J., M. Dessouky, S. Bukkapatnam (2006). “Optimal slack time for schedule-based transit operations.” Transportation Science, 40(4):529-539.