Selecting examples for regression

advertisement
Selecting examples for regression
Supervisor: João Mendes Moreira (jmoreira@fe.up.pt)
Co-supervisor: Carlos Soares (csoares@fep.up.pt)
Data mining (DM) is becoming an increasingly important technology for businesses [Ghani
& Soares, 2006, Moreira et al., 2005]. One important task that is addressed with DM
techniques is prediction. The DM approach to prediction consists of using inductive learning
methods, which analyze available data to generate a model. Then, this model is used for
making predictions concerning new examples. For instance, the examples could represent
bus trips from a public transportation company and the prediction is concerned with the
duration of the trip [Moreira et al., 2005]. Different methods can be used for prediction tasks,
including neural networks, support vector machines and decision trees.
The data used to train the model is referred to as the training set. The success of a DM
approach to prediction depends on how suitable the algorithm is for the training set and how
representative the training set is of the new examples. Different pre-processing tasks can be
used to address these issues [Reinartz, 2002; Blum & Langley, 2007], namely: example (or
instance) selection [Liu & Motoda, 2001], feature selection [Guyon & Elisseeff, 2003], and
domain values definition. The goal of example selection is to identify the data from the
training set that are expected to yield the best possible model for a particular learning
algorithm. There are some examples of successful approaches for example selection, such as
the one for the linear kernel of support vector machines [Moreira et al., 2006]. A possible
approach to example selection which has not been sufficiently explored is metalearning
[Brazdil et al., 2009, Crammer et al., 2008]. This approach consists learning about learning
algorithms (hence the prefix "meta"), i.e., using inductive learning methods on results of
previous prediction problems to choose the experimental setup that is most suitable for a new
prediction problem. In this particular case, the experimental setup to be selected consists of
both the subset of data used for training and also the learning algorithm.
The goal of this project is to combine and extend previous work on example selection
[Moreira et al., 2006] and metalearning [Brazdil et al., 2009] for the problem of selecting the
best training data for a prediction problem. The work will be empirically tested on several
datasets, including an application of bus travel time prediction from the public transportation
company of Porto (STCP).
[Blum & Langley, 1997] Blum, A. L., P. Langley (1997). “Selection of relevant features and
examples in machine learning.” Artificial Intelligence 97(1-2): 245-271.
[Brazdil et al., 2009] Brazdil, P., C. Giraud-Carrier, C. Soares, R. Vilalta (2009).
“Metalearning applications to Data Mining", Springer.
[Crammer et al., 2008] Crammer, K., M. Kearns, J. Wortman (2008). “Learning from
Multiple Sources.” Journal of Machine Learning Research 9:1757-1774.
[Ghani & Soares, 2006] Ghani, R., C. Soares (2006). "Data mining for business applications:
KDD-2006 workshop", ACM SIGKDD Explorations Newsletter.
[Guyon & Elisseeff, 2003] Guyon, I., A. Elisseeff (2003). "An introduction to variable and
feature selection." Journal of Machine Learning Research 3: 1157-1182.
[Liu & Motoda, 2001] Liu, H., H. Motoda, Eds (2001). “Instance selection and construction
for data mining,” Kluwer Academic Publishers.
[Moreira et al., 2005] Moreira, J. M., A. M. Jorge, J. F. Sousa, C. Soares (2005), “A Data
Mining approach for trip time prediction in mass transit companies.”, Workshop on Data
Mining for Business at ECML/PKDD 2005, Porto - Portugal, 63-66.
[Moreira et al., 2006] Moreira, J. M., A. M. Jorge, C. Soares, J. F. Sousa (2006). “Improving
SVM-linear predictions using CART for example selection.” International Symposium on
Methodologies for Intelligent Systems, Springer, LNAI 4203: 632-641.
[Reinartz, 2002] Reinartz, T. (2002). “A unifying view on instance selection.” Data Mining
and Knowledge Discovery 6(2): 191-210.
Development of regression algorithms for censored-data
Supervisor: João Mendes Moreira (jmoreira@fe.up.pt)
Prediction methods are an important technology for businesses. Regression refers to the
prediction of numeric variables while classification refers to the prediction of categorical
variables. In certain areas of business that use regression methods, the numeric variable is
bounded. An example is survival analysis, a branch of statistics which deals with death in
biological organisms and failure in mechanical systems. Another example is the analysis of
performance (performance is, in this case, a real value bounded between 0 and 1) using
exogenous variables in DEA - Data Envelopment Analysis (a state of the art benchmark
method). Many other problems exist where data is left-censored, right-censored or
interval-censored. A common approach to solve this kind of problems is Tobit regression, a
parametric statistical method. However, the assumptions of this model (homogeneous
variance and independency of the errors) limit the range of problems where it can be applied.
In the last few years new inductive learning algorithms have been developed for the
regression problem. Support vector regression [Smola & Scholkopf, 2004], random forests
[Breiman, 2001] and random decision tress [Fan et al., 2006] are examples of such methods.
Inductive learning does not any assumption about the data. However, they typically assume
non-censored data. For this reason, there is current research on inductive learning algorithms
in order to adapt existing approaches for the resolution of problems with output censored data.
This is the case of random survival forests [Ishwaran et al., 2008] for right-censored survival
data.
This proposal is on inductive learning approaches for interval-censored data. The work will
be empirically tested on several datasets, including an application on the evaluation of bus
line performance from the public transportation company of Porto (STCP).
[Breiman, 2001] Breiman, L. (2001). “Random forests.” Machine Learning 45: 5-32.
[Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006). “A general framework for accurate
and fast regression by data summarization in random decision trees.” The 12th ACM
SIGKDD international conference on Knowledge Discovery and Data Mining.
[Ishwaran et al., 2008] Ishwaran, H. (2008). “Random survival forests.” The Annals of
Applied Statistics 2(3): 841-860.
[Smola & Scholkopf, 2004] Smola, A. J. and B. Scholkopf (2004). “A tutorial on support
vector regression.” Statistics and Computing 14: 199-222.
Towards an off-the-shelf method for heterogeneous
ensembles
Supervisor: João Mendes Moreira (jmoreira@fe.up.pt)
In the early nineties, the use of multiple models (also named ensembles) to accomplish the
prediction task gained relevance due to works using homogeneous ensembles (i.e.,
ensembles using the same induction algorithm [Hansen & Salamon, 1990]) and
heterogeneous ensembles (ensembles using diverse induction algorithms [Perrone & Cooper,
1993]). Despite the use of ensembles was not new at that time, it was since then that
ensemble learning became a major research line for different research communities, such as
the ones on neural networks, machine learning, artificial intelligence, pattern recognition,
computational statistics, among others. Bagging [Breiman, 1996], boosting [Freund &
Schapire, 1996], random forests [Breiman, 2001], random decision trees [Fan et al., 2006]
are some of the ensemble methods for prediction that obtained good results. The first three
are nowadays important benchmarks of prediction methods.
All the methods previously referred use homogeneous ensembles. However, some studies
show that heterogeneous ensembles can enhance results of homogeneous ones [Wichard et
al., 2003; Moreira, 2008]. The main difficulty of this approach is that the best tuning and
parameter set is problem dependent, reducing meaningfully its use as an off-the-shelf method,
i.e., its use by non-experts. However, this is a promising area of research that is not enough
explored, yet.
The goal of this project is to develop off-the-shelf heterogeneous ensembles, enhancing its
use by a broader community. The work will be empirically tested on several datasets,
including an application of bus travel time prediction from the public transportation company
of Porto (STCP).
[Breiman, 1996] Breiman, L. (1996). “Bagging predictors.” Machine Learning 26: 123-140.
[Breiman, 2001] Breiman, L. (2001). “Random forests.” Machine Learning 45: 5-32.
[Fan et al., 2006] Fan, W., J. McCloskey, et al. (2006). “A general framework for accurate
and fast regression by data summarization in random decision trees.” The 12th ACM
SIGKDD international conference on Knowledge Discovery and Data Mining.
[Freund & Schapire, 1996] Freund, Y. and R. Schapire (1996). Experiments with a new
boosting algorithm. International Conference on Machine Learning, 148-156.
[Hansen & Salamon, 1990] Hansen, L. K. and P. Salamon (1990). “Neural networks
ensembles.” IEEE Transactions on Pattern Analysis and Machine Intelligence 12(10):
993-1001.
[Moreira, 2008] Moreira, J. M. (2008). “Travel time prediction for the planning of mass
transit companies: a machine learning approach.” Ph.D. thesis, Faculty of Engineering,
University of Porto.
[Perrone & Cooper, 1993] Perrone, M. P. and L. N. Cooper (1993). “When networks disagree:
ensemble methods for hybrid neural networks.” Neural Networks for Speech and Image
Processing. R. J. Mammone (Eds), Chapman-Hall.
Wichard, J., C. Merkwirth, et al. (2003). Building ensembles with heterogeneous models.
Course of the International School on Neural Nets, Salerno - Italy.
A decision support system for timetable adjustments
Supervisor: João Mendes Moreira (jmoreira@fe.up.pt)
Co-supervisor: Jorge Freire de Sousa (jfsousa@fe.up.pt)
In the last years, public transportation companies made important investments in order to
have data about the service in real time. One of those investments was in the implementation
of Automatic Vehicle Location (using GPS) in order to know where each bus is at each
moment. With this kind of system, companies are able to store both actual and planning data.
With the actual data is now possible to improve the operational planning. An important
operational planning task at this kind of companies is the definition of timetables. Until
recently, timetables were defined using only the average travel times.
With the existence of actual data, it is possible to take into account the variability of travel
times or the level of vehicles occupancy, for instance. This problem is usually seen as a single
objective optimization problem [Carey, 1998; Zhao et al., 2006], namely, the minimization
of the passengers' waiting time. In [Moreira, 2008] it is shown that two objectives must be
considered: the maximization of passengers' satisfaction (not necessarily the same as the
minimization of passengers' waiting time, at least for some lines) and the minimization of
operational costs. In fact, for the schedulers, rather than a method that solves the partial
problem in a deterministic way (as in [Zhao et al., 2006]), they need a tool to give them
insights into the best solution, at least while there are no answers to questions such as "how
does passengers' waiting time compare with the operational cost of an additional bus?", or
"what is the impact of reducing slack times on operational costs?". The multi-objective
nature of the problem justifies its study and the use of decision support systems.
The expected areas of study are on: (1) Evaluation measures for the degree of achievement of
the two objectives of the problem dependent of the bus line type [Strathman et al., 1998]; (2)
Analytical solutions for the partial problem of minimization of passengers' waiting time for
the different bus line types (the one presented in [Zhao et al., 2006] is a good startup for the
case of lines with high frequency); (3) Detection of systematic delays using data mining
approaches (the one presented in [Duarte, 2008] is a good startup); (4) Design and
development of decision support systems for timetable adjustments.
[Carey, 1998] Carey, M. (1998). “Optimizing scheduled times, allowing for behavioural
response.” Transportation Research Part B, 32(5):329-342.
[Duarte, 2008] Duarte, E. (2008), “Técnicas de Mineração de Dados para suporte à decisão
no Planeamento de Horários em Empresas de Transportes Públicos.” M.Sc. thesis,
University of Minho.
[Moreira, 2008] Moreira, J. M. (2008). “Travel time prediction for the planning of mass
transit companies: a machine learning approach.” Ph.D. thesis, University of Porto.
[Strathman et al., 1998] Strathman, J. G., K. J. Dueker, T. Kimpel, R. Gerhart, K. Turner, P.
Taylor, S. Callas, D. Griffin, J. Hopper (1998). “Automated bus dispatching, operations
control and service reliability: analysis of tri-met baseline service date.” Technical report,
University of Washington - U.S.A..
[Zhao et al., 2006] Zhao, J., M. Dessouky, S. Bukkapatnam (2006). “Optimal slack time for
schedule-based transit operations.” Transportation Science, 40(4):529-539.
Download