Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science UCL Detecting patterns in data • The aim of statistical modelling and computational learning is to identify some stable pattern from a finite sample of data. • Based on the detected pattern, we can then process future data more accurately, more efficiently, etc. • The approach is useful when we are unable to specify the precise pattern a priori – the sample must be used to identify it. CPC Launch 2 Statistical Modelling vs Computational Learning • Statistical Modelling: – Interested in inferring underlying structure, eg clusters, parametric model, underpinning network – Frequently use prior distribution and inference to estimate posterior over possible models • Computational Learning: – Interested in ability to predict/process future data accurately, eg classifiers, regression, mappings – Aims to give assurances of performance on unseen data CPC Launch 3 Types of analysis • Classification: which of two or more classes does an example belong to? Eg document filtering. • Regression: what is the real valued function of the input that matches the output values? Eg function approximation. • Novelty detection: which new examples are unusual or anomalous? Eg engine monitoring. • Ranking: what is the rank that should be assigned to a given data item? Eg recommender systems. • Clustering: natural structure in the data • Network: interactions that will persist CPC Launch 4 Principled methods • The key theoretical question is whether detected patterns are spurious or stable – are they in this sample by chance or are they implicit in the distribution/system generating the data. • This is a probabilistic or statistical question: estimate the stability of any patterns detected in the finite sample. • What is the chance that we have been fooled by the sample? CPC Launch 5 Uniform convergence • Typically we are attempting to assess an infinite (or very large) class of potential patterns. • The question is whether the finite sample behaviour of the pattern will match its future presence. • For one function this needs a bound on the tail of the distribution, but for a class we need convergence uniformly over the class, cf. multiple hypothesis testing. CPC Launch 6 High dimensional spaces • There is a further problem that modern applications typically involve high dimensional data. • Kernel methods also project data into high dimensional spaces in order to ensure that linear methods are powerful enough. • Provides a general toolkit of methods, but.. • it kills more traditional statistical methods of analysis – the curse of dimensionality. CPC Launch 7 Luckiness framework (S-T, Bartlett, Williamson and Anthony, 1998) • This allows you to use evidence of an effective low dimensionality even when in high dimensions. • Examples are: – size of the margin in a support vector machine classifier, – norm of the weight vector in (ridge) regression – sparsity of a solution – evidence in the Bayesian posterior – residual in principle components analysis CPC Launch 8 Luckiness Framework • Wins if the distribution of examples aligns with the pattern • Hence can work with far richer set of hypotheses • Best known example is the support vector machine that uses the margin as a measure of luckiness CPC Launch 9 Other luckiness approaches • Large margins are just one way of detecting fortuitous distributions • Sparsity of representation appears to be a more fundamental measure • Gives rise to different algorithms, but the luckiness of large margins can also be justified by a sparsity argument • Brings us full circle back to Ockham’s razor: parsimonious description CPC Launch 10 Kernel-based learning • The idea that we look for patterns in richer sets of hypotheses has given rise to kernel methods • Data is projected into a rich feature space where linear patterns are sought using an implicit kernel representation • The luckiness is typically measured by the norm of the linear function required for the task – equivalent to the margin in the case of classification CPC Launch 11 Modularity of kernel methods • Algorithms can be applied to different types of data by simply defining an appropriate kernel function to compare data items • Range of algorithms has been extended to include regression, ranking, novelty detection, • Can be used for statistical tests that test the null hypothesis that two samples are drawn from the same distribution • Also enables links with statistical modelling techniques CPC Launch 12 Kernel design methods • Structure kernels use dynamic programming to compute sums over possible matches, eg String kernels • Kernels from probabilistic models – eg Fisher kernels. • Kernel PCA to perform feature selection: eg Latent Semantic kernels. • Kernel alignment/PLS: tuning features to the task. CPC Launch 13 Cross-modal analysis • Not limited to same representation. • Consider web images and associated text • Kernel for image includes wavelets and colour histograms – for text is bag of words. • Creates a content based image retrieval system CPC Launch 14 Conclusion • Statistical Modelling and Computational Learning aim to find patterns in data SM interested in reliability of pattern, CL in quality of prediction Using bounds to guide algorithm design can overcome problems with high dimensions Combined with kernels allows the use of linear methods efficiently in high dimensional spaces Methods provide a general toolkit for the practitioner making use of adaptive systems Particular applications require specific tasks and representations/kernels that define new research challenges CPC Launch 15 Conclusion • Statistical Modelling and Computational Learning aim to find patterns in data: SM interested in reliability of pattern, CL in quality of prediction • Using bounds to guide algorithm design can overcome problems with high dimensions • Combined with kernels allows the use of linear methods efficiently in high dimensional spaces • Methods provide a general toolkit for the practitioner making use of adaptive systems • Particular applications require specific tasks and representations/kernels On-line optimisation of web pages • Have to decide what content to place on a web page: eg advertising banners • Have some measure of response: eg click through, purchases made, etc. • Can be viewed as an example of the one arm bandit problem – need to trade exploration and exploitation on-line • Touch Clarity was a company providing this technology for high street banks and others CPC Launch 17 Upper confidence bounds • Basic algorithm maintains estimates of the response rates (RR) for each arm (piece of content) together with standard deviations • Serve arm for which mean + S.D. is maximum: guaranteed to either be good arm or reduce the S.D. • Provable bounds on loss compared to performing the best arm all along • Can be generalised to situation where additional information linearly determines response rate (RR) • Kernel methods enable handling of non-linear RRs CPC Launch 18