Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science

Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science UCL Detecting patterns in data • The aim of statistical modelling and computational learning is to identify some stable pattern from a finite sample of data. • Based on the detected pattern, we can then process future data more accurately, more efficiently, etc. • The approach is useful when we are unable to specify the precise pattern a priori – the sample must be used to identify it. CPC Launch 2 Statistical Modelling vs Computational Learning • Statistical Modelling: – Interested in inferring underlying structure, eg clusters, parametric model, underpinning network – Frequently use prior distribution and inference to estimate posterior over possible models • Computational Learning: – Interested in ability to predict/process future data accurately, eg classifiers, regression, mappings – Aims to give assurances of performance on unseen data CPC Launch 3 Types of analysis • Classification: which of two or more classes does an example belong to? Eg document filtering. • Regression: what is the real valued function of the input that matches the output values? Eg function approximation. • Novelty detection: which new examples are unusual or anomalous? Eg engine monitoring. • Ranking: what is the rank that should be assigned to a given data item? Eg recommender systems. • Clustering: natural structure in the data • Network: interactions that will persist CPC Launch 4 Principled methods • The key theoretical question is whether detected patterns are spurious or stable – are they in this sample by chance or are they implicit in the distribution/system generating the data. • This is a probabilistic or statistical question: estimate the stability of any patterns detected in the finite sample. • What is the chance that we have been fooled by the sample? CPC Launch 5 Uniform convergence • Typically we are attempting to assess an infinite (or very large) class of potential patterns. • The question is whether the finite sample behaviour of the pattern will match its future presence. • For one function this needs a bound on the tail of the distribution, but for a class we need convergence uniformly over the class, cf. multiple hypothesis testing. CPC Launch 6 High dimensional spaces • There is a further problem that modern applications typically involve high dimensional data. • Kernel methods also project data into high dimensional spaces in order to ensure that linear methods are powerful enough. • Provides a general toolkit of methods, but.. • it kills more traditional statistical methods of analysis – the curse of dimensionality. CPC Launch 7 Luckiness framework (S-T, Bartlett, Williamson and Anthony, 1998) • This allows you to use evidence of an effective low dimensionality even when in high dimensions. • Examples are: – size of the margin in a support vector machine classifier, – norm of the weight vector in (ridge) regression – sparsity of a solution – evidence in the Bayesian posterior – residual in principle components analysis CPC Launch 8 Luckiness Framework • Wins if the distribution of examples aligns with the pattern • Hence can work with far richer set of hypotheses • Best known example is the support vector machine that uses the margin as a measure of luckiness CPC Launch 9 Other luckiness approaches • Large margins are just one way of detecting fortuitous distributions • Sparsity of representation appears to be a more fundamental measure • Gives rise to different algorithms, but the luckiness of large margins can also be justified by a sparsity argument • Brings us full circle back to Ockham’s razor: parsimonious description CPC Launch 10 Kernel-based learning • The idea that we look for patterns in richer sets of hypotheses has given rise to kernel methods • Data is projected into a rich feature space where linear patterns are sought using an implicit kernel representation • The luckiness is typically measured by the norm of the linear function required for the task – equivalent to the margin in the case of classification CPC Launch 11 Modularity of kernel methods • Algorithms can be applied to different types of data by simply defining an appropriate kernel function to compare data items • Range of algorithms has been extended to include regression, ranking, novelty detection, • Can be used for statistical tests that test the null hypothesis that two samples are drawn from the same distribution • Also enables links with statistical modelling techniques CPC Launch 12 Kernel design methods • Structure kernels use dynamic programming to compute sums over possible matches, eg String kernels • Kernels from probabilistic models – eg Fisher kernels. • Kernel PCA to perform feature selection: eg Latent Semantic kernels. • Kernel alignment/PLS: tuning features to the task. CPC Launch 13 Cross-modal analysis • Not limited to same representation. • Consider web images and associated text • Kernel for image includes wavelets and colour histograms – for text is bag of words. • Creates a content based image retrieval system CPC Launch 14 Conclusion • Statistical Modelling and Computational Learning aim to find patterns in data  SM interested in reliability of pattern, CL in quality of prediction  Using bounds to guide algorithm design can overcome problems with high dimensions  Combined with kernels allows the use of linear methods efficiently in high dimensional spaces  Methods provide a general toolkit for the practitioner making use of adaptive systems  Particular applications require specific tasks and representations/kernels that define new research challenges CPC Launch 15 Conclusion • Statistical Modelling and Computational Learning aim to find patterns in data: SM interested in reliability of pattern, CL in quality of prediction • Using bounds to guide algorithm design can overcome problems with high dimensions • Combined with kernels allows the use of linear methods efficiently in high dimensional spaces • Methods provide a general toolkit for the practitioner making use of adaptive systems • Particular applications require specific tasks and representations/kernels On-line optimisation of web pages • Have to decide what content to place on a web page: eg advertising banners • Have some measure of response: eg click through, purchases made, etc. • Can be viewed as an example of the one arm bandit problem – need to trade exploration and exploitation on-line • Touch Clarity was a company providing this technology for high street banks and others CPC Launch 17 Upper confidence bounds • Basic algorithm maintains estimates of the response rates (RR) for each arm (piece of content) together with standard deviations • Serve arm for which mean + S.D. is maximum: guaranteed to either be good arm or reduce the S.D. • Provable bounds on loss compared to performing the best arm all along • Can be generalised to situation where additional information linearly determines response rate (RR) • Kernel methods enable handling of non-linear RRs CPC Launch 18

Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science

Related documents

Products

Support

Statistical Modelling and Computational Learning John Shawe-Taylor Department of Computer Science

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib