Machine Learning Introduction: Data, Analysis, and Concepts

MACHINE LEARNING: CAP 1 – INTRODUCTION TO MACHINE LEARNING DATA Data is a collection of data objects and their attributes. An attribute is a property or characteristic of an object, and a collection of attributes describes an object. Also known as variable. An object is also known as record or observation. Data types: Data can be represented in records (data matrix, distance matrix, document data or transaction data), in graphs (world wide web or molecular structures) or in ordered (spatial data, temporal data or genetic sequence data). Data is the new oil. It allows us to: o Discover new things. o Increase productivity. o Strategy (big databases). o Understand where we are. o Realize what we have achieved. o Predict the future. Data symbols and senses Information meaning and memory Knowledge understanding Insight Wisdom vision Analytics: the ability to take data, i.e., to be able to understand it, process it, extract value from it, to visualize it and to communicate it. An automatic system is used to disseminate information to the various scenarios. This system will utilize machines for auto-extracting and for creating profiles for each of the ‘action points’ in an organization. The objective is to supply information to support specific activities carried out by individuals/groups. The system concerns itself with the acquisition of new information, its dissemination, storage, retrieval, and transmittal to the action points its servers. A data warehouse is composed by information resulting from ETL data sources (CRM, ERP, SCM), in order to be able to provide this information to some users, such as OLAP Analysis, Data Mining and Reporting. User data – email, location, blogs, video, pictures, and social networks. Data Automatic data – apps, sensors, online shop, credit cards and frequent customers cards. Big Data sources: o Social network and media (Facebook, twitter, LinkedIn, blogs, site comments). o Mobile devices (call, text, location, in-app activity). o Networked devices/sensors (internet connected hardware, sensors, beacon interactions). o Internet transactions (purchases, banking activity, investment activity). Big Data is characterized for the 4 V’s: o Volume (data size) o Velocity (speed of change) o Variety (different forms of data sources) - Internet of Things (IoT) o Veracity (uncertainty of data) Big Data allows us to: o Measure new perspectives of customers. o Update measures accurately. o Combine many different perspectives in analysis. o Understand the context. A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician. Big Data implies huge data volumes that cannot be processes effectively with traditional applications. Machine Learning (ML) is an artificial intelligence technique that is broadly used in Data Mining. ML builds models that can predict values of variables. Data Mining uses the predictive force of ML by applying various machine learning algorithms to Big Data. Data Analytics is all about automating insights into a dataset and supposes the usage of queries and data aggregation procedures. Data Analysis stands for human activities aimed at gaining some insight on a data set. MACHINE LEARNING In a traditional programming method, we create a program and give it the input (variables), which then gives us an output. This does not happen with Machine Learning. In ML, we give the input - all the variables except the variable we want to predict, i.e., the target - and the output - a train dataset that already contains the output of some observations in the target variable - and it creates the program/model, intending to predict more observations of the target variable, based on the observations we already have. Every machine learning algorithm has 3 components: o Representation o Evaluation o Optimization Types of learning: o Supervised (inductive) learning: training data includes desired outputs. o Unsupervised learning: training data does not include desired outputs. o Semi-supervised learning: training data includes a few desired outputs. o Reinforcement learning: rewards from sequence of actions. DATA SCIENCE / ML TASKS Common data mining tasks: o Description - Describes patterns or trends in data. - Data mining models should be transparent, i.e., results should be interpretable by humans. - High-quality description accomplished using Exploratory Data Analysis (EDA). o Clustering Refers to grouping records into classes of similar objects. - Cluster: collection of records similar to one another, and dissimilar records in other clusters. - Clustering algorithm seeks to segment data set into homogeneous subgroups. - Target variable not specified since clustering doesn’t try to classify/estimate/predict target variables. o - Association Find out which attributes ‘’go together’’. Commonly used for Market Basket Analysis (aka Affinity Association). Quantify relationships between two or more attributes in the form of rules as: IF antecedent THEN consequent. Rules measured using support and confidence. ex: A particular supermarket might find that Thursday night 200 of 1 000 customers bought diapers, and of those buying diapers, 50 purchased beer. Association Rule: “IF buy diapers, THEN buy beer” Support = 200/1 000 = 5% and confidence = 50/200 = 25% o o - Regression Similar to Classification task, except target variable is numeric. Models built from complete data records (Records include values for each predictor field and numeric target variable in training set). For new observations, estimate the target variable. Classification Similar to Regression task, except target variable is categorical. Models built from complete data records (Records include values for each predictor field and categorical target variable in training set). For new observations, estimate the target variable. DATA MINING A large database represents a large body of information that is presumed to be valuable since it records vital measurements. Yet, this potential is far from being effectively accessible. The interfaces between humans and storage do not support exploration, summarization or modelling of large databases. This is de goal of Data Mining (DM) and Knowledge Discovery in Database (KDD). MACHINE LEARNING: CAP 3 – EXPLORATORY DATA ANALYSIS (EDA) EXPLORATORY DATA ANALYSIS (EDA) EDA is the initial investigation process on data to discover patterns and to spot anomalies using summary statistics and graphical representations. The 2 main goals in EDA are: o Getting to know the data. - What is the shape of my dataset (number of rows and columns)? - What are the data types of my variables? - What are the main statistical characteristics of my variables? - There are interesting relationships between my variables? - Do I have interesting patterns on my data? o Identify data quality issues. - Do I have anomalous events or outliers on my data? STATISTICAL TECHNIQUES Measures of central tendency (special case of measures of locations): o Mean (numeric data): the average of the valid values taken by the variable. Sensitive to outliers. o Median (numeric data): defined as the field value in the middle when the field values are sorted. o Mode (Categorical data): represents the value occurring with the greater frequency. More measures of location: o Percentiles (numeric data) o Quartiles (numeric data) Measures of spread/variability: o Range: difference between the maximum and the minimum. o Standard deviation: measure the amount of variation or dispersion of a set of values. o Mean absolute deviation (MAD): the average distance between each data point and the mean. o Interquartile range (IQR): difference between 75th and 25th percentiles. DATA VISUALIZATION The process of displaying data (often in large quantities) in a meaningful fashion to provide insights that will support better decisions. Humans are great at seeing visual patterns and, with them, we can: o Interpret vast amounts of data. o Discover interesting patterns. o Explain complex topics. o Deliver information quickly. Box Plot o What the key values are, such as the median, 25th percentile, etc. o If there are any outliers and what their values are. o Is the data symmetrical? o How tightly is the data grouped? o If the data is skewed and, if so, in what direction. Histogram o Given an estimate as to where values are concentrated. o What the extremes are and whether there are any gaps or unusual values. Scatter Plot o By displaying a variable in each axis, you can detect if a relationship or correlation between the two variables exists. o Various types of correlation can be interpreted through the patterns displayed on scatter plots. These are: - Positive (values increase together). - Negative (one value decreases as the other increases). - Null (no correlation). o Points that end up far outside the general cluster of points are known as outliers. Line chart o Used to display quantitative values over a continuous interval or time period. o A line graph is most frequent used to show trends and analyse how the data has changed over time. o The line’s journey across the graph can create patterns that reveal trends in a dataset. SUMMARY o Variable identification: define each variable and its role in the dataset. o Univariate analysis: - Continuous variables: box plots or histograms for each variable independently. - Categorical variables: bar charts to show frequencies. - Check the descriptive summary using statistical techniques. o Bi-variate analysis – determine the interactions between variables by building visualization tools: - Continuous and continuous: scatter plots. - Continuous and categorical: stacked column chart. - Categorical and continuous: boxplots combined with swarm plots. o Detect if there are missing values or outliers. MACHINE LEARNING: CAP 4 & 5 – DATA SCIENCE & DATA PRE-PROCESSING 1. METHODOLOGY Methodology in data science (DS): Framework for recording experience, allowing projects to be replicated. Aids to project planning and management. CRISP-DM (CRoss-Industry Standard Process for Data Mining): data mining process model that transforms company data into knowledge and management information (1996). SEMMA (Sample, Explore, Modify, Model, Access): refers to the process of conducting a data mining project (2000s). OSEMN (Obtain, Scrub, Explore, Model, iNterpret): used as a blueprint for working on data problems using machine learning tools. (2010). DATA SCIENCE LOOP The data science loop is composed by six phases: 1. Data Cleaning Make sure that the data: o Is complete (no missing values) o Is correct (no errors) o Has acceptable values (no outliers) 2. Analyse & Sample Get to know your data: o Is the data useful to answer your problem? o Does that data make sense with your business intuition? o Do we see any unexpected pattern? o How is the data distributed? 3. Feature Engineering “Improve” your data: o Apply transformations to improve your data. o Create new features by combining other features. o Feature selection. 4. Model Build Apply models: o Choose the modelling technique. o Build models. 5. Hyperparameter Optimization Improve your models: o Do hyperparameter tuning, i.e., choose the best model parameters. 6. Evaluate & Compare Get the best model: o Evaluate models by comparing their predictions with real data. o Pick the best model based on performance and other desired metrics: scalability, interpretability, etc. 2. PRE-PROCESSING Pre-processing: process of turning raw data into a state that can be easily used by an algorithm. It is important to pre-process data because raw data is very incomplete, redundant, not consistent, has outliers and missing values, etc. 2.1. DATA CELANING MISSING VALUES o Data value that is not stored for a variable in the observation of interest. o This causes problems to data analysis methods. Absence of information is rarely beneficial to task of analysis. o We can spot them by doing a descriptive summary of the data. o To solve them we can: remove; fill with constant or with mean/mode; apply a predictive model; etc. Types of missing values: o Missing completely at random (MCAR): neither the unobserved values of the variable with missing nor the other variables in the dataset predict whether a value will be missing. o Missing at random (MAR): other variables (but not the variable missing itself) in the dataset can be used to predict missingness. o Missing not at random (MNAR): the unobserved value of the variable with missing is related to the reason its missing. Solving missing values: o Delete observations containing missing values: - Not necessarily the best approach. - Pattern of missing values may be systematic. - Deleting records creates biased subset. - Valuable information in other fields lost. o Fill missing values with a constant: - Missing numeric values replaced with 0.0. - Missing categorical values replaced with ‘Missing’. o Fill missing values with the mean/mode: - Missing numeric values replaced with the mean. - Missing categorical values replaced with the mode. - Replacing mode or mean for missing values sometimes works well. - Mean is not always the best choice for typical value. - Resulting confidence levels for statistical inference become overoptimistic. - Domain experts should be consulted regarding approach to replace missing values. - Benefits and drawbacks resulting from replacement of missing values must be carefully evaluated. o Fill missing values with a random value: - Values randomly taken from underlying distribution. - Method superior compared to mean substitution. o Fill missing values with a predictive model: - KNN Imputer: Match a point with its closest k neighbors in a multi-dimensional space. - Other predictive models: Linear Regression, Decision Tree, Neural Network, etc. MISCLASSIFICATIONS & INCOHERENCIES We have to be careful and verify if the values are valid and consistent. ex: Divorced at 12? - No guarantee resulting records make sense. - Alternative methods strive to replace values more precisely. Europe, France, USA, US - Records classified inconsistently with respect to origin of customer. - Maintain consistency (USA & US → North America; France → Europe). OUTLIERS o An observation that lies an abnormal distance from other values in a random sample from a population. Not necessarily errors. o Outliers distort the data distribution and statistical methods, and models are sensitive to distortions. o To spot them we can use graphical methods (histograms, scatterplots, etc) or statistical methods (Z-score, Isolation Forests, etc). o To solve this issue, we can: remove, clip, ignore, etc. Solving outliers: o Remove: - Only the most extreme ones. - Rule of thumb: no more than 3% of your data. If more, try the other approaches to the less extreme outliers. o Clipping: - Clip Feature values between lower bound and upper bound. - We can choose these lower bound and upper bound values using the percentile of the feature. (We, then, eliminate the observations that are below the lower bound or higher the upper bound). o Assign a new value: - If an outlier appears to be caused by a mistake in your data, try imputing a value (mean/median/predictive model). o Transform: - For example, try creating a percentile version of your original field and working with that new field instead. (Create a new dataset in which all variables contain, for example, values within the percentiles 0 to 100. All values that fell outside these percentiles in the old dataset will now adopt the corresponding percentile value in the new dataset). 2.2. DATA TRANSFORMATION SCALING Variables tend to have different ranges from each other, and some data mining algorithms are badly affected by this differences. Variables with greater ranges tend to have larger influence on data model’s results. Therefore, numeric field values should be normalized, in order to standardize the scale of effect each variable has on results. Most common normalization methods: o MinMax Normalization: guarantees that all data points lie within a given range, more commonly [0,1]. o Z score or standardization: it’s a measure of how many standard deviations below or above the population mean a raw score is. POWER TRANSFORMS Used to attempt to make the variable distribution closer to normal. DUMMY VARIABLES o Some methods demand numerical variables. o There is a need to recode the categorical values into one or more flag (dummy) variables. o A flag variables is a categorical variable taking only two values, 0 or 1. o When a categorical variables takes k ≥ 3 possible values, then define k-1. BINNING/DISCRETIZING o When we divide the numerical variables into bins or bands. o Used when the algorithms prefer categorical variables rather than continuous variables. We can use the following approaches: Binning based on predictive value (supervised problems): divides the numerical predictors based on the effect each partition has on the value of the target variable. ex: Customers with less than four calls to customer service had a lower churn rate than customers who had four or more calls to customer service. Bin the customer service calls variable into two classes: o Low (fewer than four) o High (Four or more) RECLASSIFY CATEGORICAL VARIABLES o Similar with binning, but for categorical variables. o Applied to categorical variables with too many field values. o Used when applying Logistic Regression and Decision Trees since they perform suboptimal when confronted with predictors containing too many field values. o We can group many fields into one category. 2.3 DATA REDUCTION We can perform data reduction by: o Removing observations Duplicate observations: lead to an overweight of the data values. o Removing variables Multicollinearity: a condition where some of the variables are correlated with each other. - Lead to instability in the solution space, possibly resulting in incoherence result. - Overemphasizes particular components of the model. Too many variables - Unnecessary complicate interpretation of analysis. - Violates principle of parsimony: one should consider keeping the number of variables to a size that could be easily interpreted. - In supervised problem, can lead to overfitting. - Curse of dimensionality: high dimensional data lead to sparse data – as the number of features increases, complexity increases, and data analysis tasks become significantly harder. One possible solution for curse of dimensionality is principal components analysis (PCA): o Defines a new axis where the majority of variance is maintained. o The new axis are the principal components (PCs) of the data. o It is a linear projection – each PC is a linear combination of the original features. o Geometrically, the PCs represent the directions of the data that explain a maximal amount of variance. 3. FEATURE SELECTION 3.1 THE IMPORTANCE OF FEATURE SELECTION Feature selection: process to select automatically or manually a subset of relevant features to use in model building. It aims to reduce the number of input variables to those that believed to be the most useful to a model in order to predict the target variable. This causes: o Performance improvement: enhanced generalization by reducing overfitting. o Computational cost decrease: shorter training time. o Simple models are easier to understand. o Variable redundancy. In some specific cases: o SVMs (support-vector machine) and Neural Networks are vulnerable to irrelevant predictors – they shrink the predictive performance of the models. o Linear and Logistic Regressions are sensible to correlated predictors. Removing redundancy will reduce multicollinearity and allow a better fit of the models . 3.2 FEATURE SELECTION VS ? Feature Selection VS Feature Engineering Feature engineering aims to create new features from the original ones with the goal of creating models more effective and with higher performance. Feature selection allow to select features from all the features available (the original ones and the created ones) to allow creation of more efficient models. In a “normal” pipeline, we perform feature engineering before applying feature selection. Feature Selection VS Dimensionality reduction Dimensionality reduction use unsupervised algorithms to reduce the number of features in a dataset. Those techniques modify or transform features into a lower dimension. Feature selection also reduce the number of features in a dataset, however, is a process that select and exclude features without any kind of modification. 3.3 FEATURE SELECTION METHODS FILTER METHODS Filter methods: techniques used to select features from a dataset without the use of any ML algorithm. There are 3 types of filter methods: o Basic Filter Methods - Constant features (ex: Gender variable that assumes the values Female or Male). - Quasi-constant features (ex: a variable that assumes the values Yes or No). - Duplicated features (ex: a variable that gives values like ‘Portugal’, ‘Spain’ and ‘France’, and a variable that gives values like ‘PT’, ‘SP’ and ‘FR’ – both variables give the same information). o Correlation Filter Methods Are correlated variables always bad? - If two predictors are highly correlated among themselves, they provide redundant information. We can make an accurate prediction on the target variable with just one of those redundant variables. - If a predictor is highly correlated with the target, this is a useful property and should be used in the final subset of variables chosen. - Use a feature selection technique to understand the weight of the variable in the target, to choose the one that is more important. o - Pearson correlation: continuous input and output variables. ¬ Can vary between -1 and 1. ¬ 1 means a positive correlation: the values of one variable increase as the values of another increase. ¬ -1 means negative correlation: the values of one variables decrease as the values of another increase. ¬ 0 means no linear correlation between the two variables. ¬ Both variables should be normally distributed ¬ There is a linear relationship between the two variables. - Spearman correlation: continuous or ordinal input and output variables. ¬ Can vary between -1 and 1. ¬ 1 means a positive correlation: the values of one variable increase as the values of another increase. ¬ -1 means negative correlation: the values of one variables decrease as the values of another increase. ¬ 0 means no linear correlation between the two variables. ¬ There are no assumptions on the distribution of the variables. - Kendall correlation. Statistical and Ranking Filter Methods - Chi-squared score. - ANOVA. WRAPPER METHODS Wrapper methods: use machine learning algorithms to select features. A search strategy is processed through the space of possible features (each subset is evaluated based on the quality of the performance of a given algorithm). The process is summarized as: NO Search for a subset of features Build a ML model Evaluate model performance The desired condition was met? Using a search method, we select a subset of features from the available ones. A chosen ML algorithm is trained on the previouslyselected subset of features. We evaluate the trained ML model with a chosen metric. Model performance increase, predefined nº of features is reached, … Choose the best subset with the best result in the validation phase. There are 4 types of search in wrapper methods: o Forward / Sequential: Start with no features and add one at a time. o Backward / RFE: Start with all features and remove one at a time. - Step 1: Start with all the features in the dataset. - Step 2: Evaluate the performance of the algorithm. - Step 3: Remove one feature at a time and evaluate the performance of the model. - Step 4: Remove permanently the least significant feature among the remaining available ones. - Step 5: Continue the loop by removing one feature at a time in each iteration until the pre-set criterion is achieved. o Bidirectional / Stepwise: Does both forward and backword simultaneously. o Exhaustive: Try all possible feature combinations. YES Advantages of wrapper methods: o Wrapper methods take into consideration the interaction of features. o Usually result in better predictive accuracy than filter methods. WRAPPER METHODS We have many techniques, and the results differ among themselves. So, how do we choose the variables to keep? We can apply different techniques and use the ones who are more frequent, as the example shown in the table below. MACHINE LEARNING: CAP 6 – PREDICTIVE MODELLING DEVELOPING A MODEL As in any other computer tasks performed in a computer, modelling requires a “program” providing detailed instructions. These instructions are typically mathematical equations which characterize the relationship between inputs and outputs. Formulating these equations is the central problem in modelling. The best way to model is to formulate closed-form equations that define how the outputs are derived from the inputs. Being all the characteristics fixed when the equations are derived, we refer to them as fixed models. These kind of models are suitable for simple, fully understood problems. When faced with an estimation problem, we might have a very good idea about how the inputs and outputs interrelate, but not to the level of precision required by a fixed model. PARAMETRIC MODELS (MODEL DRIVEN) The key feature of parametric models is that explicit mathematical equations characterize the structure of the relationship between inputs and outputs, but a few parameters are unspecified. The unspecified parameters are chosen by examining data examples. The stage of formulating mathematical equations allows for some flexibility in the model, which is fine tuned by empirical analysis. Parametric modelling usually requires a fair amount of knowledge concerning the problem. ex: A linear regression is a parametric modelling. NON-PARAMETRIC MODELS (DATA DRIVEN) Models relying heavily on the use of data rather than domain specific human expertise, can be called nonparametric or data driven models. They are very successful in solving complex problems. Can also produce arbitrarily complex models. The basic premise of nonparametric methods is that relations consistently occurring in the data set will recur in future observations. One important benefit is that nonparametric methods do not require a thorough understanding of the problem. In summary: o More correctly described as inductive approaches. o Much similar to the way humans learn. o Based on, sometimes painful, experience. o Need to have plenty of experiences with the appropriate feedback. Complex problems: o Problems that are too complex to deal with. o Though theoretically possible, there are problems in which it is not feasible to collect enough data and allow enough processing time to generate a nonparametric model. o Usually, we have information or intuition about which are the most relevant features. o This knowledge is fundamental in reducing the “search space”. In data driven model, we simplify the problem by removing irrelevant information and extract fey features. Raw input vector Pre-processor Pre-processed input vector Nonparametric model Output This can be viewed as separating the model development into two parts: domain-specific knowledge (pre-processor) and less well understood aspects (model). AQUIRE KNOWLEDGE We can acquire knowledge in two forms: o Sequential: the examples are presented one at a time and the representation of the knowledge is continually changed until it converges. o Batch: the examples are presented all at the same time and processed together. NUMBER OF ATTRIBUTES For non-parametric models: o If we use a small number of attributes, we will be unable to distinguish between classes. o If we use a large number of attributes (usual in Data Mining), we will come across the curse of dimensionality, and it will be hard to visualize (“strange effects”). o We will have to decide which attributes are relevant for the task, and which ones are redundant. THE CURSE OF DIMENSIONALITY As the number of dimensions increases, the space gets sparser and finding groups is more difficult. With one dimension seems like we have 3 groups, but with 2, it seems like we have 5 groups… SEPARABILITY PROBLEM o Separable: when the different categories do not intersect, making more easier to predict the value (0 error possible). o Non separable: the different categories intersect, so we can't define a value where from it is a different category (the error is always greater than 0). OTHER PROBLEMS o Noise: may be in the classification of the examples or in the values of the attributes; Bad information is probably worse than no information. o Huge amounts of data o Too much learning OVERFITTING Overfitting is a modelling error that happens when a model learns the detail and noise in the training data. As a result, the algorithm cannot perform accurately against unseen data, defeating its purpose. Examples (Validation) Examples (Training) CLASSIFICATION Algorithm Knowledge Examples (New) Classifier LEARNING Classification There are 3 different types of datasets: o Training set: the bigger, the better the classifier. o Validation set: the bigger, the best the estimate of the optimal training. o Test set: the bigger, the best the estimate of the performance of the classifier on unseen data. MACHINE LEARNING: CAP 7 – CLASSIFIERS CLASSIFICATION Classification is a method used to build a model or classifier, to classify new examples, given a set of pre-classified examples. In supervised learning, classes are known for the examples used to build the classifier. A classifier can be a set of rules, a decision tree, a neural network, etc. Simple algorithms often work very well. There are many kinds of simple structure, such as: o One attribute does all the work. o All attributes contribute equally and independently. o A weighted linear combination. o Instance-based: use a few prototypes. o Use simple logical rules. Success of method depends on the domain. For the following classifiers we will use a case study in which we want to predict if an individual will play golf, based on the outlook (rainy, overcast, or sunny), the temperature (hot, mild, or cool), the humidity (high or normal) and if it is windy (true or false). CLASSIFIERS 1. ZEROR o ZeroR is the classification simplest method, based only on the target, ignoring all the predictors. o Simply provides for the category based on the most frequent class. o No predictability power but is a useful baseline performance to other classification methods. ex: 2. ONER o OneR is a simple classification algorithm that generates one rule for each predictor in the data. o Learns a 1-level decision tree, i.e., rules that all test one particular attribute. o Basic version: - One branch for each value. - Each branch assigns most frequent class. - Error rate: proportion of instances that don’t belong to the majority class of their corresponding branch. - Choose attribute with lowest error rate. o Assumes nominal attributes. o Missing are treated as a separate attribute value. ex: 3. BAYESIAN CLASSIFIERS o Is a family of simple “probabilistic classifiers” based on applying Bayes’ theorem with strong independence assumptions between the features. o Statistical classifiers: performs probabilistic prediction, i.e., predict class membership probabilities. o Foundation: based on Bayes theorem. o Performance: a simple Bayesian classifier, naïve Bayesian classifier, has comparable performance with decision tree and selected neural network classifiers. o Incremental: each training example can incrementally increase/decrease the probability that a hypothesis is correct – prior knowledge can be combined with observed data. o Standard: even when Bayesian methods are computationally intractable, they can provide a standard of optimal decision making against which other methods can be measured. Bayes’ Theorem Basics: ex: Naïve Bayes Classifier We assume the attributes are conditionally independent. This greatly reduces the computation costs: only count the class distribution. Naïve Bayes works surprisingly well, even if independence assumption is clearly violated, because classification doesn’t require accurate probability estimates as long as maximum probability is assigned to correct class. However, adding too many redundant attributes will cause problems. 4. INSTANCE BASED CLASSIFIERS o A family of algorithms that, instead of performing explicit generalization, compare new problem instances with instances seen in training. o Simplest for of learning (rote learning): - Training instances are searched for instance that most closely resembles new instance. - The instances themselves represent the knowledge. - Also called instance-based learning. o Instance based learning is lazy learning (vs eager learning). o Some methods: nearest neighbor, K-nearest neighbor, … NEAREST NEIGHBOR CLASSIFIERS Requires 3 things: 1. The set of stored records. 2. Distance Metric to compute distance between records. 3. The value of k, the number of nearest neighbors to retrieve. To classify an unknown record: o Compute distance to other training records - Euclidean distance - Manhattan distance: used when dealing with high dimensionality. o o Identify k-nearest neighbors (K-nearest neighbors of a record x are data points that have the k smallest distance to x). Determine the class label of unknown record - Take the majority vote of class labels among the k-nearest neighbors. - Weigh the vote according to distance. When choosing the value of k, we have to consider that: o If k is too small: - Sensitive to noise points. - Vary sensitive to outliers. Overfitting - Crisp frontiers. o If k is too large: - Neighborhood may include points from other classes. - Smooth frontiers. - Unable to detect small variations. Underfitting Scaling issues: attributes may have to be scaled to prevent distance measures from being dominated by one of the attributes. Disadvantages of K-NN: o Are lazy learners. o It does not build models explicitly, unlike eager learners, decision tree induction and rule-based systems. o Classifying unknown records are relatively expensive. Difficulties of k-NN: o Have to calculate the distance of the test case from all training cases. o There may be irrelevant attributes amongst the attributes – curse of dimensionality. o Require a lot of memory to store the training set. o Require a LOT of time at the classification stage. o Very sensitive to outliers. o Vary sensitive to the distance function chosen. ex: MACHINE LEARNING: CAP 8 – DECISION TREES 1. DECISION TREES o Non-parametric supervised learning algorithm used for classification and regression. o Decision trees can be though of as classification and estimation tools. o One of its major advantages has to do with the fact that they represent rules, which are fairly simple to interpret. o In some problems we are just interested in achieving the best precision possible. In other we are more interested in understanding the results and the way the model is producing the estimates. Sometimes the reasons that underlie certain decisions are of fundamental importance. The objective is to: - Discriminate between classes. - To obtain leaves that are as pure as possible. Hopefully each leaf only represents individuals of a particular class. CLASSIFICATION TREES VS REGRESSION TREES Classification Tree Regression Tree ADVANTAGES OF USING DECISION TREES o Interpretation: easily understand the underlying reason to the decision. o No problems in dealing with different types of data (interval, ordinal, etc): not necessary to define the relative importance of the variables. o Insensitive to scale factors: different types of measurements can be used without the need for normalization. o Automatic definition of the attributes that are more relevant in each case: the most relevant attributes appear in the top part of the tree. o Can be adapted to regression: linear local models in the leaves. o Decision trees are considered a nonparametric method: no assumptions about the space distribution and the classifier structure. DISADVANATGES OF USING DECISION TREES o Most of the algorithms require a discrete target. o Small variations in the data can result on very different trees. o Sub-trees can be replicated several times. o Worse results when dealing with many classes. o Linear boundaries perpendicular to the axis, DECISION TREES INDUCTION Problems in building trees: Basic algorithm (a greedy algorithm) o Tree is constructed in a top-down recursive divide-and-conquer manner. o At start, all the training observations are at the root. o If attributes are continuous-valued, they are discretized in advanced. o Observations are partitioned recursively. Conditions for stopping partitioning: o All observations for a given node belong to the same class. o There are no remaining attributes for further partitioning – majority voting is employed for classify the leaf. o There are no samples left. THE ALGORITHMS 2. CLASSIFICATION TREES Classification trees are tree models where the target variable can take a discrete set of values. DTT Major characteristics: o It is a greed search. o There is no “backtracking” (once a partition is done, there is no re-evaluation). o It uses discriminate power as selection measure ex: We will use the discriminative metric to build the tree: This is a measure of “dominance” or “purity”. With one tail, we choose the bigger D.P., so we choose # nuclei, giving us the following tree: Doing this iteratively, we conclude: INFORMATION GAIN – ID3 / C4.5 Major characteristics: o It uses entropy to measure the “disorder” in each independent variable. o From entropy, we can calculate the information gain, the selection measure. o ID3 handles only categorical attributes while C4.5 is able to deal also with numeric values. ex: We are going to check the entropy associated to the variable age: So, in conclusion: But what about if I wanted to use the original age (algorithm C4.5)? o Originally Age is a continuous-valued attribute. o So, we must determine the best split point for Age: - Sort the values of Age. - Every midpoint between each pair of adjacent values is a possible split point. - Evaluate with two partitions (Partition 1 < split point < Partition 2). - The point with the minimum entropy for Age is selected as the split-point. CART CART uses the Gini Index, which is similar to entropy but without the log, so it is faster! ex: 3. OVERFITTING IN DECISION TREES An induced tree may overfit the training data when: o Too many branches, some may reflect anomalies due to noise or outliers. o Poor accuracy for unseen samples. To avoid overfitting in DT we have two main approaches: o Prepruning: o Do not split a node if this would result in the goodness measure falling below a threshold. o Difficult to choose an appropriate an appropriate threshold. o Postpruning: o Remove branches from a “fully grown” tree - get a sequence of progressively pruned trees. o Use a set of data different from the training data to decide which is the “best pruned tree”. Training set is used to develop the tree and the validation set is used to access the generalization ability of the tree. MACHINE LEARNING: CAP 9 – NEURAL NETWORKS TYPES OF PROBLEMS o Hydrology o Spatial interaction models o Climate predictions o Modelling urban development o Predicting erosion o Property prices NEURAL NETWORKS CLASSIFICATION MULTI-LAYER PERCEPTRON (MPL) NEURAL NETWORKS Advantages in forecast problems: o Learn “automatically”. o “Data driven”. o Especially useful when we know little about the phenomena we are trying to model. o Do non-linear interpolation. o Universal approximators: systems that can approximate functions, whatever complex the shape of this function may be. BIOLOGICAL NEURON Dendrites: receive information. Cell body: process information. Axon: carries processed information to other neurons. Synapse: junction between Axon end and Dendrites of other neurons. ARTIFICIAL NEURON o o o o o Receives inputs X1, X2, …, Xp from other neurons or environment. Inputs fed-in through connections with ‘weights’. Total input = weighted sum of inputs from all sources. Transfer function (activation function) converts the inputs to outputs. Output goes to other neurons or environment. 1. THE PERCEPTRON MODEL Each neuron receives a set of weighted inputs from the neurons of the previous layer. All these inputs are added together in order to obtain the activation value. The activation value is “squashed” by an activation function (sigmoid) in order to determine the output value. o o o o o Every neuron is numbered from 1 to n. The output of the jth neuron is 𝑜𝑗 . The weight between input i and unit j is 𝑤𝑖𝑗 . The activation of the jth unit is 𝑎𝑗 . The activation function is denoted by Φ(𝑥). Activation function o Transforms the inputs in outputs o Activation functions for the hidden units are needed to introduce nonlinearity into the network. o Elements of the activation functions: o Have to have a “squashing” effect (Improve numerical calculations). o Non-decreasing monotonic function (Order preserving). Activation function can be in the following forms: THE LEARNING PROCESS IN A NEURON Need to define the synaptic weights (how can we define the w’s?). General idea: o Synapses that help achieving good results should be reinforced. o Synapses that lead to unwanted results should be weakened. THE TRAINING ALGORITHM o Adjusts the weights in the network. o It distributes the responsibility for the error measured in the output. o The objective is to be able to generalize knowledge. The training algorithm adjusts the synaptic weights in such a way that, given a set of inputs, the desired result is achieved. The idea - trial and error: o Randomly initialize the weights. o If the result is good (big luck!), stop! o If the result is wrong, we make a small adjustment. Functional signals: propagate from the input neurons to the output neurons. Error signals: propagate from the output neurons to the input neurons (layer by layer). We can calculate the error and “backpropagate” it. Instead of making a bid and direct adjustments, we are going to make many small adjustments. 2. MULTI-LAYER PERCEPTRON Input layer: o Introduces the input values in the network. o No activation function. o Each neuron gets only one input, directly from outside. Hidden layer: o Classifies the characteristics. o Two hidden layers are enough to deal with any problem, no matter how complex it may be. o Connects input and output layers. Output layer: o Same functionality as the hidden layer. o The outputs are passed outside the network. o Output of each neuron directly goes outside. THE TRAINING ALGORITHM - BACKPROPAGATION o A training algorithm which allows the adjustment of the weights in a multi-layer feedforward neural network. o Theoretical it can adjust any function (it can map input output) no matter how complex. o It can provide solution to linearly inseparable problems. DELTA RULE For the unit j in the output layer: For the unit j in the hidden layer: WEIGHT UPDATE MACHINE LEARNING: CAP 10 – UNSUPERVISED LEARNING MODELS 1. UNSUPERVISED LEARNING Unsupervised learning are machine learning techniques that learn patterns from untagged data. In here we focus on descriptive models, where we can use clustering algorithms, such as Hierarchical clustering (HC), k-Means and Self-Organizing Maps (SOM). CLUSTERING VS CLASSIFICATION Supervised learning is used to: o Train datasets with labels. o Learn relationship between features and target. o Classify unseen data points. In unsupervised learning: o Data does not have labels. o It finds optimal clusters (groups). o Not used with unseen data points. 2. CLUSTER ANALYSIS Clustering: task of grouping data points (entities, observations, …) based on similarity. Assumes that similar points are related and thus can be considered a group (cluster). Cluster analysis: o A basic conceptual activity of human. o Fundamental process common to many sciences, essential to the development of scientific theories. o The possibility of reducing the complexity of the real infinite sets of similar objects or phenomena, is one of the most powerful tools in the service of man. The objective is to minimize intra-cluster distance (inside each cluster) and maximize inter-cluster distance (from each cluster). Cluster analysis is widely used in exploratory data analysis: o It summarizes big datasets (loose detail but gain understanding). o Characterize entities in a dataset (clients, products, etc). It can be used in client segmentation, product segmentations or summarization of medical imagining. MAIN STAGES 1. Definition of variables o The type of problem determines the variables to choose from. o Including discriminant variables is decisive. o The quality of any cluster analysis is highly conditioned by the variables used. o The choice of variables should play a supporting theoretical context. o This process is carried out based on a set of variables that we know are good discriminators for the problem in question. o The quality of the cluster analysis reflects the discrimination ability of the variables used. This is important because: o Clustering is highly sensitive to the choice of features. o Features have a large impact on the similarity of entities. o Curse of dimensionality. We should focus on: o Picking features that are relevant for the problem. o Avoiding correlated variables. o Pearson correlation coefficient: measure the degree of linear association between two variables. o Spearman correlation coefficient: determine the strength of the relationship between two variables. o Avoiding irrelevant features (almost constant features). 2. Similarity measure o Function that receives two entities and returns a similarity/dissimilarity score. o The right choice is fundamental for obtaining good clusters. o The choice depends on the type of data and problem: is the data categoric or numeric? and high-dimensional? o The most common type of measures are geometric measures: o Euclidean Distance: the distance between two elements (most used). o Weighted Euclidean Distance: each variable is assigned a weight according to its importance to the analysis. o Manhattan Distance: The distance between two points measured along axes at right angles (used for high-dimensional applications). 3. Algorithm 4. Profiling The main goal is to understand what is distinguishable in each cluster. We can do this by comparing means, distributions, feature relationships, etc. Interpretation – the goal is to obtain meaningful and useful clusters. o Warnings: o Random chance can often produce apparent clusters. o Different cluster methods produce different results. o Solutions: o Obtain summary statistics. o Also review in terms of variables not used in clustering. o Label the cluster. Desirable cluster features o Stability o Are clusters and cluster assignments sensitive to slight changes in inputs? o Are clusters assignments in partition B similar to partition A? o Separation o Check ratio of between-cluster variation to within-cluster variation (high is better). 3. HIERARCHICAL CLUSTERING Hierarchical clustering (HC) is a method of cluster analysis which seeks to build a hierarchy of clusters. AGGLOMERATIVE CLUSTERING Step 1: Define a distance function (how to compute the distance between two data points?) – Seen in Topic 2. Step 2: Pick a linkage criterium (how to compute the distance between two clusters?) Ward’s method o When each cluster has one record, there is no loss of information, and all individual values remain available. o When records are joined together and represented in clusters, information about an individual record is replaced by the information for the cluster to which it belongs. o Employs a measure (Sum of Squares Error (SSE)) that measures the difference between individual records and a group mean. Step 3: Compute cluster hierarchy. - Start with each data point in its own cluster. - At each iteration, merge the two closest clusters. - Stop when all data points belong to a single cluster. 1. 2. 3. 4. Step 4: Pick number of clusters and get the cluster assignments. Determining number of clusters: For a “given distance between clusters”, a horizontal line intersects the clusters that are far part to create the clusters. DISADVANTAGES Hierarchical clustering is a greedy algorithm. Thus: o Once it performs an interaction (merging or splitting), it cannot get back → Can lead to suboptimal solutions. o Computationally intensive for big data → requires computing pairwise distances between all data points. o It cannot be used to “predict” unseen data. 4. K-MEANS K-Means is a clustering method that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. PARTITION OR OPTIMIZATION METHODS Given a dataset with n objects: o Construct k partitions, where each partition represents a cluster (k ≤ n). o Classify the data into k groups, satisfying the following conditions: o Each group contains at least one object. o Each object belongs to only one cluster. THE ALGORITHM Consider we have two variables and want to group them in 5 clusters: 1. Initialization: Seed definition (typically at random). 2. Each individual is associated with the closest seed. 3. Calculate the centroids of clusters formed – Recalculate the seed so that it is in the cloud of points representing center (named as centroid). 4. Return to step 2. (2nd iteration) (3rd iteration) 5. End when the centroids no longer changes. (4th iteration – no changes → Final solution) VARIANTS o K-Means: each cluster is represented by the mean values of the objects belonging to the cluster. o K-Medoids: variant of k-Means that changes the way cluster centroid is defined. Each cluster is represented by one of the objects located near the center of the cluster. o Goal: reduce sensitivity to noise and allow for more flexible distance functions. o It is not a hypothetical center, but a real point that is most at the center of the cluster. o K-Mode: variant of k-Means for categorical data, that use the mode instead of the mean. It uses a simple matching dissimilarity measure. ADVANTAGES OF K-MEANS o Simple to implement and understand. o Easily adapts to new data points. o Guarantees convergence. o Current implementations are fast. DISADVANTAGES OF K-MEANS o The number of seed: need to set the starting number of clusters to create. o The initialization: very sensitive to the initial positions of the seed as well as the existence of outliers. o The distance used: Euclidean distance does not work in high dimensions and unscaled data. o The data “shape”: methods of partition work well with clusters of spherical shapes. To find clusters with complex shapes, methods of partition are not the best choice. DEFINING NUMBER OF CLUSTERS Decide how many clusters to use is a difficult problem to solve. We can define the number of clusters to using the following methods: o Elbow method: produce various clusters solutions with different k and choose the best. As the number of clusters increases, the cluster members are closer to one another. o Goal: minimize variability intra-group. o Dendrogram: Use a hierarchical method in order to choose the number of clusters based on the dendrogram. INITIALIZATION PROBLEM o Use multiple forms of initialization. o Reboot several times. o Use more than one method. o Use a relatively large number of clusters and re-cluster them based on the centroids. K-Means ++ o Variant of k-Means with a new seed initialization. o Goal: address the sensitivity of the model to the initialization. o Intuition: Spread out the k initial seeds leads to good cluster seeds. SUMMARY o Cluster analysis is an exploratory tool. Useful only when it produces meaningful clusters. o Hierarchical clustering gives visual representation of different levels of clustering. o On the other hand, due to the non-iterative nature, it can be unstable, can vary highly depending on settings, and it is computationally expensive. o Non-hierarchical clustering is computationally cheap and more stable. o Requires user to set k. o Can use both methods. o Be cautious of change results → data may not have definitive “real” clusters. MACHINE LEARNING: CAP 11 – SOM SELF ORGANIZING MAPS (SOM) SOM is a non-supervised neural networks used for clustering, visualization, and dimensionality reduction. The main idea is to map data with many dimensions in a space (map) with 1/2 dimensions. Goal: cause different parts of the network to respond similarly to certain input patterns. How does it work? o SOM is a grid of units which adapt the topological shape of a dataset, allowing us to visualize large datasets and identify potential clusters. o It learns the shape of a dataset by repeatedly moving its neurons closer to the data points. UNITS Units are the “neurons” of SOM. Each unit is interconnected with its neighbors and interact and learn with them while training. COMPETITIVE LEARNING The units compete between themselves to be the best. Step 1: Define the size, learning rate and initial inputs. o Each unit is a vector with size equal to the number of variables. Step 2: Define weights (usually at random). Step 3: Train the algorithm. o For each observation presented, the algorithm finds the best matching unit (BMU) – unit that is closest to the observation - and update weights – update in a way that the unit is closer to the observation. o Also update the nearest units – so they are closer to the BMU. Step 4: Repeat the process until we have good results. o The distance between a neuron and its nearest neighbor is close to 0. THE ALGORITHM The algorithm has 3 phases: o Competition: all units compete to be the best matching unit (BMU). o Cooperation: the neighbors of the BMU are also updated. o Update: all values are updated in a way to get closer to the observation. To have a stable solution, the learning rate and neighborhood radius must converge to 0. TOOLS TO EXPLORE SOM o U-Matrices. o Component planes. o Hit plots. o Parallel coordinate plots. o Boxplots and histograms. o Geographic map. MACHINE LEARNING: CAP 12 – ASSOCIATION RULES 1. ASSOCIATION RULES The objective is to extract patterns describing subsets of data: o Events that occur simultaneously. o Predict relationships between products. It assumes that all data is categorical. It is not a good algorithm for numeric data. Initially used for Market Basket Analysis to find how items purchased by customers are correlated. ex: This is not only applied to transactions. It can also be applied to a set of documents (a text document data set, where each document is treated as a “bag” of keywords). THE MODEL: RULES o A transaction t contains X, a set of items (itemset) in I, if X ⊆ t. o An association rule is an implication of the form: X → Y, where X, Y ⊂ I and X ∩ Y = ∅. o An itemset is a set of items. o A k-itemset is an itemset with k items. ex: QUALITY OF ASSOCIATION RULES: RULES STRENGTH MEASURES ex (based on the previous example): {Milk, Diapers} → {Beer} An association rule is a pattern that states when X occurs, Y occurs with certain probability. Rule strength measures: support, confidence, expected confidence, lift. The support count of an itemset X, denoted by X.count, in a data set T, is the number of transactions in T that contain X. Assume T has n transactions: o o The support of a rule, X → Y, is the percentage of transactions in T that contains X U Y and can be seen as an estimate of the probability P(X U Y). The rule support thus determine how frequent the rule is applicable in the transaction set T. Confidence is the percentage of transactions in T that contain X also contain Y -> P(Y|X) – conditional probability. Expected confidence is the percentage of times the itemset Y appears in the database. Lift is the ratio of the confidence and the expected confidence. THE GOAL Find all rules that satisfy the user-specified minimum support (minsup) and minimum confidence (mincof). ex: ASSOCIATION RULE PROBLEM Given a set I of all the items, a database D of transactions, minimum support s and minimum confidence c, find all association rules X → Y with a minimum support s and confidence c. Problem decomposition: 1. Find all sets of items that have minimum support (frequent itemsets). ex: If the minimum support is 50%, then {Shoes, Jacket} is the only 2-itemset that satisfies the minimum support. If the minimum confidence is 50%, then only two rules generated from this 2-itemset, that have confidence greater than 50%, are: o Shoes → Jacket - Support = 50% and Confidence = 66%. o Jacket → Shoes - Support = 50% and Confidence = 100%. 2. Use the frequent itemsets to generate the desired rules. TRANSACTION DATA REPRESENTATIO N o A simplistic way of shopping baskets. o Some important information not considered: - The quality of each item purchased. - The price paid. ALGORITHMS There are a large number of algorithms we can use and each one uses different strategies and data structures. However, their resulting set of rules are all the same. We will study the Apriori Algorithm. 2. APRIORI ALHORITHM Frequent itemset property: set of a frequent itemset is frequent. o A transaction containing {beer, diapers, milk} also contains {beer, diapers}. o If {beer, diapers, milk} is frequent, then {beer, diapers} is also frequent. Contrapositive: If an itemset is not frequent, none of its supersets are frequent. ex: GENERATE CANDIDATES ex: L3 = {abc, adb, acd, ace, bcd} o Generating C4 from L3: abcd from abc and abd; acde from acd and ace. o Pruning: acde is removed because ade is not in L3. o So C4 = {abcd} ADVANTAGES AND DISADVANTAGES The Apriori algorithm seems to be very expensive, however: o Level-wise search. o k = the size of the largest itemset. o It makes at most k passes over data. o The algorithm is very fast. o Under some conditions, all rules can be found in linear time. o Scale up to large datasets. 3. CONCLUSION After finding association rules, we can apply them in marketing actions, inventory management or CRM. It can also be applied in bioinformatics, data analysis, web mining and medical diagnosis. Types of association rules: o Useful rules (ex: Diapers → Beer) o Trivial rules (ex: Painting can → Painting brush) o Unexplained rules We can assess the quality of the results using: o Confidence factor. o Level of support. o Lift. o Expected confidence. A credible rule must have: o Good level of confidence. o High level of support. o More than one in lift.

Machine Learning Introduction: Data, Analysis, and Concepts

Related documents

Products

Support

Machine Learning Introduction: Data, Analysis, and Concepts

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib