Data Mining & Statistical Inference Textbook Excerpt

Chapter 5 Observation or record – A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database Unsupervised learning – Category of data mining techniques in which an algorithm explains relations without an outcome variable to guide the process Market Segmentation – The partitioning of customers into groups that share common characteristics so that a business may target customers within a group with a tailored marketing strategy K-means clustering – Process of organizing observations into one of k groups based on a measure of similarity (typically Euclidean distance) Hierarchical Clustering – Process of agglomerating observations into a series of nested groups based on a measure of similarity Euclidean Distance – Geometric measure of dissimilarity between observation based on the Pythagorean theorem Manhattan Distance – Measure of dissimilarity between two observations based on the sum of the absolute differences in each variable dimensions Matching Coefficient – Measure of similarity between observations based on the number of matching values of categorical variables Matching distance – Measure of dissimilarity between observations based on the matching coefficient Jaccard’s Coefficient – Measure of similarity between observations consisting solely of binary categorical variables that considers only matches of nonzero entries Single Linkage- Measure of calculating dissimilarity between clusters by considering only the two most similar observations between the two clusters Complete Linkage – Measure of calculating dissimilarity between clusters by considering only the two most dissimilar observations between the two clusters Group Average Linkage – Measure of calculating dissimilarity between clusters by considering the distance between each pair of observations between two clusters Median Linkage- Method that computes the similarity between two clusters as the median of the similarities between each pair of observations in the two clusters Centroid Linkage – Method of calculating dissimilarity between clusters by considering the two centroids of the respective clusters Ward’s Method – Procedure that partitions observations in a manner to obtain clusters with the least amount of information loss due to the aggregation McQuitty’s method – Measure that computes the dissimilarity introduced by merging clusters A and B by, for each other cluster C, averaging the distance between A and C and the distance between B and C and summing these averages distances Dendrogram – A tree diagram used to illustrate the sequence of nested clusters produced by hierarchical clustering Association rules – An if-then statement describing the relationship between item sets Market basket analysis – Analysis of items frequently co-occurring in transactions (such as purchases) Antecedent – The item set corresponding to the if portion of an if-then association rule Consequent – the item set corresponding to the then portion of an ifthen association rule Support – The percentage of transactions in which a collection of items occurs together in a transaction data set Confidence- The conditional probability that the consequent of an association rule occurs given the antecedent occurs Lift ratio – The ratio of the performance of a data mining model measured against the performance of a random choice. In the context of association rules, the lift ratio is the ratio of the probability of the consequent occurring in a transaction that satisfies the antecedent versus the probability that the consequent occurs in a randomly selected transaction Text Mining – The process of extracting useful information from text data Unstructured Data – data, such as text, audio, or video, that cannot be stored in a traditional structured database Document – A piece of text, which can range from a single sentence to an entire book depending on the scope of the corresponding corpus Terms – The most basic unit of text comprising a document, typically corresponding to a word or word stem Corpus – A collection of documents to be analyzed Bag of words – An approach for processing text into a structured rowcolumn data format in which documents correspond to row observations and words (or more specifically, terms) correspond to column variables Presence/absence – A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1=present and 0 = not present) Binary document-term matrix – A matrix with the rows representing documents (units of text) and columns representing terms (words or word roots), and the entries in the columns indicating either the presence or absence of a particular term in a particular document (1=present and 0 = not present) Tokenization – The process of dividing text into separate terms, referred to as tokens Term Normalization – A set of natural language processing techniques to map text into a standardized form Stemming – The process of converting a word to its stem or root word Stopwords – Common words in a language that are removed in the preprocessing of text Frequency document-term matrix – A matrix whose rows represent documents (units of text) and columns represent terms (words or word roots), and the entries in the matrix are the number of times each term occurs in each document Term frequency times inverse document frequently (TFIDF) – Text mining measure which accounts for term frequency and the uniqueness of a term in a document relative to other documents in a corpus Cosine Distance – A measure of dissimilarity between two observations often used on frequency data derived from text because it is unaffected by the magnitude of the frequency and instead measures differences in frequency patterns Word Cloud – A visualization of text data based on word frequencies in a document or set of documents Association rule – An if-then statement describing the relationship between item sets Presence/absence document-term matrix – A matrix with the rows representing documents and the columns representing words, and the entries in the columns indicating either the presence or the absence of a particular word in a particular document (1=present and 0 = not present) Sentiment analysis – The process of clustering/categorizing comments or reviews as positive, negative or neutral Term – The most basic unit of text comprising a document, typically corresponding to a word of word stem Chapter 6 Census – Collection of data from every element in the population of interest Statistical Inference- The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through the analysis of sample data drawn from the population Sampled Population – The population from which the sample is drawn Frame - A listing of the elements from which the sample will be selected Parameter – A measurable factor that defines a characteristics of a population, process or system such as a population mean , a population standard deviation , or a population proportion  Simple random sample - A simple random of size n from a finite population of size N is a sample selected such that each possible sample of size n has the same probability of being selected Random Sample – A random sample from an infinite population is a sample selected such that the following conditions are satisfied: (1) Each element selected comes from the same population and (2) each element is selected independently. Sample Statistic – A characteristic of sample data, such as a sample mean x , a sample standard deviation s or a sample proportion p. The value of the sample statistic is used to estimate the value of the corresponding population parameter Calculating sample mean, sample standard deviation, and sample proportion is called point estimation: Point Estimator – The sample statistic, such as x, s, or p, that provides the point estimate of the population parameter Point Estimate – The value of a point estimator used in a particular instance as an estimate of a population parameter Target Population – The population for which statistical inferences such as point estimates are made. It is important for the target population to correspond as closely as possible to the sampled population Random Variable – A quantity whose values are not known with certainty Sampling Distribution – A probability distribution consisting of all possible values of a sample statistic Unbiased - A property of a point estimator that is present when the expected value of the point estimator is equal to the population parameter it estimates Standard error- The standard deviation of a point estimator Finite Population Correction Factor – The term (N-n)/(N-1) that is used in the formulas for computing the estimated standard error for the sample mean and sample proportion whenever a finite population, rather than an in infinite population, is being sampled. The generally accepted rule of thumb is to ignore the finite population correction factor whenever n/N<0.05 Sampling error – The difference between the value of a sample statistic (such as the sample mean, sample standard deviation or sample proportion) and the value of the corresponding population parameter (population mean, population standard deviation, or population proportion) that occurs because a random sample is used to estimate the population parameter Interval Estimation – The process of using sample data to calculate a range of values that is believed to include the unknown value of a population parameter Interval Estimate – An estimate of a population parameter that provides an interval believed to contain the value of the parameter. For the interval estimates in this chapter, it has the form; point estimate + margin of error Margin of Error – The + value added to and subtracted from a point estimate in order to develop an interval estimate of a population parameter t distribution – A family of probability distributions that can be used to develop an interval estimate of a population mean whenever the population standard deviation s is unknown and is estimated by the sample standard deviation s Degrees of Freedom – A parameter of the t distribution. When the t distribution is used in the computation of an interval estimate of a population mean, the appropriate t distribution has n-1 degrees of freedom, where n is the size of the sample Standard normal distribution – A normal distribution with a mean of zero and standard deviation of one Confidence Level - The confidence associated with an interval estimate. For example, if an interval estimation procedure provides intervals such that 95% of the intervals formed using the procedure will include the population parameter, the interval estimate is said to be constructed at the 95 % confidence level Confidence Coefficient – The confidence level expressed as a decimal value. For example, 0.95 is the confidence coefficient for a 95% confidence level Confidence Interval – Another name for an interval estimate Level of Significance – The probability that the interval estimation procedure will generate an interval that does not contain the value of parameter being estimated; also, the probability of making a Type 1 error when the null hypothesis is true as an equality Null Hypothesis – The hypothesis tentatively assumed to be true in the hypothesis testing procedure Alternative hypothesis - The hypothesis concluded to be true if the null hypothesis is rejected Type II error – The error of accepting H0 when it is false Type I error – The error of rejecting H0 when it is true One-tailed tests – A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in one tail of its sampling distribution Test Statistic – A statistic whose value helps determine whether a null hypothesis should be rejected P value - The probability, assuming that H0 is true, of obtaining a random sample size n that results in a test statistic at least as extreme as the one observed in the current sample. For a lower-tail test, the p value is the probability of obtaining a value for the test statistic as small as or smaller than that provided by the sample. For an upper-tail test, the p value is the probability of obtaining a value for the test statistic as large as or larger than that provided by the sample. For a two-tailed test, the p-value is the probability of obtaining a value for the test statistic at least as unlikely as or more unlikely than that provided by the sample Two-tailed test – A hypothesis test in which rejection of the null hypothesis occurs for values of the test statistic in either tail of its sampling distribution Nonsampling error – Any difference between the value of a sample (such as the sample mean, sample standard deviation, or sample proportion) and the value of the corresponding population parameter (population mean, population standard deviation, or population proportion) that are not the result of sampling error. These include but are not limited to coverage error, nonresponse error, measurement error, interview error and processing error Coverage error – Nonsampling error that results when the research objective and the population from which the sample is to be drawn are not aligned Nonresponse error – Nonsampling error that results when some segments of the population are more likely or less likely to respond to the survey mechanism Measurement error is an incorrect measurement of the characteristic of interest. Big Data – Any set of data is too large or too complex to be handled by standard data processing techniques and typical desktop software Volume- The amount of data generated Variety – The diversity in types and structure of data generated Veracity – The reliability of the data generated Velocity – The speed at which the data are generated Tall data – A data set that has so many observations that traditional statistical inferences have little meaning Wide data – A data set that has so many variables that simultaneous consideration of all variables is infeasible Practical Significance- The real-world impact the result of statistical inference will have on business decisions Central Limit Theorem – A theorem stating that when enough independent random variables are added, the resulting sum is a normally distributed random variable. This result slows one to use the normal probability distribution to approximate the sampling distributions of the sample mean and sample proportion for sufficiently large sample sizes Hypothesis Testing – The process of making a conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample and using these results to draw a conclusion about the conjecture One-tailed test – A hypothesis test in which rejection of the null hypothesis occurs for values of the statistic in one tail of its sampling distribution Chapter 7 Regression Analysis – A statistical procedure used to develop an equation showing how the variables are related Dependent Variable – The variable that is being predicted or explained. It is denoted by y and is often referred to as the response Independent Variables – The variable(s) used for predicting or explaining values of the dependent variable. It is denoted by x and is often referred to as predictor variable Multiple Linear Regression – Regression analysis involving one dependent variable and more than one independent variable Estimated Regression – The estimate of the regression equation developed from sample data by using the least squares method. The estimated multiple linear regression equation is y^=b0+b1x1+b2x2+⋯+bqxq. yˆ = Estimate for the mean value of y corresponding to a give b0 = Estimated y -intercept. b1 = Estimated slope. Point Estimator – A single value used as an estimate of the corresponding population parameter Least Squares Method – A procedure for using sample data to find the estimated regression equation Determine the values of b0 and b1 . Interpretation of b0 and b1: The slope b1 is the estimated change in the mean of the dependent variable y that is associated with a one unit increase in the independent variable x . The y -intercept b 0 is the estimated value of the dependent variable y when the independent variable x is equal to 0. Residual- The difference between the observed value of the dependent variable and the value predicted using the estimated regression equation; for the ith observation, the ith residual is yi-y^i Experimental Region – The range of values for the independent variables x1, x2, xq for the data that are used to estimate the regression model Extrapolation – Prediction of the mean value of the dependent variable y for values of the independent variables x1, x2, xq that are outside the experimental range Coefficient of determination – A measure of the goodness of fit of the estimated regression equation. It can be interpreted as the proportion of the variability in the dependent variable y that is explained by the estimated regression equation Statistical Inference – The process of making estimates and drawing conclusions about one or more characteristics of a population (the value of one or more parameters) through analysis of sample data drawn from the population Hypothesis testing – The process of making conjecture about the value of a population parameter, collecting sample data that can be used to assess this conjecture, measuring the strength of the evidence against the conjecture that is provided by the sample, and using these results to draw a conclusion about the conjecture Interval Estimation – The use of sample data to calculate a range of values that is believed to include the unknown value of a population parameter T- test – Statistical test based on the student’s t probability distribution that can be used to test the hypothesis that a regression parameter Bj is zero; if this hypothesis is rejected, we conclude that there is a regression relationship between the jth independent variable and the dependent variable Confidence Interval – An estimate of a population parameter that provides an interval believed to contain the value of the parameter at some level of confidence Confidence Level – An indication of how frequently interval estimates based on samples of the same size taken from the same population using identical sampling techniques will contain the true value of the parameter we are estimating Multicollinearity – The degree of correlation among independent variables in a regression model Dummy Variable – A variable used to model the effect of categorical independent variables in a regression model; generally, takes only the value zero or one Quadratic Regression Model – Regression model in which a nonlinear relationship between the independent and dependent variables is fit by including the independent variable and the square of the independent variable in the model; also referred toa s second-order polynomial model Piecewise linear regression model – Regression model in which one linear relationship between the independent variables is fit for values of the independent variable below a prespecified value of the value independent variable, a different linear relationship between the independent and dependent variables is fit for values for the independent variable above the prespecified value of the independent variable and the two regression have the same estimated value of the dependent variable (i.e are joined) at the prespecified value of the independent variable Knot – A prespecified value of the independent variable at which its relationship with the dependent variable changes in a piecewise linear regression model; also called the breakpoint or the joint Interaction- Regression modeling technique is used when the relationship between the dependent variable and one independent variable is different at different values of a second independent variable Backward Elimination – An iterative variable selection procedure that starts with a model with all independent variables and considers removing an independent variable at each step Forward selection – An iterative variable selection procedure that starts with a model with no variables and considers adding an independent variable at each step Stepwise Selection – An iterative variable selection procedure that considers adding an independent variable and removing an independent variable at each step Best Subsets- A variable selection procedure that constructs and compares all possible models up to a specified number of independent variables Overfitting – Fitting a model too closely to sample data, resulting in a model that does not accurately reflect the population Cross-Validation – Assessment of the performance of a model on data other than the data were used to generate the model Holdout method – Method of cross-validation in which sample data are randomly divided into mutually exclusive and collectively exhaustive sets, then one set is used to build the candidate models and the other set is used to compare model performances and ultimately select a model Training set – The data set used to build the candidate models Validation set – The data set used to compare model forecasts and ultimately pick a model for predicting values of the dependent variable Prediction Interval – An interval estimates of the prediction of an individual y value given values of the independent variables Independent Variable(s) – The variable(s) used for predicting or explaining values of the dependent variable. It is denoted by x and is often referred to as predictor variable Linear Regression - Regression analysis in which relationships between the independent variables and the dependent variable are approximated by a straight line P- value – The probability that a random sample of the same size collected from the same population using the same procedure will yield stronger evidence against a hypothesis than the evidence in the sample data given that the hypothesis is actually true Parameter- A measurable factor that defines a characteristic of a population, process or system Random variable – A quantity whose values are not known with certainty Regression Model – The equation that describes how the dependent variable y is related to an independent variable x and an error term; the multiple linear regression model Simple Linear Regression – Regression analysis involving one dependent variable and one independent variable Chapter 8 Forecasts – A prediction of future values of a time series Time Series – A set of observations on a variable measured at successive points in time or over successive periods of time Stationary Time series – A time series whose statistical properties are independent of time Trend – The long-run shift or movement in the time series observable over several periods of time A trend is usually the result of long-term factors such as: • Population increases or decreases. • • • • Shifting demographic characteristics of the population. Improving technology. Changes in the competitive landscape. Changes in consumer preferences. Seasonal Patterns – The component of the time series that shows a periodic pattern over one year or less Cyclical Pattern – the component of the time that results in periodic above-trend and below-trend behavior of the time series lasting more than one year Naïve Forecasting Method – A forecasting technique that uses the value of the time from the most recent period as the forecast for the current period Forecast error – The amount by which the forecasted values y^t differs from the observed value yt, denoted by et=yt=y^t Mean absolute error (MAE) – A measure of forecasting accuracy; the average of the values of the forecast errors. Also referred to as mean absolute deviation (MAD) Mean Squared Error (MSE) – A measure of the accuracy of a forecasting method; the average of the sum of the square differences between the forecast values and the actual time series values Mean Absolute Percentage (MAPE) – A measure of the accuracy of a forecasting method; the average of the absolute values of the errors as a percentage of the corresponding forecast values Moving Average Method – A method of forecasting or smoothing a time series that uses the average of the most recent n data values in the time as the forecasts for the next period Exponential Smoothing- A forecasting technique that uses a weighted average of past time series values as the forecast Smoothing Constant – A parameter of the exponential smoothing model that provides the weight given to the most recent time series value in the calculation of the forecast value Autoregressive Models – A regressive model in which a regression relationship based on past time series values is used to predict the future time series values Causal Models – Forecasting methods that relate a time series to the variables that are believed to explained or cause its behavior Autoregressive model – A regression model in which a regression relationship based on past time series values is used to predict the future time series values Seasonal Pattern – The component of the time series that shows a periodic pattern over one year or less Chapter 9 Observation- a set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database Variables – A characteristic or quantity of interest that can take on different values Features – A set of input variables used to predict an observation’s outcome class or continuous outcome value Supervised Learning – Category of data mining techniques in which an algorithm learns how to classify or estimate an outcome variable of interest. Estimation – A predictive data mining task requiring the prediction of an observation’s continuous outcome value Classification – A predictive data mining task requiring the prediction of an observation’s outcome class or category Overfitting – A situation in which a model explains random patterns in the data on which it is trained rather than just the generalized relationships, resulting in a model with training set performance that greatly exceeds its performance on new data. Training Set – Data used to build candidate predictive models Validation Set – Data used to evaluate candidate predictive models Test Set – Data used to compute unbiased estimate of final predictive model’s performance k-fold cross-validation – A robust to train and validate models in which the observations to be used to train and validate the model are repeatedly randomly divided into k subsets called folds. In each iteration, one fold is designated as the validation set and the remaining k-1 folds are designated as the training set. The results of the iterations are then combined and evaluated Leave-one-out cross-validation – A special case of k-fold cross validation for which the number of folds equals the number of observations in the combined training and validation data Under sampling- A techniques that balances the number of Class 1 and Class 0 observations in a training set by removing majority class observations from the training set Oversampling – A technique that balances the number of Class 1 and Class 0 observations in a training set by inserting copies of minority class observations into the training set. Confusion Matrix – A matrix showing the counts of actual versus predicted class values Overall, Error Rate – The percentage of observations misclassified by a model in a data set Accuracy – Measure of classification success defined as 1 minus the overall error rate False Positive – The misclassification of a Class 0 observation as Class 1 False Negative – The misclassification of a Class 1 observation as Class 0 Cutoff Value – The smallest value that the predicted probability of an observation can be for the observation to be classified as Class 1 Cumulative Lift Chart – A chart used to present how well a model performs in identifying observations most likely to be in Class 1 as compared with random classification Decile-wise lift chart – A chart used to present how well a model performs at identifying observations for each of the top k deciles most likely to be in Class 1 versus a random classification Sensitivity – The percentage of actual Class 1 observations correctly identified Specificity – The percentage of actual Class 0 observations correctly identified Precision – The percentage of observations predicted to be Class 1 that actually are Class 1 F1 Score – A measure combining precision and sensitivity into a single metric Receiver operating characteristic (ROC) curve – A chart used to illustrate the tradeoff between a model’s ability to identify Class 1 observations and its Class 0 error rate Area under the ROC curve – A measure of a classification method’s performance; an AUC of 0.5 implies that a method is no better than random classification while a perfect classifier has an AUC of 1.0 Average Error – The average difference between the actual values and the predicted values of observations in a data; use to detect prediction bias Root mean squared error – A performance measure of an estimation method defined as the square root of the sum of squared deviations between the actual values and predicted values of observations. Bias- The tendency of a predictive model to overestimate or underestimate the value of a continuous outcome Logistic Regression – A generalization of linear regression that predicts a categorical outcome variable by computing the log odds of the outcome as a linear function of the input variables Mallow’s Cp Statistic – A measure in which small values approximately equal to the number of coefficients suggest promising logistic regression models K-nearest neighbor (k-NN) – A data mining method that predicts (classifies or estimates) an observation I’s outcome value based on the k observations most similar to observation I with respect to the input variables Impurity – Measure of the heterogeneity of observations in a classification or regression tree Classification trees- A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules on the input variables Regression Tree - A tree that predicts values of a continuous outcome variable by splitting observations into groups via a sequence of hierarchical rules on the input variables Ensemble Method – A predictive data mining approach in which a committee of individual classification or estimation models are generated and a prediction is made by combining these individual predictions Unstable – When small changes in the training set cause a model’s predictions to fluctuate substantially Bagging – An ensemble method that generates a committee of models based on different random samples and makes predictions based on the average prediction of the set of models Out-of-bag estimation – A measure of estimating the predictive performance of a bagging ensemble of m models (without a separate validation set) by leveraging the concept that the training of each model is only based on approximately 63.2% of the original observations (due to sampling with replacement). Boosting – an ensemble method that iteratively samples from the original training data to generate individual models that target observations that were mispredicted in previously generated models, and then bases the ensemble predictions on the weighted average of the predictions of the individual models, where the weights are proportional to the individual models, where the weights are proportional to the individual model’s accuracy. Random Forests – A variant of the bagging ensemble method that generates a committee of classification or regression trees based on different random samples but restricts each individual tree to a limited number of randomly selected features (variables) Area under the ROC curve (AUC) – A measure of a classification method’s performance; an AUC of 0.5 implies that a method is no better than random classification while a perfect classifier has an AUC of 1.0 Class 0 error rate – The percentage of Class 0 observations misclassified by a model in a data set Class 1 error rate – The percentage of actual Class 1 observations misclassified by a model in a data set Classification tree – A tree that classifies a categorical outcome variable by splitting observations into groups via a sequence of hierarchical rules on the input variables K-nearest neighbors – A data mining method that predicts (classifies or estimates) an observations I’s outcome value based on the k observations most similar to observation i with respect to the input variables Observation (record) A set of observed values of variables associated with a single entity, often displayed as a row in a spreadsheet or database Sensitivity (recall) The percentage of actual Class 1 observations correctly identified Variable (feature) – A characteristics or quantity of interest that can take on different values Chapter 10 What-if models – A model designed to study the impact of changes in model inputs on model outputs Make-versus-buy decision – A decision often faced by companies that have to decide whether they should manufacture a product or outsource its production to another firm. Influence Diagram – A visual representation that shows which entities influence others in a model Decision Variable - A model input the decision maker can control Parameters – In a what-if model, the uncontrollable model input Data Table - An Excel tool that quantifies the impact of changing the value of a specific input on an output of interest One-way data table – An Excel Data Table that summarizes a single input’s impact on the output of interest Two-way data table – An Excel Data Table that summarizes two inputs’ impact on the output of interest Goal Seek – An Excel tool that allows the user to determine the value for an input cell that will cause the value of a related output cell to equal some specifies value, called the goal Scenario Manager – An Excel tool that quantifies the impact of changing multiple input on one or more outputs of interest Trace Precedents button: After selecting cells, this button creates arrows pointing to the selected cell from cells that are part of the formula in that cell. Trace Dependents button: Shows arrows pointing from the selected cell to cells that depend on the selected cell.

Data Mining & Statistical Inference Textbook Excerpt

Related documents

Products

Support

Data Mining & Statistical Inference Textbook Excerpt

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib