Missing data: omission- delete rows that contain missing values. Imputation (numerical)- replace blanks with sensible values like mean or median imputation(categorical): dummy variables “unknown” or predominate value Outliers: IQR=Q3-Q2 lower=Q1-1.5(IQR) upper=Q3+1.5(IQR) Assuming that the mean Miles worked is 34,600 and the standard deviation of Miles worked is 15,950, what is the standardized Miles for Log ID #1? Round to the nearest hundredth. (50,018 – 34,600) / 15,950 = 0.97 Euclidean Distance: √(x - a)² + (y - b)² Imbalanced data set: An imbalanced data set is when 10% or less of the data belongs to the target class. We might also say that when 10-20% of the data belongs to the target class, we have a marginally imbalanced data set. When the data set is imbalanced, we need to use performance measures (like kappa) that take the imbalanced nature of the data into account. Strengths/Weaknesses of KNN: +simple to explain and implement –requires that variables be numerical –you have to transform categorical variables to numerical variables –doesn’t perform well on problems with too many variables Strengths/Weaknesses of Classification Tree: +easy to understand, interpret, and implement +used on categorical and numerical data +does not require that numerical data are standardized –small changes to the training data result in drastically different trees –requires a large dataset to build a reasonably accurate model Why must the numerical variables be standardized before using KNN? KNN uses similarity measures to quantify how close observations are to one another. Numerical variables can have different scales. Those that have larges scales and to dominate the similarity measure, so we have to standardize all numerical variables to put them on a common scale. different scales can distort the true distance between points and lead to inaccurate results Relationship between complexity parameter and size of tree: large cp small tree, and small cp large tree (large tree, risk of overfitting) Overfitting occurs when an estimated model describes quirks of the data rather than the relationships between variables. -Model becomes too complex - Fails to describe the behavior in a new samplePredictive power for new samples is compromised Assuming that the data used to train this model includes exactly 400 customers with gold status and there are an equal number of males and females, how many of them do not like the ambiance? 200 * (1 – 0.5) + 200 * (1 – 0.7) = 160 For questions 17-18, assume that during cross-validation a cutoff value of 0.5 was selected and the test data set contains 50 observations in each leaf node. Based on the 80/20 rule, how many observations were used to train the model: There are 4 leaf nodes, each comprised of 50 observations. Therefore, the test data set contains 200 observations. If 200 observations is 20% of the data, then 80% of the data is 800 observations. What is the accuracy of this model? Round to the nearest hundredth. Based on the predicted probabilities in the tree, the only group that will be classified into the non-target class is customers with no loyalty status. All others are classified into the target class. Let’s consider each leaf node. Gold-Female: 50 observations classified as “Yes”, and 50% are correct. Gold-Male: 50 observations classified as “Yes”, and 70% are correct. No status: 50 observations classified as “No”, and 80% (= 1 - 0.2) are correct. Platinum status: 50 observations classified as “Yes”, and 80% are correct. (50 * 0.5 + 50 * 0.7 + 50 * 0.8 + 50 * 0.8) / 200 = 0.7 Misclassification rate: (FP+FN)/(TP+TN+FP+FN) Accuracy rate: (TP+TN)/(TP+TN+FP+RN) Sensitivity(target class): TP/(TP+FN) Specificity: TN/(TN+FP) Precision: TP/(TP+FP) Johnson Developments, a commercial property owner, is trying to decide which of their properties to select for solar panel installation. Properties with solar panels installed are eligible for leases in excess of 10 years, and new tenants may choose the longer lease to reap the benefits of long-term utility cost savings and a more favorable public image. Johnson Developments hired a consulting company to develop a data mining model to help with the decision. The model predicts whether a property will be leased for 10 years or more if solar panels are installed, with “yes” being the target class. Performance measures obtained from the testing data are on the right. Using the data given, explain what Positive Predictive Value (aka Precision) means for this model. This positive predictive value for this model is 0.9211. This means that about 92% properties where solar panels are installed will be leased for 10+ years. Assume that the model recommends that Johnson Developments install solar panels on 10 of their properties. How many of those do you expect to be leased for 10 years or more? Round to the nearest whole number of properties. 10 * 0.9211 = 9.2 ~= 9 properties Analyze this scenario using the key principles of data ethics. Do you see any issues? Explain. The key principles are (1) human first, and (2) no biases. This scenario does not appear to have any human first issues, though the information we have is limited. It is hard to envision that installing solar panels (or not) would go against any human interest. From a bias perspective, again, information is limited. As long as the model isn’t amplifying unconscious bias, then I would not have any concerns. Something to look out for here, given the model is assessing properties, is whether there are any socioeconomic biases at work. For example, certain properties may appear to be better candidates for solar panels because they are in wealthier areas of town. If this is the case, then it is something to consider with more scrutiny. Data Privacy: confidentiality, transparency, accountability Data mining: the process identifying hidden patterns and relationships in sets of data and then using them for valuable business insights Business understanding: situational context, specific objectives, project schedule, deliverables Data understanding: collecting raw data, preliminary results, potential hypotheses Data preparation: record and variable selection, wrangling, cleaning Modeling: selection and execution of data mining techniques, convert or transform data to formats/types needed for certain analyses, document assumptions, cross-validation Evaluation: evaluate performance of competing models, select best models, review and interpret results, develop recommendations Deployment: develop a set of actionable insights and a strategy for deployment/monitoring/feedback Supervised: target is known Methods: regression, k-Nearest Neighbors, naïve Bayes, Decision Trees Classification model: target is categorical, objective is to predict class Prediction model: target is numeric, objective is to predict the target for a new case Business examples: classify movies as like/not-like, classify a stock as buy/hold/sell, predict sales based on past data Unsupervised: target is not known Methods: principal components, clustering Dimension reduction: convert highdimensional data to lower Pattern recognition: recognize patterns using machine learning Business examples: identify unusual transactions to ID fraud, marketing similar products as a group