Uploaded by francesca.eve77

Data Analytics

advertisement
Missing data: omission- delete rows that contain missing values. Imputation (numerical)- replace blanks with sensible
values like mean or median imputation(categorical): dummy variables “unknown” or predominate value
Outliers: IQR=Q3-Q2 lower=Q1-1.5(IQR) upper=Q3+1.5(IQR)
Assuming that the mean Miles worked is 34,600 and the standard deviation of Miles worked is 15,950, what is the
standardized Miles for Log ID #1? Round to the nearest hundredth. (50,018 – 34,600) / 15,950 = 0.97
Euclidean Distance: √(x - a)² + (y - b)²
Imbalanced data set: An imbalanced data set is when 10% or less of the data belongs to the target class. We might also
say that when 10-20% of the data belongs to the target class, we have a marginally imbalanced data set. When the data
set is imbalanced, we need to use performance measures (like kappa) that take the imbalanced nature of the data into
account.
Strengths/Weaknesses of KNN: +simple to explain and implement –requires that variables be numerical –you have to
transform categorical variables to numerical variables –doesn’t perform well on problems with too many variables
Strengths/Weaknesses of Classification Tree: +easy to understand, interpret, and implement +used on categorical and
numerical data +does not require that numerical data are standardized –small changes to the training data result in
drastically different trees –requires a large dataset to build a reasonably accurate model
Why must the numerical variables be standardized before using KNN? KNN uses similarity measures to quantify how
close observations are to one another. Numerical variables can have different scales. Those that have larges scales and
to dominate the similarity measure, so we have to standardize all numerical variables to put them on a common scale.
different scales can distort the true distance between points and lead to inaccurate results
Relationship between complexity parameter and size of tree: large cp  small tree, and small cp  large tree (large
tree, risk of overfitting) Overfitting occurs when an estimated model describes quirks of the data rather than the
relationships between variables. -Model becomes too complex - Fails to describe the behavior in a new samplePredictive power for new samples is compromised
Assuming that the data used to train this model includes exactly 400 customers with gold status and there are an equal
number of males and females, how many of them do not like the ambiance? 200 * (1 – 0.5) + 200 * (1 – 0.7) = 160
For questions 17-18, assume that during cross-validation a cutoff value of 0.5 was selected and the test data set contains
50 observations in each leaf node.
Based on the 80/20 rule, how many observations were used to train the model: There are 4 leaf nodes, each comprised
of 50 observations. Therefore, the test data set contains 200 observations. If 200 observations is 20% of the data, then
80% of the data is 800 observations.
What is the accuracy of this model? Round to the nearest hundredth. Based on the predicted probabilities in the tree,
the only group that will be classified into the non-target class is customers with no loyalty status. All others are classified
into the target class. Let’s consider each leaf node. Gold-Female: 50 observations classified as “Yes”, and 50% are
correct. Gold-Male: 50 observations classified as “Yes”, and 70% are correct. No status: 50 observations classified as
“No”, and 80% (= 1 - 0.2) are correct. Platinum status: 50 observations classified as “Yes”, and 80% are correct. (50 * 0.5
+ 50 * 0.7 + 50 * 0.8 + 50 * 0.8) / 200 = 0.7
Misclassification rate: (FP+FN)/(TP+TN+FP+FN) Accuracy rate: (TP+TN)/(TP+TN+FP+RN)
Sensitivity(target class): TP/(TP+FN) Specificity: TN/(TN+FP)
Precision: TP/(TP+FP)
Johnson Developments, a commercial property owner, is trying to decide which of their properties to select for solar
panel installation. Properties with solar panels installed are eligible for leases in excess of 10 years, and new tenants may
choose the longer lease to reap the benefits of long-term utility cost savings and a more favorable public image. Johnson
Developments hired a consulting company to develop a data mining model to help with the decision. The model predicts
whether a property will be leased for 10 years or more if solar panels are installed, with “yes” being the target class.
Performance measures obtained from the testing data are on the right.
Using the data given, explain what Positive Predictive Value (aka Precision)
means for this model. This positive predictive value for this model is 0.9211.
This means that about 92% properties where solar panels are installed will be
leased for 10+ years.
Assume that the model recommends that Johnson Developments install solar
panels on 10 of their properties. How many of those do you expect to be
leased for 10 years or more? Round to the nearest whole number of
properties. 10 * 0.9211 = 9.2 ~= 9 properties
Analyze this scenario using the key principles of data ethics. Do you see any
issues? Explain.
The key principles are (1) human first, and (2) no biases. This scenario does
not appear to have any human first issues, though the information we have is
limited. It is hard to envision that installing solar panels (or not) would go
against any human interest. From a bias perspective, again, information is limited. As long as the model isn’t amplifying
unconscious bias, then I would not have any concerns. Something to look out for here, given the model is assessing
properties, is whether there are any socioeconomic biases at work. For example, certain properties may appear to be
better candidates for solar panels because they are in wealthier areas of town. If this is the case, then it is something to
consider with more scrutiny. Data Privacy: confidentiality, transparency, accountability
Data mining: the process identifying hidden patterns and relationships in sets of data and then using them for valuable
business insights Business understanding: situational context, specific objectives, project schedule, deliverables Data
understanding: collecting raw data, preliminary results, potential hypotheses Data preparation: record and variable
selection, wrangling, cleaning Modeling: selection and execution of data mining techniques, convert or transform data
to formats/types needed for certain analyses, document assumptions, cross-validation Evaluation: evaluate
performance of competing models, select best models, review and interpret results, develop recommendations
Deployment: develop a set of actionable insights and a strategy for deployment/monitoring/feedback
Supervised: target is known Methods: regression, k-Nearest Neighbors, naïve Bayes, Decision Trees Classification model:
target is categorical, objective is to predict class Prediction model: target is numeric, objective is to predict the target
for a new case Business examples: classify movies as like/not-like, classify a stock as buy/hold/sell, predict sales based
on past data
Unsupervised: target is not known Methods: principal components, clustering Dimension reduction: convert highdimensional data to lower Pattern recognition: recognize patterns using machine learning Business examples: identify
unusual transactions to ID fraud, marketing similar products as a group
Download