Uploaded by ybkahveci

dats501 2021 spring week11 exam1

advertisement
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
dats501_2021_spring_week11_exam1
40 questions
1. Forecasting the state of the weather for tomorrow as rainy or notRainy is part of....
A. Predictive Analytics
B. Prescriptive Analytics
C. Diagnostic Analytics
D. Descriptive Analytics
2. Which definition is more convenient for the data of an insurance company that uses a single powerful 512
gigabyte RAM 48-core server for its analytics operations?
A. Medium data
B. Small data
C. Big data
D. Extreme data
3. Complete the conversation between two students (Aang and Katara):
A: I used to think correlation implied causation. Then I took a statistics class. Now I don't
K: Sounds like the class helped
A: ....
A. Not really, I just changed my mind
B. Definitely, it helped
C. Yeah, the professor is convincing
D. Well, maybe
4. What is the mean and mode of the following set of numbers? { 4, 9, 8, 8, 2, 16, 4, 4, 8, 9, 6, 8 }
A. mean, mode: 7, 8
B. mean, mode: 7, 4
C. mean, mode: 8, 9
D. mean, mode: 6, 8
5. Which statements are correct?
i. Q1 - 1.5 * IQR is close to - 2.7 sigma for a normally distributed and normalized data
ii. Box-plot method outlier detection boundaries cannot be flexed
iii. In general practice of box plot, outliers are found beyond 2 IQR distances from quartiles
iv. IQR includes 2/3 of the data
A. ii, iii
B. only i
C. i, iv
D. only ii
6. Which are correct?
i. Pie charts may deceive the human eye regarding the position of the slice
ii. Bubble charts are similar to the ones used in the number of pandemia patients (covid-19) in countries
iii. Scatterplot is used to present lines
iv. GIS (geographic information systems) is not useful to municipalities and the general directorate of
highways
A. i, ii
B. only ii
C. iii, iv
D. i, ii, iv
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
1/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
7. Order below options according to their explanatory power for the bimodal data for the heights of a group
of animals?
i. The same species in a region in 2 herds of different sizes
ii. 2 herds each with two different animal species
iii. 1 herd with the same species with two different genders without sexual dimorphism
iv. 2 herds of the same species from different geographies with environmental variation
A. iii, i, ii, iv
B. ii, iv, i, iii
C. ii, iv, iii, i
D. ii, iv, i, iii
8. An investment company asks one of its portfolio managers to find optimal solutions for two of its
customers. The customers each give $100 M to the investment manager. There are 3 portfolios to invest in
with a bankrupt risk, meaning losing all the money.
Customer A is a risk-taker and sees bankruptcy only as a financial loss and penalize it with the expected loss,
Customer B is risk-averse and penalizes bankruptcy with the square of the financial loss (i.e. losing 3 million
feels like losing 9 million)
Note: The penalty is calculated from the invested amount and probability of bankruptcy
Porfrolio1 (Low Risk): 15 M return, 0 % bankrupt chance
Porfrolio2 (Mid Risk): 25 M return, 2 % bankrupt chance
Porfrolio3 (High Risk): 35 M return, 5 % bankrupt chance
Calculate utilities of the portfolios for each customer. Then, choose the optimal ones for them
A. "Customer A: Portfolio 3" - "Customer B:
Portfolio 2"
B. "Customer A: Portfolio 2" - "Customer B:
Portfolio 1"
C. "Customer A: Portfolio 1" - "Customer B:
Portfolio 3"
D. "Customer A: Portfolio 2" - "Customer B:
Portfolio 3"
9. Which marketing strategies are strongly related to the endowment effect?
i. Giving free Red Bulls to university campuses
ii. Not accepting online t-shirt returns without a valid reason
iii. Offering a deal for two pairs of socks for the price of one
iv. Halving the price of the internet for the newcomers for a limited amount of time
A. iii, iv
B. i, ii
C. ii, iii
D. i, iv
10. Which are true regarding the framing effect?
i. We prefer less certain outcomes when information is framed in negative language
ii. When people have to choose between an option framed in terms of a gain and an option framed in terms of
a loss, most people choose the option framed in terms of loss
iii. Tversky & Kahneman claims when the frame is positive people are more likely to take risks
iv. When something is framed in a positive way, people are more likely to go for the safest option
A. i, ii, iv
B. ii, iv
C. only iii
D. i, iii
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
2/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
11. In the Moneyball movie, Brad Pitt's character is the sports team's (baseball) director who tries to compete
with rich teams with low-cost but statistically effective players and strategies. A Yale Economics graduate
helps him for the cause The value of a player is marked with a formula similar to "Score = (Hits + Walks —
Caught stealing)*(Total Bases + 0.7 Stolen Bases)/(At Bats + Walks +Caught Stealing)" What would be the
best description for this scoring process if this scoring is performed for the end of the next season with the
currently unrealized statistics from today to the next season's end?
A. Prescriptive analytics followed by predictive
and descriptive analytics
B. Simulated solution with the important player
features based on values predicted with
unsupervised learning
C. Predicting the scores regarding the next
season followed by utility optimization for the
team state at the end of the next season
D. Descriptive value assignment based on the
utilities provided by each important feature
12. Toss a coin 3 times. Let A = 'at least 2 tails' B = 'second toss is heads' What is P(A|B) and P(B|A) ?
A. 1/3, 1/4
B. 1/5, 1/3
C. 1/4, 1/4
D. 1/4, 1/3
13. X is a random variable and values of X are { 3, 5, 6, 8 } and cdf of it are { 0.25, 0.45, 0.70, 1.00 } respectively
What is P( X <= 5 ) and P( X = 7 )
A. 0.70, 0.00
B. 0.45, 0.30
C. 0.70, 0.30
D. 0.45, 0.00
14. We have 2 dices:
- 5-sided dice. These side values are { 2, 4, 8, 16, 16 }. The sides have realization probability inversely
proportional to the face values
- Regular 6-sided dice
What is the expected total after each dice is thrown once
A. 9
B. 8.5
C. 7.5
D. 8
15. - PMF(probability mass function) is commonly used when there are a small number of unique & discrete
values
The pmf of a discrete random variable X is given as follows: { x; P( X = x ) } { ( -5, -1, 1, 4 ); ( 0.3, 0.25, 0.05, 0.4 )
}. Compute E(X)
A. -0.1
B. -0.4
C. 0.3
D. 0.0
16. X is a variable with following values and pmf respectively; { 3, 4, 5 } & { 0.2, 0.6, 0.2 } Compute the variance
A. 0.2
B. 0.6
C. 0.8
D. 0.4
17. Choose the correlation coefficient that shows the weakest relationship between two variables?
A. r = + 0.187
B. r = - 0.874
C. r = - 0.193
D. r = + 0.843
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
3/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
18. How many of the below is an assumption of regression?
i. linearity
ii. independence of errors
iii. normality of errors
iv. equal variance
v. nominal inputs
vi. narrow variance
A. 5
B. 4
C. 3
D. 6
19. A linear regression analysis with an independent variable is
A. Simple linear regression
B. Logistic regression
C. Lasso regression
D. Multiple linear regression
20. In a multiple linear regression model, the coefficients represent
A. the change in X for a unit change in Y
B. the change in Y for unit changes in Xs
C. the change in Y for a unit change in X
D. the change in X for unit changes in Ys
21. A fitted regression equation is given by Y-hat = 150 + 7X. What is the sum of the residuals' absolute values
at points ( X1=20, Y1=300) and ( X1=30, Y1=360)?
A. -10
B. 10
C. 0
D. 20
22. Which of the below directly shows the significance of a linear regression model?
A. f statistic
B. standard error
C. p value
D. t statistic
23. The p-value of a variable can change when another variable is included in the model. Which of the below
is always correct?
A. The newly added variable is redundant
B. The old variable overfits
C. The old variable is problematic
D. The newly added variable is not fully uncorrelated
24. When an important variable is not available, another variable can try to explain it wrongly. It is omitted
variable bias. How many of the below may be a reliable way to catch such a problem?
i. checking the signs of the variables
ii. checking the size of the coefficients of the variables
iii. cleaning the outliers
A. 0
B. 2
C. 3
D. 1
25. Categorical data can be expressed as numeric thanks to ....
A. using text mining
B. using the most common elements
C. cleaning anomalies
D. converting to dummy variables
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
4/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
26. Which is wrong for logistic regression?
A. For a 3 classes target for multinomial
classification, 3 separate logistic regression
predictions are needed (3 classifiers)
B. The independent variables should be
independent of each other. That is, the model
should have little or no multicollinearity
C. Logistic Regression is mainly used for
Regression
D. ROC curve and accuracy can be used as
success metrics
27. The "odds" term is defined as the tested option of outputs over all other outputs. Example: For a standard
six-sided dice, the odds of rolling the side "3" is 1/5.
The probability of a particular customer paying back on his loan is 0.50. What are the odds of default (not
paying back)?
A. 0.25
B. 2
C. 1
D. 0.5
28. The correlation between
- Income and obesity is about -0.01
- Fast-food consumption and obesity is about 0.17
Which of the following options are possible regarding the above information?
A. There could be a strong nonlinear relationship between income and obesity
B. The relationship between income and obesity is strongly linear
C. People with higher fast-food consumption are more likely to suffer from obesity
D. Young people consume more fast food compared to the average of non-young people
A. A, D
B. A, C
C. B, C
D. B, D
29. Which of the following statements are more likely for good models relative to the models that overfit?
(i). Have higher bias
(ii). Have lower bias
(iii). Have higher variance
(iv). Have lower variance
A. ii, iv
B. i, iii
C. i, iv
D. ii, iii
30. A dataset is created by using a 2-degree polynomial function. Then some noise data points are added. We
try to understand the y values by using a polynomial function with degree 4 as a model. What features are
expected for this model in terms of variance and bias?
A. Low bias, low variance
B. High bias, low variance
C. High bias, low variance
D. Low bias, high variance
31. Which is not correct if the regularization parameter = 0 for the lasso regression?
A. Small coefficients are penalized
B. Overfitting problems are not dealt
C. Large coefficients are not penalized
D. The loss function is as same as the ordinary least
square loss function
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
5/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
32. Which is not correct if the regularization parameter is very high for the ridge regression?
A. May lead to coefficients of the variables
become zero
B. Large coefficients are significantly penalized
C. May lead to a model that is too simple and
ends up underfitting the data
D. May lead to perform worse than ordinary
least square
33. Which is not correct for a decision tree?
A. Variables for splits are chosen based on
potential information gain
B. It is easy to explain
C. It can split the data at a node into two or
more groups
D. Its greedy split algorithm can calculate
information gain for splits two depths ahead
34. Which of the below is not correct for pruning?
A. The idea is to prevent the model from learning
insignificantly small micro-segments
B. It is used to reduce the bias by sampling the
amount of data used
C. It prevents overfitting
D. Pruning should reduce the size of a
learning tree without reducing predictive
accuracy as measured by a cross-validation
set
35. What is the difference of random forest (RF) from bagging?
a. RF uses bootstrapping algorithm.
b. RF uses random thresholds for each feature rather than searching for the best possible thresholds.
c. RF uses random subset of features in trees.
d. RF is a serial boosting method.
A. b
B. c
C. a
D. d
36. h_1 = 1.60; w_1 = 73;
h_2 = 1.80; w_2 = XXXX;
BMI_1 = w_1 / ( h_1 * h_1 );
BMI_2 = w_2 / ( h_2 * h_2 );
Which of the below XXXX satisfies above BMI_1 = BMI_2 condition?
A. 86
B. 89
C. 92
D. 83
37. Which of the below best describes "list append" method's functionality?
A. add new elements as a single element
B. Adds an element at the specified position
C. adds new elements
D. returns error if the element is not in the list
38. You want jupyter notebook to show all the data instances while displaying data. Which code will you use?
A. pd.set_option("display.max.rows", None)
B. pd.set_option('display.max_colwidth', -1)
C. pd.set_option("display.max.columns", None)
D. pd.set_option('max_info_columns', 199)
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
6/7
12/29/21, 6:13 PM
dats501_2021_spring_week11_exam1
39. Which one(s) are right?
i. Python "random" library can be used to generate random integer numbers
ii. if we use print() as user-defined function (udf) output and assign udf to variable (var_x), var_x points
"NoneType" in memory
iii. There might not be multiple return statements in a user-defined function
iv. (9 // 3) and ( 9/3 ) give me the same type of result
A. ii, iv
B. i, ii
C. i, ii, iii
D. i, iii
40. Which one(s) are correct?
i. If x is a list then x.append([4,5]) and x.extend([4,5]) both append objects to end of x
ii. If x is a list then x[::-1] and x.reverse() give me the same result
iii. In user-defined function "return" statement acts as "break" statement in a Loop
A. i, iii
B. i
C. ii, iii
D. iii
https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ=
7/7
Download