12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 dats501_2021_spring_week11_exam1 40 questions 1. Forecasting the state of the weather for tomorrow as rainy or notRainy is part of.... A. Predictive Analytics B. Prescriptive Analytics C. Diagnostic Analytics D. Descriptive Analytics 2. Which definition is more convenient for the data of an insurance company that uses a single powerful 512 gigabyte RAM 48-core server for its analytics operations? A. Medium data B. Small data C. Big data D. Extreme data 3. Complete the conversation between two students (Aang and Katara): A: I used to think correlation implied causation. Then I took a statistics class. Now I don't K: Sounds like the class helped A: .... A. Not really, I just changed my mind B. Definitely, it helped C. Yeah, the professor is convincing D. Well, maybe 4. What is the mean and mode of the following set of numbers? { 4, 9, 8, 8, 2, 16, 4, 4, 8, 9, 6, 8 } A. mean, mode: 7, 8 B. mean, mode: 7, 4 C. mean, mode: 8, 9 D. mean, mode: 6, 8 5. Which statements are correct? i. Q1 - 1.5 * IQR is close to - 2.7 sigma for a normally distributed and normalized data ii. Box-plot method outlier detection boundaries cannot be flexed iii. In general practice of box plot, outliers are found beyond 2 IQR distances from quartiles iv. IQR includes 2/3 of the data A. ii, iii B. only i C. i, iv D. only ii 6. Which are correct? i. Pie charts may deceive the human eye regarding the position of the slice ii. Bubble charts are similar to the ones used in the number of pandemia patients (covid-19) in countries iii. Scatterplot is used to present lines iv. GIS (geographic information systems) is not useful to municipalities and the general directorate of highways A. i, ii B. only ii C. iii, iv D. i, ii, iv https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 1/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 7. Order below options according to their explanatory power for the bimodal data for the heights of a group of animals? i. The same species in a region in 2 herds of different sizes ii. 2 herds each with two different animal species iii. 1 herd with the same species with two different genders without sexual dimorphism iv. 2 herds of the same species from different geographies with environmental variation A. iii, i, ii, iv B. ii, iv, i, iii C. ii, iv, iii, i D. ii, iv, i, iii 8. An investment company asks one of its portfolio managers to find optimal solutions for two of its customers. The customers each give $100 M to the investment manager. There are 3 portfolios to invest in with a bankrupt risk, meaning losing all the money. Customer A is a risk-taker and sees bankruptcy only as a financial loss and penalize it with the expected loss, Customer B is risk-averse and penalizes bankruptcy with the square of the financial loss (i.e. losing 3 million feels like losing 9 million) Note: The penalty is calculated from the invested amount and probability of bankruptcy Porfrolio1 (Low Risk): 15 M return, 0 % bankrupt chance Porfrolio2 (Mid Risk): 25 M return, 2 % bankrupt chance Porfrolio3 (High Risk): 35 M return, 5 % bankrupt chance Calculate utilities of the portfolios for each customer. Then, choose the optimal ones for them A. "Customer A: Portfolio 3" - "Customer B: Portfolio 2" B. "Customer A: Portfolio 2" - "Customer B: Portfolio 1" C. "Customer A: Portfolio 1" - "Customer B: Portfolio 3" D. "Customer A: Portfolio 2" - "Customer B: Portfolio 3" 9. Which marketing strategies are strongly related to the endowment effect? i. Giving free Red Bulls to university campuses ii. Not accepting online t-shirt returns without a valid reason iii. Offering a deal for two pairs of socks for the price of one iv. Halving the price of the internet for the newcomers for a limited amount of time A. iii, iv B. i, ii C. ii, iii D. i, iv 10. Which are true regarding the framing effect? i. We prefer less certain outcomes when information is framed in negative language ii. When people have to choose between an option framed in terms of a gain and an option framed in terms of a loss, most people choose the option framed in terms of loss iii. Tversky & Kahneman claims when the frame is positive people are more likely to take risks iv. When something is framed in a positive way, people are more likely to go for the safest option A. i, ii, iv B. ii, iv C. only iii D. i, iii https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 2/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 11. In the Moneyball movie, Brad Pitt's character is the sports team's (baseball) director who tries to compete with rich teams with low-cost but statistically effective players and strategies. A Yale Economics graduate helps him for the cause The value of a player is marked with a formula similar to "Score = (Hits + Walks — Caught stealing)*(Total Bases + 0.7 Stolen Bases)/(At Bats + Walks +Caught Stealing)" What would be the best description for this scoring process if this scoring is performed for the end of the next season with the currently unrealized statistics from today to the next season's end? A. Prescriptive analytics followed by predictive and descriptive analytics B. Simulated solution with the important player features based on values predicted with unsupervised learning C. Predicting the scores regarding the next season followed by utility optimization for the team state at the end of the next season D. Descriptive value assignment based on the utilities provided by each important feature 12. Toss a coin 3 times. Let A = 'at least 2 tails' B = 'second toss is heads' What is P(A|B) and P(B|A) ? A. 1/3, 1/4 B. 1/5, 1/3 C. 1/4, 1/4 D. 1/4, 1/3 13. X is a random variable and values of X are { 3, 5, 6, 8 } and cdf of it are { 0.25, 0.45, 0.70, 1.00 } respectively What is P( X <= 5 ) and P( X = 7 ) A. 0.70, 0.00 B. 0.45, 0.30 C. 0.70, 0.30 D. 0.45, 0.00 14. We have 2 dices: - 5-sided dice. These side values are { 2, 4, 8, 16, 16 }. The sides have realization probability inversely proportional to the face values - Regular 6-sided dice What is the expected total after each dice is thrown once A. 9 B. 8.5 C. 7.5 D. 8 15. - PMF(probability mass function) is commonly used when there are a small number of unique & discrete values The pmf of a discrete random variable X is given as follows: { x; P( X = x ) } { ( -5, -1, 1, 4 ); ( 0.3, 0.25, 0.05, 0.4 ) }. Compute E(X) A. -0.1 B. -0.4 C. 0.3 D. 0.0 16. X is a variable with following values and pmf respectively; { 3, 4, 5 } & { 0.2, 0.6, 0.2 } Compute the variance A. 0.2 B. 0.6 C. 0.8 D. 0.4 17. Choose the correlation coefficient that shows the weakest relationship between two variables? A. r = + 0.187 B. r = - 0.874 C. r = - 0.193 D. r = + 0.843 https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 3/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 18. How many of the below is an assumption of regression? i. linearity ii. independence of errors iii. normality of errors iv. equal variance v. nominal inputs vi. narrow variance A. 5 B. 4 C. 3 D. 6 19. A linear regression analysis with an independent variable is A. Simple linear regression B. Logistic regression C. Lasso regression D. Multiple linear regression 20. In a multiple linear regression model, the coefficients represent A. the change in X for a unit change in Y B. the change in Y for unit changes in Xs C. the change in Y for a unit change in X D. the change in X for unit changes in Ys 21. A fitted regression equation is given by Y-hat = 150 + 7X. What is the sum of the residuals' absolute values at points ( X1=20, Y1=300) and ( X1=30, Y1=360)? A. -10 B. 10 C. 0 D. 20 22. Which of the below directly shows the significance of a linear regression model? A. f statistic B. standard error C. p value D. t statistic 23. The p-value of a variable can change when another variable is included in the model. Which of the below is always correct? A. The newly added variable is redundant B. The old variable overfits C. The old variable is problematic D. The newly added variable is not fully uncorrelated 24. When an important variable is not available, another variable can try to explain it wrongly. It is omitted variable bias. How many of the below may be a reliable way to catch such a problem? i. checking the signs of the variables ii. checking the size of the coefficients of the variables iii. cleaning the outliers A. 0 B. 2 C. 3 D. 1 25. Categorical data can be expressed as numeric thanks to .... A. using text mining B. using the most common elements C. cleaning anomalies D. converting to dummy variables https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 4/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 26. Which is wrong for logistic regression? A. For a 3 classes target for multinomial classification, 3 separate logistic regression predictions are needed (3 classifiers) B. The independent variables should be independent of each other. That is, the model should have little or no multicollinearity C. Logistic Regression is mainly used for Regression D. ROC curve and accuracy can be used as success metrics 27. The "odds" term is defined as the tested option of outputs over all other outputs. Example: For a standard six-sided dice, the odds of rolling the side "3" is 1/5. The probability of a particular customer paying back on his loan is 0.50. What are the odds of default (not paying back)? A. 0.25 B. 2 C. 1 D. 0.5 28. The correlation between - Income and obesity is about -0.01 - Fast-food consumption and obesity is about 0.17 Which of the following options are possible regarding the above information? A. There could be a strong nonlinear relationship between income and obesity B. The relationship between income and obesity is strongly linear C. People with higher fast-food consumption are more likely to suffer from obesity D. Young people consume more fast food compared to the average of non-young people A. A, D B. A, C C. B, C D. B, D 29. Which of the following statements are more likely for good models relative to the models that overfit? (i). Have higher bias (ii). Have lower bias (iii). Have higher variance (iv). Have lower variance A. ii, iv B. i, iii C. i, iv D. ii, iii 30. A dataset is created by using a 2-degree polynomial function. Then some noise data points are added. We try to understand the y values by using a polynomial function with degree 4 as a model. What features are expected for this model in terms of variance and bias? A. Low bias, low variance B. High bias, low variance C. High bias, low variance D. Low bias, high variance 31. Which is not correct if the regularization parameter = 0 for the lasso regression? A. Small coefficients are penalized B. Overfitting problems are not dealt C. Large coefficients are not penalized D. The loss function is as same as the ordinary least square loss function https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 5/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 32. Which is not correct if the regularization parameter is very high for the ridge regression? A. May lead to coefficients of the variables become zero B. Large coefficients are significantly penalized C. May lead to a model that is too simple and ends up underfitting the data D. May lead to perform worse than ordinary least square 33. Which is not correct for a decision tree? A. Variables for splits are chosen based on potential information gain B. It is easy to explain C. It can split the data at a node into two or more groups D. Its greedy split algorithm can calculate information gain for splits two depths ahead 34. Which of the below is not correct for pruning? A. The idea is to prevent the model from learning insignificantly small micro-segments B. It is used to reduce the bias by sampling the amount of data used C. It prevents overfitting D. Pruning should reduce the size of a learning tree without reducing predictive accuracy as measured by a cross-validation set 35. What is the difference of random forest (RF) from bagging? a. RF uses bootstrapping algorithm. b. RF uses random thresholds for each feature rather than searching for the best possible thresholds. c. RF uses random subset of features in trees. d. RF is a serial boosting method. A. b B. c C. a D. d 36. h_1 = 1.60; w_1 = 73; h_2 = 1.80; w_2 = XXXX; BMI_1 = w_1 / ( h_1 * h_1 ); BMI_2 = w_2 / ( h_2 * h_2 ); Which of the below XXXX satisfies above BMI_1 = BMI_2 condition? A. 86 B. 89 C. 92 D. 83 37. Which of the below best describes "list append" method's functionality? A. add new elements as a single element B. Adds an element at the specified position C. adds new elements D. returns error if the element is not in the list 38. You want jupyter notebook to show all the data instances while displaying data. Which code will you use? A. pd.set_option("display.max.rows", None) B. pd.set_option('display.max_colwidth', -1) C. pd.set_option("display.max.columns", None) D. pd.set_option('max_info_columns', 199) https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 6/7 12/29/21, 6:13 PM dats501_2021_spring_week11_exam1 39. Which one(s) are right? i. Python "random" library can be used to generate random integer numbers ii. if we use print() as user-defined function (udf) output and assign udf to variable (var_x), var_x points "NoneType" in memory iii. There might not be multiple return statements in a user-defined function iv. (9 // 3) and ( 9/3 ) give me the same type of result A. ii, iv B. i, ii C. i, ii, iii D. i, iii 40. Which one(s) are correct? i. If x is a list then x.append([4,5]) and x.extend([4,5]) both append objects to end of x ii. If x is a list then x[::-1] and x.reverse() give me the same result iii. In user-defined function "return" statement acts as "break" statement in a Loop A. i, iii B. i C. ii, iii D. iii https://app.quizalize.com/quiz/preview/Q29udGVudDo5ZjkzZDU1My0wMDczLTQ2OWYtOGZmYi05Y2ZiNzRkZjFjNWQ= 7/7