Uploaded by Kobbie Manrique

Data Science Quiz - Reviewer

advertisement
Objective-type Quiz
A. Statistics and Probability
1. A fair coin is tossed twice. What is the probability of getting exactly one
tail?
a) 1/4
b) 1/2
c) 3/4
d) 1/3
Answer: b. 1/2
Solution: The two outcomes are HT and TH, each with a probability of 1/4.
Adding those probabilities gives 1/2.
2. A random variable follows a normal distribution with mean μ = 5 and
standard deviation σ = 2. What is the probability that this random variable
is less than 3?
a) 0.9772
b) 0.0228
c) 0.3413
d) 0.1587
Answer: d. 0.1587
Solution: To solve this problem, we need to standardize the random variable
using the Z-score formula, which is:
Z = (X - μ) / σ
Where:
X is the value of the random variable
μ is the mean of the distribution
σ is the standard deviation of the distribution
For X = 3, μ = 5, and σ = 2, the Z-score is:
Z = (3 - 5) / 2 = -1
By looking up this Z-score in the standard normal distribution table, or using a
Z-score calculator, we find that the probability that the random variable is less
than 3 (i.e., the cumulative probability P(X<3) for a Z-score of -1) is
approximately 0.1587.
3. In hypothesis testing, if the p-value is less than or equal to the
significance level, we...
a) Reject the null hypothesis
b) Do not reject the null hypothesis
c) Accept the null hypothesis
d) Accept the alternative hypothesis
Answer: a. Reject the null hypothesis
Solution: A low p-value suggests that the observation is unlikely under the null
hypothesis, so we reject it.
4. Which of the following is NOT a characteristic of a binomial
distribution?
a) Consists of n identical trials
b) Each trial results in one of two outcomes
c) The probability of success changes from trial to trial
d) Trials are independent
Answer: c. The probability of success changes from trial to trial
Solution: In a binomial distribution, the probability of success is constant across
all trials.
5. What is the purpose of sampling in statistics?
a) To make conclusions about a population based on a subset of that
population
b) To observe every individual in a population
c) To conduct a hypothesis test
d) To eliminate sources of bias
Answer: a. To make conclusions about a population based on a subset of that
population
Solution: Sampling allows estimation of population parameters without
examining every individual in the population.
6. What is the variance of a standard normal distribution?
a) 1
b) 0
c) Infinite
d) Can't be determined
Answer: a. 1
Solution: A standard normal distribution has a mean of 0 and a variance of 1.
7. If the probability that event A occurs is 1/3, and the probability that event
B occurs, given that A has occurred, is 1/4. What is the joint probability of
A and B?
a) 1/7
b) 1/12
c) 7/12
d) 4/12
Answer: b. 1/12
Solution: The joint probability of A and B is P(A ∩ B) = P(A) * P(B | A) = 1/3 *
1/4 = 1/12
8. Which of the following distributions is used to describe the behavior of
a count variable?
a) Normal distribution
b) Chi-Square distribution
c) Poisson distribution
d) F distribution
Answer: c. Poisson distribution
Solution: The Poisson distribution is often used to model the number of times
an event occurs in a specified interval of time or space.
9. Sampling distribution of a statistic becomes approximately a normal
distribution when...
a) Population is normally distributed, or if sample size is large.
b) Population is not normally distributed and the sample size is small.
c) Population is uniformly distributed, or if the sample size is unknown.
d) Population is normally distributed, and the sample size is small.
Answer: a. Population is normally distributed, or if sample size is large.
Solution: According to the Central Limit Theorem, if the population from which
the sample is taken is normally distributed or the sample size is large, the
sampling distribution of the mean becomes a normal distribution.
10. If events A and B are mutually exclusive, P(A ∩ B) is...
a) equal to P(A) * P(B)
b) equal to P(A) or P(B)
c) equal to zero
d) can't be determined
Answer: c. equal to zero
Solution: If events A and B are mutually exclusive (cannot happen at the same
time), the probability of both of them happening is zero.
B. Machine Learning
1. Which of the following is an example of a supervised learning task?
a) Clustering customers into different segments
b) Predicting house prices based on various features
c) Identifying fraud in a dataset
d) Organizing news articles into different topics
Answer: b. Predicting house prices based on various features
Solution: Supervised learning involves predicting a target variable based on
given features. In this case, the house prices are the targets and are predicted
based on various features.
2. Which of the following algorithms can be used for both classification
and regression tasks?
a) K-means clustering
b) Support Vector Machines (SVM)
c) Hierarchical clustering
d) Apriori
Answer: b. Support Vector Machines (SVM)
Solution: SVMs can be used for both classification (by separating different
classes using a hyperplane) and regression (by introducing a margin of error).
3. Which of the following techniques can be used to avoid overfitting in a
decision tree?
a) Dimensionality reduction
b) Pruning
c) Clustering
d) Feature scaling
Answer: a. Dimensionality reduction and b. Pruning
Solution:
Dimensionality reduction is a technique that reduces the number of features in
the dataset. This can help to avoid overfitting by making the model simpler.
Pruning is a technique that removes unnecessary branches from the decision
tree. This can also help to avoid overfitting by making the model more general.
4. What is the main advantage of ensemble methods like Random Forest
over single decision trees?
a) They are faster to train
b) They are simpler to understand
c) They reduce variance and improve accuracy
d) They require less memory
Answer: c. They reduce variance and improve accuracy
Solution: Ensemble methods like Random Forests combine multiple weak
learners (decision trees) to create a strong learner that reduces variance and
improves prediction accuracy.
5. In k-nearest neighbors (k-NN) algorithm, what does 'k' represent?
a) The number of clusters
b) The number of features
c) The number of neighbors used to predict the class of a given instance
d) The number of classes in the target variable
Answer: c. The number of neighbors used to predict the class of a given
instance
Solution: In k-NN, 'k' represents the number of nearest neighbors considered
when predicting the class or value of a given instance.
6. A type of unsupervised learning where the algorithm learns the inherent
structure of the data is known as...
a) Classification
b) Regression
c) Clustering
d) Reinforcement learning
Answer: c. Clustering
Solution: Clustering is an unsupervised learning technique where the algorithm
tries to group similar instances together into clusters, based on the inherent
structure of the data.
7. The purpose of Principal Component Analysis (PCA) is:
a) Increase dimensionality of data
b) Reduce dimensionality of data
c) Increase variance retained
d) Reduce variance retained
Answer: b. Reduce dimensionality of data
Solution: Principal Component Analysis (PCA) is a dimensionality reduction
technique that transforms a large set of variables into a smaller one that still
contains most of the information in the large set.
8. In a Support Vector Machine (SVM), what is the 'kernel trick' used for?
a) To speed up the SVM training process
b) To convert non-linearly separable data into linearly separable by adding
more dimensions
c) To determine the number of support vectors
d) To choose the optimal hyperparameters for the SVM
Answer: b. To convert non-linearly separable data into linearly separable by
adding more dimensions
Solution: The 'kernel trick' is used for transforming non-linearly separable data
in a lower-dimensional space into linearly separable data in a higherdimensional space.
9. What output can you expect from an unsupervised learning algorithm
analyzing customer data?
a) Predicted customer churn rates
b) Customer segmentation groups
c) Predicted customer lifetime value
d) Future customer behavior
Answer: b. Customer segmentation groups
Solution: Unsupervised learning algorithms are capable of scouring through
large volumes of customer data and segregating customers into discrete
categories or segmentation groups based on common patterns.
10. Which one of the following evaluation metrics is mainly used for
regression problems?
a) Precision
b) Rec call
c) Mean Squared Error (MSE)
d) F1 Score
Answer: c. Mean Squared Error (MSE)
Solution: Mean Squared Error (MSE) is a common evaluation metric used to
measure the average of the squares of the errors between the actual and
predicted values in a regression problem.
C. Data Processing
1. What is the purpose of data transformation in data processing?
a) To handle missing values in the data
b) To normalize the values of features
c) To create new features from existing ones
d) To remove outliers from the dataset
Answer: b. To normalize the values of features
Solution: Data transformation is used to scale or normalize the values of
features to a common range, which helps algorithms during training.
2. Which software library is commonly used for data manipulation and
analysis in Python?
a) Scikit-learn
b) TensorFlow
c) Pandas
d) Keras
Answer: c. Pandas
Solution: Pandas is a popular library in Python used for data manipulation,
analysis, and exploration tasks.
3. What is the purpose of data wrangling in the data processing pipeline?
a) To handle missing values and outliers in the data
b) To transform and reshape the data into a suitable format
c) To perform statistical modeling and analysis
d) To visualize and interpret the data
Answer: b. To transform and reshape the data into a suitable format
Solution: Data wrangling involves manipulating, cleaning, and organizing the
data into a format that is suitable for further analysis and.
4. Which technique is commonly used to handle missing values in a
dataset?
a) Data imputation
b) Feature scaling
c) Dimensionality reduction
d) Outlier detection
Answer: a. Data imputation
Solution: Data imputation is the technique of estimating or substituting missing
values in a dataset with appropriate values.
5. What is the first step in the data analysis process?
a) Data cleaning and preprocessing
b) Data visualization and exploration
c) Building machine learning models
d) Evaluating model performance
Answer: a. Data cleaning and preprocessing
Solution: The first step in data analysis is to clean and preprocess the data to
ensure its quality and prepare it for further analysis.
6. Which visualization technique is best suited for exploring the
relationship between two numerical variables?
a) Bar plot
b) Line plot
c) Scatter plot
d) Histogram
Answer: c. Scatter plot
Solution: A scatter plot allows you to visualize the relationship between two
numerical variables, showing the distribution of the data points and any patterns
or correlations between them.
7. What is the purpose of feature engineering in data processing?
a) To transform numerical features into categorical features
b) To reduce the dimensions of the dataset
c) To create new features from existing ones
d) To handle missing values in the dataset
Answer: c. To create new features from existing ones
Solution: Feature engineering involves creating new features from existing
ones that may better represent the underlying patterns and relationships in the
data.
8. Which technique is commonly used to reduce the dimensions of a highdimensional dataset?
a) One-hot encoding
b) Principal Component Analysis (PCA)
c) Feature scaling
d) Data imputation
Answer: b. Principal Component Analysis (PCA)
Solution: PCA is a technique used for dimensionality reduction by transforming
high-dimensional data into a lower-dimensional representation while retaining
most of the information.
9. What is the purpose of outlier detection in data preprocessing?
a) To remove missing values from the dataset
b) To handle class imbalance in the target variable
c) To identify potential errors or abnormalities in the data
d) To minimize the impact of noisy data on model performance
Answer: c. To identify potential errors or abnormalities in the data
Solution: Outlier detection helps identify data points that deviate significantly
from the expected patterns and may indicate errors, anomalies, or interesting
phenomena in the dataset.
10. What is the most commonly used measure of central tendency?
a) Mean
b) Median
c) Mode
d) Range
Answer: a. Mean
Solution: The mean is the most commonly used measure of central tendency,
representing the average value of a dataset.
Download