Uploaded by ruhhana ali

data science interview questions

advertisement
The Latest Data science questions for you
If you want to Become data scientist you have must know this questions. A data scientist should not
only be evaluated only on his/her knowledge on machine learning, but he/she should also have good
expertise on statistics. Prepare from onlineitguru is providing data science interview questions from
Basic to advance level.
1. what is data science?
Data science is a "concept to unify statistics, data analysis, machine learning and their related
methods" in order to "understand and analyze actual phenomena" with data.
Data Science involves using automated methods to analyze massive amounts of data and to extract
knowledge from them.
2. What is selection Bias ?
Selection bias occurs when sample obtained is not represantative of the population intended to be
analyzed
3. Which language is more suitable for text analytics? R or Python?
Python is the most prominent language used in Machine Learning as per my knowledge.
However, R is also good. In one of the scenarios we were working, R was far better in time
complexity when executing some recommendation based models.
4. What is Linear Regression?
Linear regression is a statistical technique where the score of a variable Y is predicted from the score
of a second variable X. X is referred to as the predictor variable and Y as the criterion variable.
5. Can you use machine learning for time series analysis?
Yes, it can be used but it depends on the applications.
6. What is different between machine learning and Deep learning?
A machine algorithm to parse data, learn from that data, and make informed decisions based on
what it has learned. Basically, Deep Learning is used in layers to create an Artificial “Neural Network”
that can learn and make intelligent decisions on its own.
7. what is Deep learning ?
Deep learning is one of the foundations of artificial intelligence (AI), and the current interest in deep
learning is due in part to the buzz surrounding AI. At its simplest, deep learning can be thought of as
a way to automate predictive analytics.
8. What Is A Recommender System?
A recommendation system, or recommender system tries to make predictions on user preferences
and make recommendations which should interest customers.
A recommendation system is any system that automatically suggests content for website readers
and users.Recommender systems are one of the most common and easily understandable
applications of big data.
What is Naive ?
The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct.
9. Can you write the formula to calculat R-square?
R-Square can be calculated using the below formular -
1 - (Residual Sum of Squares/ Total Sum of Squares)
10. Explain The Various Benefits Of R Language?
The R programming language includes a set of software suite that is used for
graphical representation, statistical computing, data manipulation and
calculation.
11. What Are Feature Vectors?
n-dimensional vector of numerical features that represent some object
Term occurrences frequencies, pixels of an image etc.
Feature space: vector space associated with these vectors
12. Compere to Sas, Python and R Programming?
Corporate setups that require more hands on assistance & training choose SAS as an option. As per
researchers & Statisticians choose R it helps in heavy calculations as they say R was meant for Job
done & not to ease your computers. Python has been the best choice for startups today due to its
lightweight nature & growing community.
SAS is ruling the market now. So, from immediate job perspective: SAS - R - Python, would be my
rating. But I think with time, more structured data will be in place and then R & Python will be at par
or may be more demanded.
13. What Are The Types Of Biases That Can Occur During Sampling?



Selection bias
Under coverage bias
Survivorship bias
14. What is Star Schema?
the star schema is the simplest style of data mart schema and is the approach most widely used to
develop data warehouses and dimensional data marts. The star schema is an important special case
of the snowflake schema, and is more effective for handling simpler queries.
15. Why data cleaning plays an important role in analysis?
Cleaning data from multiple sources to transform it into a format that data analysts or data scientists
can work with is a cumbersome process .
The tools should also be used on a regular basis as inaccurate data levels can grow quickly,
compromising database and decreasing business efficiency.
16. what is Linear regression?
Linear regression is an important tool in analytics. The technique uses statistical calculations to plot
a trend line in a set of data points. In simple linear regression a single independent variable is used
to predict the value of a dependent variable. In multiple linear regression two or more independent
variables are used to predict the value of a dependent variable. The difference between the two is
the number of independent variables.
17. What is power analysis?
Power is the probability of detecting an effect, given that the effect is really there. A power analysis
involves estimating one of these four parameters given values for three other parameters. This is a
powerful tool in both the design and in the analysis of experiments that we wish to interpret using
statistical hypothesis tests.
18. What is the difference between data design and data model?
Data design is the process of designing a database. The main output of a data design is a detailed
logical data model of a database.
A data model gives you a conceptual understanding of how data is structured in a database it is hard
coded into the DBMS software .So I can say it as a sort of facility given by the database.
Database Design- Database design is the system of producing a detailed data model of a database.
The term database design can be used to describe many different parts of the design of an overall
database system.
19. Which technique is used to predict categorical responses?
Classification methods are used to predict binary or multi class target variable.
You could use conventional parametric models like logistic , multinomial regression, Linear
discriminate analysis etc.
20. What are the important skills to have in Python with regard to data analysis?
To be able to do data analysis in Python , you should be good with basics of Python and the below
packages:
Along with basics of python you should be comfortable in working in data frames and series also
with multi-dimensional arrays and visualisation.
Pandas: is most important in data analysis. You can load csv, excel, tables and variety of data
formats and can view the in tabular format and work on rows and columns.
Numpy: is used for working with multi dimensional arrays.
Matplotlib: is used for visualisation.
21. Describe the difference between univariate, bivariate and multivariate analysis?
Univariate and multivariate represent two approaches to statistical analysis. Univariate involves the
analysis of a single variable while multivariate analysis examines two or more variables. Most
multivariate analysis involves a dependent variable and multiple independent variables.
Univariate analysis is the simplest form of data analysis where the data being analyzed contains only
one variable. Since it's a single variable it doesn’t deal with causes or relationships. The main
purpose of univariate analysis is to describe the data and find patterns that exist within it.
Bivariate analysis is used to find out if there is a relationship between two different variables.
Something as simple as creating a scatterplot by plotting one variable against another on a Cartesian
plane (think X and Y axis) can sometimes give you a picture of what the data is trying to tell you. If
the data seems to fit a line or curve then there is a relationship or correlation between the two
variables. For example, one might choose to plot caloric intake versus weight.
Multivariate analysis is the analysis of three or more variables. There are many ways to perform
multivariate analysis depending on your goals. Some of these methods include Additive Tree,
Canonical Correlation Analysis, Cluster Analysis, Correspondence Analysis / Multiple Correspondence
Analysis, Factor Analysis, Generalized Procrustean Analysis, MANOVA, Multidimensional Scaling,
Multiple Regression Analysis, Partial Least Square Regression, Principal Component Analysis /
Regression / PARAFAC, and Redundancy Analysis.
22. What is logistic regression for?
The logistic regression is a predictive analysis. Logistic regression is used to describe data and to
explain the relationship between one dependent binary variable and one or more nominal, ordinal,
interval or ratio-level independent variables.
Logistic regression predicts the probability of an outcome that can only have two values (i.e. a
dichotomy). The prediction is based on the use of one or several predictors (numerical and
categorical). A linear regression is not appropriate for predicting the value of a binary variable for
two reasons:

A linear regression will predict values outside the acceptable range (e.g. predicting
probabilities outside the range 0 to 1)

Since the dichotomous experiments can only have one of two possible values for each
experiment, the residuals will not be normally distributed about the predicted line.
23. What is K-means?
K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data
(i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the
data, with the number of groups represented by the variable K. The algorithm works iteratively to
assign each data point to one of K groups based on the features that are provided. Data points are
clustered based on feature similarity. The results of the K-means clustering algorithm are:


The centroids of the K clusters, which can be used to label new data
Labels for the training data (each data point is assigned to a single cluster)
24. Do gradient descent methods always converge to same point?
No, they do not because in some cases it reaches a local minima or a local
optima point. You don’t reach the global optima point. It depends on the data and
starting conditions
25. What is the goal of A/B Testing?
A/B testing is a method of comparing two versions of a webpage or app against each other to
determine which one performs better.
The goal of A/B Testing is to identify any changes to the web page to maximize or increase the
outcome of an interest. It enables you to analyze which one of them perfoms better and generate
better conversion rates.
In an A/B test, you take a webpage or app screen and modify it to create a second version of the
same page.
26. What does data import in R language mean?
R Commander is used to import data in R language. To start the R commander GUI, the user must
type in the command Rcmdr into the console. There are 3 different ways in which data can be
imported in R language• Users can select the data set in the dialog box or enter the name of the data set (if they know).
• Data can also be entered directly using the editor of R Commander via Data->New Data Set.
However, this works well when the data set is not too large.
• Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical
package or from the clipboard.
27. How Many Data Structures Does R Language Have?
R programming supports five basic types of data structure namely vector, matrix, list, data frame
and factor. This chapter will discuss these data structures and the way to write these in R
Programming. Vector – This data structures contain similar types of data, i.e., integer, double,
logical, complex, etc.
28. What Is The Command Used To Store R Objects In A File?
save (x, file=”x.Rdata”)
The function save() can be used to save one or more R objects to a specified file (in .RData or .rda file
formats). The function can be read back from the file using the function load().
29. What Is Interpolation And Extrapolation?
Interpolation is guessing data points that fall within the range of the data you have, i.e. between
your existing data points.
Extrapolation is guessing data points from beyond the range of your data set. Interpolation is the
estimation of a point between endpoints that have been sampled. Thus, the estimate is constrained
by the sample values at the endpoints of the line segment over which you are estimating (and the
estimated/expected function over the interval between the end points).
Extrapolation is an estimate of the value of a point in a range beyond that spanned by existing
sample points. Because it is an extension of a trend, it is less constrained and as you move away
from the last sampling point you have, the uncertainty in your estimate increases.
30. What Is Root Cause Analysis?
Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a
problem or event.
Root cause analysis helps identify what, how and why something happened, thus preventing
recurrence.
RCA has a wide range of advantages, but it is dramatically beneficial in the continuous atmosphere
of software development and information technology.
Visit Data science online training
Download