data science interview questions

The Latest Data science questions for you If you want to Become data scientist you have must know this questions. A data scientist should not only be evaluated only on his/her knowledge on machine learning, but he/she should also have good expertise on statistics. Prepare from onlineitguru is providing data science interview questions from Basic to advance level. 1. what is data science? Data science is a "concept to unify statistics, data analysis, machine learning and their related methods" in order to "understand and analyze actual phenomena" with data. Data Science involves using automated methods to analyze massive amounts of data and to extract knowledge from them. 2. What is selection Bias ? Selection bias occurs when sample obtained is not represantative of the population intended to be analyzed 3. Which language is more suitable for text analytics? R or Python? Python is the most prominent language used in Machine Learning as per my knowledge. However, R is also good. In one of the scenarios we were working, R was far better in time complexity when executing some recommendation based models. 4. What is Linear Regression? Linear regression is a statistical technique where the score of a variable Y is predicted from the score of a second variable X. X is referred to as the predictor variable and Y as the criterion variable. 5. Can you use machine learning for time series analysis? Yes, it can be used but it depends on the applications. 6. What is different between machine learning and Deep learning? A machine algorithm to parse data, learn from that data, and make informed decisions based on what it has learned. Basically, Deep Learning is used in layers to create an Artificial “Neural Network” that can learn and make intelligent decisions on its own. 7. what is Deep learning ? Deep learning is one of the foundations of artificial intelligence (AI), and the current interest in deep learning is due in part to the buzz surrounding AI. At its simplest, deep learning can be thought of as a way to automate predictive analytics. 8. What Is A Recommender System? A recommendation system, or recommender system tries to make predictions on user preferences and make recommendations which should interest customers. A recommendation system is any system that automatically suggests content for website readers and users.Recommender systems are one of the most common and easily understandable applications of big data. What is Naive ? The Algorithm is ‘naive’ because it makes assumptions that may or may not turn out to be correct. 9. Can you write the formula to calculat R-square? R-Square can be calculated using the below formular - 1 - (Residual Sum of Squares/ Total Sum of Squares) 10. Explain The Various Benefits Of R Language? The R programming language includes a set of software suite that is used for graphical representation, statistical computing, data manipulation and calculation. 11. What Are Feature Vectors? n-dimensional vector of numerical features that represent some object Term occurrences frequencies, pixels of an image etc. Feature space: vector space associated with these vectors 12. Compere to Sas, Python and R Programming? Corporate setups that require more hands on assistance & training choose SAS as an option. As per researchers & Statisticians choose R it helps in heavy calculations as they say R was meant for Job done & not to ease your computers. Python has been the best choice for startups today due to its lightweight nature & growing community. SAS is ruling the market now. So, from immediate job perspective: SAS - R - Python, would be my rating. But I think with time, more structured data will be in place and then R & Python will be at par or may be more demanded. 13. What Are The Types Of Biases That Can Occur During Sampling?    Selection bias Under coverage bias Survivorship bias 14. What is Star Schema? the star schema is the simplest style of data mart schema and is the approach most widely used to develop data warehouses and dimensional data marts. The star schema is an important special case of the snowflake schema, and is more effective for handling simpler queries. 15. Why data cleaning plays an important role in analysis? Cleaning data from multiple sources to transform it into a format that data analysts or data scientists can work with is a cumbersome process . The tools should also be used on a regular basis as inaccurate data levels can grow quickly, compromising database and decreasing business efficiency. 16. what is Linear regression? Linear regression is an important tool in analytics. The technique uses statistical calculations to plot a trend line in a set of data points. In simple linear regression a single independent variable is used to predict the value of a dependent variable. In multiple linear regression two or more independent variables are used to predict the value of a dependent variable. The difference between the two is the number of independent variables. 17. What is power analysis? Power is the probability of detecting an effect, given that the effect is really there. A power analysis involves estimating one of these four parameters given values for three other parameters. This is a powerful tool in both the design and in the analysis of experiments that we wish to interpret using statistical hypothesis tests. 18. What is the difference between data design and data model? Data design is the process of designing a database. The main output of a data design is a detailed logical data model of a database. A data model gives you a conceptual understanding of how data is structured in a database it is hard coded into the DBMS software .So I can say it as a sort of facility given by the database. Database Design- Database design is the system of producing a detailed data model of a database. The term database design can be used to describe many different parts of the design of an overall database system. 19. Which technique is used to predict categorical responses? Classification methods are used to predict binary or multi class target variable. You could use conventional parametric models like logistic , multinomial regression, Linear discriminate analysis etc. 20. What are the important skills to have in Python with regard to data analysis? To be able to do data analysis in Python , you should be good with basics of Python and the below packages: Along with basics of python you should be comfortable in working in data frames and series also with multi-dimensional arrays and visualisation. Pandas: is most important in data analysis. You can load csv, excel, tables and variety of data formats and can view the in tabular format and work on rows and columns. Numpy: is used for working with multi dimensional arrays. Matplotlib: is used for visualisation. 21. Describe the difference between univariate, bivariate and multivariate analysis? Univariate and multivariate represent two approaches to statistical analysis. Univariate involves the analysis of a single variable while multivariate analysis examines two or more variables. Most multivariate analysis involves a dependent variable and multiple independent variables. Univariate analysis is the simplest form of data analysis where the data being analyzed contains only one variable. Since it's a single variable it doesn’t deal with causes or relationships. The main purpose of univariate analysis is to describe the data and find patterns that exist within it. Bivariate analysis is used to find out if there is a relationship between two different variables. Something as simple as creating a scatterplot by plotting one variable against another on a Cartesian plane (think X and Y axis) can sometimes give you a picture of what the data is trying to tell you. If the data seems to fit a line or curve then there is a relationship or correlation between the two variables. For example, one might choose to plot caloric intake versus weight. Multivariate analysis is the analysis of three or more variables. There are many ways to perform multivariate analysis depending on your goals. Some of these methods include Additive Tree, Canonical Correlation Analysis, Cluster Analysis, Correspondence Analysis / Multiple Correspondence Analysis, Factor Analysis, Generalized Procrustean Analysis, MANOVA, Multidimensional Scaling, Multiple Regression Analysis, Partial Least Square Regression, Principal Component Analysis / Regression / PARAFAC, and Redundancy Analysis. 22. What is logistic regression for? The logistic regression is a predictive analysis. Logistic regression is used to describe data and to explain the relationship between one dependent binary variable and one or more nominal, ordinal, interval or ratio-level independent variables. Logistic regression predicts the probability of an outcome that can only have two values (i.e. a dichotomy). The prediction is based on the use of one or several predictors (numerical and categorical). A linear regression is not appropriate for predicting the value of a binary variable for two reasons:  A linear regression will predict values outside the acceptable range (e.g. predicting probabilities outside the range 0 to 1)  Since the dichotomous experiments can only have one of two possible values for each experiment, the residuals will not be normally distributed about the predicted line. 23. What is K-means? K-means clustering is a type of unsupervised learning, which is used when you have unlabeled data (i.e., data without defined categories or groups). The goal of this algorithm is to find groups in the data, with the number of groups represented by the variable K. The algorithm works iteratively to assign each data point to one of K groups based on the features that are provided. Data points are clustered based on feature similarity. The results of the K-means clustering algorithm are:   The centroids of the K clusters, which can be used to label new data Labels for the training data (each data point is assigned to a single cluster) 24. Do gradient descent methods always converge to same point? No, they do not because in some cases it reaches a local minima or a local optima point. You don’t reach the global optima point. It depends on the data and starting conditions 25. What is the goal of A/B Testing? A/B testing is a method of comparing two versions of a webpage or app against each other to determine which one performs better. The goal of A/B Testing is to identify any changes to the web page to maximize or increase the outcome of an interest. It enables you to analyze which one of them perfoms better and generate better conversion rates. In an A/B test, you take a webpage or app screen and modify it to create a second version of the same page. 26. What does data import in R language mean? R Commander is used to import data in R language. To start the R commander GUI, the user must type in the command Rcmdr into the console. There are 3 different ways in which data can be imported in R language• Users can select the data set in the dialog box or enter the name of the data set (if they know). • Data can also be entered directly using the editor of R Commander via Data->New Data Set. However, this works well when the data set is not too large. • Data can also be imported from a URL or from a plain text file (ASCII), from any other statistical package or from the clipboard. 27. How Many Data Structures Does R Language Have? R programming supports five basic types of data structure namely vector, matrix, list, data frame and factor. This chapter will discuss these data structures and the way to write these in R Programming. Vector – This data structures contain similar types of data, i.e., integer, double, logical, complex, etc. 28. What Is The Command Used To Store R Objects In A File? save (x, file=”x.Rdata”) The function save() can be used to save one or more R objects to a specified file (in .RData or .rda file formats). The function can be read back from the file using the function load(). 29. What Is Interpolation And Extrapolation? Interpolation is guessing data points that fall within the range of the data you have, i.e. between your existing data points. Extrapolation is guessing data points from beyond the range of your data set. Interpolation is the estimation of a point between endpoints that have been sampled. Thus, the estimate is constrained by the sample values at the endpoints of the line segment over which you are estimating (and the estimated/expected function over the interval between the end points). Extrapolation is an estimate of the value of a point in a range beyond that spanned by existing sample points. Because it is an extension of a trend, it is less constrained and as you move away from the last sampling point you have, the uncertainty in your estimate increases. 30. What Is Root Cause Analysis? Root cause analysis (RCA) is a systematic process for finding and identifying the root cause of a problem or event. Root cause analysis helps identify what, how and why something happened, thus preventing recurrence. RCA has a wide range of advantages, but it is dramatically beneficial in the continuous atmosphere of software development and information technology. Visit Data science online training

data science interview questions

Related documents

Products

Support

data science interview questions

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib