Data Science Interview Question & Answer by KrishnaMildain

Data Science Interview Question Q. What is Data Science? List the differences between supervised and unsupervised learning. Data Science is a blend of various tools, algorithms, and machine learning principles with the goal to discover hidden patterns from the raw data. How is this different from what statisticians have been doing for years? The answer lies in the difference between explaining and predicting.

Data Science Interview Question The differences between supervised and unsupervised learning are as follows; Supervised Learning

Unsupervised Learning

Input data is labelled.

Input data is unlabelled.

Uses a training data set.

Uses the input data set.

Used for prediction.

Used for analysis.

Enables classification and regression.

Enables Classification, Density Estimation, & Dimension Reduction

Data Science Interview Question Q: What are some of the steps for data wrangling and data cleaning before applying machine learning algorithms? There are many steps that can be taken when data wrangling and data cleaning. Some of the most common steps are listed below: • Data profiling: Almost everyone starts off by getting an understanding of their dataset. More specifically, you can look at the shape of the dataset with .shape and a description of your numerical variables with .describe(). • Data visualizations: Sometimes, it’s useful to visualize your data with histograms, boxplots, and scatterplots to better understand the relationships between variables and also to identify potential outliers.

Data Science Interview Question • Syntax error: This includes making sure there’s no white space, making sure letter casing is consistent, and checking for typos. You can check for typos by using .unique() or by using bar graphs. • Standardization or normalization: Depending on the dataset your working with and the machine learning method you decide to use, it may be useful to standardize or normalize your data so that different scales of different variables don’t negatively impact the performance of your model. • Handling null values: There are a number of ways to handle null values including deleting rows with null values altogether, replacing null values with the mean/median/mode, replacing null values with a new category (eg. unknown), predicting the values, or using machine learning models that can deal with null values.

Data Science Interview Question Differentiate between Data Science, Machine Learning, and AI. Criteria

Data Science

Machine Learning

Data Science is not exactly a subset A subset of AI that of machine learning but it uses Definition focuses on a narrow machine learning to analyze and range of activities. make future predictions.

Role

It can take on a business role.

Artificial Intelligence A wide term that focuses on applications ranging from Robotics to Text Analysis.

It is a combination of It is a purely technical both business role. and technical aspects.

Data Science Interview Question

Scope

Data Science is a broad term for Machine learning fits diverse disciplines AI is a sub-field of computer within the data and is not merely science. science spectrum. about developing and training models.

Loosely integrated

A sub-field of computer science consisting of various tasks like Machine learning is a planning, moving around in the subfield of AI and is world, recognizing objects and tightly integrated. sounds, speaking, translating, performing social or business transactions, creative work.

Data Science Interview Question

Q. Python or R – Which one would you prefer for text analytics? The best possible answer for this would be Python because it has a Pandas library that provides easy-to-use data structures and high-performance data analysis tools. Q. Which technique is used to predict categorical responses? Classification technique is used widely in mining for classifying data sets. Q. How to combat Overfitting and Underfitting? To combat overfitting and underfitting, you can resample the data to estimate the model accuracy (k-fold cross-validation) and by having a validation dataset to evaluate the model.

Data Science Interview Question Q. What is Selection Bias? Selection bias is a kind of error that occurs when the researcher decides who is going to be studied. It is usually associated with research where the selection of participants isn’t random. It is sometimes referred to as the selection effect. It is the distortion of statistical analysis, resulting from the method of collecting samples. If the selection bias is not taken into account, then some conclusions of the study may not be accurate. The types of selection bias include: • Sampling bias: It is a systematic error due to a non-random sample of a population causing some members of the population to be less likely to be included than others resulting in a biased sample.

Data Science Interview Question

• Time interval: A trial may be terminated early at an extreme value (often for ethical reasons), but the extreme value is likely to be reached by the variable with the largest variance, even if all variables have a similar mean. • Data: When specific subsets of data are chosen to support a conclusion or rejection of bad data on arbitrary grounds, instead of according to previously stated or generally agreed criteria. • Attrition: Attrition bias is a kind of selection bias caused by attrition (loss of participants) discounting trial subjects/tests that did not run to completion.

Data Science Interview Question Q: Neural Network Fundamentals A neural network is a multi-layered model inspired by the human brain. Like the neurons in our brain, the circles above represent a node. The blue circles represent the input layer, the black circles represent the hidden layers, and the green circles represent the output layer. Each node in the hidden layers represents a function that the inputs go through, ultimately leading to an output in the green circles. The formal term for these functions is called the sigmoid activation function.

Data Science Interview Question Q: What is cross-validation? Cross-validation is essentially a technique used to assess how well a model performs on a new independent dataset. The simplest example of crossvalidation is when you split your data into two groups: training data and testing data, where you use the training data to build the model and the testing data to test the model. Q. What are Recommender Systems? A subclass of information filtering systems that are meant to predict the preferences or ratings that a user would give to a product. Recommender systems are widely used in movies, news, research articles, products, social tags, music, etc.

Data Science Interview Question Differentiate between univariate, bivariate, and multivariate analysis. These are descriptive statistical analysis techniques that can be differentiated based on the number of variables involved at a given point in time. For example, the pie charts of sales based on territory involve only one variable and can be referred to as univariate analysis. If the analysis attempts to understand the difference between 2 variables at the time as in a scatterplot, then it is referred to as bivariate analysis. For example, analyzing the volume of sales and spending can be considered as an example of bivariate analysis. Analysis that deals with the study of more than two variables to understand the effect of variables on the responses is referred to as multivariate analysis.

Data Science Interview Question What is the difference between Cluster and Systematic Sampling? Cluster sampling is a technique used when it becomes difficult to study the target population spread across a wide area and simple random sampling cannot be applied. A cluster sample is a probability sample where each sampling unit is a collection or cluster of elements. Systematic sampling is a statistical technique where elements are selected from an ordered sampling frame. In systematic sampling, the list is progressed in a circular manner so once you reach the end of the list, it is progressed from the top again. The best example for systematic sampling is the equal probability method. What is Collaborative filtering? The process of filtering used by most of the recommender systems to find patterns or information by collaborating viewpoints, various data sources, and multiple agents.

Data Science Interview Question Q: Explain what precision and recall are? Recall attempts to answer “What proportion of actual positives was identified correctly?”

Precision attempts to answer “What proportion of positive identifications was actually correct?”

Data Science Interview Question Q3. What is bias-variance trade-off? Bias: Bias is an error introduced in your model due to oversimplification of the machine learning algorithm. It can lead to underfitting. When you train your model at that time model makes simplified assumptions to make the target function easier to understand. Low bias machine learning algorithms — Decision Trees, k-NN and SVM High bias machine learning algorithms — Linear Regression, Logistic Regression Variance: Variance is error introduced in your model due to complex machine learning algorithm, your model learns noise also from the training data set and performs badly on test data set. It can lead to high sensitivity and overfitting.

Data Science Interview Question Normally, as you increase the complexity of your model, you will see a reduction in error due to lower bias in the model. However, this only happens until a particular point. As you continue to make your model more complex, you end up over-fitting your model and hence your model will start suffering from high variance.

Data Science Interview Question Q4. What is a confusion matrix? The confusion matrix is a 2X2 table that contains 4 outputs provided by the binary classifier. Various measures, such as error-rate, accuracy, specificity, sensitivity, precision and recall are derived from it. Confusion Matrix

Mildain-Trainings Mildain-trainings offers Several trainings all over the world. At Mildain-trainings Training courses cover all the key topics in-depth learning & it will be Instructorled trainings as well, At Mildain-trainings all our trainers are having more than 12 years of practical industry experience which will help you to grow your career.

Mildain-Trainings The Key Features of the Mildaintrainings on training Courses are: • • • • • • •

Instructor LED Training Real-Time Learning Assignments 24×7 Expert Supports 100% Placement Support Classroom and Online Training Learn with Real-life examples.

Thank You For more information please visit our website https://mildaintrainings.com/