Are you a beginner in data science and machine learning looking to ace your job interview? This article is here to help. We have compiled a list of the top 20 interview questions that beginners can expect in their data science job interviews. The questions cover a wide range of topics, including data pre-processing, machine learning algorithms, evaluation metrics, and more. Each question comes with a detailed answer to help you prepare for your interview and showcase your knowledge and skills. With these questions and answers, you will be well-prepared to impress your potential employer and land your dream job in data science
Q.1 What does one understand by the term Data Science?
Data Science is a field of study that involves the extraction of insights, knowledge, and meaning from structured and unstructured data using scientific methods, processes, algorithms, and systems.
It involves the use of statistical, mathematical, computational, and machine learning techniques to analyse and interpret data, and to derive actionable insights that can be used to make informed decisions, improve business operations, optimize processes, and drive innovation.
Data Science involves several steps, including data collection, data cleaning, data pre-processing, exploratory data analysis, feature engineering, model selection, model training, model evaluation, and deployment.
Data scientists work with large and complex datasets from various sources, including social media, IoT devices, web logs, sensors, and customer transactions, among others. They use programming languages such as Python, R, and SQL, as well as tools such as Jupyter notebooks, Apache Hadoop, Apache Spark, and machine learning frameworks like TensorFlow and scikit-learn.
Q.2 What is the difference between data analytics and data science?
Data analytics and data science are two related but distinct fields that are focused on working with data to uncover insights and drive decision-making. Here are some key differences between the two:
Focus: Data analytics is primarily focused on using statistical and computational methods to analyse existing data sets and identify patterns and trends. Data science, on the other hand, is focused on the entire data pipeline, from data collection and cleaning to analysis and modelling.
Methods: Data analytics typically involves working with structured data sets, such as those found in databases or spreadsheets, and using statistical methods like regression analysis or hypothesis testing to identify insights. Data science, on the other hand, often involves working with unstructured or semi-structured data, such as text or image data, and using machine learning and artificial intelligence techniques to develop predictive models.
Skillset: Data analytics typically requires proficiency in statistical analysis, data visualization, and data management tools like SQL or Excel. Data science, on the other hand, requires a broader range of skills, including programming in languages like Python or R, machine learning, and big data technologies like Hadoop and Spark.
Scope: Data analytics is often focused on solving specific business problems, such as optimizing marketing campaigns or improving supply chain efficiency. Data science, on the other hand, is often focused on developing more generalizable models and algorithms that can be applied to a wide range of data sets and problems.
In summary, data analytics and data science are both important fields that involve working with data, but they have different focuses, methods, skillsets, and scopes.
Q.3 What are some of the techniques used for sampling? What is the main advantage of sampling?
In data science, there are several techniques used for sampling, including:
Simple random sampling: A technique where each observation in the dataset has an equal chance of being selected for the sample.
Stratified sampling: A technique where the data is divided into strata or subgroups based on certain characteristics, and then a random sample is taken from each stratum.
Cluster sampling: A technique where the data is divided into clusters, and then a random sample of clusters is selected. All observations within the selected clusters are included in the sample.
Systematic sampling: A technique where a subset of observations is selected by using a fixed interval between observations.
Oversampling: A technique where the sample is intentionally skewed to oversample a particular subgroup in the population, such as a minority group, to ensure adequate representation.
The main advantage of sampling in data science is that it can reduce the amount of data that needs to be processed, making it more efficient to analyse. Sampling can also help reduce bias in the data by ensuring that the sample is representative of the population. Additionally, sampling can be used to test hypotheses and estimate parameters, such as means and variances, for a larger population based on the sample data. Finally, sampling can help improve data quality by allowing for more careful selection of data points, leading to more accurate models and analyses.
Q.4 List down the conditions for Overfitting and Underfitting.
Overfitting and underfitting are two common issues in machine learning and data science. The conditions for overfitting and underfitting are as follows:
Conditions for Overfitting are:
The model has very high accuracy on the training data but performs poorly on new, unseen data.
The model is too complex for the amount of available training data.
The model fits the noise in the training data instead of the underlying patterns.
The model is too sensitive to the training data and cannot generalize well to new data.
The model has too many parameters relative to the number of training examples.
Conditions for Underfitting are:
The model has poor performance on both the training and test data.
The model is too simple and cannot capture the underlying patterns in the data.
The model is not complex enough to fit the training data.
The model has high bias and low variance, leading to underfitting.
The model is not trained for long enough, or the learning rate is too low.

Source - medium.com
Q.5 Differentiate between the long and wide format data.
In data science, there are two commonly used formats for storing and organizing data: long format and wide format.
The wide format is also known as the "unstacked" or "matrix" format, where each variable is stored in its own column and each observation is stored in its own row. This format is useful for storing data that has a small number of variables and a large number of observations. The wide format is easy to read and analyse, as it allows for quick comparisons between variables.
On the other hand, the long format is also known as the "stacked" or "tall" format. In this format, the data is stored in a "long" table, where each observation is represented by multiple rows, and each row corresponds to a specific value of a variable. This format is useful for storing data that has a large number of variables and a small number of observations. The long format is useful for data analysis tasks that involve grouping, aggregation, and statistical modelling.
Q.6 What are Eigenvectors and Eigenvalues?
In data science, Eigenvectors and Eigenvalues are commonly used in linear algebra to represent certain characteristics of a dataset, such as the directions and magnitudes of the data's principal components.
Eigenvalues are a scalar quantity that represents how much variance there is in the data along a particular Eigenvector direction. Eigenvectors are non-zero vectors that represent the direction of the maximum variation in a dataset.
To compute the Eigenvectors and Eigenvalues of a dataset, one typically performs a technique called Principal Component Analysis (PCA). PCA is a mathematical technique used to transform a set of correlated variables into a set of uncorrelated variables called principal components. These principal components represent the most significant sources of variation in the dataset.
The Eigenvectors and Eigenvalues of a dataset can provide valuable insights into the underlying structure of the data. For example, the Eigenvectors can reveal the most significant features of the data, while the Eigenvalues can provide information about the amount of variance in the data that is captured by these features.
Q.7 What does it mean when the p-values are high and low?
In statistical hypothesis testing, the p-value is a measure of the strength of evidence against a null hypothesis. A low p-value indicates that the observed data is unlikely to have occurred by chance, while a high p-value suggests that the observed data is likely to have occurred by chance.
Specifically, a p-value less than or equal to the significance level (usually 0.05) is considered to be statistically significant, indicating strong evidence against the null hypothesis. In other words, a low p-value suggests that the null hypothesis can be rejected in favour of the alternative hypothesis.
On the other hand, a p-value greater than the significance level indicates weak evidence against the null hypothesis, and thus, the null hypothesis cannot be rejected.
Therefore, in data science, if the p-value is low, it suggests that the result is statistically significant and not due to chance, while if the p-value is high, the result is not statistically significant, and it may be due to chance.
Q.8 When is resampling done?
Resampling is a technique commonly used in data science to estimate the performance of a machine learning model on unseen data or to validate a model. Resampling involves creating new training and testing sets from the original dataset, which can help to evaluate the performance of a model in a more robust and reliable way.
Resampling can be done at different stages of the data science process, depending on the specific task and the nature of the data. Here are some examples:
Cross-validation: Cross-validation is a resampling technique that involves splitting the data into multiple folds and using each fold as a validation set while training the model on the remaining data. This technique is commonly used to estimate the performance of a model and to tune hyperparameters.
Bootstrap: Bootstrap is a resampling technique that involves creating multiple random samples with replacement from the original dataset. This technique is commonly used to estimate the variability of a statistic, such as the mean or the standard deviation, and to construct confidence intervals.
Jackknife: Jackknife is a resampling technique that involves creating multiple subsamples from the original dataset by leaving out one observation at a time. This technique is commonly used to estimate the bias and variance of a statistic and to construct confidence intervals.
Overall, resampling is a powerful tool that can help to improve the reliability and accuracy of machine learning models in data science.
Q.9 What do you understand by Imbalanced Data?
In data science, imbalanced data refers to a situation where the distribution of classes in a dataset is not equal. Specifically, one class may have significantly more examples than the others. This is a common problem in many real-world applications, such as fraud detection, medical diagnosis, and rare event prediction, where the target class is often the minority class.
Imbalanced data can lead to biased models, where the model tends to favor the majority class and perform poorly on the minority class. This is because the model has seen many more examples of the majority class and therefore has a better understanding of it. As a result, the model may have difficulty identifying the minority class, which can lead to inaccurate predictions.
To address imbalanced data, various techniques are used in data science, such as resampling methods, cost-sensitive learning, and ensemble methods. These techniques aim to balance the distribution of classes in the dataset and improve the model's ability to predict the minority class accurately.
Q.10 Are there any differences between the expected value and mean value?
In data science, the terms "expected value" and "mean value" are often used interchangeably, as they both refer to the central tendency of a distribution. However, there is a subtle difference between the two concepts.
The expected value is a theoretical concept used in probability theory that represents the long-run average of a random variable over an infinite number of trials. It is denoted as E(X) and calculated by summing the product of each possible value of X with its corresponding probability.
On the other hand, the mean value is a statistical concept that represents the arithmetic average of a sample or population. It is denoted as μ and calculated by summing all the values in a dataset and dividing by the number of observations.
While the expected value and mean value can be the same in some cases, they can differ when dealing with random variables with different probabilities of occurrence. In such cases, the expected value is a more appropriate measure of central tendency than the mean value.
Q.11 What do you understand by Survivorship Bias?
Survivorship bias is a common issue in data science that occurs when only successful or surviving observations are taken into account, while failed or non-surviving observations are excluded from the analysis. This can lead to biased conclusions and incorrect decisions, as the excluded data could provide valuable insights and information.
For example, survivorship bias could occur when analysing the performance of a particular investment strategy by only considering the successful investments, while ignoring the unsuccessful ones. This could lead to the false conclusion that the strategy is effective, when in fact it may not be.
In data science, it's important to be aware of survivorship bias and take steps to mitigate its effects, such as including data from unsuccessful or non-surviving observations, or applying appropriate statistical methods to account for the bias.
Q.12 Define the terms KPI, lift, model fitting, robustness and DOE.
KPI (Key Performance Indicator): It is a measurable value that helps to assess how effectively an organization is achieving its key objectives. In data science, KPIs are used to track and evaluate the success of data-driven initiatives and strategies. Common KPIs in data science include metrics such as conversion rates, customer retention rates, revenue growth, and return on investment (ROI).
Lift: It is a measure of the effectiveness of a predictive model in terms of the difference in response rates between the target group and the overall population. It represents the ratio of the response rate of the target group to the response rate of the general population. A lift value of 1 indicates that the model is not effective, while a lift value greater than 1 indicates that the model is effective.
Model Fitting: Model fitting is the process of adjusting the parameters of a statistical or machine learning model to minimize the difference between the predicted values of the model and the actual data. In other words, it is the process of finding the best possible model to fit the available data.
Robustness: It is the ability of a statistical or machine learning model to maintain its predictive accuracy and stability even when the data it is trained on is noisy or contains outliers. A robust model is less sensitive to variations in the data and is more likely to provide accurate predictions when applied to new data.
DOE (Design of Experiments): It is a statistical method used to determine the relationship between multiple factors (input variables) and a response variable in a system or process. The goal of DOE is to optimize the process or system by identifying the most important factors and their optimal settings. DOE is widely used in industries such as manufacturing, engineering, and quality control to improve product quality, reduce costs, and increase efficiency. In data science, DOE is used to optimize the performance of machine learning models by determining the optimal values of model hyperparameters.
Q.13 Define confounding variables.
Confounding variables, also known as confounders, are variables that are related to both the independent variable(s) and the dependent variable(s) in a study. Confounding variables can potentially distort or obscure the true relationship between the independent and dependent variables. They are an important consideration in data science because failing to account for them can lead to inaccurate or misleading results.
For example, imagine a study investigating the relationship between exercise and heart health. If the study finds that people who exercise regularly have better heart health than those who don't, this relationship could be confounded by factors such as age, diet, smoking, and genetics. If these factors are not controlled for, they could explain some or all of the observed relationship between exercise and heart health, and the true relationship between these variables may be unclear.
To address the issue of confounding variables, researchers use a variety of techniques, such as stratification, matching, and statistical adjustment, to account for the effects of these variables and better isolate the relationship between the independent and dependent variables of interest.
Q.14 Define and explain selection bias?
Selection bias in data science occurs when certain groups or observations in a dataset are systematically excluded or underrepresented in the sample used for analysis, resulting in a distorted view of reality. This can occur at various stages of the data collection and analysis process, such as during the selection of participants, the measurement of variables, or the analysis of results.
For example, selection bias can occur when a study only includes participants from a specific geographic area or demographic group, leading to results that may not be generalizable to other populations. Similarly, selection bias can occur when certain variables are only measured for some participants but not others, leading to incomplete or biased conclusions.
Selection bias can also occur when participants are self-selected or volunteer for a study, as these individuals may differ systematically from those who do not volunteer, leading to a biased sample. This is known as self-selection bias.
To avoid selection bias, it is important to use appropriate sampling methods to ensure that the sample is representative of the population of interest, and to collect data on all relevant variables for all participants. Additionally, it is important to carefully consider the potential sources of bias in the analysis and to control for them using appropriate statistical techniques.
Q.15 Define bias-variance trade-off?
In data science, the bias-variance trade-off refers to the trade-off between the ability of a model to fit the training data well (low bias) and its ability to generalize to new data (low variance).
Bias refers to the difference between the expected (or average) predictions of a model and the true values, given a fixed set of input variables. High bias can lead to underfitting, where the model is too simple and fails to capture the true underlying relationships in the data.
Variance, on the other hand, refers to the variability of the model's predictions for different training sets. High variance can lead to overfitting, where the model is too complex and captures noise or random fluctuations in the training data, making it unable to generalize well to new data.
The trade-off arises because reducing bias typically involves increasing the complexity of the model, which increases the variance. Conversely, reducing variance typically involves simplifying the model, which increases the bias. Therefore, finding the optimal balance between bias and variance is crucial for building a model that can perform well on both the training and test data.
Q.16 Define the confusion matrix?
The confusion matrix is a table that is commonly used to evaluate the performance of a classification model in data science. It provides a summary of the predictions made by the model and their actual outcomes, allowing us to assess the model's accuracy and identify any patterns of errors it may be making.
For more understanding go through the below article
Q.17 What is logistic regression? State an example where you have recently used logistic regression.
Logistic regression is a statistical method used in data science for predicting binary outcomes (i.e., yes/no, true/false, 0/1). It is a type of regression analysis where the dependent variable is categorical, and the independent variables can be continuous, categorical, or a mixture of both. The aim of logistic regression is to estimate the probability of the dependent variable being a certain category based on the values of the independent variables.
Q.18 What is Linear Regression? What are some of the major drawbacks of the linear model?
Linear regression is a statistical method used in data science to model the relationship between a dependent variable (often denoted as "Y") and one or more independent variables (often denoted as "X"). It assumes a linear relationship between the variables and aims to find the best fitting line that explains the relationship between the two.
The equation for a simple linear regression model with one independent variable can be represented as:
Y = β0 + β1X + ε
where Y is the dependent variable, X is the independent variable, β0 is the intercept, β1 is the coefficient for X, and ε is the error term.
Some of the major drawbacks of the linear model in data science include:
Linearity assumption: Linear regression assumes that the relationship between the variables is linear, which might not always be the case. In such situations, linear regression may not provide the best fit to the data.
Overfitting: Linear regression models with too many variables can overfit the data, meaning that the model is too complex and only fits the training data well but does not generalize well to new data.
Underfitting: On the other hand, linear regression models with too few variables may underfit the data and not capture the full relationship between the variables.
Outliers: Linear regression is sensitive to outliers, which can significantly affect the fit of the model.
Assumptions of independence and normality: Linear regression assumes that the errors are independent and normally distributed, which may not be the case in all situations. Violation of these assumptions can lead to inaccurate results.
Limited applicability: Linear regression is useful for modelling linear relationships, but it may not be applicable for modelling complex nonlinear relationships between variables.
Q.19 What is a random forest? Explain it’s working.
A random forest is a popular ensemble learning method used in data science for both classification and regression problems. It involves constructing multiple decision trees at training time and then using their collective output to make predictions on new data.
The working of a random forest algorithm can be explained in the following steps:
Data Preparation: The first step is to prepare the data by splitting it into training and testing sets. The training set is used to build the random forest model, while the testing set is used to evaluate its performance.
Building Decision Trees: The next step is to build a large number of decision trees, where each tree is trained on a random subset of the training data. This process is repeated multiple times to create a forest of decision trees.
Voting System: To make predictions on new data, the random forest algorithm aggregates the predictions of all the individual decision trees. This is done by taking the majority vote of the predicted class (in case of classification) or the average of the predicted values (in case of regression).
Feature Importance: Random forests can also be used to measure the importance of each feature in the data. This is done by calculating the total reduction of the impurity (or entropy) caused by each feature across all the decision trees in the forest.
Hyperparameter Tuning: Lastly, the performance of a random forest model can be further improved by tuning its hyperparameters. Some of the important hyperparameters include the number of decision trees, the maximum depth of each tree, and the minimum number of samples required to split a node.
Overall, random forests are a powerful and flexible machine learning algorithm that can handle a wide range of data types and perform well on both small and large datasets.

Source - freeCodeCamp
Q.20 In a time interval of 15-minutes, the probability that you may see a shooting star or a bunch of them is 0.2. What is the percentage chance of you seeing at least one star shooting from the sky if you are under it for about an hour?
The time interval of 15 minutes represents 1/4th of an hour, so the probability of seeing a shooting star in a 15-minute interval is 0.2. The probability of not seeing a shooting star in a 15-minute interval is therefore 0.8 (since the probabilities must add up to 1).
To calculate the probability of seeing at least one shooting star in an hour, we can use the complement rule. That is, we can calculate the probability of not seeing any shooting stars in an hour, and then subtract that probability from 1 to get the probability of seeing at least one shooting star.
Since an hour contains 4 intervals of 15 minutes, the probability of not seeing any shooting stars in an hour is:
0.8 x 0.8 x 0.8 x 0.8 = 0.4096
Therefore, the probability of seeing at least one shooting star in an hour is:
1 - 0.4096 = 0.5904
So there is a 59.04% chance of seeing at least one shooting star in an hour.
Q.21 What is deep learning? What is the difference between deep learning and machine learning?
Deep learning is a subset of machine learning that uses artificial neural networks to learn and make predictions on large and complex data sets. It is called "deep" learning because the neural networks used in this type of machine learning are typically composed of many layers, making them capable of processing vast amounts of data and recognizing patterns that would be difficult or impossible for humans to identify.
The main difference between deep learning and machine learning in data science is the complexity of the data that each is designed to handle. While machine learning algorithms can be trained on simpler, structured data sets with relatively few features, deep learning algorithms are more suited to unstructured data sets with many features, such as images, videos, or natural language text.
Another key difference is the way in which the algorithms are trained. Machine learning algorithms typically require the input of a human data scientist who selects and engineers features that the algorithm can use to make predictions. In contrast, deep learning algorithms can automatically extract features from raw data, without requiring explicit feature engineering by a human expert.
Overall, deep learning has emerged as a powerful tool for handling large and complex data sets, and it has been applied successfully in many areas, including computer vision, natural language processing, speech recognition, and robotics.
Q.22 What is Gradient and Gradient Descent?
In mathematics, a gradient is a vector that represents the slope of a function at a given point. It is the direction of the steepest ascent of the function at that point. In data science and machine learning, the gradient is commonly used in optimization algorithms to find the minimum of a function.
Gradient descent is an optimization algorithm used to find the minimum of a function. It is a method of iteratively adjusting the parameters of a model in the direction of the negative gradient of the loss function with respect to the model's parameters. The goal of gradient descent is to find the parameters that minimize the loss function and produce the best model performance. There are different variants of gradient descent, such as batch gradient descent, stochastic gradient descent, and mini-batch gradient descent, each with their own advantages and disadvantages depending on the problem and data set.
To better understand the topic of data science interview questions for beginners, we recommend checking out the following articles as well. These articles provide additional insights and tips on how to prepare for data science job interviews and what to expect during the interview process. By combining the information from these articles with the questions and answers provided in this article, you will be well-equipped to excel in your data science job interview.
We have a selection of videos that can be helpful when you're applying for jobs or looking to enhance your skills:
1. HiDevs Community | Enhance Your Profile with Expert Mentorship & Real-World Projects
- Watch here: Video 1
2. HiDevs | Elevate Your Resume | LinkedIn Profile | Cover Letter
- Watch here: Video 2
3. HiDevs Community Platform Overview | Job Support | Resume Building | LinkedIn Enhancement
- Watch here: Video 3
4. HiDevs | Unlock Your Tech Future | Create a Standout Resume | Optimize LinkedIn | Get Job Assistance | Access Upskill Programs for Free
- Watch here: Video 4
We also have a collection of short videos that you may find valuable: Short Videos
Book a FREE OF COST 1:1 mentorship call with our Founder Mr. Deepak Chawla here,
here is his LinkedIn profile for more information.
Comments