101 Data Science Interview Questions and Answers | Interview Prep

As the title suggests, below you will find 101 data science interview questions with an example on how you can answer. These questions are divided into 6 categories:

  • Stats, Probability, and Maths
  • Machine Learning
  • Programming
  • Experiential/Behavioral Questions
  • SQL
  • Data Wrangling

We hope this resource is able to help you ace your next interview!

Experiential/Behavioral Questions

What was the most challenging project you have worked on so far? Can you explain your learning outcomes?

It is crucial that you prepare an answer in advance since the interviews are intimidating for most of the people and it would be time consuming to formulate a well thought-out example. You should keep the following points in mind while finalizing the answer:

  • Choose an appropriate example: Pick a project that’s most relevant to the responsibilities of the job you’re applying for.
  • Be Specific: Take the hiring manager through the process of the project. Break down the project into goals and milestones and explain how you were able to achieve those and dwell upon your responsibilites as well. If you were managing a group project, make sure to mention about your communication and group management skills Look at the key words in the job description of the company so that you know what they are looking for. For instance, if they are looking for a leader then explain your role as a leader in the project
  • Explain Your Position Clearly: Make sure to highlight the outcomes of the project and your role in achieving those. Align your learning outcomes with the aims of the company you are applying for.In addition to mentioning the learning outcomes from the project, the hiring manager should know your challenges through the project phase and how you overcame those. To sum it up, the project journey should communicate your willingness to learn and overcome hurdles in life.

According to your judgement, does Data Science differ from Machine Learning?

Data Science

It is the processing and analysis of data that you generate to draw various useful insights from the data. For instance, when you Log on Netflix and browse to watch shows and genres, you are generating data. All of these activites are tracked for each user and the data is consumed by a data scientist at the backend to understand the customer behaviour. This is one of the reasons you see customized ads everywhere regarding a product which you are currently searching for. This is one of the simplest implementations of data science.

Machine Learning

Machine learning is just a small chunk of the work done by data scientists.We know that data gets generated in massive volumes which becomes extremely cumbersome for a data scientist to work on. Hence, a machine learning algorithm has the ability to learn and process large data sets autonomously without human intervention. For example, Facebook is an example of a machine learning algorithm. The algorithm gathers behavorial information regarding the user by tracking it's activity consistently and then by using the past behaviour of the user, the algorithm trains itself to predict the interests and recommends notifications on the News Feed

Give an Example of scenarios in which you faced selection Bias. How did you avoid it?

How would you describe Data Science to a Business Executive?

Stats, Probability and Maths

How would you select a representative sample of search queries from 5 million search queries?

Some key features need to be kept in mind while selecting a representative sample.

Diversity: A sample must be as diverse as the 5 million search queries. It should be sensitive to all the local differences between the search query and should keep those features in mind.

Consistency: We need to make sure that any change we see in our sample data is also reflected in the true population which is the 5 million queries.

Transparency: It is extremely important to decide the appropriate sample size and structure so that it is a true representative. These properties of a sample should be discussed to ensure that the results are accurate.

Discuss how to randomly select a sample from a product user population.

The sampling techniques to select a sample from a product user population can be divided into two categories:

Probability sampling methods

  • Simple Random Sampling
  • Stratified Sampling
  • Clustered Sampling
  • Systematic Sampling

Non-Probability sampling methods

  • Convenience Sampling
  • Snowball Sampling
  • Quota Sampling
  • Judgement Sampling

What is the importance of Markov Chains in Data Science?

Markov Chain can be used in marketing analytics. It is a stochastic model describing a sequence of possible events. These are sequential events that are probabilistically related to each other. The probability of the upcoming event depends only on the present state and not on the previous states. This property of Markov Chain is called Memoryless property. It disregards the events in the past and uses the present information to predict what happens in the next state. For instance, imagine you have an online product selling platform and you would like to know whether the customer is in the stage where they are considering to "buy a product" or "purchasing a product". These are the states at which the customer would be at any point in their purchase journey. To find the customer state at any given point, Markov Chain comes in handy. It provides Information about the current state & transition probabilities of moving from one state to another. As a result, we can predict the next stage. In this case, we can predict how likely a customer is going to buy the specified product.

How do you prove that males are on average taller than females by knowing just gender or height?

We can use the concept of Null and Alternate hypothesis to prove this.It is used for statistical significance testing. Firstly, compare the sample mean of the male heights with the sample mean of female heights. The Null hypothesis will state that the mean female height and male height are the same.The alternate hypothesis will state that the mean male height is greater than mean female height. One tailed hypothesis test can be used to accept or reject the Null Hypothsis. P-value analysis can be used to figure out whether the test is statistically significant or not.

What is the difference between Maximum Likelihood Estimation(MLE) and Maximum A Posteriori(MAP)?

Maximum Likelihood Estimation (MLE) and Maximum A Posteriori (MAP), are both a method for estimating some variable in the setting of probability distributions or graphical models. MAP usually comes up in Bayesian setting. Because, as the name suggests, it works on a posterior distribution, not only the likelihood unlike MLE. If you have any useful prior information, then the posterior distribution will be more informative than the likelihood function. Comparing both MLE and MAP equation, the only thing differs is the inclusion of prior P(θ) in MAP, otherwise they are identical. What it means is that, the likelihood is now weighted with some weight coming from the prior.

What does P-Value mean?

P-Value is used to determine a statistical significance in Null Hypotehsis. It stands for probability value and indicates that how likely is it that a result occured by chance alone. If the p-value is small, it indicates that the result was unlikely to have occured by chance alone. These results are known as being statistically significant. A large p-value indicates that result is within chance or normal sampling error which means that nothing happend and the test is not significant.

Define Central Limit Theorem (CLT) and it's application?

To make statistical inferences about the data, it is important to understand the Central Limit Theorem. The theorem gives us the ability to quantify the probability that the random sample will deviate from the population without having to take any new sample to compare it with. Because of this theorem, we don’t need the characteristics about the whole population to understand the likelihood of our sample being representative of it. Confidence intervals, hypothesis testing and p-value analysis is based on the CLT. In a nutshell, CLT can make inferences from a sample about a population.

Machine Learning

The world is filled with data about images, videos, audios and text. Machine learning is responsible for deriving meaning from this data. It is a tool which you can employ to answer questions regarding your data. Machine learning is the science of getting computers to act without being explicitly programmed. The primary aim is to allow the computers learn automatically without human intervention or assistance and adjust actions accordingly.

Explain Logistic Regression and it's assumptions.

Logistic Regression is a go-to method for classification. Logistic regression models the probability of the default class (e.g. the first class). It employs the use of sigmoid function that can take any real-valued number and map it into a probability value between 0 and 1 to predict the output class. Logisitc regression is of two types: Binary, Multinomial. Binary Logistic Regression deals with two categories whereas multinomial deals with three or more categories.


Binary logistic regression requires the dependent variable to be binary. The independent variables should be independent of each other. That is, the model should have little or no multi-collinearity, The independent variables should be linearly related to the log odds.

Explain Linear Regression and it's assumptions.

Linear regression is useful for finding relationship between two continuous variables. One is predictor or independent variable and other is response or dependent variable.


There are 5 basic assumptions of linear regression:

  1. Linear relationship: Between the dependent and independent variable,
  2. Multivariate normality: Multiple regression assumes that the residuals are normally distributed.
  3. No or little multicollinearity between the independent variables
  4. No auto-correlation: It is a characteristic of data in which the correlation between the values of the same variables is based on related objects. It violates the assumption of instance independence, which underlies most of the conventional models.
  5. Homoscedasticity: This assumption means that the variance around the regression line is the same for all values of the predictor variable.

What is cross validation?

Cross-validation is a technique to evaluate predictive models. It partitions the original sample into a training set to train the model and a test set to evaluate it.In k-fold cross-validation, the original sample is randomly partitioned into k equal size subsamples. Of the k subsamples, a single subsample is retained as the validation data for testing the model, and the remaining k-1 subsamples are used as training data. The cross-validation process is then repeated k times, with each of the k subsamples used exactly once as the validation data. The k results from the folds can then be averaged to produce a single estimation. The advantage of this method is that all observations are used for both training and validation, and each observation is used for validation exactly once.This method helps is reducting bias in a model beecause cross validation ensures that every observation from the original dataset has the chance of appearing in training and test set.

Describe Binary Classification.

Binary classification is the process of predicting the class of a given set of data points. These classes are also known as targets/ labels. This predictive modeling approximates a mapping function (f) from input variables (X) to discrete output variables (y). For example, spam detection in email service providers can be identified as a classification problem. This is a binary classification since there are only 2 classes: spam and not spam.

What is bias-variance trade off?

Bias and Variance are part of the model prediction errors. A model with high bias pays very little attention to the training data and oversimplifies the model leading to underfitting. A model with high variance pays a lot of attention to training data and does not generalize well on the unseen data leading to overfitting. Gaining a proper understanding and insight into these errors would help us not only in builduing accurate models but also in avoiding the mistake of overfitting and underfitting.

Underfitting/Bias: Bias error is the difference between the expected/average prediction of the model and the true value. The model builduing/prediction process is repreated more than once with new variation of the data. Hence, due to the randomness in the underlying data sets, we will have a set of predictions for each point. Bias measures how much the predictions deviate from the true value we are trying to predict. Overfitting/Variance: Variance error is defined as the variability of a model prediction for a given data point. The model prediction is repeated for various data sets. It is an indicator to a model’s sensitivity to small variations that can exist while feeding a new subset of the training data. For instance, if a model has high variance then small changes in the training data can result in large prediction changes.

There is no analytical way to measure the point at which we can achieve the bias-variance tradeoff. To figure it out, it's essential to explore the complexity of the model and measure the prediction error in order to minimize the overall error.

What is the use of regularization? What are the differences between L1 and L2 regularization?

Regularization is a technique used to reduce the error by fitting a function appropriately on the given training set thereby avoiding overfitting.

The key difference between L1 and L2 regularization is the penalty term. Lasso Regression(Least Absolute Shrinkage and Selection Operator) adds “absolute value of magnitude” coefficient as penalty term to the loss function. Ridge regression adds “squared magnitude” of coefficient as penalty term to the loss function. Another difference between these techniques is that Lasso shrinks the less important feature’s coefficient to zero thus, removing some features altogether. So, this works well for feature selection/dimensionality reduction in case we have a huge number of features.

Why Rectified Linear Unit/ReLU is a good activation function?

ReLUs are better for the training of deep neural networks when compared to the traditional sigmoid or tangent activation functions because they help in addressing the problem of vanishing gradients. Due to vanishing gradients, the learning is very slow for large values of the input as gradient values are small. When a neuron's activation saturates close to 0, the gradients at these regions are close to 0. During backpropogation, this local gradient will be multiplied with the gradient of the state's output. Hence, if the local gradient is really small it will make the gradients slowly vanish. As a result almost no signal will flow through the neurons to it's weights.ReLUs are faster in learning. They are only used for the hidden layers of the neural networks in deep learning.

Why do we use feature selection while modelling?

The data features that are used to train your machine learning models have a huge influence on the performance of the model. Some features set will be more influential than others on the model accuracy. Irrelevant features can increase the complexity of the model and add noise to the data which can negatively impact the model performance. Features may be redundant if they are highly correlated with another feature. These types of features can be removed from the data set without any loss of information. Feature selection methods can be used to identify and remove redundant attributes from data that do not contribute to the accuracy of a predictive model. Moreover, variable selection helps in reducing the amount of data that contributes to the curse of dimensionality. Reducing the number of features through feature selection ensures training the model will require minimum memory and computational power, leading to shorter training times and also reducing the common problem of overfitting.

What are the Naive Bayes fundamentals?

Naive Bayes is a probabilistic machine learning model that’s primarily used for text classification. It learns the probability of an object with a certain feature belonging to a particular group of class. The Naive Bayes algorithm is called "Naive" because it makes the assumption that the occurence of a certain feature is independent of the occurence of other features. The crux of the classifier is based on the Bayes theorem. It gives us a method to calculate the conditional probability, that is the probability of an event A based on the previous knowledge events. There are essentially three types of Naive Bayes:

  1. Multinomial Naive Bayes: It is used when we have discrete data. With respect to text classification, if the words can be represented in terms of their occurrences/frequency count, then use this method.
  2. Bernoulli Naive Bayes: It assumes that the input features are binary with only two categories (e.g. 0 can represent the word is not present in the document while 1 represents the word presence. If you just care about the presence or absence of a particular word in the document, then use bernouli classification.
  3. Gaussian Naive Bayes: It is used in the case with continous features. For example in Iris dataset features have sepal width, petal width, sepal length, petal length. The values in the data vary from the width to the length.

What are time series forecasting techniques?

The following are some of the most common time series methods:

  1. Simple moving average: A simple moving average (SMA) is the simplest type of technique of forecasting. Basically, a simple moving average is calculated by adding up the last ‘n’ period’s values and then dividing that number by ‘n’. So the moving average value is considering as the forecast for next period.
  2. Exponential Smoothing: Exponential Smoothing assigns exponentially decreasing weights as the observations get older.
  3. Autoregressive Integrated Moving Average (ARIMA): This is a statistical technique that uses time series data to predict future. The parameters used in the ARIMA are (P, d, q) which refers to the autoregressive, integrated and moving average parts of the data set, respectively. ARIMA modeling handles the trends, seasonality, cycles, errors and non-stationary aspects of a data set when making forecasts.
  4. Neural networks: They are also used for time series forecasting. There is an increasing interest in using neural networks to model and forecast time series.

What are neural networks?

Neural network is like a black box which takes one or more inputs, processing them into an output. The neural network itself consists of many small units called neurons. These neurons are grouped into several layers. Neurons of the previous layers are connected with the neurons of the next layer through weighted connections. They can be used for both predictive analytics/regression and classification involving image, audio, video and text analytics. Neural networks are used both in supervised and unsupervised learning.

What experiment would you run to implement new features on Facebook?

A/B testing can be used to check the response on the new features by the general audience. A/B testing can be valuable because different audiences behave, well, differently. Something that works for one company may not necessarily work for another. A/B testing, is a marketing experiment wherein you "split" your audience to test a number of variations of a campaign/new feature and determine which performs better. In other words, you can show version A of a piece of feature to one half of your audience, and version B to another to determine the success rate of both the versions.

If a Product Manager says that they want to double the number of ads in Facebook's Newsfeed, how would you figure out if this is a good idea or not?

You can use A/B testing to make a conclusion about the success rate of the ad's. A/B testing is experimenting and comparing two types or variations of an online or offline campaign such as an ad text, a headline, or any element of a marketing campaign such as ads. For example, one set of audience can be shown ads that are double the amount than the usual they see on their newsfeed while the second set of audience will continue to see the existing number of ad's. The reactions of both the sets can be recorded using an appropriate feedback system. Using this approach can help the company in deciding how much percentage of audience is comfortable looking at the ads and responding to them. Even a relatively small sample size in an A/B test can provide significant, actionable results as to which changes are most engaging for users.

What are different metrics to classify a dataset?

The performance metrics for classification problems are as follows:

  1. Confusion Matrix
  2. Accuracy
  3. Precision and Recall
  4. F1 Score
  5. AUC-ROC Curve.

The choice of selecting a performance metrics depends on the type of question and the dataset. For instance, if the dataset is balanced then accuracy would be a good measure to evaluate the model performance.Confusion matrix would be a good alternative if you want to know the cost of False Positives and False Negatives.

How will you design the heatmap for Uber drivers to provide recommendation on where to wait for passengers? How would you approach this?

To design the heatmap, some of the pointers are listed as follows:

  • You can use k-means clustering to group previous journeys in similar area.
  • Peform exploratory data analysis to analyse how long it took for a driver to find the client once they arrived to the pick-up location.
  • Additionally the model can use maps to identify whether it is possible to pick up people at those points or not in terms of practicality. For instance, it would be inconvenient to pick up people from busy markets so a nearby pickup point should be suggested to ensure efficiency and quick service.

What is AUC - ROC Curve?

When we need to evaluate or visualize the performance of the multi-class classification problem, we use AUC (Area Under The Curve) ROC (Receiver Operating Characteristics) curve. ROC is a probability curve and AUC represents the degree of separability. It tells how much the model is capable of distinguishing between classes such as spam/not-spam. Higher the AUC, better the model is at predicting spam email as spam and non- spam email as non-spam. A highly accurate model has AUC close to 1 which reflets it's good measure of separability. A poor model has AUC near to the 0 which means it has worst measure of separability.

What is the difference between bagging and boosting?

Bagging and Boosting are similar in that they are both ensemble techniques, where a set of weak learners are combined to create a strong learner that obtains better performance than a single one. In Bagging, each model is trained in parallel and is running independently. The outputs are then aggregated at the end without preference to any model. Meanwhile boosting is all about “teamwork”. Each previous model decides the subset of features used by the next model depending on the performace. The choice of the model to use depends on the data.

What is k-means?

K-means is an unsupervised clustering based algorithm. Initially, K-means algorithm identifies k number of centroids randomly, and then allocates every data point to the nearest cluster, while keeping the centroids as small as possible. The result is that the input unlabelled data is converted into clusters which are differentiable.

How does a neural network with one layer and one input and output compare to a logistic regression?

Neural networks and logistic regression are both used for classification problems. Logistic regression can be defined as the simplest form of Neural Network that results in straightforward decision boundaries whereas neural networks is a superset that includes additional complex decision boudaries to cater to a more complex and large data. Logistic regression models cannot capture complex non-linear relationships w.r.t features. Meanwhile, a neural network with non-linear activation functions enables one to capture highly complex features.

If the model is not perfect, how would you like to select the threshold so that the model outputs 1 or 0 for label?

To make a decision, we need to understand the consequences that will happen as a result of selecting a decision boundary. You need to find out the relative cost of a false positive vs. a false negative. A precision-recall curve of your the model can be plotted on your validation data. For instance,it is important to understand that if you accidently label a true potential customer as false, then this will result in losing customers. This analysis will help in deciding a right threshold for the model.

Describe linear regression vs. logistic regression.

Linear regression is used to describe the relationship between the continouous independent and dependent variable whereas logistic regression is used to model a categorical dependent variable.

Explain PCA and it's assumptions.

Principle component analysis is a dimensinality reduction technique for large data sets, by transforming a large set of variables into a smaller one that still contains most of the information in the large set. It's often used to make data easy to explore and visualize. PCA does not make any explicit assumptions.


What are the factors in the algorithm uber uses to assign rides to drivers?

The following list of factors can be used to assign rides to drivers:

  1. Drivers who are online at the time of the request.
  2. Drivers who have never been rated lower than 3/4 by the passenger making the request.
  3. Drivers who are closest to the requesting passenger.
  4. Drivers who don’t have a destination filter set that excludes the passenger’s destination.

What is the role of a cost function?

Cost function is used to learn the parameters in the machine learning model such that the total error is as minimum as possible. A cost function is a measure of how wrong the model is in terms of its ability to estimate the relationship between the dependent and independent variable. This function is typically expressed as a difference between the predicted value and the actual value. Every algorithm can have it's own cost function depending on the problem.

Explain advantages and drawbacks of Support Vector Machines(SVM).

  • It has a regularisation parameter, which can be tweaked to avoide over-fitting.
  • SVM uses the kernel trick, so you can build a modified version of a model depending on the problem complexity.
  • The choice of kernal according to the problem type is tricky to choose.Kernal models are usually quite sensitive to over-fitting so a lot of knowledge is required to make sound decisions.
  • It is difficult to tune the hyperparameters such that the error is the minimum.

How does a logistic regression model know what the coefficients are?

Firstly, let's consider the case when the input variable is continous. The first coefficient is the y-axis intercept. It means that when the input variable/feature is 0 the log(odds of output variable) is equal to the intercept value. The second coefficient is the slope. It represents the change of value in the log(odds of output variable) for every one unit of x-axis gained. Now, let's consider the case when the input variable is discrete. Let's take the example where a mouse is "obese" or "not obsese". The independent variable is a discrete variable which is whether the mouse has normal genes or mutates genes. In this case the first coefficient/intercept tells us the log(odds of normal gene) and the second coefficient tells us the log odds ratio which determines, on a log scale, how much having a mutated gene increases or decreases the odds of being obese.

Is random weight assignment is better than assigning same weights to the units in the hidden layer?

To answer this question, let's think about a situation where the weights are assigned equally. Since neural networks use the gradient descent phenomenon to optimize the parameters and find the lowest point to reduce the error of the cost function, they need to have an initialization point from which they can move in the direction of the local minima. For instance, if the starting point is A at the first ietration then it is possible that the network is unable to find a path towards the local minima. Keeping the initialization point consistent every single time will lead to the same conclusion. However, if the starting point is random at each ietration then the network will have a better chance at finding the local minima to reduce the error of the cost function. This technique is also known as breaking the symmetry. The initialization is asymmetric so we can find various solutions to the same problem.

Formulate Latent Semantic Analysis(LSA) and Latent Dirichlet Allocation(LDA) techniques.

Latent Semantic Analysis (LSA) and Latent Dirichlet Allocation (LDA) are part of topic modelling. LSI (also known as Latent Semantic Analysis, LSA) learns latent topics by performing a matrix decomposition on the term-document matrix. The objective of LSA is to reduce dimensions for classification in Natural Language Processing. Latent Dirichlet Allocation (LDA) is a “generative probabilistic model” which uses unsupervised learning for topic modeling/classification of topics.


Data Analysis

What are the core steps for data pre-processing before applying machine learning algorithms?

Data pre-processing is the process of giving structure to the data for better understanding and decision making related to the data. The following steps summarizes the data pre-processing pipeline:

  1. Discovering/Data Acquisition: Gather the data from all the sources and try to understand and make sense of your data.
  2. Structuring/Data Transformation: Since the data may come in different formats and sizes so it needs to have a consistent size and shape when merged together.
  3. Cleaning: This step consists of imputing null values and treating outliers/anomalies in the data to make the data ready for further analysis.
  4. Exploratory Data Analysis: Try to find patterns in the dataset and extract new features from the given data in order to optimize the performance of the applied machine learning model.
  5. Validating: This stage verifies data consistency and quality.
  6. Publishing/Modeling: The wrangled data is ready to be processed further by an algorithm or machine learning model.

How do you detect if a new observation is outlier?

To detect outliers, the following visualizations can be used:

  • Use Boxplot/Whisker's plot to visualize outlier: Any value that will be more than the upper limit or lesser than the lower limit of the plot will be the outliers. Only the data that lies within Lower and upper limit is statistically considered normal and thus can be used for further analysis.
  • Standard deviation: Find the points which lie more than 3 times the standard deviation of the data since according to the empirical sciences the so-called three-sigma rule of thumb expresses a conventional heuristic that nearly all values are taken to lie within three standard deviations of the mean.
  • Clustering: Use K-means or Density-Based Spatial Clustering of Applications with Noise(DBSCAN) for clustering to detect outliers.

If Facebook's Likes per user and minutes spent on a platform are increasing but total number of users are decreasing. What could be the root cause of it?

There can be multiple approaches to answer this question. One way is to gather the context information for this problem. The follwing factors can be analyzed from the data in order to reach a sound conclusion:

Timeline: Is the drop in users a one time event or has it happened progressively? Region: Is the decline in the number of users happeneing from a specific region? If this is the case, the problem might be related to a country’s regulations or a competitive product in that region. Platforms: Is the decline happening on specific platforms, like iOS, Android or others? If so, then compare the number of user's who are leaving on each platform.

How can bogus Facebook accounts be detected?

The company can use the stored data to identify inauthentic profiles by looking for patterns, such as repeatedly posting the same thing over and over or a sudden spike in messaging activity. Moreover, if there is an increased number of requests from a particular account then this might be suspicious as well.

What would you add to Facebook and how would you pitch it and measure its success? The choice of feature to add is yours. To test any feature, let's walk through an example: To check the popularity of Facebook's feature, it would be a good idea to measure how frequently people are using the feature. One of the metrics to use is: the average number of times a user shares a story per day/per week/per month. To test whether users want to first view stories from close friends or whether they want to see stories from all their friends, we can measure how many times a user clicks on stories from friends they don’t engage with that frequently before they click on stories from their close friends. If people click on stories randomly without prioritizing stories from close friends then a more appropriate ordering of stories should be considered. This is an example of a few metrics to monitor. These measures in addition to others can be used to evaluate whether these components are achieving the overall goals set by Facebook.

How do you inspect missing data?

The following techniques can be used to handle missing data:

1.Imputation of missing values depending on whether the data is numerical or categorical. 2.Replacing values with mean, median, mode. 3.Using the average value of K nearest neighbours as an imputation estimate. 4.Using linear regression to predict values.

What are anomaly detection methods

Anomaly detection is a technique used to identify unusual patterns that do not conform to expected behavior, called outliers.The simplest approach to identify irregularities in data is to flag the data points that deviate from common statistical properties of a distribution, including mean, median, mode, and quantiles. Density-Based Anomaly Detection is a Machine Learning approach which works on the assumption that normal data points occur around a dense neighborhood and abnormalities are far away. The nearest set of data points are evaluated using a score, which could be Eucledian distance or something else. Another technique to detect anomalies is Z-score, which is a parametric outlier detection method.This technique assumes a Gaussian distribution of the data. The outliers are the data points that are in the tails of the distribution and therefore far from the mean.

How to solve multi-collinearity?

Multicollinearity occurs when independent variables in a regression model are correlated. This correlation is a problem because independent variables should be independent. To solve this issue, remove highly correlated predictors from the model. If you have two or more correlated variables, remove one from the model since they supply redundant information. You can also use Principle component analysis to cut the number of correlated predictors.

How does caching work and how do you use it in Data science?

It is often necessary to save intermediate data files when the process of loading and/or manipulating data takes a considerable amount of time. That's where caching comes in. There will be caching on the server where already computed elements may need not recomputed. When you want to grab some data that is expensive to look up (in terms of time or other resources), you cache it so that next time you want to look up that same data, it’s much less expensive. Caching also enables content to be retrieved faster because an entire network round trip is not necessary. Caches maintained close to the user, like the browser cache, can make data retrieval extremely fast.

What metrics would you use to track whether Uber's strategy of using paid advertising to acquire customers works?

Customer acquisition cost (CAC) is a metric which can be used to track consumers/customers as they progress from interested leads to acquiring customers. CAC, is the cost of convincing a potential customer to buy a product or service .CAC can be calculated by simply dividing all the costs spent on acquiring more customers (marketing expenses) by the number of customers acquired in the period the money was spent. For example, if Uber spent $200 on marketing in a year and acquired 200 customers in the same year, their CAC is $1.00.

How to optimize marketing spend between various marketing channels?

Finding the ideal marketing strategy is a skill. You should keep in mind that where should you invest? And how often should you reassess where you are investing? The most important thing is to choose a set of metrics that you should be using to determine which channels get more investment? This will help you in making sound decisions regarding what marketing campaigns work and which one's aren't successful.


How would you handle NULLs when querying a data set?

Null means either the values exists but is unknown or there is no information about the existence of value. Databases such as SQL reserves the NULL keyword to denote an unknown or missing value.


How will you explain JOIN function in SQL in the simplest possible way?

SQL handles queries across more than one table through the use of JOINs. JOINs are clauses in SQL statements that link two tables together, usually based on the common keys that define the relationship between those two tables. The key is the common column between the two tables. There are several types of JOINs:

INNER: It selects all rows from both the tables that meet the required condition. LEFT: This returns all the rows of the table on the left side of the join and matching rows for the table on the right side of join. In case of no match on right side, the result will contain null. RIGHT: This returns all the rows of the table on the right side of the join and matching rows for the table on the left side of join. In case of no match on left side, the result will contain null. FULL: It combines the result of both LEFT and RIGHT JOIN. The result will contain all the rows from both the tables. In case of no matching, the result will contain null.

Like the 101 machine learning algorithms blog post, the accordion drop down is available for you to embed on your own site/blog post. Simply click the 'embed' button on the lower left-hand corner, copy the iframe, and paste it within the page.

This is a companion discussion topic for the original entry at https://blog.datasciencedojo.com/p/67cc900e-bd88-4c57-8f0d-57a6a7e8a74f/