Lasso Regression Vs Ridge Regression

I’ve been experimenting with Lasso and Ridge regression but how to select one method over the other?

We know that both types of regularization are used for the bias-variance trade-off in the model.

Ridge Regression

Ridge regression helps in reducing the complexity of the model(number of predictors) since the greater the complexity, the greater the variance (over-fitting).
Instead of removing the predictors by setting their coefficients to be exactly zero, ridge regression penalizes them if they are too far from zero so that they are small. This way reduces the model complexity while keeping the variables.
In Ridge Regression, we minimize the sum of squared residuals and also penalize the size of parameter estimates, in order to shrink them towards zero.

Lasso Regression

In this type of regression, a penalty for non-zero coefficients is also added, but unlike ridge regression which penalizes sum of squared coefficients (L2 penalty), lasso penalizes the sum of their absolute values (L1 penalty). As a result, for high values of λ , many coefficients are exactly forced towards 0 under lasso, which is never the case in ridge regression.

The only difference in ridge and lasso loss functions is in the penalty terms.

General rule to follow for choosing one over another:

  • Lasso performs well if there are a small number of significant parameters and the others are close to zero (when only a few predictors actually influence the response variable).
  • Ridge works well if there are many large parameters of about the same value (when most predictors impact the response).
  • The above mentioned points are not true for every case so it is recommended to run cross-validation to select the more suited model for a specific case.

Lasso Regression uses L1 Regularization and Ridge Regression uses L2 Regularization. The decision is between the two regularization methods.

L1 regularization aims to shrink the coefficients of the variables which have little or no impact, where as L2 regularization aims to have a compromise among all.

If you think the problem you are working on has some variables which have no role to play in predicting the dependent, opt for Lasso, otherwise Ridge.