#=======================================================================
#
# Code sample illustrating the concept of Bias-Variance Tradeoff in 
# terms of feature engineering.
#
#=======================================================================
library(ggplot2)
# Load iris dataset and set up
data("iris")
iris$Species <- as.factor(iris$Species)
# Visualize Petal.Width as a predictive feature. Notice
# the overlap in the density plot - this is indicative of
# variance in the data for this feature. Reflexively,
# clean separation is indicative of lack of variance.
ggplot(iris, aes(x = Petal.Width, fill = Species)) +
  theme_bw() +
  geom_density(alpha = 0.5)
# Visualize Petal.Width as a predictive feature.
ggplot(iris, aes(x = Petal.Length, fill = Species)) +
  theme_bw() +
  geom_density(alpha = 0.5)
# Visualize Petal.Width vs Petal.Lengt by Species. Notice 
# while unlikely, it is possible that combinations of features
# can provide decreased bias (e.g., increased accuracy) while
# simultaneously reducing variance.
ggplot(iris, aes(x = Petal.Width, y = Petal.Length,
                 color = Species)) +
  theme_bw() +
  geom_point(size = 2.0)
# Preform some feature engineering. In some cases, ratios of
# numeric features are good predictors. However, not in this
# case as it increases the amount of overlap in the density
# plot.
iris$Petal.Ratio <- iris$Petal.Width / iris$Petal.Length
ggplot(iris, aes(x = Petal.Ratio, fill = Species)) +
  theme_bw() +
  geom_density(alpha = 0.5)
# This time, engineer an interaction of two numeric features.
# The density plot isn't clear whether this is a good/bad 
# feature. May want to explore further with a migthy Random
# Forest.
iris$Petal.Interaction <- iris$Petal.Width * iris$Petal.Length
ggplot(iris, aes(x = Petal.Interaction, fill = Species)) +
  theme_bw() +
  geom_density(alpha = 0.5)
# NOTE - Set your working directory to where the Titanic files
#        are stored.
train <- read.csv("train.csv", stringsAsFactors = FALSE)
# Setup factors
train$Sex <- as.factor(train$Sex)
train$Survived <- as.factor(train$Survived)
# Visualize Survived in terms of Sex. Notice how the
# proportions are not pure. This is indicative of variance
# in the data. While Sex is certainly the best out of the
# box Titanic feature, it still demonstrates quite a bit
# of potential variance.
ggplot(train, aes(x = Sex, fill = Survived)) +
  theme_bw() +
  geom_bar()
# By adding segments we have the ability to discover 
# potential hidden patterns in the data, but at the 
# expense of adding more variance. This is illustrated
# in the following plot.
ggplot(train, aes(x = Pclass, fill = Survived)) +
  theme_bw() +
  facet_wrap(~ Sex) +
  geom_bar()
# The Bias-Variance Tradeoff is very clear in the following
# plot. The segmentation clearly show areas where Age
# is cleanly separated between survival/death, but also shows
# many areas of overal that are indicative of variance.
ggplot(train, aes(x = Age, fill = Survived)) +
  theme_bw() +
  facet_wrap(Sex ~ Pclass) +
  geom_density(alpha = 0.5)
Bias: During feature engineering there is definitely a way to fix the Bias. We can increase features to add more meaningful factors to the data. This can help the model understand data well.
Variance: You can reduce High variance, by reducing the number of features in the model. There are several methods available to check which features don’t add much value to the model and which are of importance (Hint: Hypothesis Testing).