The Lesser of Two Evils in Machine Learning: Variance and Bias
Every society throughout mankind’s history has its own version of the fundamental principle of the lesser of two evils. In idyllic ancient Greece, Homer sang the tale of how Odysseus in his homeward journey is faced with the vexing choice of sacrificing six of his crewmen to satiate the voracious appetite of Scylla the six-headed monster or losing his entire crew to Charybdis the whirlpool. I don’t have to tell you the outcome of that dilemma, but for the modern data scientist, he or she is faced with the dialectical dilemma between our Scylla and Chrybdis, bias and variance in machine learning algorithms.
What is Bias?
Data scientists are not unique in encountering bias through their work. I remember first learning about bias in the scope of clinical research, where selection and information biases must be minimized to ensure the validity of the study. We encounter and can observe cognitive biases from our daily interactions, such as how confirmation bias has locked the two parties in the United States each into its own political bias bubble with an ever-widening divide.
In the scope of machine learning, bias is the difference between the expected (or average) value of the estimator (θ_hat) and the parameter we are trying to estimate (θ):
Practically, we look to the training error as an indication of the amount of bias in the model.
A model with high bias:
- oversimplifies our model and pays little attention to the training data
- has a high training error
- is considered underfit if the variance is low
- causes algorithm to miss the relevant relations between the features and target variable
- is what parametric algorithms are prone to become (such as linear and logistic regression), hence easier to understand and generally less flexible
In order to reduce the bias in your model, indicated by a high training error, we can employ any of the following techniques:
- add more input features because the dataset is too generalized
- add more complexity, such as introducing polynomial features
- decrease or remove any regularization, such as decreasing lambda
What is Variance?
Also known by its other manifestations as the square of the standard deviation or the covariance of a variable with itself, variance is the amount that the estimate of target function will change given different training data or the sensitivity to small fluctuations in the training data. It is defined as the difference between the expected value of the squared estimator and the squared expectation of the estimator:
Practically, we can look at the difference between the training and testing errors as an indication of the amount of variance in the model.
A model with high variance:
- pays a lot of attention to the training data, hence will perform well on the training dataset, but not on the testing dataset
- causes the algorithm to model random noise in the training data
- has a significant difference between training and testing errors
- is what non-parametric algorithms are prone to become such as k-Nearest Neighbor and Decision Tree because they are very flexible and tune themselves to the data points
- is considered overfit if the error due to bias for the model is low
To remediate high variance in your learning algorithm, we can employ any of the following techniques:
- increasing the training dataset
- reducing the number of input features by feature selection
- increasing regularization, such as increasing lambda or using Ridge or Lasso regression
What is the Dilemma or Tradeoff?
Bias and variance has an often antagonistic relationship such that as one increases the other decreases. They don't like to share the spotlight. For instance, when we employed techniques such as regularization and feature selection to decrease variance, bias increased. When we increased the model size or added more features to reduce the bias, we inevitably increased the variance. The following graph illustrates that point:
Why are we even concerned with bias and variance?
Bias and variance are the two main components of the total error for a model, where the last term represents that irreducible error resulting from the noise:
In essence, we are never able to predict Y from X with zero prediction error in the case of linear regression. In the most ideal case, we could theoretically reduce the error due to variance and due to bias down to zero. For linear regression models, we can represent the expected test error as the mean squared error:
Through this process of performance assessment and model selection, we are looking to minimize RMSE
One way to gain a better intuition of the relationship between variance, bias, and total error is to interpret the Pythagorean relationship between the variables. Given that the variance is the square of standard deviation:
where the two lengths are the standard deviation and bias and the hypotenuse is the root mean squared error.
High Bias and Low Variance (Underfitting):
- models are consistent or predictions will be similar to one another but inaccurate on average
- can be identified with a high training error and validation error same to the training error
- in the diagram, indicated by 1 and 4: notice the small gap between test and training data
High Variance and Low Bias (Overfitting):
- models are somewhat accurate but inconsistent on average or predictions not similar to one another
- can be identified with a low training error and high testing error
- in the diagram, indicated by 3 and 6: notice the large gap between test and training data
High Variance and High Bias:
- models are both inaccurate and inconsistent on average
- can be identified with high training error and even higher validation error
Ultimately, we are aiming for a model that has the lowest variance and the lowest bias.
I am currently in Phase 3 of the Flatiron School Data Science Online Immersive Bootcamp.