Impact of outliers to the decision trees
Impact
of outliers to the decision trees
What is an Outlier?
An outlier is an object that deviates
significantly from the rest of the objects. They can be caused by measurement
or execution error.
The impact of outliers on the model is depend upon the
type of model we are using. Different models have different impact of outliers
on the accuracy.
Impact on Decision Trees
There are two conditions present in case of decision
trees,
Case 1: Impact of outliers in
predictor variables (continuous variable)
Case 2: Impact of outliers in
target variables (continuous variable)
Note : In statistics, outliers
will have a considerable effect only if it is present in a continuous
(numerical) column. If there are some rare occurrences of a particular class in
categorical variables then we have to consider them as anomalies or class
imbalance and treat them accordingly.
In case 1, there will be no impact almost certainly.
In case 2, there might be some impact but not
necessarily.
This is because decision trees are made using
splitting the data points into multiple groups in a multidimensional space
considering the minimum sum of squared residuals and it does not create a
hyperplane like linear regression and support vectors.
Moreover, it works on an “if-then” procedure which
makes the model to ask specific questions to the data. If the condition is
satisfied then it gives a defined output otherwise it gives another defined
output.
Let’s understand this with an example.
We need to predict the occupation of a person
concerning his salary.
Here,
The predictor variable is salary. The target variable
is Occupation. Here, we can see that Persons with Person Id 10 and 11 are
outliers. Because person 10 is getting a high salary which is way above the
salary of all others and person 11 is getting only $2 which is very low when
compared with all other persons.
However, this decision tree can predict the occupation
of person 10 and person 11 without any error despite them being a part of
outliers.
Hence, an outlier in predictor variables cannot affect
the predictive ability of the model in most of the time.
Here, we have only one predictor variable- Salary.
So, the nodes will be split in a plane like this-
If we have multiple predictor variables, the nodes
will be spitted like this-
Case 2:
In this problem, we need to predict the salary of a
person concerning his occupation.
Here, The predictor variable is occupation.
The target variable is the salary.
There are 2 approaches to decide the output here-
Categorize the data points according to predictor
variables and consider -
1)
the mean of target variables as a
prediction.
2)
the median of target variables as a
prediction.
In the above decision tree, the salary of engineers
and doctors is a little bit inflated because the prediction comes around 207K
while most of them make a salary below 100 K.
Hence, there was an effect of the outlier in this situation.
Let’s take the median as a prediction metric instead
of mean and consider the prediction.
Here, both the predictions (Salary of Doctors/
Engineers and Salary of Teachers) seems less affected by the outliers. This is
because we used median instead of mean to predict the outcome.
Hence, if we use the median for predicting the target
value then there is less possibility of being affected by the outlier but if we
use mean for predicting the target value then there exists a possibility of
being affected by the outliers.
Summary:
The metrics used for splitting the node of decision
trees (Information gain / Gini impurity) and aggregative functions (Mean/
Median) to give a prediction as a continuous variable plays a major role in the
impact of outliers in the decision tree.
If the outliers are present in predictor variables
then there will be no impact for sure.
If the outliers are present in target variables then
there might be some impact (but not necessarily).
Comments
Post a Comment