Impact of outliers to the decision trees

 

Impact of outliers to the decision trees

 

What is an Outlier?

     



An outlier is an object that deviates significantly from the rest of the objects. They can be caused by measurement or execution error.

The impact of outliers on the model is depend upon the type of model we are using. Different models have different impact of outliers on the accuracy.

Impact on Decision Trees

There are two conditions present in case of decision trees,

Case 1: Impact of outliers in predictor variables (continuous variable)

Case 2: Impact of outliers in target variables (continuous variable)

 

Note : In statistics, outliers will have a considerable effect only if it is present in a continuous (numerical) column. If there are some rare occurrences of a particular class in categorical variables then we have to consider them as anomalies or class imbalance and treat them accordingly.

In case 1, there will be no impact almost certainly.

In case 2, there might be some impact but not necessarily.

This is because decision trees are made using splitting the data points into multiple groups in a multidimensional space considering the minimum sum of squared residuals and it does not create a hyperplane like linear regression and support vectors.

Moreover, it works on an “if-then” procedure which makes the model to ask specific questions to the data. If the condition is satisfied then it gives a defined output otherwise it gives another defined output.

Let’s understand this with an example.


We need to predict the occupation of a person concerning his salary.


Case 1 Our data looks like this-      

 


Here,

The predictor variable is salary. The target variable is Occupation. Here, we can see that Persons with Person Id 10 and 11 are outliers. Because person 10 is getting a high salary which is way above the salary of all others and person 11 is getting only $2 which is very low when compared with all other persons.

 

 





However, this decision tree can predict the occupation of person 10 and person 11 without any error despite them being a part of outliers.



Hence, an outlier in predictor variables cannot affect the predictive ability of the model in most of the time.

Here, we have only one predictor variable- Salary.

So, the nodes will be split in a plane like this-



If we have multiple predictor variables, the nodes will be spitted like this-




Case 2:

In this problem, we need to predict the salary of a person concerning his occupation.


Here, The predictor variable is occupation.

The target variable is the salary.

There are 2 approaches to decide the output here-

Categorize the data points according to predictor variables and consider -

1)   the mean of target variables as a prediction.

2)   the median of target variables as a prediction.

 



 

 

 




In the above decision tree, the salary of engineers and doctors is a little bit inflated because the prediction comes around 207K while most of them make a salary below 100 K.

Hence, there was an effect of the outlier in this  situation.

Let’s take the median as a prediction metric instead of mean and consider the prediction.



Here, both the predictions (Salary of Doctors/ Engineers and Salary of Teachers) seems less affected by the outliers. This is because we used median instead of mean to predict the outcome.

Hence, if we use the median for predicting the target value then there is less possibility of being affected by the outlier but if we use mean for predicting the target value then there exists a possibility of being affected by the outliers.

Summary:

The metrics used for splitting the node of decision trees (Information gain / Gini impurity) and aggregative functions (Mean/ Median) to give a prediction as a continuous variable plays a major role in the impact of outliers in the decision tree.

If the outliers are present in predictor variables then there will be no impact for sure.

If the outliers are present in target variables then there might be some impact (but not necessarily).

 

Comments

Popular posts from this blog

Copyright registration Procedure in India

Connecting Business

Basics of Neural Networks