To best address data cleaning and data preparation for Machine Learning, I developed a workflow to handle outliers to improve consistency and outcome quality. Since the presence of one or more outliers in the data set can severely affect a Machine Learning model, the standard approach is often dropping these data points to get cleaned data that behave as we “expect”. However, without a robust workflow, the choice of either removing or leaving outliers, can easily introduce severe bias or end up with a model that does not generalize well, impacting its performance.
A workflow to handle outliers
To guide the decision making and properly manage outliers I use a simple but robust workflow:
- Detection
- Understanding
- Cleaning
- Documentation
Additionally, for my Exploratory Data Analysis tasks, I often use Jupiter Notebooks keeping track of the whole session, code, and commentary.
Detection
Generally speaking, an outlier is a data point that differs significantly from other observations.
The underlying assumption in this statement is that the set of “other observations” is big enough to give us some insight into the general data distribution or its statistical properties.
Single variable
Using a simple example and the standard Python Seaborn library I start by plotting both the distribution and the boxplot:
The boxplot gives us a good way to identify outliers, bearing in mind that their identification does not automatically imply future removal. The blue area in the boxplot goes from the first quartile (25% of data or Q25) to the third quartile (75% of data or Q75), defining the Inter Quartile Range (RIQR). A data point is an outlier if it is lower than Q25 or higher than Q75 by more than 1.5 RIQR , below respectively as Ol and Oh.
The whiskers are up to the data minimum and maximum without including the outliers.
$ \begin{eqnarray*} O_{l}: & x < Q_{25} – 1.5 &R_{IQR}\\ O_{h}: & x > Q_{75} + 1.5 &R_{IQR} \end{eqnarray*} $
Note that these levels can be easily codified with a NumPy function for programmatic removal:
q25, q75 = np.percentile(a=sample, q=[25, 75])
IQR = q75 - q25
lower_limit = q25 - 1.5 * IQR
higher_limit = q75 + 1.5 * IQR
Correlated variables
While preparing data for Machine Learning, another very common situation of possible outliers occurs when two highly correlated variables show a few data points very much misaligned with the general correlation.
For instance, using the Ames Housing Data from Kaggle we pick the SalePrice as the label and explore which features are highly correlated.
df.corr()['SalePrice'].sort_values(ascending=False).head()
SalePrice 1.000000
Overall Qual 0.799262
Gr Liv Area 0.706780
Garage Cars 0.647877
Garage Area 0.640401
Name: SalePrice, dtype: float64
Unsurprisingly, both Overall Quality and Great Living Area features are highly correlated with SalePrice so that a Machine Learning model could leverage this correlation to predict house sale prices.
Then, I systematically check the presence of possible outliers looking at a scatter plot of the chosen label with the two or three most correlated features. For instance:

As we see in the picture, Tthree data points with very high Great Living Area scores, above 4000, and SalePrice below 200000, qualifying them as outliers.
If the final purpose is to build a Machine Learning algorithm to predict sale price, these three points are very dangerous as they may completely mislead the model.
Given our purpose, we are leaning towards removing these three data points but we don’t know yet why these three houses were such a bargain. Infact, there might be unrecorded transactions such as barter to make up for the difference or potentially fraudulent behavior. Before removing outliers we need to try and better understand them.
Understanding
Occasionally, outliers are signals for new, interesting dynamics that are just emerging in data, they may deserve further study and should be kept. Often, they are just meaningless noise, without any useful information, and can seriously mislead our model.
Outliers can be due to many reasons:
- They may simply be due to expected fluctuations in the population
- A sensor may have had a temporary mismeasurement
- Humans may have tampered with the data, made a recording mistake or had fraudulent behavior
- Some external factor may have affected the feature or its measurement
- The underlying data-generating mechanism differs for extreme values (King’s effect)
- A new phenomena is taking place which is not included in the current understanding
For instance, let’s say we are modeling the likely price of used cars based on brand, year, and engine power. During our Exploratory Data Analysis (EDA) we discover that two or three specific cars in our data set were sold at 10 times the expected price, realizing that these data points are likely to affect our model significantly.
The importance of domain knowledge
Should we remove these outliers?
We cannot decide without acquiring some domain knowledge.
In scenario 1, after some investigation, we discover these three outliers are cars previously belonging to a famous actor, purchased years ago by a single (rich) fan who gladly overpaid for them. Since our model is not designed to handle this type of rare, odd transaction, we could certainly drop these data.
Alternatively, in scenario 2, we sold these three cars in three independent transactions within the last few months. We also discover these cars have become old enough to be featured in recent issues of Classic Cars magazine. Therefore, a new trend, may be potentially opening up a completely different market with classic cars aficionados. Suddenly these three outliers are way more interesting as they may help to represent future data in this segment.
Data Science is a team sport, the best data science organizations work together with the business to incorporate domain knowledge to make sense of data, unveil insights and opportunities.
Cleaning
When we decide to remove the outliers, I use the powerful Python Pandas library. With our data stored into the df data frame we pick the three outliers:
df[(df['Gr Liv Area'] > 4000) & (df['SalePrice'] < 200000)]

I remove these outliers from the data frame by retrieving their index and dropping them:
outliers_index = df[(df['Gr Liv Area'] > 4000) & (df['SalePrice'] < 200000)].index
df = df.drop(outliers_index, axis=0)
sns.scatterplot(data=df, x='Gr Liv Area', y='SalePrice')

Once satisfied I save the cleaned data frame for future use into a different .csv file using Pandas.
Documentation
I believe keeping documentation is super-important to hand over the dataset to others, to remind me months down the road, and to code possible filters when new data are integrated.
I leverage Jupiter notebooks that make documentation extremely easy. What I do:
- I make a copy of the notebook (which probably reveals my age…:)
- On the copy I remove non-essential codes, double-checks, etc
- I add markdown cells with commentary, especially about the reasoning behind outliers removal or non removal.
Done – it takes me no more than 10 minutes.
Why I need a consistent workflow to handle outliers
Outliers are patterns in data that deviate from expected normal behaviour. Outliers may contain extremely important information making them a fascinating topic with very interesting applications in domains such as medical diagnostics or malicious activity detection.
In many cases however, outliers are just unwanted noise in the data, and we need to remove them before any data analysis or further modelling is performed. We must have a way to identify the “normal behaviour” in our data, especially when normal behaviour keeps evolving and emerging patterns appear.
Acknowledging the complexity of the task, I propose a simple, robust workflow that requires a few Python libraries to detect, understand and clean. I cannot stress enough the importance of incorporating domain knowledge while leaving a documentation trail to use as a reference if novel, emerging patterns are confirmed by new data.
.