10 Imputation Techniques for Dealing with Outliers

Imputation is the process of replacing missing or outlier values in a dataset with substituted values. This is often done to enhance the quality of the data for analysis or modeling. There are various imputation techniques, and the choice of method depends on the nature of the data and the analysis goals.
1. Mean Imputation:
- Replace outliers with the mean of the variable.
data_without_outliers = data.copy()
data_without_outliers['feature'] = np.where(condition, data['feature'].mean(), data['feature'])
2. Median Imputation:
- Replace outliers with the median of the variable.
data_without_outliers = data.copy()
data_without_outliers['feature'] = np.where(condition, data['feature'].median(), data['feature'])
3. Mode Imputation:
- Replace outliers with the mode (most frequent value) of the variable.
data_without_outliers = data.copy()
mode_value = data['feature'].mode().iloc[0]
data_without_outliers['feature'] = np.where(condition, mode_value, data['feature'])
4. Constant Imputation:
- Replace outliers with a predefined constant value.
constant_value = 0 # Adjust as needed
data_without_outliers = data.copy()
data_without_outliers['feature'] = np.where(condition, constant_value, data['feature'])
5. Regression Imputation:
- Use regression models to predict missing or outlier values based on other features.
from sklearn.linear_model import LinearRegression
impute_model = LinearRegression()
impute_model.fit(X_train, y_train)
imputed_values = impute_model.predict(X_outliers)
6. K-Nearest Neighbors Imputation:
- Replace outliers with values from their k-nearest neighbors.
from sklearn.impute import KNNImputer
imputer = KNNImputer(n_neighbors=5)
data_without_outliers = imputer.fit_transform(data)
7. Interpolation Imputation:
- Replace outliers by interpolating between neighboring values.
data_without_outliers = data.interpolate(method='linear', limit_direction='both')
8. Random Imputation:
- Replace outliers with random values drawn from the distribution of the variable.
data_without_outliers = data.copy()
random_values = np.random.normal(data['feature'].mean(), data['feature'].std(), size=sum(condition))
data_without_outliers.loc[condition, 'feature'] = random_values
9. Custom Imputation:
# Assuming 'df' is your DataFrame and 'column' is the column containing numerical data
# Replace outliers with a custom value
custom_value = 999
df['column_imputed_custom'] = np.where(outliers_condition, custom_value, df['column'])
10. Multiple Imputation (using fancyimpute library):
from fancyimpute import IterativeImputer
# Assuming 'df' is your DataFrame and 'column' is the column containing numerical data
# Iterative Imputation
imputer = IterativeImputer()
df['column_imputed_iterative'] = imputer.fit_transform(df[['column']])
Choose the imputation method based on the characteristics of your data and the analysis context. Be cautious when imputing values, as it introduces assumptions and may impact the results of your analysis or modeling.