Skip to content

On Solving ML Data Quality Problems

Poor data quality negatively affects many data-processing efforts; therefore, we must address data quality for reliable modeling outcomes.

Kyle Lyon
Kyle Lyon
12 min read
On Solving ML Data Quality Problems
Dilbert Comic

Table of Contents

Responsible machine learning starts with good quality data, as machines are as good as the data from which it was trained. Without quality data, decision-makers cannot trust ML-powered applications to help them make informed decisions. So, below I have composed a list of common data quality problems with their corresponding feature engineering solution.

Incomplete Data

Incomplete data should not be confused with missing data. Missing data is when there are missing values for variables in observations. However, datasets are considered incomplete when they do not contain information about the phenomenon of interest. Easy to spot symptoms of incomplete datasets are useless models and meaningless results.

For instance, suppose you conduct a survey to understand the dietary habits of a populations, but you only collect information from adults about their food intake and do not include the habits of children in your study. This dataset would be incomplete as it lacks information on a significant portion of the population, and any results might not be accurately representative of the population.

Solution: Unfortunately, there is no methodology to fix an incomplete dataset because we cannot force data to understand, predict, explain, or describe certain phenomena. The best option is to get more and better data and eliminate the original incomplete data.

Biased Data

Unlike incomplete data, biased data contains information about the phenomenon of interest; however, that information is consistently and systemically wrong.

Bias, to some extent, can be found in any data due to personal biases influencing what data to collect and how it's collected. Additionally, how we define things we collect data on can also introduce bias. If our understanding of the world is shaped by our subjective interpretations of reality, then human biases can poison data collection, analysis, and interpretability.

A couple of examples of biased data could for instance the study on the effectiveness of a medication that only includes data from patients with a positive experience as it will not represent the population of patients or studies on subjective viewpoints such as "success", "mental health", or "crime".

Solution: Unfortunately, again, there is no analytical remedy for biased data. The only course of action is to dispose of the original biased dataset and find more and better data.

Wide Data

Some cases of wide data are where a dataset has each different data variable in a separate column, or each value in the first column is unique. Wide (or long) data is not inherently wrong because, in many cases, it can make sense to have a dataset in a particular format.

For example, a wide data format would be helpful in calculating the averages of observations (row values) across features (columns). However, if the goal is to visualize multiple variables on a plot, a long dataset format may be better.

Wide data becomes a problem when you encounter lengthy, intolerable compute times or face the curse of dimensionality, which renders models meaningless.

Solution: I would recommend the following three solutions to overcome wide data problems:

Feature Selection

Finding the best subset of original variables from a data set, typically by measuring the original variable’s relationship to the target variable and taking the subset of original variables with the strongest relationships with the target.

from sklearn.feature_selection import RFE
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression

# generate sample data
X, y = make_regression(n_samples=100, n_features=50, n_informative=10, noise=0.5)

# create a linear regression model
model = LinearRegression()

# select the top 10 features using RFE
selector = RFE(model, n_features_to_select=10)
selector.fit(X, y)

# print the selected features
selected_features = [i for i, x in enumerate(selector.support_) if x]
print("Selected Features:", selected_features)

In this example, we first generate sample data with 50 features and 10 informative features. Then we create a Linear Regression model and use Recursive Feature Selection to select the top 10 features.

Feature Extraction

Combining the original variables in a data set into a new, smaller set of more representative variables, very often using unsupervised learning methods.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

# load the iris dataset
iris = load_iris()
X = iris.data

# create a PCA object and fit it to the data
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X)

# print the shape of the original and transformed data
print("Original shape:", X.shape)
print("Transformed shape:", X_pca.shape)

In this example, we load the iris dataset with 150 samples and 4 features. Then use PCA to reduce the data from 4 dimensions to 2 dimensions, fit the PCA object to the data, and transform it to obtain the new 2-dimensional data. Lastly, we just print the shape of the original and transformed data to show the dimensionality reduction.

L1 Regularization (Lasso Regression)

Lasso is a regularization technique that addresses feature selection by using a type of linear regression that uses shrinkage. L1 Regularization adds a penalty equal to the absolute value of the magnitude of coefficients, which can eliminate some coefficients, thus creating a simpler model.

In short: Shrinking less important feature's coefficients to zero works well for feature selection.

from sklearn.linear_model import Lasso
from sklearn.datasets import fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# load the California Housing dataset
california = fetch_california_housing()

# split the dataset into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(california.data, california.target, test_size=0.2, random_state=42)

# create a Lasso object and fit it to the training data
lasso = Lasso(alpha=0.1)
lasso.fit(X_train, y_train)

# make predictions on the testing data
y_pred = lasso.predict(X_test)

# calculate the mean squared error
mse = mean_squared_error(y_test, y_pred)
print("Mean Squared Error:", mse)

Sparse Data

Not to be confused with missing values, sparse data is when variables for observations do not contain actual data. Sparse data is empty or has zero value while missing data does not show anything, meaning the data points are unknown.

Let's take for example customer purchases data. If each column in a dataset represents whether or not an individual product was purchased or not, there will likely be many null or zero value cells which could impact the accuracy of a model that predicts future purchases or make it difficult to identify spending patterns.

Sparse data can negatively affect machine learning performance by inhibiting a model's ability to calculate accurate predictions and increasing models' space and time complexity. However, in some cases, sparseness is a non-issue as some algorithms handle it automatically and elegantly, but again, some don't.

Solution: Three options to deal with sparse data are the following:

  • Reduce noise by removing sparse features from the model.
  • Make features denser by using methods such as Principal Component Analysis or Feature Hashing.
  • Appropriate algorithm selection: use models that are robust to sparse features.

Imbalanced Data: Inputs & Targets

A case of imbalanced data typically arises when the classes in classification problems are not represented equally. For example, the dataset below with 100 instances contains two different labels: "A" and "B."

id label col 3 col 4
1 A 51 67
2 B 7 20
... ... ... ...
97 A 73 20
98 B 16 9
100 B 69 53

In this two-class classification problem, 80 percent of the instances are labeled "B," and the remaining are labeled "A." Since the split is 80:20, the dataset is imbalanced. As a result, the classification problem will face the accuracy paradox, a paradox where a model with a decent classification accuracy (like 80 percent) merely reflects the underlying label distribution.

Imbalanced data is a likely cause of single class and biased model predictions.

Solution: There are a few options to tackle imbalanced data:

Collect More Data

Although pretty obvious, it is sometimes overlooked.

Proportional Sampling

Take all the rows containing rare events in a data set and increase them proportionally to the number of rows not containing rare values. Both oversampling and undersampling artificially inflate the frequency of rare events, which helps models learn to predict rare events, making interpreting results more challenging.

import pandas as pd
from sklearn.utils import resample

# load the imbalanced dataset
df = pd.DataFrame({'class': ['A']*20 + ['B']*80})

# count the number of instances for each class
class_counts = df['class'].value_counts()

# determine the smallest class size
min_class_size = class_counts.min()

# resample each class to the smallest class size
sampled_df = pd.concat([resample(df[df['class'] == c], n_samples=min_class_size, replace=False, random_state=42) for c in class_counts.index])

# verify the class balance of the sampled dataset
print(sampled_df['class'].value_counts())

In this example, we create an imbalanced dataset with 20 instances labeled "A" and 80 instances labeled "B." Then, we count the number of instances for each class and determine the smallest class size. Next, we use the resample function from scikit-learn to resample each class to the smallest class size, using the n_samples, replace, and random_state parameters to control the sampling. Lastly, we verify the class balance of the sampled dataset by printing the number of instances for each class.

Change Performance Metric

Given that accuracy is a misleading metric when working with imbalanced data, try using different performance measures that may grant more insight. Some popular metrics include:

  • Confusion Matrix
  • Precision
  • Recall
  • F1 Score
  • ROC Curve

A personal favorite of mine is looking at the "capture rate". It measures the proportion of positive instances correctly identified. So, for example, if we train a model for a classification problem that has 900 negative instances and 100 positive instances, we could sort the predictions by their probability score and calculate the capture rate at different thresholds. That is to say (fictitiously speaking), we can look at the top 20 percent of highest scores and see that it contains 80 true positives and 20 false negatives for a capture rate of 80 percent at this threshold.

Ultimately, I'd recommend using many metrics in conjunction with each other for a more complete picture of a model's performance.

Outliers

Data outliers can seriously spoil and mislead the training process of machine learning algorithms because they are sensitive to the range and distribution of attribute values. More specifically, outliers are damaging to predictive models because models seek to minimize squared error and minimize disproportionately large squared residuals at the expense of minimizing the error of more reliable data points.

Additionally, extreme values increase the variability of models, which decreases statistical power, and impacts the basic assumption of popular learning methods such as regression, ANOVA, and other statistical modeling approaches.

Solutions: If you detect outliers in your data, consider discretization and winsorizing:

Discretization

Changing a numeric variable into an ordinal or nominal categorical variable based on value ranges of the original numeric variable. Discretization can also be referred to as "binning." Discretization has many benefits:

  • When restricted to linear models, binning helps introduce nonlinearity because each bin in a variable gets its own parameter.
  • Binning smoothes complex signals in training data, often decreasing overfitting.
  • Binning deals with missing values elegantly by assigning them to their own bin.
  • Binning handles outliers elegantly by assigning all outlying values to the ‘high’ or ‘low’ bin in training and new data.
import pandas as pd
import numpy as np

# load the dataset
df = pd.read_csv('data.csv')

# create bins for the 'age' column using quantiles
df['age_bin'] = pd.qcut(df['age'], q=4, labels=False)

# print the frequency distribution of the bins
print(df['age_bin'].value_counts())

In this example, we load the dataset and bin the age column into 4 bins using quantiles with pd.qcut(). We create a new column age_bin to store the binned values and verify the success of binning by printing the frequency distribution of the bins.

Winsorizing

Removing outliers in a variable’s value and replacing them with more central values of that variable.

import pandas as pd
import numpy as np

# load the dataset
df = pd.read_csv('data.csv')

# winsorize the 'age' column to the 5th and 95th percentiles
age_winsorized = np.clip(df['age'], np.percentile(df['age'], 5), np.percentile(df['age'], 95))

# replace the original 'age' column with the winsorized values
df['age'] = age_winsorized

# print the summary statistics of the 'age' column before and after winsorizing
print("Before winsorizing:\n", df['age'].describe())
print("\nAfter winsorizing:\n", age_winsorized.describe())

In this example, we winsorize the 'age' column by clipping the extreme values to the 5th and 95th percentiles using np.clip(). Then, we replace the original 'age' column with the winsorized values in the DataFrame and print summary statistics before and after winsorizing to compare the distribution of the data.

Missing Values

Missing values is when no data is stored for a variable of observation of interest.

Solution: When facing missing data points, you could replace missing values with inference substituted values, known as imputation, or try discretization (binning). A tree-based algorithm or Naive Bayes Classification may be good options here.

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

# load the dataset
df = pd.read_csv('data.csv')

# split the dataset into training and testing sets
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)

# create an imputer object using an ExtraTreesRegressor estimator
imputer = IterativeImputer(estimator=ExtraTreesRegressor(n_estimators=10, random_state=42), random_state=42)

# fit the imputer object to the training data
imputer.fit(train_df)

# impute the missing values in the training and testing data
train_df_imputed = pd.DataFrame(imputer.transform(train_df), columns=train_df.columns)
test_df_imputed = pd.DataFrame(imputer.transform(test_df), columns=test_df.columns)

# make predictions on the testing data and calculate the mean squared error
y_pred = train_df_imputed['target'].mean() * test_df_imputed['feature'].values
mse = mean_squared_error(test_df_imputed['target'].values, y_pred)
print("Mean Squared Error:", mse)

In this example, we load the dataset, split it into training and testing sets, and use IterativeImputer from scikit-learn with an ExtraTreesRegressor estimator to impute missing values in both sets. We then make predictions on the testing set and calculate the mean squared error.

It estimates missing values based on variable relationships using a decision tree or forest of decision trees, and iteratively fills in missing values until convergence, resulting in an imputed dataset for further analysis or modeling.

Character Variables

Also known as "strings" or factor variables, character variables are recognized by computers as text. Character variables can pose data wrangling and cleaning challenges resulting in information loss, computational errors, and biased results.

Solution: Encoding and the same solutions for missing values are appropriate fixes for character variables.

Encoding

Changing the representation of a variable. In data mining applications, often categorical, character variables are encoded to numeric variables to be used with algorithms that cannot accept character or categorical variables.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# load the dataset
df = pd.read_csv('data.csv')

# convert the 'gender' column to a categorical variable
df['gender'] = pd.Categorical(df['gender'])

# create a one-hot encoder object
encoder = OneHotEncoder()

# encode the 'gender' column using one-hot encoding
gender_encoded = encoder.fit_transform(df[['gender']])

# print the encoded categories
print(encoder.categories_)

In this example, we convert the 'gender' column to categorical using pd.Categorical() and encode it using one-hot encoding with OneHotEncoder() from scikit-learn. The resulting encoded categories are stored in gender_encoded and printed to verify the success of the encoding.

High Cardinality Categorical Variables

Cardinality refers to the number of possible categorical values a variable can assume. For example, the variable "NBA Team," has 30 possible values because there are only 30 teams in the NBA. Therefore, high cardinality is when there are rare or too many categorical values that cause problems such as overfitting, memory issues, and huge matrices.

Solution: To handle high cardinality, our options include discretization and target encoding (categorical or numeric).

import pandas as pd
from category_encoders import TargetEncoder

# load the dataset
df = pd.read_csv('data.csv')

# create a target encoder object
encoder = TargetEncoder(cols=['NBA Team'])

# fit the encoder to the data and transform the 'NBA Team' column
df_encoded = encoder.fit_transform(df['NBA Team'], df['Points'])

# print the encoded categories
print(df_encoded)

In this example, we first load the dataset and create a target encoder object using TargetEncoder() from the category_encoders library. We specify the 'NBA Team' column as the target column to encode and fit the encoder to the data using the 'Points' column as the target variable. We then transform the 'NBA Team' column using the trained encoder and store the encoded categories in a new DataFrame df_encoded. Finally, we print the encoded categories to verify the success of the encoding process.

Disparate Variable Scales

When variables are not measured using the same scale, you have a case of disparate variable scales. The lack of uniformity in measuring variables can create standardization problems that prevent comparison.

Solution: To fix disparate variable scales, you will need to implement standardization, which is enforcing similar scales on a set of variables.

import pandas as pd
from sklearn.preprocessing import StandardScaler

# load the dataset
df = pd.read_csv('data.csv')

# create a standard scaler object
scaler = StandardScaler()

# scale the 'feature1' and 'feature2' columns using the scaler object
df[['feature1', 'feature2']] = scaler.fit_transform(df[['feature1', 'feature2']])

# print the standardized dataset
print(df)

In this example, we use StandardScaler() from scikit-learn to standardize the 'feature1' and 'feature2' columns in the dataset. The standardized dataset is printed to verify the success of the standardization process.

Basically, it works by transforming each variable in the dataset to have a mean of 0 and a standard deviation of 1, resulting in variables that are measured using the same scale.

Strong Multicollinearity

Multicollinearity is a statistical concept where two or more independent variables are correlated with each other, consequently offering the same prediction on the dependent variable. In other words, if two or more predictor variables are perfectly correlated, they would have zero unique predictive power.

Solution: The solution for multicollinearity is feature selection, feature extraction, and L2 Regularization.

L2 Regularization

It helps deal with multicollinearity by constricting the coefficient and keeping all variables. It estimates the significance of predictors and penalizes insignificant predictors.

import pandas as pd
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler

# load the dataset
df = pd.read_csv('data.csv')

# separate the dependent variable from the independent variables
X = df.drop('y', axis=1)
y = df['y']

# standardize the independent variables
scaler = StandardScaler()
X = scaler.fit_transform(X)

# create a ridge regression object with a regularization strength of 0.5
ridge = Ridge(alpha=0.5)

# fit the ridge regression object to the data
ridge.fit(X, y)

# print the coefficients of the ridge regression object
print(ridge.coef_)

In this example, we use Ridge Regression with L2 regularization to handle multicollinearity. The independent variables are standardized using StandardScaler(), and the ridge regression object is fitted to the data with a regularization strength of 0.5. The coefficients of the ridge regression object are printed to verify the success of the regularization process.

Dirty Data

Dirty data comes in all shapes and sizes, but it usually  causes one or more of the following issues:

  • Information Loss
  • Biased Models and Inaccurate Results
  • Lengthy, Intolerable Compute Times
  • Unstable Parameter Estimates
  • Out-of-Domain Predictions

Solution: A combination of solution strategies explained above.

Conclusion

Part of what makes data science an intriguing is the variability of challenges faced when encountering new problems. Hopefully from reading this article, you have a general idea of what some of those problems look like and where to start in solving them with respect to data quality.

References:

Kyle Lyon Twitter

I'm Kyle, a Data Scientist in DevSecOps contracting with the U.S. Space Force through Silicon Mountain Technologies.

Comments


Related Posts

Members Public

Microsoft's DoD Azure Summit: JADC2 Data Fabric, PKIs, and Fortifying U.S. Decision Advantage

Attended a military conference featuring the newest Azure mission capabilities, such as GPT-4 Copilot Semantic Kernel, Hyperscale AI, Ask Sage, and more.

Microsoft's DoD Azure Summit: JADC2 Data Fabric, PKIs, and Fortifying U.S. Decision Advantage
Members Public

On "Showing Your Work"

How can you showcase the sheer brilliance of your statistical approach in a way that's both informative and easy to follow?

On "Showing Your Work"
Members Public

Understanding Responsible ML

With great power comes great responsibility, especially when it comes to machine learning.

Understanding Responsible ML