The effect of (bad) Data Quality on Model Accuracy in Supervised Machine Learning — Results from Simulation Studies

14 min readFeb 23, 2022

--

Data quality obviously plays an important role for data science and the validity of the received results. This article addresses the case where data quality cannot be improved retrospectively and you need to perform your analyses on this data.

Simulation studies for supervised machine learning have been applied to quantify the effect of (bad) data quality on the accuracy of your results compared to a dataset with no data quality problems.

The following 4 data quality criteria have been studied in these simulation case studies

Data Quantity — How much data do I need? The different amount of available observations and events is studied on model outcome.
Data Availability — The consequences of withholding the set of the most important variables is analyzed in these case studies.
Data Correctness — “Bias in the data” — What happens if random and systematic bias in the input and target variables are introduces in supervised machine learning
Data Completeness: How do missing values affect predictive power?

In this article the simulation results for the first 3 criteria are discussed in detail. Details for “Data Completeness / Missing values” are illustrated in a separate article: Quantifying the Effect of Missing Values on Model Accuracy in Supervised Machine Learning Models

Conclusion

The following conclusion can be drawn from the simulation studies:

Increasing the number of events and non-events matters

The simulation results have shown that with the increasing number of events and non-events, the model quality improves. This is especially true for cases with up to 250 events, but it is also true for cases with more events. Additional data still improve model quality to some extent.

Age variable is important and there are compensation effects between the variables

In all the simulation data sets, an age variable was present. Removing the age variable from the list of available input variables showed a decrease in variable quality. It has, however, also been shown that the non-availability of the input variables in the best model can be compensated, to some extent, by other input variables.

It makes a difference whether data disturbances occur in the training data only or in both the training and scoring data

The simulations have shown that it makes a difference whether only the training data are biased by missing or incorrect data or whether both the training and scoring data are biased. Biasing the scoring data decreases model quality.

Random disturbances affect model quality much less than systematic disturbances

It has been shown that many models can cope with random disturbances to a certain extent. The introduction of systematic biases, however, causes a larger decrease in model quality.

General simulation procedure

Preprocessing

For the simulations a supervised machine learning task with a binary target variable has been used. The data are taken from four real life datasets from different industries.

In order to have a “perfect” start dataset for the simulations the data have be preprocessed in the following way. This leads to a training dataset with no missing values.

If a variable has has more than 5 % of missing values → the variable is dropped
Observations with missing values for the remaining variables (≤ 5% of missing values) are removed from the analysis.

Multiple modeling cycles are run to retrieve a stable model with good predictive power. The list of list of variables of this variable are frozen for the usage in simulations.

Running the simulations

The following graph shows the procedure for the simulations and their evaluation. Data are split to training and validation data. The test partition is used as “scoring data” to mirror a data set which has not been used in model training. This allows to evaluate the effect of missing values in model scoring.

In the next step (blue rectangle) “scenario specific” pre-treatment of the data is performed. This includes e.g.

the deletion of records for the analysis of data quantity
the removal of input variables
the introduction of biases to the data
the insertion of missing values

Finally the regression model with the frozen set of variables is trained on this data and the results are evaluated.

Possible bias in the models in the simulation scenarios

Note that providing a predefined set of optimal variables for the predictive models in the different simulation scenarios generates a bias toward too optimistic model performance. The model does not need to find the optimal set of predictor variables. For data with data quality problems, such as not enough observations, high numbers of missing values, bias in the input data, and selection of the optimal predictor variables may not result in the best data set.

Data quality can impact the selection of the optimal input. Thus, the simulation scenarios are influenced by this a priori variable selection (and, consequently, on a priori knowledge). Nevertheless, it has been decided to provide a predefined set of variables in order to compare apples to apples and to remove possible bias in the scenario results caused by differences in the variable selection.

Framework for a business case calculation

Introduction

A fictional reference company is used to calculate a business case based on the outcome of the different simulation scenarios. The change in model quality is transferred into a response rate and the respective profit is expressed in US dollars.

This is done in order to illustrate the effect of different data quality changes from a business perspective. As a note of caution, the numbers in US dollars should only be considered as rough indicators based on the assumption of the simulation scenarios and on the business case as described here. In individual cases, these values and relationships are somewhat different.

The reference company Quality DataCom

The reference company, Quality DataCom, in the following chapters operates in the communications industry. The company has 2 million customers and runs campaigns to promote the cross- and up-sell of its products and services. In a typical campaign, usually about 100,000 customers (5% of the customer base) with the highest response probability are contacted. The average response of a customer to an offer (offer on take or product upgrade) represents a profit of $25.

Assume that you use an analytic model that predicts 19% of the positive responses in the top 5% of the customer base correctly. Response here means that the campaign contact results in a product purchase or upgrade. This leads in total to 19,000 responding customers in this campaign, which generate a profit of $475,000 (19,000 x $25).

Data Quantity — How much data do I need

For the simulations, the variation of the data quantity has been performed in two different ways:

The variation of the number of events (blue line)

Here events are gradually deleted from the data in order to study the change in predictive power if fewer events are available for modeling. Non-events are not deleted from the data.

This reflects the situation that is frequently encountered where there is a sufficient amount of non-events in the data, but only a limited amount of observations with events have been observed.

The variation of the number of events only, while keeping the number of non-events constant, implicitly causes a variation of the sampling ratio between events and non-events. This, however, is not the focus of the study that is performed here.

The variation of the number of observations in general (red line)

Here the number of observations is systematically reduced in order to study the impact on the predictive power. Both events and non-events are deleted from the data so that the sampling ratio between events and non-events stays (approximately) constant.

This reflects the situation where a limited amount of observations is available for the analysis. This scenario illustrates the effect of additional observations on the predictive power (for example, by providing more customer records or interviewing more respondents).

Results

Comparison of the %Response between “only events excluded” and “events and
non-events excluded

As expected, the %Response increases with the increasing number of events (blue line). There is a strong relationship between an increase of available event cases and model performance gains in the range from 50 to 100 observations. Above 400 events, the improvements get smaller. Above 1,000 (2,000 and 5,000), only a slight increase in model quality can be seen.

These results implicitly show that there is a difference between the two cases, where either only events or events and non-events are removed from the data. The model quality is better if just events are removed from the data because the non-events still contribute information to the model.

For scenarios up to 1,000 events, it makes a difference whether just events or both events and non-events are removed from the data. Thus, as soon as the total sample size does not get sufficiently large, also additional non-event records in the data contribute to the predictive power of the model, as this increases the total “n.”

Calculating the business case

For the reference company described above these results mean the following:

Running a campaign that has been trained on 50 events has an expected cumulative profit of $268,500, whereas 100 events in the training data have an expected cumulative profit of $364,000.
Thus, an additional profit of $77,500 can be created per campaign, if the training data contain 100 events instead of 50 events. This is an additional profit of $1,550 for each additional event in the training data.
Increases in total event numbers on a higher base (for example, from 500 to 1,000) still generate an additional total profit of $13,750 per campaign. Broken down to additional profit for each additional event in the training data, the result is $27.50.
Running a campaign that has been trained on 50 events has an expected cumulative profit of $372,750, whereas 100 events in the training data have an expected cumulative profit of $418,750.
This leads to an additional profit of $46,000 per campaign, if the training data contain 100 events instead of 50 events. This is an additional profit of $920 for each additional event in the training data.
In the scenarios with higher event numbers (for example, 500 and 750) the additional profit is still $18,500 per campaign. And the additional profit for each additional event in the training data is still $74.

Data Availability — What if I can’t use certain variables?

The consequences of withholding the set of the most important variables is analyzed in these case studies.

The analysis that is described in this section is based on the availability of variables. If a certain subset allows you to build a good model, study how much the predictive power of this model decreases if this set of variables is removed from the training data. In this case, the analytical model has to use other variables for model building.

The correlation between variables is important to some extent to determine a representative replacement value for missing values and for alternate predictor variables. If, for example, a certain variable is deemed to be important for the model but cannot be made available, other variables will probably take over its predictive content and power as they are correlated.

The following scenarios have been studied:

The unconstrained model that can use all available variables except the Customer Age variable (Model 1 without Age). Note that customer age has intentionally been chosen here because this is, in many cases, a very important variable (also all four baseline models presented in chapter 16 use the customer age variable), but often it cannot be made available or it can only be made available with (systematic) missing values.
The unconstrained model that can use all available variables except the AGE variable and those variables that are in the same variable cluster as age (Model 1 without Age Cluster). This scenario reflects the situation where the non-availability of a certain variable often triggers the non-availability of correlated variables. This is especially the case if different variables are derived from the same variable in transactional data. If this variable is not available, all derived variables that are based on it will not be available, too.
The model that can use all variables, except those that are in the unconstrained model (Model 2 [without Model 1 Variables]).
The model that can use all variables, except those that are in Model 1 and 2 (Model 3 [without Model 1 and 2 Variables]).
The model that can use all variables, except those that are in Model 1, 2, and 3 (Model 4 [without Model 1, 2, and 3 Variables]).

Results

For the simulation data, the results show a monotonic drop in the average %Response from availability in model 1 to model 4.

It is interesting to see that the removal of the age variable causes a drop in the average %Response from 16.4% to 14%, which is almost as strong as the decrease from model 1 to model 2. This means that the non-availability of the age variable solely accounts for almost all of the decrease in the %Response when excluding the model 1 variables.

You can also see that the Model 1 without Age Cluster variables falls below the Model 2 %Response. This means that Model 2, even with the Age variables removed, performs better because this model can include correlated substitute variables for age, which are not available in the model where the variables that are correlated with age are removed.

Results from the business case

For the reference company described above, the results for the removal of the age variable means that there is an expected cumulative profit of $410,000 with the age variable and an expected cumulative profit of $350,000 without the age variable. The absolute difference of $60,000 can be a trigger to pay particular attention to the completeness and correctness of the age variable in a customer database.

Data Correctness — “Bias in the data”

“Invisible” data quality problem

Data correctness is, in many cases, an invisible data quality problem because it often cannot be explicitly seen. Unlike missing values, where a missing value can explicitly be queried from the data and summarized for each variable, whether the available data are biased or not cannot just be detected by simply checking the value. Often business and validation rules are applied here to decide whether a value is correct or not.

There are different methods to profile the data quality status in terms of data correctness. In some cases, hard fact rules that base on value ranges, list of values, or integrity checks can be used to check whether a value is correct or not. In other cases, however, the outcome of checking for data correctness can only be done on a probabilistic basis (like “value is probably not correct”). For example, statistical methods can be used to flag observations that fall outside the upper and lower validation limits that are calculated based on the standard deviations.

Random and systematic bias

As a data scientist you should differentiate between random and systematic bias in the data. Random biases occur for each analysis subject with the same probability, irrespective of the other variables. Also, the bias itself does not point systematically in one direction.

Systematic missing biases are assumed to occur for each analysis subject with a different probability. Also the direction of the bias can be systematically upward, downward, or toward the center. Consider, for example, the cases where all observations show the identical value or all observations with a large value show a small value.

For simplicity, systematic bias values have been created in the data by defining 10 equal-sized clusters of analysis subjects. Specific to the definition of the respective simulation scenario, the variables for all observations in one or more of these clusters are biased.

Biased values in the scoring data partition

The simulations that are performed to assess the effect of bias in the input variables are set up that way that the data partition that is used to evaluate the model quality also has biased values.

This scenario refers to situations where data quality problems like biased values occur not only in the model training phase, but also in the application of the model. This case refers to the situation where data quality problems have not yet been fixed and are present in the training and scoring data.

Scenario Overview

The following scenarios have been created based on the data

R0: No bias is introduced into the data.
R0.5: Half a standard deviation is randomly added/subtracted from the standardized interval input variables.
R1: One standard deviation is randomly added/subtracted from the standardized interval input variables.
R2: Two standard deviations are randomly added/subtracted from the standardized interval input variables.
R3: Three standard deviations are randomly added/subtracted from the standardized interval input variables.
S0: The values of the input variables are set to 0.
S-1: The sign of the input variables is changed.

Results

Box plot for %Response for different settings of random and systematic bias in the training
and scoring data

You can see that the increased amount of random disturbances in the input data lowers the average response rate from 19.29 for R0 to 17.63 for R1, 15.88 for R2 and 14.99 for R3. This means that the relative loss in the response rate for R1 (+/- 1 standard deviation) compared to R0 is already 8.6 %.

You also see that a systematic bias has a much larger impact. Scenario S0 drops to 15.55 % and Scenario S-1 to 13.63 %.

Conclusion from the simulation studies

From a data quality point of view, the following conclusions can be drawn from the simulation studies

Increasing the number of events and non-events matters

The simulation results have shown that with the increasing number of events and non-events, the model quality improves. This is especially true for cases with up to 250 events, but it is also true for cases with more events. Additional data still improve model quality to some extent.

Age variable is important and there are compensation effects between the variables

In all the simulation data sets, an age variable was present. Removing the age variable from the list of available input variables showed a decrease in variable quality. It has, however, also been shown that the non-availability of the input variables in the best model can be compensated, to some extent, by other input variables.

It makes a difference whether data disturbances occur in the training data only or in both the training and scoring data

The simulations have shown that it makes a difference whether only the training data are biased by missing or incorrect data or whether both the training and scoring data are biased. Biasing the scoring data decreases model quality.

Random disturbances affect model quality much less than systematic disturbances

It has been shown that many models can cope with random disturbances to a certain extent. The introduction of systematic biases, however, causes a larger decrease in model quality.

Webinar presentations

Links

Links to two webinars on related analyses are shown in the text above. My data preparation for data science webinar contains more contributions around this topic.

Medium Article: Is your data ready for data science? — Motivating this topic from a sail-race-analysis example

Medium Article: https://gerhard-svolba.medium.com/is-your-data-ready-for-data-science-motivating-this-topic-from-a-sail-race-analysis-example-7dde97a68e4d?

Medium Article: https://gerhard-svolba.medium.com/quantifying-the-effect-of-missing-values-on-model-accuracy-in-supervised-machine-learning-models-8d47d7eca921

https://www.youtube.com/watch?v=61mKcvEj5a0&list=PLdMxv2SumIKsqedLBq0t_a2_6d7jZ6Akq&index=8

SAS Communities Article: Using SAS Enterprise Miner for Predictive Modeling Simulation Studies

Presentation #102 in my slide collection contains more visuals on this topic.