Explaining Sign Inversion of Parameter Estimates in Multiple Regression Models — A Story-Telling Approach

Gerhard Svolba
6 min readDec 8, 2021

Explaining ML models — Classical multiple regression models can be a challenge as well

Interpretability and explainability of machine learning models has always been a hot topic. While this topic recently gets a lot of attention in the spotlight of “Fairness”, “Transparency” and “Interpretability”, it is definitely not only limited to complex ML algorithms like random forests, gradient boosting, neural networks or support vector machines.

Also the results of multiple regression models can sometimes be hard to explain to business people and decision makers. Consider the example where a variable that is positively correlated in univariate analysis receives a negative regression coefficient in multiple regression analysis.

This article focuses on the fact that sign inversion of parameter estimates between the univariate and the multiple regression model can occur, which might cause confusion for those, who should apply the model results in their business processes.

The purpose of this text is to provide a story-telling approach to explain and illustrate an occurrence of sign inversion to business people rather than giving an overview how to tackle this problem statistically.

Why does “Shopping Frequency” now have a negative effect?

Data scientists and statisticians usually encounter such a situation more than once in their career: They explain the results of a statistical model to medical researchers, marketing campaign managers, process engineers or fraud investigators. And the positive relationship of a particular variable with the outcome variable, which has been seen in univariate explanatory analysis, suddenly turns into a negative relationship in the multiple model.

For example: The shopping frequency in the last 3 months has a clear positive effect with the fact that a voucher for a new product line has been redeemed by the customer. However as soon as the event “Voucher Redemption” is analyzed in a model with other variables like age, loyalty card status, gender, or average basket size, the model exhibits a negative effect of variable “shopping frequency”.

Even if there is a clear statistical explanation for this fact (correlation between variables, multiple model, parameter estimates depend on each other and must be interpreted in the context of the model setup) it is often hard to explain (and justify) such results to business people. It can also result in the case that the model is not accepted and used at all.

A quick example with demo data

Consider the CARS data which are included in every SAS installation in the SASHELP.CARS dataset. This table contains data like invoice amount, horsepower, cylinders, engine size, wheelbase, or miles per gallon for 428 cars.

A regression model for INVOICE is built to study and quantify the effect of the car features

  • horsepower
  • cylinders
  • enginesize
  • mpg_city
  • weight
  • wheelbase

on the invoice amount.

Univariate Analysis

The following table shows the parameter estimate and the adjusted R² from the 6 univariate regression models for each variables. (note that the code to re-run these examples with SAS Visual Analytics or SAS/STAT can be found at the end of this article)

Parameter   Estimate    Adj-R2
==============================
Cylinders 7319.6 0.4149
EngineSize 8983.4 0.3171
Horsepower 202.3 0.6778
Mpg_City -1584,4 0.2195
Weight 10.3 0.1938
Wheelbase 314.8 0.0197

The results make sense from an interpretation point of view:

  • The larger and stronger the car is, the more it costs.
  • The larger the car and the stronger the engine the more it consumes and the less miles can be driven with one gallon.

Multiple regression model with LASSO selection

LASSO regression only picks two variables (HORSEPOWER and WHEELBASE) from the above list to explain invoice amount. This model has an adjusted-R² of 0.7063.

Parameter   Estimate
====================
Horsepower 204.4
Wheelbase -279.8

From the results, it can be seen that:

  • Only two variables are selected
  • The estimate of the effect of WHEELBASE (horizontal distance between the centers of the front and rear wheels) changes from the univariate model +314.8 to -279,8 in the multiple model.

Large cars are cheaper?

If we just look at parameter WHEELBASE, it looks like that larger cars have a lower price, which is contra-intuitive and different from the univariate model. However we need to take into account that both parameters, HORSEPOWER and WHEELBASE, are estimated together and thus influence each other.

The Story of the “Three Friends in the Theater” — Explaining Sign Inversion to Business People

Here is an approach to explain sign inversion to business people that I have used successfully to discuss it with business people.

Three friends go to a theater to see a performance of a classical play. Each of them likes the performance a lot. If each one of them was asked individually, each would provide positive feedback to the act. So their univariate coefficient is positive.

However, if they are asked together whether they liked the play, the following situation could happen:

  • The first one responds: “It was the best play I have ever seen. The actors were performing so well that I could not stop watching them.”
  • The second replies: “The choreography was wonderful. I have never seen a comparable ensemble of scenery, actors, music, and lighting.”
  • The third one, even though he liked the play a lot, thought that the first two exaggerated the positive impression of the play to some extent. He wants to correct this impression, so after the first declarations he might say that he generally liked the play, but he thought the choreography was a little bit too exaggerated and the male actor did not really fit the role.

In the context of the two other opinions, the third opinion moves the results from a positive opinion to a slightly negative one. So, the sign for his opinion, when expressed in this context, changes from positive to negative.

From a strict mathematical point of view, this above situation is a little bit different from the estimation of coefficients in multiple regression because the coefficients are determined simultaneously and not sequential as in the story above.

However, this description gives a good explanation that can be used when speaking to non-analytical and non-mathematical people to explain the inversion of signs of regression coefficients.

Connecting the story with the CARS example

In the above example we can conclude that

  • Variable HORSEPOWER already tells a very strong story about the engine and the consumption of the car and no other variables like engine size needs to be “interviewed”.
  • When adding variable WHEELBASE the model finds a better fit by finetuning this effect with a negative sign (correcting the story of variable HORSEPOWER).

Note that the univariate effect remains unchanged, only in the context of the multiple regression model the parameter estimates change.

Example for a model with 3 variables

When you extend the CARS model to three variables, you see the same situation as has been presented in the story above.

  • Variable HORSEPOWER “says”: the stronger the more expensive
  • Variable CYLINDERS “says”: more cylinders, higher price
  • Variable WHEELBASE “corrects” their statements by using a negative parameter estimate.
Parameter   Estimate
====================
Horsepower 206.5
Cylinders 988.4
Wheelbase -478.6

Conclusion

The approach presented above does not “solve” the problem of sign inversion. However it has proven to be a good tool to explain model results and make sure that they are understood and accepted by those who should work with it.

Code Examples

The following SAS Code has been used to the above examples.

You might want to use the following statement, to reduce the output to the object you need.

ods select parameterestimates  fitstatistics;

Univariate regression model

proc glmselect data=sashelp.cars;
model invoice = cylinders ;
run;

LASSO regression model

proc glmselect data=sashelp.cars;
model invoice =
horsepower cylinders enginesize mpg_city weight wheelbase /
selection=lasso ;
run;

Multiple model with 3 variables

proc glmselect data=sashelp.cars;
model invoice = horsepower cylinders wheelbase / selection=none; run;

Links

My data preparation for data science webinar contains more contributions around this topic.

This case study has been taken from my chapter 8 of my SAS Press Book Data Quality for Analytics Using SAS. More data science case studies can also be found in my other books, e.g. Applying Data Science: Business Case Studies Using SAS

--

--

Gerhard Svolba

Applying data science and machine learning methods-Generating relevant findings to better understand business processes