Practicality of Using Transformations in MLR

Lydia Gibson

Introduction

  • In previous research, I’ve used multiple linear regression (MLR) to explore the relationship between median salary of STEM majors and gender demographics.
  • The final reduced model used the inverse transformation of the response variable, due to the skewness of it’s distribution, to improve the model fit.
  • While transforming response variables can lead to better fitting models, these models are not easy to explain.

Outline

Problem

Data Source

Methods

Results

Conclusion

Further Research

Problem

  • How much prediction power is lost by not using a transformed response variable in a linear regression model?

  • Is it worth the inability to easily explain your model?

Data Source

  • The data set was obtained from the American Community Survey (ACS) 2010-2012 Public Use Microdata Series (PUMS).
  • It has 76 observations and 9 variables.

Exploratory Data Analysis

Method

Model Without Transformation

Parameter                                | Coefficient |      SE |               95% CI | t(71) |      p
--------------------------------------------------------------------------------------------------------
(Intercept)                              |    36421.43 | 2597.45 | [31242.27, 41600.59] | 14.02 | < .001
Major category [Computers & Mathematics] |     6324.03 | 3915.80 | [-1483.85, 14131.90] |  1.62 | 0.111 
Major category [Engineering]             |    20961.33 | 3162.87 | [14654.74, 27267.92] |  6.63 | < .001
Major category [Health]                  |      403.57 | 3823.34 | [-7219.95,  8027.09] |  0.11 | 0.916 
Major category [Physical Sciences]       |     5468.57 | 4023.95 | [-2554.95, 13492.09] |  1.36 | 0.178 

Why a Transformation?

Model With Transformation

Parameter                                | Coefficient |       SE |         95% CI | t(71) |      p
---------------------------------------------------------------------------------------------------
(Intercept)                              |    2.79e-05 | 9.88e-07 | [ 0.00,  0.00] | 28.23 | < .001
Major category [Computers & Mathematics] |   -4.19e-06 | 1.49e-06 | [ 0.00,  0.00] | -2.81 | 0.006 
Major category [Engineering]             |   -9.69e-06 | 1.20e-06 | [ 0.00,  0.00] | -8.06 | < .001
Major category [Health]                  |   -1.47e-07 | 1.45e-06 | [ 0.00,  0.00] | -0.10 | 0.920 
Major category [Physical Sciences]       |   -3.33e-06 | 1.53e-06 | [ 0.00,  0.00] | -2.17 | 0.033 

Results

Diagnostic Plots for Model Without Transformation

Diagnostic Plots for Model With Transformation

Metrics Comparison

# Comparison of Model Performance Indices

Name        | Model |   AIC (weights) |   BIC (weights) |    R2 | R2 (adj.) |      RMSE
---------------------------------------------------------------------------------------
lm1_reduced |    lm |  1618.1 (<.001) |  1632.1 (<.001) | 0.486 |     0.457 |  9393.617
lm2_reduced |    lm | -1678.7 (>.999) | -1664.7 (>.999) | 0.573 |     0.549 | 3.573e-06

Conclusion

  • Regression models with an inverse-transformation dependent response variable are not easy to explain to individuals without a statistics background, which is likely to occur in statistical consulting.
  • Based on the adjusted \(R^2\) values of the two models, there is a less than 10% loss of ability to explain the variability between our response variable and explanatory variables by using a linear regression model without an inverse-transformation dependent response variable to one with.

Further Research

  • I would like to redo my analysis with a data set of a more quantitative nature, and run the regression models using the TidyModels framework.
  • I would like to do a comparison of the data visualizations and analyses available in the various ggplot2 extension packages (ggpubr, easystats, lindia, ggstatsplot) used in this presentation.

Acknowledgements

  • I would like to thank my colleagues, Sara Hatter and Ken Vu, with whom I collaborated on the previous research projects, Gender Wage Inequality in STEM and Unemployment in STEM, which laid the groundwork for this project.
  • I would like to acknowledge the FiveThirtyEight blog for uploading the data behind their story, The Economic Guide To Picking A College Major, which was used for this analysis.
  • I would like to thank Prof. Eric Suess for his guidance with this research and the #rstats community members who helped with the styling of this presentation.

Appendix