The R-cubed. Preliminary Thoughts [MARTIN C, J.M.]
José-Manuel Martin Coronado
Chief Economist, EMECEP Consultoría www.emecep-consultoria.com
Professor and Researcher, Instituto de Econometria de Lima, IEL, www.institutoeconometria.com
In Econometrics, or let's say in the Theory and Practice of Linear Regression, R-squared is not something new, nor it's strange, since it comes directly from the Analysis of Variance (ANOVA) and the complement of the Sum of Squared of Residuals (SSR) minimized to through the Least Squares Estimator (LSE).
However, since the beginning of regression statistics, it has always been subject to improvement (by one group) or invalidation (by other groups), in all cases because it's not a perfect indicator of goodness of fit, as if one really existed. The goal is clear, the "observed deterministic" part is more important in the model than the non-deterministic or "implicit deterministic" part.
Sometimes the search of a good R-squared is abandoned thanks to consistency theory (CLT). That is to say, basically the omitted variable bias (OVB) doesn't matter at all if in the end the modeler has so much information that, thanks to CLT, these results will converge to the real parameter, ignoring that such a view is something purely theoretical, and has a certain frequentist bias.
The search for an R-squared that improves the display of global explanatory power in the form of a percentage, which is much more readable and quantitatively weighted than a hypothesis test of global significance, is necessary and should be continued. And if what you are looking for is to predict, having just a high individual significance is next or nothing.
However, the discussion is deeper and less peaceful, because there are cases of symmetric relationships that could care less for a high R-squared, or other cases in which a very high R-squared implies model over-fitting, which is also not good, due to a lack of generalization power or the fallacy of tautological causality, among others.
For this, the adjusted R-squared is often used or the R-squared without constant, the predicted R-squared, the R-squared between, the R-squared within, the overall R-squared, the pseudo-R-squared, or the alternative R-squared, among others. So the search for a "better" R-squared is a continuous task, hoping to find a useful relative goodness of fit indicator.
On the other hand, one could say that the R-squared, in very simplistic terms, is the equivalent of the correlation coefficient but squared. Its unavoidable effect is to reduce the correlation power. That is, if the correlation between two variables is 90%, its coefficient of determination is 81%. It is an adjustment or reduction of the relationship between two groups of variables, the observed dependent variable and the fitted dependent variables.
In theory, the correlation coefficient is not taking into account the impact of imperfection on modeling because it only considers explanatory variables as denominators, that is, forgets about bias error due to omitted variables. Although much will depend on the values of the variance of X against the variance of Y, and these against the covariance of X with Y.
However, in some cases an R-squared of 60% is considered as sufficient, which may be arguable, if not acceptable. But in other cases an R-squared of 40% may be acceptable, which can be also be argued, depending on the case, it will be wise to consider it a low one. But as it's close to 50% that may be forgiven by some researchers. Bu what if another determination coefficient is used?
In the following episodes, the implementation of the R-Cubed will be justified as an alternative coefficient of determination, from a practical approach with statistical simulation about its utility. The clear advantage is that it will visually raise concerns when a R-squared is significantly low, to prevent an unlawful interpretation of the coefficients and individual hypothesis testing.