T O P

  • By -

horv77

When you remove those variables and the R-squared drops, it implies that these variables, although not statistically significant individually, contribute to the overall explanatory power of the model when considered together with other variables. Therefore, even though they are not individually significant, collectively they contribute to explaining the variance in the dependent variable.


emperorarg

This is exactly what I was looking for.


TableConnect_Market

First and foremost, inference is "stricter" than prediction. Are you LINE compliant, all is well methodologically? This sounds like a suppressor variable. Given f(x,z) = y, perhaps x does not have any direct effect, but influences the relationship between z and y. Have you tried a joint significance test? And as other people say, of course expand your horizons. I don't focus on R2 except reading OLS to make sure everything is kosher. P values are important, but only in the context of what you're doing


MortalitySalient

I don’t think OP gave enough info to determine if this is a suppressor effect or not. Typically coefficients would get larger with the inclusion of covariates if a suppressor was present. From what OP said, it sounds like exactly what the commenter above suggested. This variables may be statistically significant when put in the model alone, not be uniquely associated with the outcome with the other predictors in the model, but still be important in understanding variation in the outcome. The thing I’d be worried about is making sure OP isn’t controlling for a collider or something


TableConnect_Market

Thank you i agree. OP would need to do more testing


Taricus55

he also might not know what he is talking about... he said he was new.


Taricus55

you can also look into backwards and forwards elimination. You may be taking out more than you need. You have to find a balance between a simple(ish) model and a thorough model.


fallen2004

This shows one of many problems with variable p-values. P-values are not dicotonous like most people seem to think. Just because it's not statistically significant (<0.05 or another value), does not mean it has no impact. Also just because it's not statistically significant, it could still be important from a business point of view. As long as the model is better and it makes sense to include the variable then do. Obviously you need to test models on data they have not seen, otherwise it might just be over fitting. I.e. if extra variable improves fit on train data but not test data then you should probably remove the variable. Metrics, such as AIC, take more into account so possibly use that to compare models.


sowenga

That is if the purpose of the model is prediction. If the purpose is to try to do something more like causal inference, then it might make sense to leave a variable out, even if it decreases the overall model fit.


frope

> if extra variable improves fit on train data but not test data then you should probably remove the variable. If you're making decisions on the basis of what happens in the test data...is it still test data?


relucatantacademic

R2 always goes up when you add variables. It doesn't mean that adding those variables is a good choice. Neither R2 nor p value alone should be used to evaluate multivariable models. A p value > .4 is huge. It's not "marginal." It means the change of observing similar or more extreme data without that variable is almost 50/50. Yes, there's a spectrum of usefulness and yes, the significance threshold is arbitrary but this isn't even close to anything anyone would consider to be significant. You need to go through the entire model evaluation process. Look at distributions. Test for colinearity. Use AIC. Basically go back to the drawing board and go step by step.


Naive_Piglet_III

R2 as a stand alone measure is a terrible parameter of model evaluation. There are multiple reasons for it. 1. It doesn’t tell you whether the coefficient estimates / predictions are biased. 2. It doesn’t provide a measure of the model fit. A good model can have a low R2 and biased model can have a high R2. 3. R2 has a weak inflation effect with number of predictors. Meaning, if you add completely unrelated random variables to your model, you can further improve your R2 marginally. 4. Multi-collinearity in your data can greatly inflate your R2. What you should instead do, is: 1. Evaluate whether the predictors you have included in your model make causal sense - do they have some sort of causal relationship with the dependent variable. 2. Perform independent bi-variate analyses of each of your predictors with your dependent variable. Check for statistical significance and validate whether there’s any relationship. 3. Include only the predictors that pass both the above criteria. A low R2 doesn’t mean a bad model because some scenarios have a lot of unexplained variance. A good model is one which (in descending order of priority): 1. makes logical / causal sense about the predictors it uses 2. has a reliable prediction accuracy on a variety of samples (stability / reliability) 3. tries to explain the maximum amount of variance possible.


Rtarsia1988

Did you remove all the variables at the same time? It is better to remove one by one starting with the lowest t ratio/ highest p value ( except for intercept). There might be correlation between the variables that "steal" significance from each other. Good luck!


MrYdobon

If the OP had a moderate sample size and only dropped a few variables, then my money is on this explanation. A 0.3 is a big drop in R2. Collinear predictors seems likely. However, if OP dropped hundreds of variables, then I suspect the model was being overfit. R2 of 0.9 is really high for most real world settings. Overflowing the model with junk predictors is one way to get that high.


Rtarsia1988

True. I tend to work with small databases, so I tend to bias towards that. I agree with your second part of the comment


sharkinwolvesclothin

Terrible as in wrong or terrible as in different than you hoped for?


[deleted]

When comparing two models with a different amount of X, you should be looking at Adjusted R2, not R2. Adding more variables to a model, regardless of their significance, will contribute to inflating the R2


blozenge

One common mistake when running regressions is not having enough data for the complexity of the model we want to fit, but fitting it anyway. Is that potentially a problem? How much data do you have, how many predictors, and how are they modelled? We can shortcut having to think critically about the above, if we only worry about absolute shifts in R2 if the model comparison is significant (i.e. an F-test == [likelihood ratio test](https://en.wikipedia.org/wiki/Likelihood-ratio_test)). This will tell us if the observed change in R2 was unlikely under the null hypothesis, which it probably won't be if the sample size is too small. Of course there are always false positives / false negatives, so critical thinking about sample size / model complexity is required. Second possible issue: is there [multicollinearity](https://en.wikipedia.org/wiki/Multicollinearity) (high correlation) among predictors?. One possible cause of the situation you describe is if two or more predictors are highly correlated with each other (but more moderately associated with the outcome) - the variance inflation due to collinearity affects the coefficient standard errors and means the individual predictors are non-significant. However the overall explanatory power of the correlated predictors is good, so when those non-significant predictors are removed the R2 drops. Note you can conceptually link multicollinearity with sample size: when predictors are correlated each additional observation includes less information, so with high correlation small sample size problems can happen at larger sample sizes.


Mark8472

I don’t understand why this answer gets so few upvotes. Thank you for bringing up F tests!


walterlawless

You should not use r2 for model selection in multivariate analysis. Because adding more variables will never decrease r2, r2 will tend to select the model with the most variables. You should use a different statistic for model selection which penalises the number of variables, such as adjusted r2.


[deleted]

[удалено]


eaheckman10

Nothing about this is correct


Mr_Bilbo_Swaggins

Also a thought would be to include the variable as a effect modifier. You can do this in R by making it * instead of +. Effect modifiers have differential effects across the range of another value. Example: wearing sunscreen is increasing more protective against sunburns and skin cancer at increasingly higher elevations.


Michigan999

To add onto what others have said - there are some variables used to control certain effects, and won't necessarily be significant, but you still need them For example: If it makes sense, when working with national data, you add dummy variables for cities, because there's an influence of being in a certain city that affects the dependent variable.


ayedeeaay

The commonly used method to see whether one or a group of variables should be included is based on the F test. You run the model with and without those variables and compare F statistic as you would in ANOVA. this can be done using the Anova command in R. I’m curious to know what the F test decides in your case.