Multicolinearity

Multicolinearity

Solution 

  1. Provide a correlation matrix of the variables under investigation and comment on the patterns of correlation you see amongst the variable as well as in relation to the predicted variable of interest.
  2. With the exception of depression, all variables showed a significant correlation with Startleblink. However, the highest correlation was between Startleblink vs. Antisaccade (r = 0.740; p <0.01); followed by Startleblink vs. Anxiety (r = 0.543; p <0.01) and Startleblink vs. PTSD (r = 0.543; p <0.01). Other significant correlations were observed between PTSD vs. Anxiety (r = 0.331; p <0.01), between PTSD vs. Antisaccade (r = 0.311; p <0.01) and Anxiety vs. Antisaccade.
Correlations
Startleblink PTSD Anxiety Depression Antisaccade
Startleblink Pearson Correlation 1 .349** .543** .185 .740**
Sig. (2-tailed) .003 .000 .125 .000
N 70 70 70 70 70
PTSD Pearson Correlation .349** 1 .331** -.121 .311**
Sig. (2-tailed) .003 .005 .317 .009
N 70 70 70 70 70
Anxiety Pearson Correlation .543** .331** 1 .303* .380**
Sig. (2-tailed) .000 .005 .011 .001
N 70 70 70 70 70
Depression Pearson Correlation .185 -.121 .303* 1 .079
Sig. (2-tailed) .125 .317 .011 .518
N 70 70 70 70 70
Antisaccade Pearson Correlation .740** .311** .380** .079 1
Sig. (2-tailed) .000 .009 .001 .518
N 70 70 70 70 70
**. Correlation is significant at the 0.01 level (2-tailed).
*. Correlation is significant at the 0.05 level (2-tailed).

Startleblink is the response variable or dependent variable

Independent variables: PTSD, Anxiety, Depression and Antisaccade

Would you be concerned about multicolinearity?

Multicollinearity refers to a strong correlation among several independent variables. To have a good statistical model it is interesting that the independent variables have low multicollinearity with the other independent variables, but also have high correlations with the dependent variable. In the present study, no high correlations were observed between the independent variables. However, there was a strong correlation between Startleblink vs. Antisaccade and a moderate correlation between Startleblink vs. Anxiety.

How and when does multicolinearity bias a regression model?

Moderate multicollinearity may not be problematic. However, severe multicollinearity is a problem because it can increase the variance of the coefficient estimates and make the estimates very sensitive to minor changes in the model. Fortunately, in our dataset we found low correlations indicating low multicollinearity.

The variance inflation factors (VIF), which indicate the extent to which multicollinearity is present in a regression analysis. A VIF of 5 or greater indicates a reason to be concerned about multicollinearity. 

  1. Perform a standard regression analysis withthe IVs predicting startle blink response. Report on the effects observed, and comment on the contribution of the variables in predicting startle blink response. 
Variables Entered/Removeda
Model Variables Entered Variables Removed Method
1 Antisaccade, Depression, PTSD, Anxietyb . Enter
a. Dependent Variable: Startleblink
b. All requested variables entered.

The summary table of the adjusted model is shown in the table below. A R2 = 0.635 was observed, that is, the adjusted model explains the startle blink response prediction by 63%. To know the contribution of each variable, it is sufficient to observe the correlation table between the independent variables and the dependent variable.

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .797a .635 .612 1.94082
a. Predictors: (Constant), Antisaccade, Depression, PTSD, Anxiety

The adjusted model is shown below.

startleblink = -6.888 + 0.112* PTSD +0.088* Anxiety + 0.034* Depression + 0.018* Antisaccade

R2 = 0.635 and SEE = 1.94;

Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) -6.888 1.341 -5.138 .000
PTSD .112 .117 .080 .957 .342
Anxiety .088 .030 .265 2.974 .004
Depression .034 .042 .067 .824 .413
Antisaccade .018 .003 .609 7.336 .000
a. Dependent Variable: Startleblink

 

  1. Perform a stepwise regression analysis and compare your results with that of 2 above. How has the stepwise method altered the contribution of the independent variables? How can the regression analysis check for multicolinearity? And are there cases for concern?

The model developed using the stepwise regression analysis method is shown below.

Model Summary
Model R R Square Adjusted R Square Std. Error of the Estimate
1 .740a .547 .540 2.11259
2 .792b .627 .616 1.93046
a. Predictors: (Constant).Antisaccade
b. Predictors: (Constant). Antisaccade. Anxiety

As can be seen in the table below, the model has not been improved compared to the one above. But the new model presented new predictive variables.

startleblink = -6.592 + 0.019* Antisaccade +0.102* Anxiety

R2 = 0.635 and SEE = 1.93;

Compared to the previous model, the latter presents less independent variables and the same performance, therefore, is a better model for predicting startleblink.

Coefficientsa
Model Unstandardized Coefficients Standardized Coefficients t Sig.
B Std. Error Beta
1 (Constant) -3.306 1.061 -3.117 .003
Antisaccade .022 .002 .740 9.061 .000
2 (Constant) -6.592 1.299 -5.075 .000
Antisaccade .019 .002 .623 7.727 .000
Anxiety .102 .027 .306 3.800 .000
a. Dependent Variable: Startleblink

Checking for Multicollinearity – A condition index greater than 15 indicates a possible problem. An index greater than 30 suggests a serious problem with collinearity. Como se observanatabelaabaixo in coluna the Condition Index.There were no worrisome cases.

CollinearityDiagnosticsa
Model Dimension Eigenvalue Condition Index Variance Proportions
(Constant) Antisaccade Anxiety
1 1 1.971 1.000 .01 .01
2 .029 8.280 .99 .99
2 1 2.948 1.000 .00 .01 .00
2 .033 9.507 .14 .99 .19
3 .019 12.374 .86 .00 .80
a. Dependent Variable: Startleblink
  1. Assume that you are interested in assessing the contribution of PTSD first and Antisaccade (inhibitory control) on the last step. Comment on the contribution of each model and the F-change and R2 change on every step. How do the results of this model compare with the Stepwise model?

PTSD was included only in the first model. In the model developed using the Stepwise was removed, therefore, presents a low contribution to the model. The Antisaccade was present in all models, presents a greater contribution to explain the appearance of Startleblink.

  1. Write a brief summary on the role of PTSD in explaining startle blink response in soldiers when exposed to upsetting scenes. What did the analysis as the most important contributor?

The PTSD presented low correlation with Startleblink and was removed from the model by stepwise method. The most important contributor to the model was the Antisaccade (r = 0.740; p <0.01);

  1. Using regression equation of the final model of the hierarchical method (as 4 above), if a soldier score 8 on PTSD, 46 Anxiety, 20 on Depression, and had an antisaccade latency of 301 msec, what would the model predict for his startle blink response?

startleblink = -6.888 + 0.112* PTSD +0.088* Anxiety + 0.034* Depression + 0.018* Antisaccade

R2 = 0.635 and SEE = 1.94;

startleblink = -6.888 + 0.112* 8+0.088* 46 + 0.034* 20 + 0.018* 301

startleblink = 4.154

R2 = 0.635 and SEE = 1.94;

Logistic model

1.

The exploratory data analysis no missing cases were observed, as shown in the table “Case Processing Summary”.

Case Processing Summary
UnweightedCasesa N Percent
Selected Cases Included in Analysis 30 100.0
Missing Cases 0 .0
Total 30 100.0
Unselected Cases 0 .0
Total 30 100.0
a. If weight is in effect. see classification table for the total number of cases.

 

The table below shows that 1 = relapsed and 0 = no relapse.

Dependent Variable Encoding
Original Value Internal Value
no relapse 0
relapsed 1

In the table below, we have both observed and predicted variable. The dependent variable are indicated by the number of 0’s and 1’s. In the dataset was founded (56.7%) cases of relapsedpatiente. This gives the percent of cases for which the dependent variables that was correctly predicted given the model. 

Block 0: Beginning Block

Classification Tablea.b
Observed Predicted
relapse Percentage Correct
no relapse relapsed
Step 0 relapse no relapse 0 13 .0
relapsed 0 17 100.0
Overall Percentage 56.7
a. Constant is included in the model.
b. The cut value is .500

The null model is showed in the next table and we have just the constant coefficient or intercept, indicated by B. Also, is showed the standard error (S.E.) around the coefficient for the constant. Moreover, the Wald chi-square test that tests the null hypothesis that the constant equals 0. The result show that the null hypothesis was not rejected because the p-value (0.467) was greater than the critical p-value of .05.  Hence, we conclude that the intercept is 0. Usually, this finding is not of interest to researchers. Exp(B) is the exponentiation of the B coefficient, which is an odds ratio: 1.308.This value is given by default because odds ratios can be easier to interpret than the coefficient.

Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 0 Constant .268 .368 .530 1 .467 1.308

In the next table, the Score test was used to predict whether an independent variable would be significant in the model. Looking at the p-values (located in the column labelled “Sig.”), we can see in that just was WMC (p<0.001) and FE (p = 0.003) was a predictors significant statistically. In overall statistics itis shown the result of including all predictors in the model.

Variables not in the Equation
Score df Sig.
Step 0 Variables WMC 20.865 1 .000
FE 8.967 1 .003
Severity 7.805 9 .554
Severity(1) 2.802 1 .094
Severity(2) 2.802 1 .094
Severity(3) .136 1 .713
Severity(4) .027 1 .869
Severity(5) .084 1 .773
Severity(6) .632 1 .427
Severity(7) .039 1 .844
Severity(8) .136 1 .713
Severity(9) .136 1 .713
Overall Statistics 23.335 11 .016

 Block 1: Method = Forward Stepwise

Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 31.059 3 .000
Block 31.059 3 .000
Model 31.059 3 .000
  1. Comment oh the significance of the model(s) tested, the amount of variation explained by the model(s), and the – 2LLs of interest. 

The – 2LLs is shown in the table below.

Model Summary
Step -2 Log likelihood Cox & Snell R Square Nagelkerke R Square
1 9.995a .645 .865
a. Estimation terminated at iteration number 8 because parameter estimates changed by less than .001.

In the next table, is shown the forward stepwisemethod(or model) with predictors. The value given in the Sig. column is the probability of obtaining the chi-square statistic given that the null hypothesis is true. We have just one step and we can see a high chi-square statistic (29.521). The Sig. p-value which is compared to a critical value(.05 or .01) to determine if the overall model is statistically significant.  In this case, the model is statistically significant because the p-value is less than the significance level.

Omnibus Tests of Model Coefficients
Chi-square df Sig.
Step 1 Step 29.521 1 .000
Block 29.521 1 .000
Model 29.521 1 .000

In the next table, it is shown the values for the logistic regression equation for predicting the dependent variable from the independent variable. They are in log-odds units. The prediction equation is:

log(p/1-p) = 12.077 -4.241* WMC

where p is the probability of a patient being discharged by depression.

  1. Which variable(s) predict the likelihood of relapse significantly? Provide an interpretation of the significance of Exp(B) of the significant variable(s).

These coefficients are in log-odds units, they are often difficult to interpret, so they are often converted into odds ratios. You can do this by hand by exponentiation the coefficient, or by looking at the column labelled “Exp(B)”.

The constant is the expected value of the log-odds of relapse when all of the predictor variables are equal zero.

  1. For every one-unit increase in WMC score, we expect a -4.241 decrease in the log-odds of relapse.
Variables in the Equation
B S.E. Wald df Sig. Exp(B)
Step 1a WMC -4.241 1.569 7.307 1 .007 .014
Constant 12.077 4.764 6.428 1 .011 175795.687
a. Variable(s) entered on step 1: WMC.

ROC curve

A measure of goodness-of-fit often used to evaluate the fit of a logistic regression model is based on the simultaneous measure of sensitivity (True positive) and specificity (True negative) for all possible cutoff points. First, we calculate sensitivity and specificity pairs for each possible cutoff point and plot sensitivity on the y axis by (1-specificity) on the x axis. This curve is called the receiver operating characteristic (ROC) curve. The area under the ROC curve ranges from 0.5 and 1.0 with larger values indicative of better fit.

Area Under the Curve
Test Result Variable(s):   Predicted probability
Area Std. Errora Asymptotic Sig.b Asymptotic 95% Confidence Interval
Lower Bound Upper Bound
.950 .049 .000 .854 1.000
a. Under the nonparametric assumption
b. Null hypothesis: true area = 0.5

 

SPSS output shows ROC curve. The area under the curve is 0.95 with 95% confidence interval (.854, 0.999). Also, the area under the curve is significantly different from 0.5 since p-value is 0.000 meaning that the logistic regression classifies the group significantly better than by chance.