# Understanding regression analysis

Regression analysis is a form of predictive modeling technique that investigates the relationship between dependent and independent variables. This model is used in forecasting time series modeling and finding the causal effect relationship between variables. Regression analysis is a very important tool in modeling and analyzing data.

• Introduction

## Introduction

The objective of this study is to determine if there is a significant difference in the average length of hospital stay between pre and post periods. Secondly, to determine if there is a significant difference in the average length of hospital stay among different insurance. We will also explore variables that are important to predict the length of stay in the hospital. The dataset is extracted from the Sentara hospital system. It is related to patients undergoing Coronary Artery Bypass Graft (CABG). The data set contains information on 1,013 patients.
Method
To study if there is a significant difference in the average length of hospital stay between pre and post periods, we will use a regression model. The model to be estimated is
length of stay=β_0+β_1 pre_post
The null and alternative hypothesis is
H_0:β_1=0(There is no significant difference in the average length of hospital stay between pre and post period)
H_1:β_1≠0(There is a significant difference in the average length of hospital stay between pre and post period)
To determine if there is a significant difference in the average length of hospital stay among different insurance, we use one-way ANOVA. The null and alternative hypothesis are:
H0: There is no significant difference in the average length of hospital stay among different insurance
H1: There is a significant difference in the average length of hospital stay among different insurance
The assumptions underlying this test are
1.  Independence: This means that each record in the data must be a distinct and independent entity. This is met as each of the observation belongs to one group of the categorical variable only
2.  Normality: The responses for each factor level is normally distributed. i.e. average length of stay for each insurance group must be normally distributed.
3.  Homogeneity of variance: This means that the variance of the groups are equal
To determine variables that are important to predict the length of stay in the hospital, we will use the multiple linear regression model. The model to be estimated is
the length of stay=β_0+β_1 hosp charge+β_2 race+β_3 insurance+β_4 age+β_5 infection+β_6 heart attack+β_8 glucose.
The hypothesis to be tested is
H_0 1:β_1=0;H_0 2:β_2=0…H_0 8:β_8=0
H_1 1:β_1≠0;H_1 2:β_2≠0…H_1 8:β_8≠0
The assumptions of the simple/multiple linear regression are

Linearity: there must be the existence of a linear relationship between the dependent and the independent variables.

No autocorrelation: Autocorrelation occurs when the residuals are not independent of each other. In other words when the value of y(x+1) is not independent of the value of y(x). For the linear regression model, we expect the residuals to be independent of one another.

Normality of residuals: We expect the residual from the model to be normally distributed.

No heteroscedasticity: We expect the residual variance to be constant. However, heteroscedasticity occurs if the variance of the residuals changes with the observation. Therefore, there should be no heteroscedasticity

No outliers: Outlier values may bias the estimate from the regression model. outliers are values that are too large or too small compared to other observations. We require that no outlier exists in the dataset.

There is little or no multicollinearity: multicollinearity exists if there is a very high correlation between the independent variables. Therefore, we expect there should be a not too high a correlation between the independent variables.

Result
The descriptive statistics in table 1 show that the average age of the patient is 63.9 years (sd=10.06 years). The average length of stay in the hospital is 11.73 days (sd=8.03 days) whole average hospital charge is $150,606.1 (sd=$108,393.8) and the average glucose level is 137.25 (sd=15.84). 43.83% of respondents were measured during the pre-implementation period while 56.17% were measured during the post-implementation period. 72.06% of respondents suffer a heart attack while 27.94% do not. 82.53% of patients suffer in-hospital infections while 17.47% do not. 5.73% of respondents have Medicaid insurance, 25.77% have medicare insurance, 9.97% have other insurance and 58.54% have private insurance. 28.23% are African American, 63.28% are whites while 7.9% are other races.
Table 2 presents the result of the simple linear regression of length of stay on dummy variable measuring pre and post-implementation period. The result shows a significant estimate for the slope (β=-1.154,[95%CI=-2.15,-0.16],p=0.02which means the length of stay is significantly different between pre and post-implementation period.
Table 3 presents the ANOVA result testing difference in length of stay among different insurance types. The result shows that F(3,1009)=4.39, p=0.004 which means we reject the null hypothesis. there is thus a significant difference in the average length of stay among the insurance types. The multiple comparison results show that a significant difference was found between the length of stay of private insurance and Medicaid (p=0.019).
Table 4 presents the result of multiple regression model to predict which of the variables is important in predicting length of stay. The result shows that age (β=0.03,[95%CI=0.002,0.058],p=0.034, hospital charge (β=5.92e-05,[95%CI=5.67e-05,6.17e-05],p<0.001, heart attack (β=1.25,[95%CI=0.68,1.83],p<0.01, infection (β=1.7,[95%CI=0.995,2.40],p<0.001) and average glucose (β=0.05,[95%CI=0.03,0.07],p<0.01) were the only significant variables in the model.
Table 1: Descriptive summary of Infection data
 Variable Mean Std. Dev. pat_age 3 6 losadmitdi~e 11.73445 8.031901 hosp_charge 150606.1 108393.8 glucose 137.2452 15.84332

 Variable n % pre_post prepost 444569 43.8356.17 heart-attack NoYes 283730 27.9472.06 infection NoYes 836177 82.5317.47 Insurance MedicaidMedicareOthersPrivate 58261101593 5.7325.779.9758.54 Race AAsOthersWhites 28680641 28.237.963.28
Table 2: simple linear regression model

 Variables estimates p-value confidence interval Intercept 12.38 <.0001 (11.63,13.13) pre-post -1.15 0.02 (-2.15,-0.16)

Table 3: ANOVA result

 Test statistics df test statistics p multiple comparisons p Anova F 31,009 4.39 0.0044 medicare-Medicaidothers-Medicaidothers-medicareprivate-Medicaidprivate-medicareprivate-others 0.7350.6741.0000.0190.0841.000

Table 4: Parameter Estimates of the Multiple Linear Regression

Discussion
Post-implementation period length of stay in the hospital is lower than the pre-implementation period length of stay by 1.15 days [95% CI=-2.15,-0.15]. This suggests that the intervention program is successful. length of stay in the hospital seems to be significantly different across insurance types but the difference occurs only between patients and Medicaid patients while no difference is found for other categories. age (β=0.03,[95%CI=0.002,0.058],p=0.034, hospital charge (β=5.92e-05,[95%CI=5.67e-05,6.17e-05],p<0.001, heart attack (β=1.25,[95%CI=0.68,1.83],p<0.01, infection (β=1.7,[95%CI=0.995,2.40],p<0.001) and average glucose (β=0.05,[95%CI=0.03,0.07],p<0.01) were the significant factors affecting length of stay in the hospital. An additional year for age translates to 0.03 days more of staying in the hospital. This is plausible since the more people age, the less their body is responding to treatment. Hospital charge has a very small effect on the length of stay in the hospital. A dollar increase in hospital charge increases the length of stay by 4.57 seconds or for \$100,000 in additional hospital charge, there is an additional 5.92 days stay in the hospital. People who have experienced heart attack stayed 1.25 days more in hospital than those who haven’t while those who suffered infection while in hospital stayed 1.7 days more in hospital than those who did not. An increase in average glucose increases the length of stay by 0.05 days. Whites stay in hospital 0.63 days lesser than African Americans (p=0.036).
Appendix
proc means data=WORK.QUERY chartype mean std min max n vardef=df;
run;
proc freq data=WORK.QUERY;
tables pre_post insurance race heartattack infection/ plots=(freqplot cumfreqplot);
run;
class pre_post/ param=glm;
run;
proc reg data=Work.reg_design alpha=0.05 plots(only)=(diagnostics residuals
observedbypredicted);
where pre_post is not missing
ods select ParameterEstimates OutputStatistics ResidualStatistics SpecTest
DiagnosticsPanel ResidualPlot ObservedByPredicted;
class insurance race heartattack infection / param=glm;
model losadmitdischarge= pat_age avg_glucose hosp_charge insurance race heartattack infection /showpvalues selection=none;
run;
proc reg data=Work.reg_design alpha=0.05 plots(only)=(diagnostics residuals
observedbypredicted);
where insurance is not missing and race is not missing and heatattack is not missing and infection is not missing
ods select ParameterEstimates OutputStatistics ResidualStatistics SpecTest
DiagnosticsPanel ResidualPlot ObservedByPredicted;
proc glm data=WORK.QUERY;
class insurance;