# Understanding Logistic regression

Logistic regression analysis is a statistical method used to find an equation that predicts the outcome of a binary variable. The only difference between logistic regression and linear regression is that logistic regression uses the log odds ratio rather than the probabilities used in linear regression. This gives the researcher more freedom when using logistic regression.

## Descriptive statistics

The descriptive statistics of diagnostic interval by year shows that average diagnostic interval is highest in 2016 (M=36.87, SD=42.41, N=1,548) followed by 2010 (M=34.56 , SD=41.47, N=1,353,), then by 2016 (M=33.85,SD=41.93,N=1,630), then by 2013 (M=33.39,SD=38.17,N=1,422), then by 2012 (M=32.51,SD=36.73,N=1,330) then by 2014 (M=32.35,SD=39.03,N=1,356) and the least is 2011 ((M=31.86,SD=39.19,N=1,270). For region, region 4 (M=42.35,SD=48.00,N=622) has the highest average diagnostic interval followed by region 1 (M=34.78, SD=30.73, N=4,445) and then by region 3 (M=33.19, SD=38.47, N=1,760) and the least is region 2 (M=30.73, SD=37.88, N=3,082).

## Data research analysis

The regression model to be used is a multiple linear regression model. The dependent variable is diagnostic interval while the independent variables are year of diagnosis and health region. Other control variables are age group, community size, cancer stage, and neighborhood income. All the independent variables and control will be coded into dummy variables. Then we will regress the diagnostic interval on the independent variables. Since all independent variables are dummy variables, we will choose the first level to be the base to avoid perfect multicollinearity, then we will estimate the adjusted average diagnostic length for other levels by adding the constant to the respective coefficient and then test the hypothesis that the adjusted average is equal to 49 days, if we have p<0.05, we conclude that guideline is not adhered to. The final hypothesized model is given as
diag_int=β_(1-7) Year dummies+β_(8-11) Region dummies+β_(12-13) csize dummies+β_(14-18) stage dummies+β_(19-20) Income dummies+β_(21-24) age_group dummies+β_(25-26) det dummies

## Data research output model

Study Objective: The objective of the study is to investigate if the guidelines of a 7-week target for diagnostic intervals adhere to every year and every region in Alberta.

Method: The data consists of simulated data on all primary first-ever breast cancer in women in Alberta, the data set consists of 9,909 observations and 10 variables which are, id, diagnostic interval, region, year, detection method, age, age group, cancer stage, community size, and neighborhood income. The method of analysis is a multiple linear regression model and STATA 14 software will be used.

Result: the regression result is presented in table 2, our interest is in the last two columns which provide the adjusted average for each of the levels apart from the base and the p-value for the null hypothesis that they are equal to 49. For all the base dummies, their adjusted average diagnostic interval is the constant which is not significantly different from 49 (p=0.3465). for years, all p-value is greater than 0.05 except 2015 (M=53.99, p=0.0469). For the region of the health authority, all p-values are greater than 0.05 except in region 4 (M=58.63, p=0.067). for the control variables, all p-values for income and age group are greater than 0.05 which means they are not different from 49 days. However, for urban community size, the adjusted mean is significantly greater than 49 (p=53.87). for cancer stage and screen detection, an average diagnostic interval is significantly less than 49 days.

Conclusion: given the result above, we conclude that the guideline is adhered to all the years except 2015 and in all health regions except region 4. Income and age group does not affect whether guidelines are met or not while community size, detection method, and cancer stage affect whether guidelines will be met.
Appendix
Table 1: Summary statistics of the diagnostic interval by independent variables
 Variable Levels Obs Mean Std.Dev. Min Max Year 2010 1,353 34.5558 41.47153 0 295 2011 1,270 31.86535 39.19425 0 280 2012 1,330 32.51579 36.73751 0 241 2013 1,422 33.39803 38.17135 0 268 2014 1,356 32.35103 39.02998 0 310 2015 1,548 36.8708 42.4108 0 285 2016 1,630 33.85215 41.93524 0 281 Region Region 1 4,445 34.78313 40.64375 0 267 Region 2 3,082 30.72875 37.88113 0 310 Region 3 1,760 33.19375 38.46828 0 285 Region 4 622 42.35691 48.00052 0 268 csize Rural 2,071 33.27764 38.8659 0 268 Urban 7,838 33.83082 40.33199 0 310 stage 0 1,334 44.43853 46.42146 0 285 1 3,990 32.50752 39.29039 0 295 2 3,014 31.05209 37.88881 0 310 3 1,210 33.27603 39.17303 0 263 4 361 31.14404 36.56252 0 225 Incomeq High 4,095 33.26935 39.37119 0 295 Low 5,772 34.01421 40.50235 0 310 Age group 39- 603 37.94859 44.98094 0 310 40-49 1,721 32.49448 38.4354 0 295 50-69 5,451 33.23115 39.31082 0 285 70+ 2,134 34.73993 41.52299 0 285 Detection method No 5,860 36.64693 42.89485 0 310 Yes 4,049 29.47222 35.04632 0 267

Table 2: regression result
 Source SS df MS Number of obs = 9,867 F(19, 9847) = 16.92 Model 499903.2 19 26310.69 Prob > F = 0 Residual 15314479 9,847 1555.243 R-squared = 0.0316 Adj R-squared = 0.0297 Total 15814382 9,866 1602.917 Root MSE = 39.437 diag_int Coef. Std. Err. t P>t [95% Conf. Interval] adjusted estimates p>49 year 2011 -2.78388 1.544251 -1.8 0.071 -5.81092 0.243174 48.60249 0.8764 2012 -1.86133 1.52701 -1.22 0.223 -4.85458 1.131927 49.52503 0.8371 2013 -1.05957 1.501508 -0.71 0.48 -4.00284 1.883691 50.32679 0.6017 2014 -1.49114 1.520678 -0.98 0.327 -4.47198 1.489703 49.89522 0.7248 2015 2.609912 1.471206 1.77 0.076 -0.27395 5.493776 53.99627 0.0469 2016 -0.46243 1.456265 -0.32 0.751 -3.31701 2.392143 50.92393 0.439 stage 1 -12.6648 1.253695 -10.1 0 -15.1223 -10.2073 38.72153 <0.001 2 -16.7358 1.336492 -12.52 0 -19.3556 -14.116 34.65059 <0.001 3 -15.5823 1.616628 -9.64 0 -18.7512 -12.4134 35.80404 <0.001 4 -17.8391 2.384647 -7.48 0 -22.5135 -13.1647 33.5473 <0.001 rhan Region 2 -4.46662 0.940487 -4.75 0 -6.31017 -2.62308 46.91974 0.4271 Region 3 -0.48025 1.362242 -0.35 0.724 -3.15052 2.190025 50.90611 0.427 Region 4 7.243908 1.704822 4.25 0 3.902108 10.58571 58.63027 0.0007 incomeqn Low 0.433228 0.806866 0.54 0.591 -1.1484 2.014851 51.81959 0.2616 age_groupn 40-49 -2.75159 1.896045 -1.45 0.147 -6.46823 0.965043 48.63477 0.8687 50-69 -1.48077 1.742776 -0.85 0.396 -4.89697 1.935429 49.90559 0.6624 70+ -0.92921 1.847533 -0.5 0.615 -4.55076 2.692331 50.45715 0.4982 detn Yes -9.80668 0.883767 -11.1 0 -11.539 -8.07432 41.57968 0.0045 csize2 Urban 2.484267 1.270398 1.96 0.051 -0.00597 4.974507 53.87063 0.0034 _cons 51.38636 2.534833 20.27 0 46.41757 56.35515 0.3465
Related Topics