Logistic Regression Model R Homework

Descriptive statistics

The descriptive statistics of diagnostic interval by year shows that average diagnostic interval is highest in 2016 (M=36.87, SD=42.41, N=1,548) followed by 2010 (M=34.56 , SD=41.47, N=1,353,), then by 2016 (M=33.85,SD=41.93,N=1,630), then by 2013 (M=33.39,SD=38.17,N=1,422), then by 2012 (M=32.51,SD=36.73,N=1,330) then by 2014 (M=32.35,SD=39.03,N=1,356) and the least is 2011 ((M=31.86,SD=39.19,N=1,270). For region, region 4 (M=42.35,SD=48.00,N=622) has the highest average diagnostic interval followed by region 1 (M=34.78, SD=30.73, N=4,445) and then by region 3 (M=33.19, SD=38.47, N=1,760) and the least is region 2 (M=30.73, SD=37.88, N=3,082).

Data research analysis

The regression model to be used is a multiple linear regression model. The dependent variable is diagnostic interval while the independent variables are year of diagnosis and health region. Other control variables are age group, community size, cancer stage, and neighborhood income. All the independent variables and control will be coded into dummy variables. Then we will regress the diagnostic interval on the independent variables. Since all independent variables are dummy variables, we will choose the first level to be the base to avoid perfect multicollinearity, then we will estimate the adjusted average diagnostic length for other levels by adding the constant to the respective coefficient and then test the hypothesis that the adjusted average is equal to 49 days, if we have p<0.05, we conclude that guideline is not adhered to. The final hypothesized model is given as

diag_int=β_(1-7) Year dummies+β_(8-11) Region dummies+β_(12-13) csize dummies+β_(14-18) stage dummies+β_(19-20) Income dummies+β_(21-24) age_group dummies+β_(25-26) det dummies

Data research output model

Study Objective: The objective of the study is to investigate if the guidelines of a 7-week target for diagnostic intervals adhere to every year and every region in Alberta.

Method: The data consists of simulated data on all primary first-ever breast cancer in women in Alberta, the data set consists of 9,909 observations and 10 variables which are, id, diagnostic interval, region, year, detection method, age, age group, cancer stage, community size, and neighborhood income. The method of analysis is a multiple linear regression model and STATA 14 software will be used.

Result: the regression result is presented in table 2, our interest is in the last two columns which provide the adjusted average for each of the levels apart from the base and the p-value for the null hypothesis that they are equal to 49. For all the base dummies, their adjusted average diagnostic interval is the constant which is not significantly different from 49 (p=0.3465). for years, all p-value is greater than 0.05 except 2015 (M=53.99, p=0.0469). For the region of the health authority, all p-values are greater than 0.05 except in region 4 (M=58.63, p=0.067). for the control variables, all p-values for income and age group are greater than 0.05 which means they are not different from 49 days. However, for urban community size, the adjusted mean is significantly greater than 49 (p=53.87). for cancer stage and screen detection, an average diagnostic interval is significantly less than 49 days.

Conclusion: given the result above, we conclude that the guideline is adhered to all the years except 2015 and in all health regions except region 4. Income and age group does not affect whether guidelines are met or not while community size, detection method, and cancer stage affect whether guidelines will be met.

Appendix

Table 1: Summary statistics of the diagnostic interval by independent variables

Variable	Levels	Obs	Mean	Std.Dev.	Min	Max
Year	2010	1,353	34.5558	41.47153	0	295
	2011	1,270	31.86535	39.19425	0	280
	2012	1,330	32.51579	36.73751	0	241
	2013	1,422	33.39803	38.17135	0	268
	2014	1,356	32.35103	39.02998	0	310
	2015	1,548	36.8708	42.4108	0	285
	2016	1,630	33.85215	41.93524	0	281
Region	Region 1	4,445	34.78313	40.64375	0	267
	Region 2	3,082	30.72875	37.88113	0	310
	Region 3	1,760	33.19375	38.46828	0	285
	Region 4	622	42.35691	48.00052	0	268
csize	Rural	2,071	33.27764	38.8659	0	268
csize	Urban	7,838	33.83082	40.33199	0	310
stage	0	1,334	44.43853	46.42146	0	285
	1	3,990	32.50752	39.29039	0	295
	2	3,014	31.05209	37.88881	0	310
	3	1,210	33.27603	39.17303	0	263
	4	361	31.14404	36.56252	0	225
Incomeq	High	4,095	33.26935	39.37119	0	295
Incomeq	Low	5,772	34.01421	40.50235	0	310
Age group	39-	603	37.94859	44.98094	0	310
	40-49	1,721	32.49448	38.4354	0	295
	50-69	5,451	33.23115	39.31082	0	285
	70+	2,134	34.73993	41.52299	0	285
Detection method	No	5,860	36.64693	42.89485	0	310
Detection method	Yes	4,049	29.47222	35.04632	0	267

Table 2: regression result

Source	SS	df	MS	Number of obs	=	9,867
				F(19, 9847)	=	16.92
Model	499903.2	19	26310.69	Prob > F	=	0
Residual	15314479	9,847	1555.243	R-squared	=	0.0316
				Adj R-squared	=	0.0297
Total	15814382	9,866	1602.917	Root MSE	=	39.437
diag_int	Coef.	Std. Err.	t	P>t	[95% Conf.	Interval]	adjusted estimates	p>49
year
2011	-2.78388	1.544251	-1.8	0.071	-5.81092	0.243174	48.60249	0.8764
2012	-1.86133	1.52701	-1.22	0.223	-4.85458	1.131927	49.52503	0.8371
2013	-1.05957	1.501508	-0.71	0.48	-4.00284	1.883691	50.32679	0.6017
2014	-1.49114	1.520678	-0.98	0.327	-4.47198	1.489703	49.89522	0.7248
2015	2.609912	1.471206	1.77	0.076	-0.27395	5.493776	53.99627	0.0469
2016	-0.46243	1.456265	-0.32	0.751	-3.31701	2.392143	50.92393	0.439
stage
1	-12.6648	1.253695	-10.1	0	-15.1223	-10.2073	38.72153	<0.001
2	-16.7358	1.336492	-12.52	0	-19.3556	-14.116	34.65059	<0.001
3	-15.5823	1.616628	-9.64	0	-18.7512	-12.4134	35.80404	<0.001
4	-17.8391	2.384647	-7.48	0	-22.5135	-13.1647	33.5473	<0.001
rhan
Region 2	-4.46662	0.940487	-4.75	0	-6.31017	-2.62308	46.91974	0.4271
Region 3	-0.48025	1.362242	-0.35	0.724	-3.15052	2.190025	50.90611	0.427
Region 4	7.243908	1.704822	4.25	0	3.902108	10.58571	58.63027	0.0007
incomeqn
Low	0.433228	0.806866	0.54	0.591	-1.1484	2.014851	51.81959	0.2616
age_groupn
40-49	-2.75159	1.896045	-1.45	0.147	-6.46823	0.965043	48.63477	0.8687
50-69	-1.48077	1.742776	-0.85	0.396	-4.89697	1.935429	49.90559	0.6624
70+	-0.92921	1.847533	-0.5	0.615	-4.55076	2.692331	50.45715	0.4982
detn
Yes	-9.80668	0.883767	-11.1	0	-11.539	-8.07432	41.57968	0.0045
csize2
Urban	2.484267	1.270398	1.96	0.051	-0.00597	4.974507	53.87063	0.0034
_cons	51.38636	2.534833	20.27	0	46.41757	56.35515		0.3465

Understanding Logistic regression

Descriptive statistics

Data research analysis

Data research output model