Data Analysis in Stata using IPEDS dataset
Instructions:
The purpose of this assignment is to continue to develop your skills using the essential commands in Stata, statistical tests, and interpreting Stata output. Create a DoFile to track the commands you use.
Dataset:
This assignment uses data from the IPEDS dataset and includes data for 104 California Community Colleges during the 201415 academic year.
 Identify the number of observations and variables in the dataset.
 dta includes the same data as Assign2_data.dta. This dataset includes the variable locale, where each institution is coded as rural, suburban, or urban. The dataset also includes a new variable called success_center. This variable is coded as 1 if the institution has a center dedicated to student success and 0 if the institution does not (Note: success_center is a fictitious variable).
 You are interested in understanding what factors predict and explain the variance in associate’s degree completion among California community colleges. Because an institution’s degree “production” is likely relative to the number of students enrolled (that is, colleges that enroll large numbers of students will likely produce large numbers of degrees), you first need to create a new dependent variable that adjusts for this. Create a new variable called deg_enroll that represents the ratio of associate’s degree conferred (associate_deg) to the total number of students enrolled (total_enroll). This calculation should result in a value that ranges from 0 to 1 for each community college. Summarize deg_enroll and describe the distribution of the variable.
 Based on the analysis of the distribution of deg_enroll, you should identify four observations that are outliers (two with a value of 0, and two with extremely high values). For all subsequent questions on this assignment, exclude these four observations from your analysis. That is, your analysis for all subsequent questions should only have 100 observations, rather than 104.
 Next, you want to see which independent variables relate to your dependent variable. First, create a Pearson’s correlation matrix among the following six independent variables that you intend to include in your regression model: academic_per_fte, stuservice_exp_fte, ft_retention,res_allocation, locale, and success_center. Identify which variables, if any, are correlated and indicate the significance and r value for variables that are correlated. You do not need to interpret each r value, just list those binary relationships that are statistically significant and list their r
 Conduct a regression to determine if the following five variables (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale) predict colleges’ degree production (deg_enroll). Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the rsquared value; c) determining if the regression coefficient for the independent variables are significant; and d) interpreting the regression coefficients for the significant independent variables.
 Note: When you interpret the regression coefficient for a binary variable (i.e., res_allocation), the category that is coded as 0 is the reference category and only the coefficient for the category coded as 1 is reported in the Stata output. Refer to Pages 173176 of the Longest book for a reminder of how to interpret binary/dichotomous variables (also referred to as dummy variables).
 NOTE: When you add the locale variable to the Stata regression code, rather than typing “locale”, type “locale”. This syntax tells Stata that this variable is a categorical variable with more than two categories and to use the category coded as 3 as the reference group. This means that the Stata output will display coefficients for the categories coded as 1 and 2, and you interpret these coefficients relative to the category coded as 3.
 After assessing this model, you decide to add one additional independent variable to the regression model you produced in Question 6: success_center. Conduct a new regression analysis. Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the rsquared value; c) interpreting how much variance in the outcome is explained by the success_center variable; d) determining if the regression coefficient for the independent variables are significant; and e) interpreting the regression coefficients for the significant independent variables.
 Print your entire output to a PDF by typing: translate @Results assignment3.pdf
 Upload the PDF, the Assignment in a Word document, and the DoFile to Canvas.
Solution
Quantitative Research Methods
 Identify the number of observations and variables in the dataset.
describe
Contains data from C:\Users\Joaquin\Desktop\Stata 17112017\Assign3_data.dta
obs: 104
vars: 16 14 Nov 2017 09:43
size: 9,464
——————————————————————————————–
storage display value
variable name type format label variable label
——————————————————————————————–
id long %12.0g Institution ID
name str53 %53s Institution Name
state str2 %9s
urbanization byte %8.0g Degree of ubanization
associate_degint %8.0g Number of Associate’s Degrees Conferred in 201415
ft_retention byte %8.0g FullTime Student Retention Rate
total_enroll long %12.0g Total Fall 2014 Enrollment
ft_enrollint %8.0g FullTime Enrollment in Fall 2014
pt_enrollint %8.0g Part Time Enrollment in Fall 2014
inst_exp_fteint %8.0g Instruction Expenses Per FTE
pubservice_ex~eint %8.0g Public Service Expenses per FTE
stuservice_ex~eint %8.0g Student Service Expenses per FTE
academic_per_~eint %8.0g Academic Support Expenses per FTE
res_allocation float %9.0g Resource Allocation Policy
locale float %9.0g loclabelUbanization
success_center float %9.0g Presence of success center
——————————————————————————————–
Sorted by:
There are 104 observations in 16 variables.
 dta includes the same data as Assign2_data.dta. This dataset includes the variable locale, where each institution is coded as rural, suburban, or urban. The dataset also includes a new variable called success_center. This variable is coded as 1 if the institution has a center dedicated to student success and 0 if the institution does not (Note: success_center is a fictitious variable).
 You are interested in understanding what factors predict and explain the variance in associate’s degree completion among California community colleges. Because an institution’s degree “production” is likely relative to the number of students enrolled, you first need to create a new dependent variable that adjusts for this. Create a new variable called deg_enroll that represents the ratio of associate’s degree conferred (associate_deg) to the total number of students enrolled (total_enroll). This calculation should result in a value that ranges from 0 to 1 for each community college. Summarize deg_enroll and describe the distribution of the variable.
generatedeg_enroll = associate_deg / total_enroll
summarizedeg_enroll, detail
deg_enroll
————————————————————
Percentiles Smallest
1% 0 0
5% .0297899 0
10% .0404624 .024646 Obs 104
25% .05415 .0283109 Sum of Wgt. 104
50% .0624418 Mean .0641819
Largest Std. Dev. .0210354
75% .0777014 .0984634
90% .0901145 .1046278 Variance .0004425
95% .0963641 .1161496 Skewness .0910757
99% .1161496 .1291017 Kurtosis 4.145744
The ratio of associate’s degree conferred to the total number of students enrolled has an average of 0.0641819. This means that, on average, they have conferred the associate degree to 6.4 percent of students enrolled in California community colleges. The distribution is slightly negative asymmetric, so there is a higher concentration of data in the right part of the series; that is, in values greater than the median, as can be seen in the histogram of this variable, which is presented in Figure 2. Although the distribution is slightly asymmetric, the arithmetic mean is very similar to the median and it is observed that the kurtosis is greater than 3, which means that there is a higher concentration of data around the arithmetic mean. With respect to the percentiles of the distribution, it is observed that 25 percent of the community colleges have an associate’s degree conferred ratio less than or equal to 0.05415. In addition, 75 percent of community colleges have an associate’s degree conferred ratio less than or equal to 0.0963641.
 Based on the analysis of the distribution of deg_enroll, you should identify four observations that are outliers (two with a value of 0, and two with extremely high values). For all subsequent questions on this assignment, exclude these four observations from your analysis. That is, your analysis for all subsequent questions should only have 100 observations, rather than 104.
In the box plot shown in Figure 1 it shows that the distribution is moderately asymmetric. The median of the data is closer to the first quartile. In addition, outliers have indeed been identified. Only three points are observed although we know that there are four observations, since in two cases the value 0 is presented in the variable deg_enroll.
Figure 1.Box plot of the variable deg_enroll.
lvdeg_enroll
# 104 deg_enroll
———————————
M 52.5  .0624418  spread pseudosigma
F 26.5  .05415 .0659257 .0777014  .0235514 .0175239
E 13.5  .0424345 .0656624 .0888903  .0464558 .0202944
D 7  .031401 .0626133 .0938257  .0624247 .020496
C 4  .0283109 .0633872 .0984634  .0701525 .0193786
B 2.5  .012323 .0613558 .1103887  .0980657 .0240574
A 1.5  0 .0613128 .1226256  .1226256 .026843
1  0 .0645508 .1291017  .1291017 .0260788
 
  # below # above
inner fence  .0188229 .1130284  2 2
outer fence  .0165042 .1483555  0 0
The lv command has allowed to identify the extreme values. The row in which it is indicated that there are four “mild outliers” is highlighted in yellow. These values are outside the inner fence but not outside the outer fence.
generate outlier = 0
replace outlier = 1 if deg_enroll<= 0.0188229  deg_enroll>= 0.1130284
A variable called outlier was created that is equal to one if the value of deg_enroll is less than or equal to 0.0188229 or if the value of deg_enroll is greater than or equal to 0.1130284. This variable will be used to filter the data for the following questions.
Figure 2.Histogram of the variable deg_enroll.
 Next, you want to see which independent variables relate to your dependent variable. First, create a Pearson’s correlation matrix among the following six independent variables that you intend to include in your regression model: academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale, and success_center. Identify which variables, if any, are correlated and indicate the significance and r value for variables that are correlated. You do not need to interpret each r value, just list those binary relationships that are statistically significant and list their r
pwcorrdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation locale success_center if outlier == 0, sig
 deg_en~lacadem~estuser~eft_ret~nres_al~n locale succes~r
————+—————————————————————
deg_enroll  1.0000


academic_p~e  0.0648 1.0000
 0.5220

stuservice~e  0.0306 0.1987 1.0000
 0.7628 0.0475

ft_retention  0.1340 0.0008 0.2783 1.0000
 0.1837 0.9937 0.0051

res_alloca~n 0.0310 0.2403 0.3746 0.1659 1.0000
 0.7594 0.0160 0.0001 0.0990

locale  0.3093 0.1069 0.1919 0.3387 0.0794 1.0000
 0.0017 0.2897 0.0558 0.0006 0.4324

success_ce~r  0.3511 0.1340 0.2025 0.1744 0.0210 0.2341 1.0000
 0.0003 0.1838 0.0433 0.0827 0.8360 0.0191
Based upon the ttest results, there is a linear relationship in the population between:
 deg_enroll and locale: r = 0.3093
 deg_enroll and success_center: r = 0.3511
 academic_per_fte and stuservice_exp_fte: r = 0.1987
 academic_per_fte and res_allocation: r = 0.2403
 stuservice_exp_fte and ft_retention: r = 0.2783
 stuservice_exp_fte and res_allocation: r = 0.3746
 stuservice_exp_fte and success_center: r = 0.2025
 ft_retention and locale: r = 0.3387
 locale and success_center: r = 0.2341
Conduct a regression to determine if the following five variables (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale) predict colleges’ degree production (deg_enroll). Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the rsquared value; c) determining if the regression coefficient for the independent variables are significant; and d) interpreting the regression coefficients for the significant independent variables.

 Note: When you interpret the regression coefficient for a binary variable (i.e., res_allocation), the category that is coded as 0 is the reference category and only the coefficient for the category coded as 1 is reported in the Stata output. Refer to Pages 173176 of the Longest book for a reminder of how to interpret binary/dichotomous variables (also referred to as dummy variables).
 NOTE: When you add the locale variable to the Stata regression code, rather than typing “locale”, type “locale”. This syntax tells Stata that this variable is a categorical variable with more than two categories and to use the category coded as 3 as the reference group. This means that the Stata output will display coefficients for the categories coded as 1 and 2, and you interpret these coefficients relative to the category coded as 3.
regressdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation ib3.locale if outlier == 0
Source  SS df MS Number of obs = 100
————+——————————— F(6, 93) = 3.20
Model  .005200357 6 .000866726 Prob> F = 0.0067
Residual  .025220643 93 .00027119 Rsquared = 0.1709
————+——————————— Adj Rsquared = 0.1175
Total  .030420999 99 .000307283 Root MSE = .01647
————————————————————————————
deg_enroll  Coef. Std. Err. t P>t [95% Conf. Interval]
——————+—————————————————————
academic_per_fte  7.82e07 2.87e06 0.27 0.786 4.92e06 6.48e06
stuservice_exp_fte  1.61e06 3.77e06 0.43 0.671 9.09e06 5.88e06
ft_retention  .0006757 .0002535 2.67 0.009 .0001723 .0011792
res_allocation .0001321 .0038217 0.03 0.973 .0077213 .0074572

locale 
Rural  .0214497 .0053498 4.01 0.000 .0108262 .0320733
Suburban  .0069706 .0036337 1.92 0.058 .0002452 .0141864

_cons  .0129716 .0191434 0.68 0.500 .0250434 .0509866
————————————————————————————
 It is observed that Prob>F = 0.0067 < 0.05 and then the overall model is statistically significant.
 The Rsquared = 0.1709. The 17.09% of the variation in colleges’ degree production (deg_enroll) is accounted for by the variables included in the regression model (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocationand locale).
 To analyze if the regression coefficient for the independent variables are significant, it is observed that the pvalue is less than 0.05 only for the variables ft_retention and Rural. Both rows were highlighted in yellow. The beta coefficient is significantly different from 0 only for variables ft_retention and Rural.
 When FullTime Student Retention Rate (ft_retention) increases on a oneunit, the colleges’ degree production (deg_enroll) increases in .0006757. The predicted mean of the colleges’ degree production (deg_enroll) are approximately .0214497 higher in the Rural than in the Urban, when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), FullTime Student Retention Rate (ft_retention) and Resource Allocation Policy (res_allocation).
 After assessing this model, you decide to add one additional independent variable to the regression model you produced in Question 6: success_center. Conduct a new regression analysis. Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the rsquared value; c) interpreting how much variance in the outcome is explained by the success_center variable; d) determining if the regression coefficient for the independent variables are significant; and e) interpreting the regression coefficients for the significant independent variables.
regressdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation ib3.locale success_center if outlier == 0
Source  SS df MS Number of obs = 100
————+——————————— F(7, 92) = 5.29
Model  .008727823 7 .001246832 Prob> F = 0.0000
Residual  .021693176 92 .000235795 Rsquared = 0.2869
————+——————————— Adj Rsquared = 0.2326
Total  .030420999 99 .000307283 Root MSE = .01536
————————————————————————————
deg_enroll  Coef. Std. Err. t P>t [95% Conf. Interval]
——————+—————————————————————
academic_per_fte  4.38e07 2.70e06 0.16 0.871 5.79e06 4.92e06
stuservice_exp_fte  3.81e06 3.56e06 1.07 0.287 .0000109 3.26e06
ft_retention  .0008009 .0002386 3.36 0.001 .000327 .0012747
res_allocation  .0013128 .0035832 0.37 0.715 .0058036 .0084293

locale 
Rural  .0197394 .005008 3.94 0.000 .0097931 .0296858
Suburban  .0039636 .0034763 1.14 0.257 .0029407 .0108679

success_center  .0126712 .0032761 3.87 0.000 .0061646 .0191777
_cons  .0034944 .0180179 0.19 0.847 .0322906 .0392795
————————————————————————————
 It is observed that pvalue (Prob> F) < 0.001 and then the overall model is statistically significant.
 The Rsquared = 0.2869. The 28.69% of the variation in colleges’ degree production (deg_enroll) is accounted for by the variables included in the regression model (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale andsuccess_center).
 The coefficient of determination increased by 0.116 with the incorporation of variable success_center in the model. Therefore, 11.6% of the variation in colleges’ degree production (deg_enroll) is explained by the success_center variable. The variable success_centerimproves the prediction of colleges’ degree production.
 To analyze if the regression coefficient for the independent variables are significant, it is observed that the pvalue is less than 0.05 only for the variables ft_retention,Rural and success_center. Those rows were highlighted in yellow. The beta coefficient is significantly different from 0 only for variables ft_retention,Rural and success_center.
 When FullTime Student Retention Rate (ft_retention) increases on a oneunit, the colleges’ degree production (deg_enroll) increases in 0 .0008009. The predicted mean of the colleges’ degree production (deg_enroll) are approximately 0 .0197394 higher in the Rural than in the Urban, when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), FullTime Student Retention Rate (ft_retention), Resource Allocation Policy (res_allocation) and Presence of success center (success_center). The predicted mean of the colleges’ degree production (deg_enroll) are approximately 0.0126712 higher in the institutions that have a center dedicated to student success than the institutions that does not have that ones when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), FullTime Student Retention Rate (ft_retention), Resource Allocation Policy (res_allocation).
 Print your entire output to a PDF by typing: translate @Results assignment3.pdf
 Upload the PDF, the Assignment in a Word document, and the DoFile to Canvas.