# Data Analysis in Stata using IPEDS dataset

Instructions:

The purpose of this assignment is to continue to develop your skills using the essential commands in Stata, statistical tests, and interpreting Stata output. Create a Do-File to track the commands you use.

Dataset:

This assignment uses data from the IPEDS dataset and includes data for 104 California Community Colleges during the 2014-15 academic year.

1. Identify the number of observations and variables in the dataset.
2. dta includes the same data as Assign2_data.dta. This dataset includes the variable locale, where each institution is coded as rural, suburban, or urban. The dataset also includes a new variable called success_center. This variable is coded as 1 if the institution has a center dedicated to student success and 0 if the institution does not (Note: success_center is a fictitious variable).
3. You are interested in understanding what factors predict and explain the variance in associate’s degree completion among California community colleges. Because an institution’s degree “production” is likely relative to the number of students enrolled (that is, colleges that enroll large numbers of students will likely produce large numbers of degrees), you first need to create a new dependent variable that adjusts for this. Create a new variable called deg_enroll that represents the ratio of associate’s degree conferred (associate_deg) to the total number of students enrolled (total_enroll). This calculation should result in a value that ranges from 0 to 1 for each community college. Summarize deg_enroll and describe the distribution of the variable.
4. Based on the analysis of the distribution of deg_enroll, you should identify four observations that are outliers (two with a value of 0, and two with extremely high values). For all subsequent questions on this assignment, exclude these four observations from your analysis. That is, your analysis for all subsequent questions should only have 100 observations, rather than 104.
5. Next, you want to see which independent variables relate to your dependent variable. First, create a Pearson’s correlation matrix among the following six independent variables that you intend to include in your regression model: academic_per_fte, stuservice_exp_fte, ft_retention,res_allocation, locale, and success_center. Identify which variables, if any, are correlated and indicate the significance and r value for variables that are correlated. You do not need to interpret each r value, just list those binary relationships that are statistically significant and list their r
6. Conduct a regression to determine if the following five variables (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale) predict colleges’ degree production (deg_enroll). Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the r-squared value; c) determining if the regression coefficient for the independent variables are significant; and d) interpreting the regression coefficients for the significant independent variables.
1. Note: When you interpret the regression coefficient for a binary variable (i.e., res_allocation), the category that is coded as 0 is the reference category and only the coefficient for the category coded as 1 is reported in the Stata output. Refer to Pages 173-176 of the Longest book for a reminder of how to interpret binary/dichotomous variables (also referred to as dummy variables).
2. NOTE: When you add the locale variable to the Stata regression code, rather than typing “locale”, type “locale”. This syntax tells Stata that this variable is a categorical variable with more than two categories and to use the category coded as 3 as the reference group. This means that the Stata output will display coefficients for the categories coded as 1 and 2, and you interpret these coefficients relative to the category coded as 3.
7. After assessing this model, you decide to add one additional independent variable to the regression model you produced in Question 6: success_center. Conduct a new regression analysis. Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the r-squared value; c) interpreting how much variance in the outcome is explained by the success_center variable; d) determining if the regression coefficient for the independent variables are significant; and e) interpreting the regression coefficients for the significant independent variables.
8. Print your entire output to a PDF by typing: translate @Results assignment3.pdf
9. Upload the PDF, the Assignment in a Word document, and the Do-File to Canvas.

Solution

Quantitative Research Methods

1. Identify the number of observations and variables in the dataset.

describe

Contains data from C:\Users\Joaquin\Desktop\Stata 17-11-2017\Assign3_data.dta

obs:           104

vars:            16                          14 Nov 2017 09:43

size:         9,464

——————————————————————————————–

storage   display    value

variable name   type    format     label      variable label

——————————————————————————————–

id              long    %12.0g                Institution ID

name            str53   %53s                  Institution Name

state           str2    %9s

urbanization    byte    %8.0g                 Degree of ubanization

associate_degint     %8.0g                 Number of Associate’s Degrees Conferred in 2014-15

ft_retention    byte    %8.0g                 Full-Time Student Retention Rate

total_enroll    long    %12.0g                Total Fall 2014 Enrollment

ft_enrollint     %8.0g                 Full-Time Enrollment in Fall 2014

pt_enrollint     %8.0g                 Part Time Enrollment in Fall 2014

inst_exp_fteint     %8.0g                 Instruction Expenses Per FTE

pubservice_ex~eint     %8.0g                 Public Service Expenses per FTE

stuservice_ex~eint     %8.0g                 Student Service Expenses per FTE

res_allocation  float   %9.0g                 Resource Allocation Policy

locale          float   %9.0g      loclabelUbanization

success_center  float   %9.0g                 Presence of success center

——————————————————————————————–

Sorted by:

There are 104 observations in 16 variables.

1. dta includes the same data as Assign2_data.dta. This dataset includes the variable locale, where each institution is coded as rural, suburban, or urban. The dataset also includes a new variable called success_center. This variable is coded as 1 if the institution has a center dedicated to student success and 0 if the institution does not (Note: success_center is a fictitious variable).
2. You are interested in understanding what factors predict and explain the variance in associate’s degree completion among California community colleges. Because an institution’s degree “production” is likely relative to the number of students enrolled, you first need to create a new dependent variable that adjusts for this. Create a new variable called deg_enroll that represents the ratio of associate’s degree conferred (associate_deg) to the total number of students enrolled (total_enroll). This calculation should result in a value that ranges from 0 to 1 for each community college. Summarize deg_enroll and describe the distribution of the variable.

generatedeg_enroll = associate_deg / total_enroll

summarizedeg_enroll, detail

deg_enroll

————————————————————-

Percentiles      Smallest

1%            0              0

5%     .0297899              0

10%     .0404624        .024646       Obs                 104

25%       .05415       .0283109       Sum of Wgt.         104

50%     .0624418                      Mean           .0641819

Largest       Std. Dev.      .0210354

75%     .0777014       .0984634

90%     .0901145       .1046278       Variance       .0004425

95%     .0963641       .1161496       Skewness      -.0910757

99%     .1161496       .1291017       Kurtosis       4.145744

The ratio of associate’s degree conferred to the total number of students enrolled has an average of 0.0641819. This means that, on average, they have conferred the associate degree to 6.4 percent of students enrolled in California community colleges. The distribution is slightly negative asymmetric, so there is a higher concentration of data in the right part of the series; that is, in values greater than the median, as can be seen in the histogram of this variable, which is presented in Figure 2. Although the distribution is slightly asymmetric, the arithmetic mean is very similar to the median and it is observed that the kurtosis is greater than 3, which means that there is a higher concentration of data around the arithmetic mean. With respect to the percentiles of the distribution, it is observed that 25 percent of the community colleges have an associate’s degree conferred ratio less than or equal to 0.05415. In addition, 75 percent of community colleges have an associate’s degree conferred ratio less than or equal to 0.0963641.

1. Based on the analysis of the distribution of deg_enroll, you should identify four observations that are outliers (two with a value of 0, and two with extremely high values). For all subsequent questions on this assignment, exclude these four observations from your analysis. That is, your analysis for all subsequent questions should only have 100 observations, rather than 104.

In the box plot shown in Figure 1 it shows that the distribution is moderately asymmetric. The median of the data is closer to the first quartile. In addition, outliers have indeed been identified. Only three points are observed although we know that there are four observations, since in two cases the value 0 is presented in the variable deg_enroll. Figure 1.Box plot of the variable deg_enroll.

lvdeg_enroll

#    104                deg_enroll

———————————

M     52.5 |             .0624418            |    spread  pseudosigma

F     26.5 |    .05415   .0659257   .0777014 |  .0235514    .0175239

E     13.5 |  .0424345   .0656624   .0888903 |  .0464558    .0202944

D      7   |   .031401   .0626133   .0938257 |  .0624247     .020496

C      4   |  .0283109   .0633872   .0984634 |  .0701525    .0193786

B      2.5 |   .012323   .0613558   .1103887 |  .0980657    .0240574

A      1.5 |         0   .0613128   .1226256 |  .1226256     .026843

1   |         0   .0645508   .1291017 |  .1291017    .0260788

|                                 |

|                                 |   # below     # above

inner fence |  .0188229              .1130284 |         2           2

outer fence | -.0165042              .1483555 |         0           0

The lv command has allowed to identify the extreme values. The row in which it is indicated that there are four “mild outliers” is highlighted in yellow. These values are outside the inner fence but not outside the outer fence.

generate outlier = 0

replace outlier = 1 if deg_enroll<= 0.0188229 | deg_enroll>= 0.1130284

A variable called outlier was created that is equal to one if the value of deg_enroll is less than or equal to 0.0188229 or if the value of deg_enroll is greater than or equal to 0.1130284. This variable will be used to filter the data for the following questions. Figure 2.Histogram of the variable deg_enroll.

1. Next, you want to see which independent variables relate to your dependent variable. First, create a Pearson’s correlation matrix among the following six independent variables that you intend to include in your regression model: academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale, and success_center. Identify which variables, if any, are correlated and indicate the significance and r value for variables that are correlated. You do not need to interpret each r value, just list those binary relationships that are statistically significant and list their r

pwcorrdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation locale success_center if outlier == 0, sig

————-+—————————————————————

deg_enroll |   1.0000

|

|

|   0.5220

|

stuservice~e |  -0.0306   0.1987   1.0000

|   0.7628   0.0475

|

ft_retention |   0.1340  -0.0008  -0.2783   1.0000

|   0.1837   0.9937   0.0051

|

res_alloca~n|  -0.0310   0.2403   0.3746  -0.1659   1.0000

|   0.7594   0.0160   0.0001   0.0990

|

locale |  -0.3093  -0.1069  -0.1919   0.3387  -0.0794   1.0000

|   0.0017   0.2897   0.0558   0.0006   0.4324

|

success_ce~r |   0.3511   0.1340   0.2025  -0.1744   0.0210  -0.2341   1.0000

|   0.0003   0.1838   0.0433   0.0827   0.8360   0.0191

Based upon the t-test results, there is a linear relationship in the population between:

1. deg_enroll and locale: r = -0.3093
2. deg_enroll and success_center: r = 0.3511
3. academic_per_fte and stuservice_exp_fte: r = 0.1987
4. academic_per_fte and res_allocation: r = 0.2403
5. stuservice_exp_fte and ft_retention: r = -0.2783
6. stuservice_exp_fte and res_allocation: r = 0.3746
7. stuservice_exp_fte and success_center: r = 0.2025
8. ft_retention and locale: r = 0.3387
9. locale and success_center: r = -0.2341

Conduct a regression to determine if the following five variables (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale) predict colleges’ degree production (deg_enroll). Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the r-squared value; c) determining if the regression coefficient for the independent variables are significant; and d) interpreting the regression coefficients for the significant independent variables.

1. Note: When you interpret the regression coefficient for a binary variable (i.e., res_allocation), the category that is coded as 0 is the reference category and only the coefficient for the category coded as 1 is reported in the Stata output. Refer to Pages 173-176 of the Longest book for a reminder of how to interpret binary/dichotomous variables (also referred to as dummy variables).
2. NOTE: When you add the locale variable to the Stata regression code, rather than typing “locale”, type “locale”. This syntax tells Stata that this variable is a categorical variable with more than two categories and to use the category coded as 3 as the reference group. This means that the Stata output will display coefficients for the categories coded as 1 and 2, and you interpret these coefficients relative to the category coded as 3.

regressdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation ib3.locale if outlier == 0

Source |       SS           df       MS      Number of obs   =       100

————-+———————————-   F(6, 93)        =      3.20

Model |  .005200357         6  .000866726   Prob> F        =    0.0067

Residual |  .025220643        93   .00027119   R-squared       =    0.1709

Total |  .030420999        99  .000307283   Root MSE        =    .01647

————————————————————————————

deg_enroll |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

——————-+—————————————————————-

academic_per_fte |   7.82e-07   2.87e-06     0.27   0.786    -4.92e-06    6.48e-06

stuservice_exp_fte |  -1.61e-06   3.77e-06    -0.43   0.671    -9.09e-06    5.88e-06

ft_retention |   .0006757   .0002535     2.67   0.009     .0001723    .0011792

res_allocation|  -.0001321   .0038217    -0.03   0.973    -.0077213    .0074572

|

locale |

Rural  |   .0214497   .0053498     4.01   0.000     .0108262    .0320733

Suburban  |   .0069706   .0036337     1.92   0.058    -.0002452    .0141864

|

_cons |   .0129716   .0191434     0.68   0.500    -.0250434    .0509866

————————————————————————————

1. It is observed that Prob>F = 0.0067 < 0.05 and then the overall model is statistically significant.
2. The R-squared = 0.1709. The 17.09% of the variation in colleges’ degree production (deg_enroll) is accounted for by the variables included in the regression model (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocationand locale).
3. To analyze if the regression coefficient for the independent variables are significant, it is observed that the p-value is less than 0.05 only for the variables ft_retention and Rural. Both rows were highlighted in yellow. The beta coefficient is significantly different from 0 only for variables ft_retention and Rural.
4. When Full-Time Student Retention Rate (ft_retention) increases on a one-unit, the colleges’ degree production (deg_enroll) increases in .0006757. The predicted mean of the colleges’ degree production (deg_enroll) are approximately .0214497 higher in the Rural than in the Urban, when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), Full-Time Student Retention Rate (ft_retention) and Resource Allocation Policy (res_allocation).
1. After assessing this model, you decide to add one additional independent variable to the regression model you produced in Question 6: success_center. Conduct a new regression analysis. Interpret the output of the regression analysis by: a) describing if the overall model is significant; b) interpreting the r-squared value; c) interpreting how much variance in the outcome is explained by the success_center variable; d) determining if the regression coefficient for the independent variables are significant; and e) interpreting the regression coefficients for the significant independent variables.

regressdeg_enrollacademic_per_ftestuservice_exp_fteft_retentionres_allocation ib3.locale  success_center if outlier == 0

Source |       SS           df       MS      Number of obs   =       100

————-+———————————-   F(7, 92)        =      5.29

Model |  .008727823         7  .001246832   Prob> F        =    0.0000

Residual |  .021693176        92  .000235795   R-squared       =    0.2869

Total |  .030420999        99  .000307283   Root MSE        =    .01536

————————————————————————————

deg_enroll |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

——————-+—————————————————————-

academic_per_fte |  -4.38e-07   2.70e-06    -0.16   0.871    -5.79e-06    4.92e-06

stuservice_exp_fte |  -3.81e-06   3.56e-06    -1.07   0.287    -.0000109    3.26e-06

ft_retention |   .0008009   .0002386     3.36   0.001      .000327    .0012747

res_allocation |   .0013128   .0035832     0.37   0.715    -.0058036    .0084293

|

locale |

Rural  |   .0197394    .005008     3.94   0.000     .0097931    .0296858

Suburban  |   .0039636   .0034763     1.14   0.257    -.0029407    .0108679

|

success_center |   .0126712   .0032761     3.87   0.000     .0061646    .0191777

_cons |   .0034944   .0180179     0.19   0.847    -.0322906    .0392795

————————————————————————————

1. It is observed that p-value (Prob> F) < 0.001 and then the overall model is statistically significant.
2. The R-squared = 0.2869. The 28.69% of the variation in colleges’ degree production (deg_enroll) is accounted for by the variables included in the regression model (academic_per_fte, stuservice_exp_fte, ft_retention, res_allocation, locale andsuccess_center).
3. The coefficient of determination increased by 0.116 with the incorporation of variable success_center in the model. Therefore, 11.6% of the variation in colleges’ degree production (deg_enroll) is explained by the success_center variable. The variable success_centerimproves the prediction of colleges’ degree production.
4. To analyze if the regression coefficient for the independent variables are significant, it is observed that the p-value is less than 0.05 only for the variables ft_retention,Rural and success_center. Those rows were highlighted in yellow. The beta coefficient is significantly different from 0 only for variables ft_retention,Rural and success_center.
5. When Full-Time Student Retention Rate (ft_retention) increases on a one-unit, the colleges’ degree production (deg_enroll) increases in 0 .0008009. The predicted mean of the colleges’ degree production (deg_enroll) are approximately 0 .0197394 higher in the Rural than in the Urban, when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), Full-Time Student Retention Rate (ft_retention), Resource Allocation Policy (res_allocation) and Presence of success center (success_center). The predicted mean of the colleges’ degree production (deg_enroll) are approximately 0.0126712 higher in the institutions that have a center dedicated to student success than the institutions that does not have that ones when controlling for Academic Support Expenses per FTE (academic_per_fte), Student Service Expenses per FTE (stuservice_exp_fte), Full-Time Student Retention Rate (ft_retention), Resource Allocation Policy (res_allocation).
6. Print your entire output to a PDF by typing: translate @Results assignment3.pdf
7. Upload the PDF, the Assignment in a Word document, and the Do-File to Canvas.     