Interpret Regression

Interpret Regression

 

 

Solution 

In Stata, open the WAGE1.dta data. These are different data with which we will examine the relationship between wage, education experience and job tenure.
DON’T run, but examine the regression:

  1. Use a scatter plot and LOESS fit line to explain your expectations for the signs of 

The full regression does not have to be estimated to find a bivariate scatterplot that shows the relationship between wage and tenure and wage and tenure2.Instead, to find the relationship between wage and tenure and wage and tenure2, we can regresswage on educ and save the residuals (as wage_resids) and then regress tenure on educ and save the residuals (tenure_resids). These two residuals variables will show the relationship between wage and tenure after the effects of educ are removed. The lowess curve on the scatterplot will provide insight on whether the relationship between wage and tenure is linear or quadratic (i.e., on whether the tenure2 term improves on the linear fit). If the scatterplot is nearly linear, then we would expect a zero coefficient on tenure2, whereas if the scatterplot is u-shaped we would expect a positive coefficient on the linear term and negative coefficient on the square term or inverted u-shaped then we would expect a negative coefficient on the linear term and a positive coefficient on the square term.

The Stata code to do accomplish this and he graph follow:

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

The part of the graph where Residuals of tenure on educ is less than 0 is u-shaped, which implies negative coefficient on the linear term and positive coefficient on the square term. However, If the graph started at Residuals of tenure on educ> 0, then the regression would view this as an inverted u, with positive estimated coefficient on tenure and negative estimated coefficient on tenure2. I would expect inverted u to be the overall result.

  1. Run and interpret the regression.

The Stata command to estimate the regression and its results follow:

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source |       SS       df       MS              Number of obs =     526

————-+——————————           F(  3,   522) =   80.10

Model |  2257.25872     3  752.419574           Prob> F      =  0.0000

Residual |  4903.15557   522  9.39301833           R-squared     =  0.3152

————-+——————————           Adj R-squared =  0.3113

Total |  7160.41429   525  13.6388844           Root MSE      =  3.0648

——————————————————————————

wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————-

educ |      0.562      0.048   11.609   0.000        0.467       0.657

tenure |      0.330      0.048    6.917   0.000        0.236       0.424

tenursq |     -0.006      0.002   -3.194   0.001       -0.009      -0.002

_cons |     -2.420      0.638   -3.795   0.000       -3.672      -1.167

——————————————————————————

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable wage on the variables educ, tenure and tenure2. The R2, 0.3152, implies that 31.52% of the overall variation in wage can be explained by variation in educ, tenure and tenure2. Also, the F statistic, 80.10 with 3 and 522 degrees of freedom and associated p value <0.0001, implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

The constant, -2.420,  is interpreted as the average wage of a worker with 0 years of tenure and 0 years of education, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the wage axis.

The coefficient on the linear term on tenure,, 0.330, indicates the increase in the average wage that is linearly related to tenure with a one unit increase in tenure, holding educ constant, and ignoring the decrease in the wage that happens because of the negative coefficient on tenure2. The estimated coefficient on tenure2,

-0.006,estimates the decrease in the average wage with a 1 unit increase in the square of tenure, holding educ constant and ignoring the increase in the average wage from the linear tenure term. The net effect of a one unit  increase in tenure, holding education constant, will be 0.330 minus 0.006 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from,  and this sum of effects may be either negative or positive.

The estimated coefficient oneduc, , 0.562, is the change in the average wage with a 1 unit increase in educ (i.e., number of years of school completed) holding tenure constant.

  1. Explain in plain English the interpretation of 

The estimated coefficient oneduc, , 0.562, is the change in the average wage with a 1 unit increase in educ (i.e., number of years of school completed) holding tenure constant. In other words, a one unit increase in the number of years of school completed increases the average wage by 0.562 holding everything else (i.e., tenure) constant.

Is the quadratic form of tenure appropriate in this regression (consider the following questions)?

  1. Are the estimated coefficients and the expected sign? Explain.

Yes, the estimated coefficients have the expected signs that we expect. Wages should rise initially as one gains tenure (i.e., time with the same employer) but these gains should probably rise fastest in the first 10 years or so, and then taper off, and then quite possibly decrease as a worker nears retirement.

  1. Are the estimated coefficients and statistically significant? Explain.

Yes, the estimated coefficients are statistically significant. The t statistic is just the estimated coefficient divided by its standard error (i.e., the standard deviation of the sampling distribution of the coefficient). Thus the t statistic is the coefficient re-scaled so that it is in the units “standard deviations from the zero.”  The p value of the t statistic is the probability of obtaining a t statistic this large or larger in absolute value if the null hypothesis that the population coefficient is equal to zero is true. Thus the p value indicates how likely it would be to obtain a t statistic as large as was obtained if the population coefficient was zero, and if it turns out to be very unlikely, then that is good evidence that the population coefficient is not zero. Here, the p value for tenure is less than 0.001 and the p value for tenure2 is 0.001, so that for each of these if the null hypothesis were true, there would be no more than 1 chance in 1,000 that we would obtain a t statistics this large. This is very strong evidence that the population coefficients on these variables is not equal to zero.

iii. Did the inclusion of the quadratic form improve our regression model? Compare R2 and adjusted R2 within this model and with the model without the square terms (you don’t need to report the full regression results, just talk about the R2 and adjusted R2). Explain.

In the model above, the R2 is 0.3152, and the adjusted R2 is 0.3113. In the model without tenure2, the R2 is 0.3019 and the adjusted R2 is 0.2992. R2 is guaranteed to not decrease (and usually increases) as we increase the number of variables, so that we obtained an increase in R2 by adding tenure2 is neither surprising nor impressive. However, adjusted R2 adjusts R2 for the number of variables in the model relative to the number of observations in the data set, and reduces R2 quite drastically as the number of variables gets closer to the number of observations (i.e., as the solution to the regression model  gets closer and closer to looking like the solution to a system of simultaneous equations with an exact solution that fits every point exactly). The result that the adjusted R2 does not decrease when we add tenure2 is a reason to believe that tenure2 is a good variable to add, along with the fact that with tenure2 the model fits an economic story that makes sense (see part d.i).

  1. Based on the results, on average, how many years should one expect to be at a firm before reaching their maximum wage (i.e., what is the level of tenure maximizes the relationship between wage and tenure)?

The answer can be found by finding the number of years of tenure at which the slope of the wage in tenure is 0. The slope of the wage in tenure is the partial derivative of the estimated wage equation with respect to tenure, which is 0.33 – 0.012*tenure. Setting 0.33 – 0.012*tenure equal to zero and solving for tenure yields tenure=27.5. The average wage is maximized after 27.5 years, and that is when the worker can expect to be earning the highest wage.

2) Run the regression: 

The Stata command to accomplish this and its results follow:

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source |       SS       df       MS              Number of obs =     526

————-+——————————           F(  4,   521) =   60.74

Model |  2277.22215     4  569.305538           Prob> F      =  0.0000

Residual |  4883.19214   521  9.37272963           R-squared     =  0.3180

————-+——————————           Adj R-squared =  0.3128

Total |  7160.41429   525  13.6388844           Root MSE      =  3.0615

——————————————————————————

wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————-

educ |      0.586      0.051   11.475   0.000        0.486       0.687

tenure |      0.305      0.051    6.046   0.000        0.206       0.405

tenursq |     -0.005      0.002   -2.978   0.003       -0.009      -0.002

exper |      0.018      0.012    1.459   0.145       -0.006       0.041

_cons |     -2.921      0.724   -4.036   0.000       -4.343      -1.499

——————————————————————————

  1. Interpret the regression.

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable wage on the variables educ, tenure, tenure2, and exper. The R2, 0.3180, implies that 31.8% of the overall variation in wage can be explained by variation in educ, tenure and tenure2 and exper. Also, the F statistic 60.74, with degrees of freedom 4 and 521 and associated p value <0.0001 implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

The constant,  -2.921, is interpreted as the average wage of a worker with 0 years of tenure , 0 years of education, and 0 years of experience, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the wage axis.

The coefficient on the linear term on tenure β ̂_2,0.305, indicates the increase in the average wage that is linearly related to tenure with a one unit increase in tenure, holding educ and exper constant, and ignoring the decrease in the wage that happens because of the negative coefficient on tenure2. The estimated coefficient on tenure2,     -0.005 estimates the decrease in the average wage with a 1 unit increase in tenure squared, holding educ and exper constant and ignoring the increase in the average wage from the linear tenure term. The net effect of a one unit  increase in tenure, holding education and experience constant, will be 0.305 minus 0.005 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from,  and this sum of effects may be either negative or positive.

The estimated coefficient oneduc, , 0.586, is the change in the average wage with a 1 unit increase in educ (i.e., number of years of school completed) holding exper and tenure constant.

The estimated coefficient onexper, ,0.018, is the change in the average wage with a 1 unit increase in experience (i.e., number of years of work experience) holding tenure and educ constant.

  1. Are the coefficients statistically significant? Explain.

The estimated coefficients on the constant, educ, tenure and tenure2 are statistically significant because all of their  p values are less than 0.05 or any other conventionally chosen level of statistical significance. The estimated coefficient of exper has a p value 0.145, which means that if the slope on exper in the population regression line were 0, there would still be 0.145 probability (i.e., about 1 chance in 7) of obtaining a t statistic this large or larger. This sample does not provide evidence that the coefficient on exper is statistically significantly different than 0.

  1. Did including experience improve our model compared to the previous model (in question (1))?
  2. Compare R2 and adjusted R2 within and between models to support your answer.

The R2 for the model that does not include exper is 0.3152 and the R2 for the model that includes exper is 0.3180, a very small increase in R2whenexper is added. The adjusted R2 for the model that does not include exper is 0.3113 and the adjusted R2 for the model that includes exper is 0.3128, a very small increase in adjusted R2whenexper is added. The improvement from adding exper to the model is slight in terms of the fit, and the coefficient on exper is not statistically significant than zero, so the overall conclusion is that exper does not lead to a model improvement.

  1. Explain why you think there was or wasn’t an improvement to the model.

Tenure is the number of years with the same company, experience is the number of years working after school completion. For many workers, these are the same number, and they will tend to be close enough so that the information added by experience is, to a large extent, already in the model through tenure. The correlation between exper and tenure is 0.4993, so there is not much new information added when exper is added to the model.

3) Run the regression: :

The Stata command and its results follow:

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source |       SS       df       MS              Number of obs =     526

————-+——————————           F(  5,   520) =   55.22

Model |  2483.27264     5  496.654529           Prob> F      =  0.0000

Residual |  4677.14165   520  8.99450317           R-squared     =  0.3468

————-+——————————           Adj R-squared =  0.3405

Total |  7160.41429   525  13.6388844           Root MSE      =  2.9991

——————————————————————————

wage |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————-

educ |      0.552      0.051   10.928   0.000        0.453       0.652

tenure |      0.244      0.051    4.778   0.000        0.144       0.345

tenursq |     -0.003      0.002   -1.777   0.076       -0.007       0.000

exper |      0.187      0.037    5.012   0.000        0.114       0.260

expersq |     -0.004      0.001   -4.786   0.000       -0.005      -0.002

_cons |     -3.403      0.716   -4.753   0.000       -4.810      -1.997

——————————————————————————

  1. Interpret the regression.

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable wage on the variables educ, tenure, tenure2, experand exper2. The R2, 0.3468, implies that 34.68% of the overall variation in wage can be explained by variation in educ, tenure and tenure2, exper and exper2. Also, the F statistic 55.22, with degrees of freedom 5 and 520 and associated p value <0.0001 implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

The constant,  -3.403, is interpreted as the average wage of a worker with 0 years of tenure , 0 years of education, and 0 years of experience, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the wage axis.

The coefficient on the linear term on tenure, 0.244 indicates the increase in the average wage that is linearly related to tenure with a one unit increase in tenure, holding educ and exper (and thereby exper2 as well) constant, and ignoring the decrease in the wage that happens because of the negative coefficient on tenure2. The estimated coefficient on tenure2, -0.003, estimates the decrease in the average wage with a 1 unit increase in tenure squared, holding educ and exper constant and ignoring the increase in the average wage from the linear tenure term. The net effect of an increase in tenure, will be 0.244 minus 0.003 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from,  and this sum of effects may be either negative or positive.

The estimated coefficient on the linear term forexper, ,0.187, is the change in the average wage with a 1 unit increase in experience (i.e., number of years of work experience) holding tenure and educ constant and ignoring the contribution of the exper2 term when exper is increased by 1.The estimated coefficient on exper2, -0.004 is the change in the average wage when exper2 increases by 1. The net effect of an increase in exper, will be 0.187 minus 0.004 times the change in the square of exper when exper increases by 1, which will depend on the value of exper that the 1 unit increase is computed from, and this sum of effects may be either negative or positive.

The estimated coefficient ofeduc, , 0.552, is the change in the average wage with a 1 unit increase in educ (i.e., number of years of school completed) holding exper and tenure constant.

  1. Are the coefficients statistically significant? Explain.

All of the coefficients except the coefficient tenure squared are significant at the 0.05 level of significance because their p values are less than 0.05. The tenure squared variable is statistically significant at the 0.10 level of significance because its p value, 0.076,  is less than 0.10, but it is not statistically significant at the 0.05 level of significance because 0.076 > 0.05.

  1. Did including the quadratic form of experience improve our model compared to the previous model (in question (2))?
  2. Compare R2 and adjusted R2 within and between models to support your answer.

Yes, including the exper2 term improved the fit of the model, as the R2 increased from 0.3180 to 0.3468, and the adjusted R2 improved from 0.3128 to 0.3405.

  1. Explain why you think there was or wasn’t an improvement to the model.

I think adding experience squared improved the model for a couple of reasons. First, all of the estimated coefficients are significant at at least the 0.10 level of significance, and all but tenure squared are significant at less than the 0.001 level of significance. Second, the fit of the model improved, as measured by both R2 and adjusted R2. I believe the explanation for why we get a better fit when we add experience and the square of experience but do not get a better model when we add just experience is that experience and tenure are correlated, but the square of experience and tenure are not so highly correlated because squaring experience is not a linear transformation of experience, and experience squared and tenure squared are apparently not that highly correlated.

Use the data birthweight.RDATA and R to answer the following questions. It is known that maternal smoking during pregnancy can lead to adverse effects on the fetus. In this problem we would like to estimate the impact of smoking during pregnancy on live birth weights.

4) To begin, we want to estimate the marginal effect of cigarette smoking on birth weight (in ounces) after controlling for the race of the mother (white) and sex of the baby (male). Use OLS to estimate the following equation:

  1. Write the SRF. 

  1. Interpret the regression (I provide guidance here on the points I expect you to address when I ask you to “interpret the regression” in forthcoming problems):
  2. Is the intercept meaningful? If so, explain its interpretation and if it is statistically significant.

For the intercept, cigs is equal to 0 (i.e., non-smoker), white is equal to zero (i.e., non-white mother) and male is equal to zero (i.e., female baby). Therefore the intercept, 113.278 ounces (7.08 pounds), is the average birthweight of a female baby of a non-smoking, non-white mother, which is a meaningful observation in the data. As we would expect, this estimate is statistically significantly different than zero, as indicated by the p value 2e-16 (decimal point followed by 15 zeroes followed by 2).

  1. What is the interpretation of the coefficient on cigs? Is it economically and/or statistically significant?

The interpretation of the coefficient on cigs is that the average birthweight of a baby of any sex from either a white or non-white mother will decrease by 0.0506 ounces for each additional cigarette the mother smokes per day. A half an ounce per cigarette may not seem like a lot, but the change is the same for each additional cigarette, so that a mother who smokes 20 cigarettes (a pack a day), lowers the average birthweight by more than 11 ounces (i.e., 20*0.506), which is definitely an economically important amount.

iii. What is the interpretation of the coefficient on white? (Make sure that your interpretation is in reference to the proper comparison/base group.) Is it economically and/or statistically significant?

The coefficient on white indicates the difference, average birthweight of babies with white mothers minus the average birthweight of babies with non-white mothers, holding the sex of the baby constant across the different mothers (but allowing the constant sex to be either male or female) and holding the number of cigarettes smoked per day constant across mothers (but allowing the number of cigarettes smoked per day to be any reasonable number which might appear in the data). Thus on average, babies of white mothers weigh 6.23 ounces more than babies of non-white mothers, holding the sex of the baby and the number of cigarettes smoked per day constant.

  1. What is the interpretation of the coefficient on male? (Make sure that your interpretation is in reference to the proper comparison/base group.) Is it economically and/or statistically significant?

             The coefficient on male, 3.052, implies that the average male baby weighs 3.52 pounds more than the average female baby, holding constant the number of cigarettes per day the mother smokes, but for any reasonable number of cigarettes which might appear in the data) and holding constant the race of the mother (but for any race).

  1. Provide a graph (use visreg()) of the (regression functions) relationship between bwghtand cigs for white and non-white mothers (holding male constant). Explain what you see.

The R code to generate the graphs consists of a line of code to graph for males for non-white and white mothers and a line of code to graph for females of white and non-white mothers, which follow along with the graphs:

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

Male babies

Female babies

Each graph shows the fitted relationship between number of cigarettes smoked per day (cigs) and birthweight in ounces (bwght), with the left panel showing this relationship for non-white mothers and the right panel showing this relationship for white mothers. The slopes are the same in all four panels, but the intercepts are lower for both sets of female babies than for their male counterparts, and the intercepts are lower for both sets of non-white mothers than their white counterparts. The gray bands in the graphs are for 95% confidence intervals for the predictions.

  1. Is there a statistical difference in birthweight of babies born to white vs. non-white women for all levels of smoking (cigs)?

From the graph, the 95% confidence intervals do not look as though they would overlap, which suggests there is a difference. From the estimate coefficient on white, using the summary command, the t statistic is 4.787, which has a p value of 1.88e-06, which implies that there is a statistically significant difference between whites and non-whites at all levels of cigarettes, as long as the sex of the baby and the number of cigarettes is the same across both whites and non-whites.

  1. Relate this to the coefficient on white () from the regression. Specifically, why do you think that is statistically significant when the CIs on the regression lines overlap for a large portion of the range of cigs?

The CIs on the graphs do not appear to overlap.

  1. What is the expected birth weight of the average female child born to a white mother that smokes 2 cigarettes a day?

Substitute male = 0, cigs=2,  and white = 1 into the SRF to obtain

=118.496

  1. What is the expected birth weight of the average female child born to a white mother that smokes 35 cigarettes a day?

Substitute male = 0, white = 1 and cigs = 35 into the SRF to obtain

=101.798 

  1. Calculate the confidence intervals on the fitted values obtained in parts (c) and (d). Based on these calculations, is there a significant difference in expected birth weight, on average, when the mother smokes 2 versus 35 cigarettes a day during pregnancy (for a white mother giving birth to a girl, as outlined in parts (c) and (d))?

For 2 cigarettes, white mother, female baby, the R commands and their results are given by the following:

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

fitlwrupr

1 118.496 79.37637 157.6155

Thus the 95% confidence interval for a prediction for a white mother who smokes two cigarettes a day with a female baby is [79.37637, 157.6155].

For 35 cigarettes, the 95% interval for prediction is given by the following R commands and their results:

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)

fitlwrupr

1 101.8071 62.26082 141.3534

Thus the 95% prediction intervals for a white mother who smokes 35 cigarettes per day and has a female baby is [62.26082, 141.3534]. Because the prediction intervals overlap (i.e., the lower bound for 2 cigarettes is less than the upper bound for 35 cigarettes) these is not a statistically significant difference between the two predictions.

Use the professor data in the statistical program of your choice to answer the following questions. These data contain information on salaries of professors at 9 large Midwestern universities.

5) We will examine if male professors earn, on average, more than female professors.

  1. On the same graph, compare the distribution of salaries for men and women.

I chose Stata, and the commands and graph follow:

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000)  fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

  1. Explain what you see.

The histogram shows salaries in $20,000 increments, starting at $10,000. The solid green is the histogram for women, whereas the black outline with no fill is the histogram for males. Although no females earn below $30,000 and a small number of males do, males have a higher proportion of their salaries at all salaries greater than $70,000, have a lower proportion of their salaries between $30,000 and $70,000, and have all of the salaries above $150,000.

  1. What is the mean, median and standard deviation of salaries for men and woman?

The Stata command and its results follow:

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

Summary for variables: salary

by categories of: female

female |         N      mean        sd       p50

——-+—————————————-

Male |       606  83776.46  28210.02     80043

Female |        58  68319.62   22946.9     61464

——-+—————————————-

Total |       664  82426.31   28116.4     78550

————————————————

  1. Write the null and alternative hypotheses that represent the question above.

The null hypothesis is that the population average of male professors is less than or equal to the population average of female professors. The alternative hypothesis is that the population average for male professors is greater than the population average for female professors. This is a one-tailed test.

  1. What is the critical value for the test?

For the one tailed test at the α=0.05 level of significance, the critical value is 1.645 if the difference is given as male mean minus female mean, or -1.645 if the difference is taken as female mean minus male mean. In the former case, reject if t > 1.645, in the latter case, reject if t < -1.645.

  1. Run a t-test (if you use R you should specify the option var.equal=TRUE in your t.test).

The Stata command to run the test and its results follow:

ttest salary, by(female)

Two-sample t test with equal variances

——————————————————————————

Group |     Obs        Mean    Std. Err.   Std. Dev.   [95% Conf. Interval]

———+——————————————————————–

Male |     606    83776.46    1145.954    28210.02    81525.92    86026.99

Female |      58    68319.62    3013.076     22946.9    62286.04     74353.2

———+——————————————————————–

combined |     664    82426.31    1091.128     28116.4    80283.83    84568.79

———+——————————————————————–

diff |            15456.83    3820.476                7955.124    22958.55

——————————————————————————

diff = mean(Male) – mean(Female)                              t =   4.0458

Ho: diff = 0                                     degrees of freedom =      662

 

Ha: diff < 0                 Ha: diff != 0                 Ha: diff > 0

Pr(T < t) = 1.0000         Pr(|T| > |t|) = 0.0001          Pr(T > t) = 0.0000

Stata compute the difference as male mean minus female mean, so that the critical value is 1.645. the t statistic is 4.0458, and the p value (highlighted above) is less than 0.0001. Therefore we reject the null hypothesis that the population mean of male salary is less than or equal to the population mean of female salary and conclude that the population mean of male salary is greater than the population mean of male salary.

  1. Run the following regression:

The Stata command and its results follow:

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

Source |       SS       df       MS              Number of obs =     664

————-+——————————           F(  1,   662) =   16.37

Model |  1.2647e+10     1  1.2647e+10           Prob> F      =  0.0001

Residual |  5.1148e+11   662   772622716           R-squared     =  0.0241

————-+——————————           Adj R-squared =  0.0227

Total |  5.2412e+11   663   790532179           Root MSE      =   27796

——————————————————————————

salary |      Coef.   Std. Err.      t    P>|t|     [95% Conf. Interval]

————-+—————————————————————-

1.female |  -1.55e+04   3.82e+03   -4.046   0.0001    -2.30e+04   -7.96e+03

_cons |   8.38e+04   1.13e+03   74.195   0.0000     8.16e+04    8.60e+04

——————————————————————————

The regression table does not display as much detail as the t test table, but note that the difference between female salaries and male salaries (computed as female salary minus male salary) is the coefficient  on 1.female, which if -1.55e+04 = 15,500, which differs from the table above only in showing less detail and in being the difference of female average minus male average rather than male average minus female average. The t statistic also has less significant digits displayed, but the t statistic of -4.046 is the negative of 4.0458, rounded to three digits. Clearly these are the same results.

  1. What can you conclude from the two tests in parts (e) and (f)?

These are the same estimated sample mean salary difference and t statistics, up to differences in the number of significant digits displayed. Both tests lead to rejection of the null hypothesis that the population average male salary is less than or equal to the population average female salary in favor of the alternative hypothesis that the population average male salary is greater than the population average female salary.

  1. Compare the results from parts (e) and (f). What do you notice? Explain.

The constant in the regression is the sample average male salary, up to not as accurate display of the average, with 8.38e+04 being $83,800 compared to the more accurate $83,776.46 in the t test command. The

-1.55e+04 is the same as the “diff” mean in the t table, -15,456.83, up to the accuracy displayed. Also, observe that the standard error of “diff” is the same as the standard error of the 1.female coefficient because the 1.female coefficient measures the difference between the female and male sample means. Finally, although it cannot be seen from the output, the t test in the regression is for the two tailed test as contrasted with the highlighted t test for the one-tailed test – the p value for the regression table is the same as for the p value in the center of the three t tests, 0.0001.

Appendix

A1 Stata commands for questions 1-3

set more off

/* #1                                                    */

/* #1a. Use a scatterplot with a lowess curve to show    */

/* the relationship between wage and tenure and wage and */

/* tenure squared. The first part of the command,        */

/* “quietly” tells Stata that I do not want to see the   */

/* table of estimated coefficients, which I use because  */

/* the regression itself is not of interest, the         */

/* regression residuals are of interest.                 */

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

/* #1b Estimate the regression.                          */

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #1.d.ii Estimate the model without the tenure square  */

/* variable and compare R squared and adjusted R squared.*/

regress wage educ tenure, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #2 Add the variable exper to the previous equation and */

/* estimate the new equation. Estimate the regression.    */

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

pwcorrexper tenure, obs sig

/* #3, add an exper squared term to the regression model  */

/* #2.                                                    */

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

 A2 R commands for question 4

## The working directory is “C:/R”

## Load the data file

setwd(“C:/R”)

load(file=”birthweight.RDATA”)

View(data)

## Question 4. Regress birthweight in ounces, bwght, on

## number of cigarettes per day (cigs), an indicator for

## white mother, white, and an indicator for male child,

## (male),

reg1 <- lm(data$bwght~data$cigs+data$white+data$male)

summary(reg1)

## Question 4c, graph the relationship between bwght and cigs

## for white andnon-white mothers holding male constant

install.packages(“visreg”)

library(visreg)

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

## Question 4f. 95% prediction interval for white mother, girl baby,2 cigs per day

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)

A3 Stata commands for question 5

/* #5 */

/* #5 a historgram that compares salries of men and women. */

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000)  fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

/* #5b, mean median and stddev of male and female salaries. */

/* Provide value labels.                                     */

label define FEMALE 0 “Male” 1 “Female”

label values female FEMALE

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

/* #5e, t test.                                              */

ttest salary, by(female)

/* #5e, regression.                                          */

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

homework5.do

set more off

/* #1                                                    */

/* #1a. Use a scatterplot with a lowess curve to show    */

/* the relationship between wage and tenure and wage and */

/* tenure squared. The first part of the command,        */

/* “quietly” tells Stata that I do not want to see the   */

/* table of estimated coefficients, which I use because  */

/* the regression itself is not of interest, the         */

/* regression residuals are of interest.                 */

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

/* #1b Estimate the regression.                          */

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #1.d.ii Estimate the model without the tenure square  */

/* variable and compare R squared and adjusted R squared.*/

regress wage educ tenure, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #2 Add the variable exper to the previous equation and */

/* estimate the new equation. Estimate the regression.    */

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

pwcorrexper tenure, obs sig

/* #3, add an exper squared term to the regression model  */

/* #2.                                                    */

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #5 */

/* #5 a historgram that compares salries of men and women. */

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000)  fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

/* #5b, mean median and stddev of male and female salaries. */

/* Provide value labels.                                     */

label define FEMALE 0 “Male” 1 “Female”

label values female FEMALE

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

/* #5e, t test.                                              */

ttest salary, by(female)

/* #5e, regression.                                          */

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

homework5.R

## The working directory is “C:/R”

## Load the data file

setwd(“C:/R”)

load(file=”birthweight.RDATA”)

View(data)

## Question 4. Regress birthweight in ounces, bwght, on

## number of cigarettes per day (cigs), an indicator for

## white mother, white, and an indicator for male child,

## (male),

reg1 <- lm(data$bwght~data$cigs+data$white+data$male)

summary(reg1)

## Question 4c, graph the relationship between bwght and cigs

## for white andnon-white mothers holding male constant

install.packages(“visreg”)

library(visreg)

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

## Question 4f. 95% prediction interval for white mother, girl baby,2 cigs per day

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)