# Interpret Regression

** **

** **

**Solution**** **

**In Stata, open the WAGE1.dta data. These are different data with which we will examine the relationship between wage, education experience and job tenure.
**

**DON’T run, but examine the regression:**

**Use a scatter plot and LOESS fit line to explain your expectations for the signs of**

The full regression does not have to be estimated to find a bivariate scatterplot that shows the relationship between *wage* and *tenure* and *wage* and *tenure ^{2}*.Instead, to find the relationship between

*wage*and

*tenure*and

*wage*and

*tenure*, we can regress

^{2}*wage*on

*educ*and save the residuals (as

*wage_resids*) and then regress tenure on

*educ*and save the residuals

*(tenure_resids*). These two residuals variables will show the relationship between

*wage*and

*tenure*after the effects of

*educ*are removed. The lowess curve on the scatterplot will provide insight on whether the relationship between wage and tenure is linear or quadratic (i.e., on whether the

*tenure*term improves on the linear fit). If the scatterplot is nearly linear, then we would expect a zero coefficient on

^{2}*tenure*, whereas if the scatterplot is u-shaped we would expect a positive coefficient on the linear term and negative coefficient on the square term or inverted u-shaped then we would expect a negative coefficient on the linear term and a positive coefficient on the square term.

^{2}The Stata code to do accomplish this and he graph follow:

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

The part of the graph where *Residuals of tenure on edu*c is less than 0 is u-shaped, which implies negative coefficient on the linear term and positive coefficient on the square term. However, If the graph started at *Residuals of tenure on educ*> 0, then the regression would view this as an inverted u, with positive estimated coefficient on *tenure* and negative estimated coefficient on *tenure ^{2}*. I would expect inverted u to be the overall result.

**Run and interpret the regression.**

The Stata command to estimate the regression and its results follow:

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source | SS df MS Number of obs = 526

————-+—————————— F( 3, 522) = 80.10

Model | 2257.25872 3 752.419574 Prob> F = 0.0000

Residual | 4903.15557 522 9.39301833 R-squared = 0.3152

————-+—————————— Adj R-squared = 0.3113

Total | 7160.41429 525 13.6388844 Root MSE = 3.0648

——————————————————————————

wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

————-+—————————————————————-

educ | 0.562 0.048 11.609 0.000 0.467 0.657

tenure | 0.330 0.048 6.917 0.000 0.236 0.424

tenursq | -0.006 0.002 -3.194 0.001 -0.009 -0.002

_cons | -2.420 0.638 -3.795 0.000 -3.672 -1.167

——————————————————————————

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable *wage* on the variables *educ*, *tenure* and *tenure ^{2}*. The R

^{2}, 0.3152, implies that 31.52% of the overall variation in

*wage*can be explained by variation in

*educ*,

*tenure*and

*tenure*. Also, the F statistic, 80.10 with 3 and 522 degrees of freedom and associated p value <0.0001, implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

^{2}The constant, -2.420, is interpreted as the average wage of a worker with 0 years of tenure and 0 years of education, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the *wage* axis.

The coefficient on the linear term on *tenure*,, 0.330, indicates the increase in the average wage that is linearly related to *tenure* with a one unit increase in *tenure*, holding *educ* constant, and ignoring the decrease in the wage that happens because of the negative coefficient on *tenure ^{2}*. The estimated coefficient on

*tenure*,

^{2}-0.006,estimates the decrease in the average wage with a 1 unit increase in the square of tenure, holding *educ* constant and ignoring the increase in the average wage from the linear tenure term. The net effect of a one unit increase in *tenure*, holding education constant, will be 0.330 minus 0.006 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from, and this sum of effects may be either negative or positive.

The estimated coefficient oneduc, , 0.562, is the change in the average wage with a 1 unit increase in *educ* (i.e., number of years of school completed) holding *tenure* constant.

**Explain in plain English the interpretation of**

The estimated coefficient oneduc, , 0.562, is the change in the average wage with a 1 unit increase in *educ* (i.e., number of years of school completed) holding *tenure* constant. In other words, a one unit increase in the number of years of school completed increases the average wage by 0.562 holding everything else (i.e., tenure) constant.

**Is the quadratic form of tenure appropriate in this regression (consider the following questions)?**

**Are the estimated coefficients****and****the expected sign? Explain.**

Yes, the estimated coefficients have the expected signs that we expect. Wages should rise initially as one gains tenure (i.e., time with the same employer) but these gains should probably rise fastest in the first 10 years or so, and then taper off, and then quite possibly decrease as a worker nears retirement.

**Are the estimated coefficients****and****statistically significant? Explain.**

Yes, the estimated coefficients are statistically significant. The t statistic is just the estimated coefficient divided by its standard error (i.e., the standard deviation of the sampling distribution of the coefficient). Thus the t statistic is the coefficient re-scaled so that it is in the units “standard deviations from the zero.” The p value of the t statistic is the probability of obtaining a t statistic this large or larger in absolute value if the null hypothesis that the population coefficient is equal to zero is true. Thus the p value indicates how likely it would be to obtain a t statistic as large as was obtained if the population coefficient was zero, and if it turns out to be very unlikely, then that is good evidence that the population coefficient is not zero. Here, the p value for *tenure* is less than 0.001 and the p value for *tenure ^{2}* is 0.001, so that for each of these if the null hypothesis were true, there would be no more than 1 chance in 1,000 that we would obtain a t statistics this large. This is very strong evidence that the population coefficients on these variables is not equal to zero.

**iii. Did the inclusion of the quadratic form improve our regression model? Compare R ^{2} and adjusted R^{2} within this model and with the model without the square terms (you don’t need to report the full regression results, just talk about the R^{2} and adjusted R^{2}). Explain.**

In the model above, the R^{2} is 0.3152, and the adjusted R^{2} is 0.3113. In the model without *tenure ^{2}*, the R

^{2}is 0.3019 and the adjusted R

^{2}is 0.2992. R

^{2 }is guaranteed to not decrease (and usually increases) as we increase the number of variables, so that we obtained an increase in R

^{2}by adding

*tenure*is neither surprising nor impressive. However, adjusted R

^{2}^{2}adjusts R

^{2}for the number of variables in the model relative to the number of observations in the data set, and reduces R

^{2}quite drastically as the number of variables gets closer to the number of observations (i.e., as the solution to the regression model gets closer and closer to looking like the solution to a system of simultaneous equations with an exact solution that fits every point exactly). The result that the adjusted R

^{2}does not decrease when we add

*tenure*is a reason to believe that

^{2}*tenure*is a good variable to add, along with the fact that with

^{2}*tenure*the model fits an economic story that makes sense (see part d.i).

^{2}**Based on the results, on average, how many years should one expect to be at a firm before reaching their maximum***wage*(i.e., what is the level of tenure maximizes the relationship between*wage*and*tenure*)?

The answer can be found by finding the number of years of tenure at which the slope of the wage in tenure is 0. The slope of the wage in tenure is the partial derivative of the estimated wage equation with respect to tenure, which is 0.33 – 0.012**tenure*. Setting 0.33 – 0.012**tenure* equal to zero and solving for *tenure* yields *tenure*=27.5. The average wage is maximized after 27.5 years, and that is when the worker can expect to be earning the highest wage.

**2) Run the regression:**** **

The Stata command to accomplish this and its results follow:

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source | SS df MS Number of obs = 526

————-+—————————— F( 4, 521) = 60.74

Model | 2277.22215 4 569.305538 Prob> F = 0.0000

Residual | 4883.19214 521 9.37272963 R-squared = 0.3180

————-+—————————— Adj R-squared = 0.3128

Total | 7160.41429 525 13.6388844 Root MSE = 3.0615

——————————————————————————

wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

————-+—————————————————————-

educ | 0.586 0.051 11.475 0.000 0.486 0.687

tenure | 0.305 0.051 6.046 0.000 0.206 0.405

tenursq | -0.005 0.002 -2.978 0.003 -0.009 -0.002

exper | 0.018 0.012 1.459 0.145 -0.006 0.041

_cons | -2.921 0.724 -4.036 0.000 -4.343 -1.499

——————————————————————————

**Interpret the regression.**

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable *wage* on the variables *educ*, *tenure*, *tenure ^{2}, *and

*exper*. The R

^{2}, 0.3180, implies that 31.8% of the overall variation in

*wage*can be explained by variation in

*educ*,

*tenure*and

*tenure*and

^{2}*exper*. Also, the F statistic 60.74, with degrees of freedom 4 and 521 and associated p value <0.0001 implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

The constant, -2.921, is interpreted as the average wage of a worker with 0 years of tenure , 0 years of education, and 0 years of experience, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the *wage* axis.

The coefficient on the linear term on *tenure *β ̂_2,0.305, indicates the increase in the average wage that is linearly related to *tenure* with a one unit increase in *tenure*, holding *educ* and *exper* constant, and ignoring the decrease in the wage that happens because of the negative coefficient on *tenure ^{2}*. The estimated coefficient on

*tenure*, -0.005 estimates the decrease in the average wage with a 1 unit increase in tenure squared, holding

^{2}*educ*and

*exper*constant and ignoring the increase in the average wage from the linear tenure term. The net effect of a one unit increase in

*tenure*, holding education and experience constant, will be 0.305 minus 0.005 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from, and this sum of effects may be either negative or positive.

The estimated coefficient on*educ*, , 0.586, is the change in the average wage with a 1 unit increase in *educ* (i.e., number of years of school completed) holding *exper* and *tenure* constant.

The estimated coefficient on*exper*, ,0.018, is the change in the average wage with a 1 unit increase in experience (i.e., number of years of work experience) holding *tenure* and *educ* constant.

**Are the coefficients statistically significant? Explain.**

The estimated coefficients on the constant, *educ*, *tenure* and *tenure ^{2}* are statistically significant because all of their p values are less than 0.05 or any other conventionally chosen level of statistical significance. The estimated coefficient of

*exper*has a p value 0.145, which means that if the slope on

*exper*in the population regression line were 0, there would still be 0.145 probability (i.e., about 1 chance in 7) of obtaining a t statistic this large or larger. This sample does not provide evidence that the coefficient on

*exper*is statistically significantly different than 0.

**Did including***experience*improve our model compared to the previous model (in question (1))?**Compare R**^{2}and adjusted R^{2}within and between models to support your answer.

The R^{2} for the model that does not include *exper* is 0.3152 and the R^{2 }for the model that includes *exper* is 0.3180, a very small increase in R^{2}when*exper* is added. The adjusted R^{2} for the model that does not include *exper* is 0.3113 and the adjusted R^{2 }for the model that includes *exper* is 0.3128, a very small increase in adjusted R^{2}when*exper* is added. The improvement from adding *exper* to the model is slight in terms of the fit, and the coefficient on exper is not statistically significant than zero, so the overall conclusion is that *exper* does not lead to a model improvement.

**Explain why you think there was or wasn’t an improvement to the model.**

Tenure is the number of years with the same company, experience is the number of years working after school completion. For many workers, these are the same number, and they will tend to be close enough so that the information added by experience is, to a large extent, already in the model through tenure. The correlation between *exper* and *tenure* is 0.4993, so there is not much new information added when *exper* is added to the model.

**3) Run the regression: :**

The Stata command and its results follow:

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

Source | SS df MS Number of obs = 526

————-+—————————— F( 5, 520) = 55.22

Model | 2483.27264 5 496.654529 Prob> F = 0.0000

Residual | 4677.14165 520 8.99450317 R-squared = 0.3468

————-+—————————— Adj R-squared = 0.3405

Total | 7160.41429 525 13.6388844 Root MSE = 2.9991

——————————————————————————

wage | Coef. Std. Err. t P>|t| [95% Conf. Interval]

————-+—————————————————————-

educ | 0.552 0.051 10.928 0.000 0.453 0.652

tenure | 0.244 0.051 4.778 0.000 0.144 0.345

tenursq | -0.003 0.002 -1.777 0.076 -0.007 0.000

exper | 0.187 0.037 5.012 0.000 0.114 0.260

expersq | -0.004 0.001 -4.786 0.000 -0.005 -0.002

_cons | -3.403 0.716 -4.753 0.000 -4.810 -1.997

——————————————————————————

**Interpret the regression.**

The regression model provides the best fit, in the sense of least sum of squared residuals from the fitted line, of the variable *wage* on the variables *educ*, *tenure*, *tenure ^{2}, exper*and

*exper*. The R

^{2}^{2}, 0.3468, implies that 34.68% of the overall variation in

*wage*can be explained by variation in

*educ*,

*tenure*and

*tenure*,

^{2}*exper*and

*exper*. Also, the F statistic 55.22, with degrees of freedom 5 and 520 and associated p value <0.0001 implies that we can reject the null hypothesis that all of the slope coefficients are equal to zero in favor of the alternative that at least one slope coefficient is different than zero.

^{2}The constant, -3.403, is interpreted as the average wage of a worker with 0 years of tenure , 0 years of education, and 0 years of experience, but since such a worker does not exist, we should interpret the constant as where the fitted line would cross the *wage* axis.

The coefficient on the linear term on *tenure*, 0.244 indicates the increase in the average wage that is linearly related to *tenure* with a one unit increase in *tenure*, holding *educ* and *exper* (and thereby *exper ^{2}* as well) constant, and ignoring the decrease in the wage that happens because of the negative coefficient on

*tenure*. The estimated coefficient on

^{2}*tenure*, -0.003, estimates the decrease in the average wage with a 1 unit increase in tenure squared, holding

^{2}*educ*and

*exper*constant and ignoring the increase in the average wage from the linear tenure term. The net effect of an increase in

*tenure*, will be 0.244 minus 0.003 times the change in the square of tenure when tenure increases by 1, which will depend on the value of tenure that the 1 unit increase is computed from, and this sum of effects may be either negative or positive.

The estimated coefficient on the linear term for*exper*, ,0.187, is the change in the average wage with a 1 unit increase in experience (i.e., number of years of work experience) holding *tenure* and *educ* constant and ignoring the contribution of the *exper ^{2}* term when

*exper*is increased by 1.The estimated coefficient on

*exper*, -0.004 is the change in the average wage when

^{2}*exper*increases by 1. The net effect of an increase in

^{2}*exper*, will be 0.187 minus 0.004 times the change in the square of

*exper*when

*exper*increases by 1, which will depend on the value of

*exper*that the 1 unit increase is computed from, and this sum of effects may be either negative or positive.

The estimated coefficient of*educ*, , 0.552, is the change in the average wage with a 1 unit increase in *educ* (i.e., number of years of school completed) holding *exper* and *tenure* constant.

**Are the coefficients statistically significant? Explain.**

All of the coefficients except the coefficient tenure squared are significant at the 0.05 level of significance because their p values are less than 0.05. The tenure squared variable is statistically significant at the 0.10 level of significance because its p value, 0.076, is less than 0.10, but it is not statistically significant at the 0.05 level of significance because 0.076 > 0.05.

**Did including the quadratic form of***experience*improve our model compared to the previous model (in question (2))?**Compare R**^{2}and adjusted R^{2}within and between models to support your answer.

Yes, including the *exper ^{2}* term improved the fit of the model, as the R

^{2}increased from 0.3180 to 0.3468, and the adjusted R

^{2}improved from 0.3128 to 0.3405.

**Explain why you think there was or wasn’t an improvement to the model.**

I think adding experience squared improved the model for a couple of reasons. First, all of the estimated coefficients are significant at at least the 0.10 level of significance, and all but tenure squared are significant at less than the 0.001 level of significance. Second, the fit of the model improved, as measured by both R^{2} and adjusted R^{2}. I believe the explanation for why we get a better fit when we add experience and the square of experience but do not get a better model when we add just experience is that experience and tenure are correlated, but the square of experience and tenure are not so highly correlated because squaring experience is not a linear transformation of experience, and experience squared and tenure squared are apparently not that highly correlated.

**Use the data birthweight.RDATA and R to answer the following questions. It is known that maternal smoking during pregnancy can lead to adverse effects on the fetus. In this problem we would like to estimate the impact of smoking during pregnancy on live birth weights. **

**4) To begin, we want to estimate the marginal effect of cigarette smoking on birth weight (in ounces) after controlling for the race of the mother ( white) and sex of the baby (male). Use OLS to estimate the following equation:**

**Write the SRF.**

**Interpret the regression (I provide guidance here on the points I expect you to address when I ask you to “interpret the regression” in forthcoming problems):****Is the intercept meaningful? If so, explain its interpretation and if it is statistically significant**.

For the intercept, *cigs* is equal to 0 (i.e., non-smoker), *white* is equal to zero (i.e., non-white mother) and *male* is equal to zero (i.e., female baby). Therefore the intercept, 113.278 ounces (7.08 pounds), is the average birthweight of a female baby of a non-smoking, non-white mother, which is a meaningful observation in the data. As we would expect, this estimate is statistically significantly different than zero, as indicated by the p value 2e-16 (decimal point followed by 15 zeroes followed by 2).

**What is the interpretation of the coefficient on***cigs*? Is it economically and/or statistically significant?

The interpretation of the coefficient on cigs is that the average birthweight of a baby of any sex from either a white or non-white mother will decrease by 0.0506 ounces for each additional cigarette the mother smokes per day. A half an ounce per cigarette may not seem like a lot, but the change is the same for each additional cigarette, so that a mother who smokes 20 cigarettes (a pack a day), lowers the average birthweight by more than 11 ounces (i.e., 20*0.506), which is definitely an economically important amount.

**iii. What is the interpretation of the coefficient on white? (Make sure that your interpretation is in reference to the proper comparison/base group.) Is it economically and/or statistically significant? **

The coefficient on *white* indicates the difference, average birthweight of babies with white mothers minus the average birthweight of babies with non-white mothers, holding the sex of the baby constant across the different mothers (but allowing the constant sex to be either male or female) and holding the number of cigarettes smoked per day constant across mothers (but allowing the number of cigarettes smoked per day to be any reasonable number which might appear in the data). Thus on average, babies of white mothers weigh 6.23 ounces more than babies of non-white mothers, holding the sex of the baby and the number of cigarettes smoked per day constant.

**What is the interpretation of the coefficient on***male*? (Make sure that your interpretation is in reference to the proper comparison/base group.) Is it economically and/or statistically significant?

** **** **The coefficient on male, 3.052, implies that the average male baby weighs 3.52 pounds more than the average female baby, holding constant the number of cigarettes per day the mother smokes, but for any reasonable number of cigarettes which might appear in the data) and holding constant the race of the mother (but for any race).

**Provide a graph (use****visreg()****) of the (regression functions) relationship between***bwght*and*cigs*for white and non-white mothers (holding*male*constant). Explain what you see.

The R code to generate the graphs consists of a line of code to graph for males for non-white and white mothers and a line of code to graph for females of white and non-white mothers, which follow along with the graphs:

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

Male babies

Female babies

Each graph shows the fitted relationship between number of cigarettes smoked per day (*cigs*) and birthweight in ounces (*bwght*), with the left panel showing this relationship for non-white mothers and the right panel showing this relationship for white mothers. The slopes are the same in all four panels, but the intercepts are lower for both sets of female babies than for their male counterparts, and the intercepts are lower for both sets of non-white mothers than their white counterparts. The gray bands in the graphs are for 95% confidence intervals for the predictions.

**Is there a statistical difference in birthweight of babies born to white vs. non-white women for all levels of smoking (***cigs*)?

From the graph, the 95% confidence intervals do not look as though they would overlap, which suggests there is a difference. From the estimate coefficient on *white*, using the summary command, the t statistic is 4.787, which has a p value of 1.88e-06, which implies that there is a statistically significant difference between whites and non-whites at all levels of cigarettes, as long as the sex of the baby and the number of cigarettes is the same across both whites and non-whites.

**Relate this to the coefficient on white (****) from the regression. Specifically, why do you think that****is statistically significant when the CIs on the regression lines overlap for a large portion of the range of***cigs*?

The CIs on the graphs do not appear to overlap.

**What is the expected birth weight of the average female child born to a white mother that smokes 2 cigarettes a day?**

Substitute male = 0, cigs=2, and white = 1 into the SRF to obtain

=118.496

**What is the expected birth weight of the average female child born to a white mother that smokes 35 cigarettes a day?**

Substitute male = 0, white = 1 and cigs = 35 into the SRF to obtain

=101.798** **

**Calculate the confidence intervals on the fitted values obtained in parts (c) and (d). Based on these calculations, is there a significant difference in expected birth weight, on average, when the mother smokes 2 versus 35 cigarettes a day during pregnancy (for a white mother giving birth to a girl, as outlined in parts (c) and (d))?**

For 2 cigarettes, white mother, female baby, the R commands and their results are given by the following:

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

fitlwrupr

1 118.496 79.37637 157.6155

Thus the 95% confidence interval for a prediction for a white mother who smokes two cigarettes a day with a female baby is [79.37637, 157.6155].

For 35 cigarettes, the 95% interval for prediction is given by the following R commands and their results:

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)

fitlwrupr

1 101.8071 62.26082 141.3534

Thus the 95% prediction intervals for a white mother who smokes 35 cigarettes per day and has a female baby is [62.26082, 141.3534]. Because the prediction intervals overlap (i.e., the lower bound for 2 cigarettes is less than the upper bound for 35 cigarettes) these is not a statistically significant difference between the two predictions.

**Use the professor data in the statistical program of your choice to answer the following questions. These data contain information on salaries of professors at 9 large Midwestern universities. **

**5) We will examine if male professors earn, on average, more than female professors. **

**On the same graph, compare the distribution of salaries for men and women.**

I chose Stata, and the commands and graph follow:

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000) fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

**Explain what you see.**

The histogram shows salaries in $20,000 increments, starting at $10,000. The solid green is the histogram for women, whereas the black outline with no fill is the histogram for males. Although no females earn below $30,000 and a small number of males do, males have a higher proportion of their salaries at all salaries greater than $70,000, have a lower proportion of their salaries between $30,000 and $70,000, and have all of the salaries above $150,000.

**What is the mean, median and standard deviation of salaries for men and woman?**

The Stata command and its results follow:

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

Summary for variables: salary

by categories of: female

female | N mean sd p50

——-+—————————————-

Male | 606 83776.46 28210.02 80043

Female | 58 68319.62 22946.9 61464

——-+—————————————-

Total | 664 82426.31 28116.4 78550

————————————————

**Write the null and alternative hypotheses that represent the question above.**

The null hypothesis is that the population average of male professors is less than or equal to the population average of female professors. The alternative hypothesis is that the population average for male professors is greater than the population average for female professors. This is a one-tailed test.

**What is the critical value for the test?**

For the one tailed test at the α=0.05 level of significance, the critical value is 1.645 if the difference is given as male mean minus female mean, or -1.645 if the difference is taken as female mean minus male mean. In the former case, reject if t > 1.645, in the latter case, reject if t < -1.645.

**Run a t-test (if you use R you should specify the option****var.equal=TRUE****in your****t.test****).**

The Stata command to run the test and its results follow:

ttest salary, by(female)

Two-sample t test with equal variances

——————————————————————————

Group | Obs Mean Std. Err. Std. Dev. [95% Conf. Interval]

———+——————————————————————–

Male | 606 83776.46 1145.954 28210.02 81525.92 86026.99

Female | 58 68319.62 3013.076 22946.9 62286.04 74353.2

———+——————————————————————–

combined | 664 82426.31 1091.128 28116.4 80283.83 84568.79

———+——————————————————————–

diff | 15456.83 3820.476 7955.124 22958.55

——————————————————————————

diff = mean(Male) – mean(Female) t = 4.0458

Ho: diff = 0 degrees of freedom = 662

Ha: diff < 0 Ha: diff != 0 Ha: diff > 0

Pr(T < t) = 1.0000 Pr(|T| > |t|) = 0.0001 Pr(T > t) = 0.0000

Stata compute the difference as male mean minus female mean, so that the critical value is 1.645. the t statistic is 4.0458, and the p value (highlighted above) is less than 0.0001. Therefore we reject the null hypothesis that the population mean of male salary is less than or equal to the population mean of female salary and conclude that the population mean of male salary is greater than the population mean of male salary.

**Run the following regression:**

The Stata command and its results follow:

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

Source | SS df MS Number of obs = 664

————-+—————————— F( 1, 662) = 16.37

Model | 1.2647e+10 1 1.2647e+10 Prob> F = 0.0001

Residual | 5.1148e+11 662 772622716 R-squared = 0.0241

————-+—————————— Adj R-squared = 0.0227

Total | 5.2412e+11 663 790532179 Root MSE = 27796

——————————————————————————

salary | Coef. Std. Err. t P>|t| [95% Conf. Interval]

————-+—————————————————————-

1.female | -1.55e+04 3.82e+03 -4.046 0.0001 -2.30e+04 -7.96e+03

_cons | 8.38e+04 1.13e+03 74.195 0.0000 8.16e+04 8.60e+04

——————————————————————————

The regression table does not display as much detail as the t test table, but note that the difference between female salaries and male salaries (computed as female salary minus male salary) is the coefficient on 1.female, which if -1.55e+04 = 15,500, which differs from the table above only in showing less detail and in being the difference of female average minus male average rather than male average minus female average. The t statistic also has less significant digits displayed, but the t statistic of -4.046 is the negative of 4.0458, rounded to three digits. Clearly these are the same results.

**What can you conclude from the two tests in parts (e) and (f)?**

These are the same estimated sample mean salary difference and t statistics, up to differences in the number of significant digits displayed. Both tests lead to rejection of the null hypothesis that the population average male salary is less than or equal to the population average female salary in favor of the alternative hypothesis that the population average male salary is greater than the population average female salary.

**Compare the results from parts (e) and (f). What do you notice? Explain.**

The constant in the regression is the sample average male salary, up to not as accurate display of the average, with 8.38e+04 being $83,800 compared to the more accurate $83,776.46 in the t test command. The

-1.55e+04 is the same as the “diff” mean in the t table, -15,456.83, up to the accuracy displayed. Also, observe that the standard error of “diff” is the same as the standard error of the 1.female coefficient because the 1.female coefficient measures the difference between the female and male sample means. Finally, although it cannot be seen from the output, the t test in the regression is for the two tailed test as contrasted with the highlighted t test for the one-tailed test – the p value for the regression table is the same as for the p value in the center of the three t tests, 0.0001.

**Appendix **

**A1 Stata commands for questions 1-3**

set more off

/* #1 */

/* #1a. Use a scatterplot with a lowess curve to show */

/* the relationship between wage and tenure and wage and */

/* tenure squared. The first part of the command, */

/* “quietly” tells Stata that I do not want to see the */

/* table of estimated coefficients, which I use because */

/* the regression itself is not of interest, the */

/* regression residuals are of interest. */

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

/* #1b Estimate the regression. */

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #1.d.ii Estimate the model without the tenure square */

/* variable and compare R squared and adjusted R squared.*/

regress wage educ tenure, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #2 Add the variable exper to the previous equation and */

/* estimate the new equation. Estimate the regression. */

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

pwcorrexper tenure, obs sig

/* #3, add an exper squared term to the regression model */

/* #2. */

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

** ****A2 R commands for question 4**

## The working directory is “C:/R”

## Load the data file

setwd(“C:/R”)

load(file=”birthweight.RDATA”)

View(data)

## Question 4. Regress birthweight in ounces, bwght, on

## number of cigarettes per day (cigs), an indicator for

## white mother, white, and an indicator for male child,

## (male),

reg1 <- lm(data$bwght~data$cigs+data$white+data$male)

summary(reg1)

## Question 4c, graph the relationship between bwght and cigs

## for white andnon-white mothers holding male constant

install.packages(“visreg”)

library(visreg)

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

## Question 4f. 95% prediction interval for white mother, girl baby,2 cigs per day

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)

**A3 Stata commands for question 5**

/* #5 */

/* #5 a historgram that compares salries of men and women. */

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000) fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

/* #5b, mean median and stddev of male and female salaries. */

/* Provide value labels. */

label define FEMALE 0 “Male” 1 “Female”

label values female FEMALE

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

/* #5e, t test. */

ttest salary, by(female)

/* #5e, regression. */

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

**homework5.do**

set more off

/* #1 */

/* #1a. Use a scatterplot with a lowess curve to show */

/* the relationship between wage and tenure and wage and */

/* tenure squared. The first part of the command, */

/* “quietly” tells Stata that I do not want to see the */

/* table of estimated coefficients, which I use because */

/* the regression itself is not of interest, the */

/* regression residuals are of interest. */

quietly regress wage educ

predictwage_resids, residuals

quietly regress tenure educ

predicttenure_resids, residuals

label variable wage_resids “Residuals of wage on educ”

label variable tenure_resids “Residuals of tenure on educ”

lowesswage_residstenure_resids, recast(scatter) mcolor(ltblue) msize(vsmall) mfcolor(ltblue) mlcolor(blue) mlwidth(vthin) lineopts(lwidth(medthick)) title(Lowess: “Residuals of wage on educ by residuals of tenure on educ”) legend(order(3 “Lowess: Residuals of wage on educ by residuals of tenure on educ”))

/* #1b Estimate the regression. */

regress wage educ tenure tenursq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #1.d.ii Estimate the model without the tenure square */

/* variable and compare R squared and adjusted R squared.*/

regress wage educ tenure, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #2 Add the variable exper to the previous equation and */

/* estimate the new equation. Estimate the regression. */

regress wage educ tenure tenursqexper, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

pwcorrexper tenure, obs sig

/* #3, add an exper squared term to the regression model */

/* #2. */

regress wage educ tenure tenursqexperexpersq, cformat(%9.3f) pformat(%5.3f) sformat(%8.3f)

/* #5 */

/* #5 a historgram that compares salries of men and women. */

twoway (histogram salary if female==1, start(10000) width(20000) color(green)) (histogram salary if female==0, start(10000) width(20000) fcolor(none) lcolor(black)), legend(order(1 “Female” 2 “Male” )) xtitle(Salary) xlabel(10000(40000)210000)

/* #5b, mean median and stddev of male and female salaries. */

/* Provide value labels. */

label define FEMALE 0 “Male” 1 “Female”

label values female FEMALE

tabstat salary, statistics( count mean sd median ) by(female) columns(statistics)

/* #5e, t test. */

ttest salary, by(female)

/* #5e, regression. */

regress salary i.female, cformat(%9.5fc) pformat(%5.4f) sformat(%8.3f)

**homework5.R**

## The working directory is “C:/R”

## Load the data file

setwd(“C:/R”)

load(file=”birthweight.RDATA”)

View(data)

## Question 4. Regress birthweight in ounces, bwght, on

## number of cigarettes per day (cigs), an indicator for

## white mother, white, and an indicator for male child,

## (male),

reg1 <- lm(data$bwght~data$cigs+data$white+data$male)

summary(reg1)

## Question 4c, graph the relationship between bwght and cigs

## for white andnon-white mothers holding male constant

install.packages(“visreg”)

library(visreg)

fit<-lm(bwght~cigs+white+male, data=data)

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=1))

visreg(fit, “cigs”, by = “white”, type = “conditional”, cond=list(male=0))

## Question 4f. 95% prediction interval for white mother, girl baby,2 cigs per day

reg1 <- lm(bwght~cigs+white+male, data=data)

newdata2 <- data.frame(cigs=2,white=1,male=0)

predict(reg1, newdata2,interval=”predict”)

newdata35 <- data.frame(cigs=35,white=1,male=0)

predict(reg1, newdata35,interval=”predict”)