Regression Coefficient 

The first six questions pertain to results taken from the following paper:

Pascaline Dupas, Do Teenagers Respond to HIV Risk Information? Evidence from a Field Experiment in Kenya, American Economic Journal: Applied Economics

This paper analyzes the impact of a HIV prevention campaign in schools on teenagers’ sexual behavior in Kenya. There is considerable debate over whether school-based HIV / AIDS education programs can be effective in limiting the spread of HIV / AIDS among youths, and over what should be the content of these programs. Many sub-Saharan African countries have incorporated HIV / AIDS education in their school curriculum, but the great majority of those curricula are limited to risk avoidance information; they aim at completely eliminating pre-marital sex by promoting abstinence until marriage. They omit to provide risk reduction information, for example that condom use reduces the risk of HIV transmission.

In 2004, a local non-governmental organization, ICS, implemented a HIV – AIDS prevention program in 71 out of 328 schools in the Busia district of Kenya. The program was a “relative-risk” reduction campaign (otherwise known as RR). Contrary to the normal HIV / AIDS education curriculum in Kenya, which focused on an abstinence-only campaign, the NGO provided students in the targeted schools with information on the distribution of HIV infections by age and gender, thus providing teens with the “relative risks” of differently aged populations to be exposed to HIV. The main focus of the program was eighth-grade girls. The rest of the schools did not participate in any HIV / AIDS information campaign.

The author identifies the impact of the HIV / AIDS information campaign on teenagers’ sexual behavior through two approaches. First, she compares rates of childbearing among students in schools who received the RR campaign relative to rests of childbearing among teenagers in schools who did not. Second, she compares the pre and post differences in childbearing among teenagers in the RR schools with pre and post differences among teenagers in schools that did not receive the program, using data from before (2003) and after (2005) the program.

Table I provides an abbreviated table of variable definitions and descriptive statistics. (The paper includes a much larger number of variables but we keep things simpler). Table 2 provides an abbreviated version of the paper’s main regression results. Please read both tables carefully.

The author concludes that providing information on the relative risk of HIV infection led to a decrease in teen pregnancy, a proxy for the incidence of unprotected sex. Self-reported sexual behavior data suggests substitution away from older (riskier) partners and toward same-age partners. These results suggest that teenagers are responsive to risk information. 

Table 1

Variable Definitions and Descriptive Statistics 

Variable Definition Mean (Std. Dev.)
Childbearing A binary variable equal to 1 if the adolescent girl had started childbearing (ie, was pregnant) within a year of starting eight grade .144 (.05)


RR A binary variable equal to 1 if a school participated in the “relative risk” reduction program, zero otherwise Not reported
Year A binary variable equal to 1 if the observation was in 2005, 0 if it was in 2003 Not reported
RR * Year An interaction term created by multiplying the “RR” variable with the “Year” variable Not reported
Age Age (in years) of the adolescent girl 15.10 (1.2)
Sex Binary variable equal to 1 if the girl reported having had sex in the past 12 months, 0 otherwise .21 (.15)

Table 2

Regression Coefficient Estimates

(t-statistics in parentheses)

Dependent variable : childbearing

Regression 1 Regression 2
Estimation Method OLS Difference-in-Differences OLS
RR -.015




RR * Year -.024


Year .006




Observations 5,988 10,968
R-squared .12 .23

*Indicates significance at the 5 percent level, **Indicates significance at the 1 percent level. All of the regressions in this table are estimated using OLS, not a probit. Regression 1 only uses data from after the program (2005). Regression 2 uses data from both years (2003 and 2005).

  1. Previous econometric studies of the effects of HIV and AIDS information campaign on adolescents’ sexual behavior and pregnancies have employed simple cross section datasets. Such studies attempted to estimate the ceteris paribus effect of attending an HIV and AIDS information campaign (RR) on pregnancies (as a proxy for sexual behavior) by running simple cross section regression of the form

childbearingi = β0 + β1RRi + β2Xi + β3ZS + ui

Where X and Z are basic, observed characteristics of students (X) and schools (Z), like the adolescent’s age and the size of the school. The author of this paper argues that such cross-sectional regressions are likely to produce biased estimates of the ceteris paribus effect of HIV and AIDS information campaigns on sexual behavior (as measured by whether or not the girl is pregnant), as they omit controls for important but difficult-to-measure adolescent characteristics such as how “risky” a person is (we can call this characteristic Ri for individual-specific risk preference). Explain your answer to the following questions as carefully and completely as possible.

  • Write out an equation describing the “true model” that we suspect is relevant in describing the relationship between participating in an HIV / AIDS information campaign (RR) and whether the adolescent is pregnant or has started childbirth (we will use both terms interchangeably). (You may assume the relationship is linear. The task here is just to get the correct things on the right and left-hand sides of the equation, and to set up the notation you will need later in your answer).
  • Write out an equation describing the “incorrectly specified model” that we are forced to run in the absence of a good measure of individual-specific risk preferences.
  • Write down the formula for the bias in the estimate of the coefficient on RR that would arise out of omitting R from the regression. Use the notation for this specific example, not the general case and clearly define any necessary additional terms. Which term is the bias term?
  • Under what conditions will our estimate of the coefficient on RR suffer from bias as a result of omitting R from the regression?
  • What guess would you make regarding the likely sign of the bias caused by the omission of R? (Note: You can assume that a higher value of R means that an individual is more “risk-loving”).
  1. The author of this paper explains that adolescents’ risk preferences (Ri) in Kenya are not time-invariant, but can vary over time (Rit). She argues that she can minimize omitted variables bias by using a difference-in-differences estimation, taking advantage of the unique panel nature of the dataset.

(a) The authors use a differences-in-differences (DD) in Table 2, Regression 2. DD estimations usually rely upon a natural (or quasi-) experiments. What is the quasi-experiment that the author is referring to in this paper? In other words, who is the treatment group in this experiment? Who is the comparison group?

(b) Identify the “difference-in-differences” variable in Table 2, Regression 2. Referring to the coefficient and t-statistic for this variable, how would you interpret the effect of the RR program on an adolescent’s probability of childbearing? (One well-constructed sentence should suffice. Be sure to read the definitions of the variables).

  1. Now let us temporarily forget about the omitted variable problem and assume that Regression 2 in Table 2 satisfies all of the assumptions required for unbiasedness and efficiency and that the error term is normally distributed. Regression 2 from Table 2:

childbearingi = β0 + β1RRi + β2Year + β3(RRi * Year) + ui

Suppose that we wish to test the hypothesis that none of the right-hand side variables matter in the determination of whether the adolescent girl has started childbearing.

  • Write down the null and alternative hypothesis for this test.
  • Write down the unrestricted and restricted regressions that you would need to run to perform the relevant test of the null hypothesis.
  • Write down the formula for the test statistic you would calculate to perform this test, defining all of your notation carefully.
  • Using the above information, write down the critical value to which you would compare this statistic if you wished to test the hypothesis at the 5-percent significance level, and explain how you found the critical value.
  1. Let us use the same assumption that we had in Question 3 (in other words, let us temporarily forget about the omitted variable problem and assume that the regression in Table 2 satisfies all of the assumptions required for unbiasedness and efficiency and that the error term in the regression is normally distributed.)
  • Look at the coefficient for RR in Regression 1, Table 2. Note that the term in parentheses is the t-statistic. What is the standard error for the OLS coefficient on RR? (Note: Please show your work in calculating the standard error, not just the final answer.)
  • What is the null hypothesis for the t-statistic on RR in Regression 1, Table 2? What is the alternative hypothesis for the t-statistic? Is this a one-sided or two-sided test?
  • On the basis of the coefficient and t-statistic on RR in Regression 1, Table 2, what conclusions can you draw regarding the “social science importance” and economic significance of the effect of RR on the likelihood of childbearing of adolescent girls? Answer completely and carefully, making explicit reference to any information that influences your conclusions.
  1. Now suppose that we replace the dependent variable in the regression in Table 2 (childbearing) with a self-reported measure of adolescents’ sexual behavior. It seems likely that this would contain some measurement error. That is, we suspect that the observed report of sexual behavior (let’s call it sex*) to be the sum of true sexual behavior (sex) plus a measurement error (ε). Please answer the following carefully and completely.

(a) What do we mean by classical measurement error? That is, what conditions must be true of the measurement error e for it to be called “classical”?

(b) If sexual behavior is indeed measured with error, and the measurement error (ε) is classical, what would be the consequences for OLS estimates of RR in Table 2, Regression 1?

(c) In the case of self-reported sexual behavior of adolescent girls, would you expect the measurement error to be classical? Why or why not? (Note: Be specific and refer to your answer in part a).

  1. By now, you have probably noticed that the regression in Table 2 has a binary dependent variable. Rather than using a probit, the author estimates a “Linear Probability Model” for the regression of RR on childbearing.
  • First, fill in the right-hand side of the following equation for a linear probability model (referring to Table 2, Regression 1).

Prob(childbearingi = 1│RR, Year) =

  • What are the problems associated with using OLS to estimate the regression?
  1. Look at the STAT printout for a probit regression, employing the dataset on Ethiopian children of primary school age.

The variables employed are defined as follows.

  • Attend = indicator variable equal to 1 if the child attends a school
  • Male = indicator variable equal to 1 if the child is male
  • Age = age of the child in years
  • Toprim = distance from the child’s house to the nearest primary school in kilometers
  • Rexppa = real per capita consumption expenditure in the child’s household, in Birr (Ethiopian currency)
  • If we look only at the printout on the next page and do not make any additional calculations, can we draw any conclusions about the statistical significance of the effect toprim on the probability that a child attends school? If so, what conclusions do you draw? Explain your answer.
  • If we look at the printout on the next page and do not make any additional calculations, can we draw any conclusions about the “economic importance” of the effect of toprim on the probability that a child attends school? If so, what conclusions do you draw? Explain your answer.
  1. Suppose the true model generating a dataset is:

ln(O)i =  β0 + β1ln(Li) + β2ln(Ki) + ui

And the variance of the error is:

Var(ui) = σ2ln(Li)

  • When modeling heteroskedasticity, what does h(X) equal in the above equation?
  • Write down the transformed regression that you would estimate using Ordinary Least Squares in order to perform the appropriate Weighted Least Squares estimation “by hand.”
  • If you had to run this Weighted Least Squares estimation “by hand” in STATA, what option would you need to add and why?
  • What is true of errors in the transformed equation?