Using probability to check the possibility of an event happening
Here, we use probability to see the connection between sleep and stroke. Also, we will use probability to see the connection between the two types of cancers.
PROBLEM 1:
a) A number of studies have been conducted to investigate the association between sleep duration and risk of stroke in middle-aged and older adults. One such study was conducted in a sample of 31,000 participants with an average age of 62 years. Information on sociodemographic characteristics and sleep duration were obtained by a self-administered questionnaire. The study took place over 6 years, during which 1,557 cases of stroke were documented. The risk of stroke was higher in participants who reported sleeping at least 9 hours per night, in comparison to the group who reported sleeping 7 to 8 hours per night; the relative risk was 1.23.
A news article reporting the results stated: “Sleeping a lot may increase the risk for stroke. A new study has found that compared with sleeping seven to eight hours a night, sleeping nine or more hours increased the relative risk for stroke by 23 percent. Maintaining appropriate sleep duration might hold great promise as primary prevention of stroke.”
Write a short response to the newspaper editor explaining clearly why the article is potentially misleading. Be sure to use language accessible to a general audience without a statistics background. Limit your answer to at most six sentences.
b) Suppose that you have purchased a 40-piece box of saltwater taffy; taffy is a type of soft, chewy candy that became popular in the United States in the 1800s. The box is advertised as containing a random assortment of 8 flavors. Calculate the probability that at least one flavor is missing from the box. If using an algebraic approach, explain your reasoning and any necessary assumptions. If using a simulation approach, be sure to clearly comment on any code and describe your logic.
c) Accurately distinguishing lung cancer from benign lung disease remains challenging, even with the use of imaging scans; computed tomography (CT) scans are known to have high sensitivity but poor specificity for lung cancer diagnosis. Tumor markers, molecules produced by a tumor associated with cancer or by the body in response to cancer, may be useful for clinical diagnosis. Consider two tumor markers for lung cancer, CYFRA 21-1 and CEA, which tend to be elevated in patients with lung cancer relative to those with benign lung disease. A study was conducted on patients with known lung cancer status to assess how these tumor markers could be used for clinical diagnosis. The study team observed that in patients with lung cancer, CYFRA 21-1 is normally distributed with a mean of 4.7 ng/mL and a standard deviation of 9.2 ng/mL while CEA is normally distributed with a mean of 5.9 ng/mL and standard deviation 19.8 ng/mL. In patients with benign lung disease, CYFRA 21-1 is normally distributed with a mean of 1.6 ng/mL and a standard deviation of 4.3 ng/mL while CEA is normally distributed with mean 2.2 ng/mL and standard deviation 5.3 ng/mL.
Use the data from this study to answer the following questions.
i. Compute the sensitivity and specificity of a diagnostic test based on classifying patients with CYFRA 21-1 level greater than 3.3 ng/mL as having lung cancer.
ii. Compute the sensitivity and specificity of a diagnostic test based on classifying patients with CEA levels greater than 5.0 ng/mL as having lung cancer.
iii. Explain the reasoning behind why a diagnostic test with low sensitivity may not be recommended for use in the general population but appropriate for use in high-risk groups, such as patients presenting with several risk factors or symptoms strongly predictive of lung cancer. Use language accessible to someone who has not taken a statistics course. Limit your answer to no more than six sentences.
iv. Suppose a high-risk patient is tested for elevated CYFRA 21-1 level and found to have
CYFRA 21-1 level below the cutoff in part i. Explain whether it seems reasonable to rule out lung cancer for this patient based on this test result and the test features computed in part i. Limit your answer to no more than six sentences, referencing numerical results as necessary. The study team is interested in whether a diagnostic test based on both CYFRA 21-1 and CEA is an improvement over tests based solely on one of the markers. Suppose that a patient is classified as having lung cancer if at least one of the markers is above the cutoffs used in parts i. and ii; i.e., a patient tests positive for lung cancer if CYFRA 21-1 level is greater than 3.3 ng/mL, CEA level is greater than 5.0 ng/mL, or both are elevated.
v. Compute the sensitivity and specificity of this diagnostic test. State any assumptions necessary to make the calculation and comment on whether those assumptions seem reasonable.
vi. Does the diagnostic test based on both markers represent an improvement over the tests in parts i. and ii. for use in high-risk patients? Explain your answer, referencing numerical results to support your reasoning.
Solution
a) “Dear Sir or madam. With reference to your article about the risks of longer sleeping hours to have a heart stroke, I am writing to clarify a few points that I do not agree with. As a statistics graduate, I would like to point out that the concept of relative risk is more complicated to express as an increase in percentage. One would prefer to express relative risk in terms, risks of people with longer sleeping hours are more by factor 1.23 higher than those with shorter hours. The other issue why the conclusion could be misleading is the fact it is a self-administered questionnaire. That means there are could be measurement errors or the sample given the older age of the sample could be not representative. For example, if people in the category longer sleeping hours are older than those with shorter sleeping hours, then the conclusion is completely wrong. In that respect, I would like to suggest taking extra care when trying to interpret non-scientific or reviewed studies.”
We need to find the probability that in the 40-piece box at least one of the 8 flavors is not included. we calculate first the probability that the 1st flavor, F_1 is missing. This is to say that the probability that each piece does not contain the 1st flavor.
P(s_1≠F_1 and s_2≠F_1⋯)=P(〖∩┴40〗┬(i=1) s_i≠F_1 )
The probability intersection of 40 independent events is the product of the following event: any sample does not contain the 1st flavor, which is given by P(s_i≠F_1 )=7/8 and hence the above probability is the product of all these events:
P(s_1≠F_1 and s_2≠F_1⋯)=∏_(i=1)^40▒P(s_i≠F_1 ) =∏_(i=1)^40▒7/8=(7/8)^40
Next, we calculate this probability for the other 7 flavors:
P(s_1≠F_1 and s_2≠F_1⋯)+P(s_1≠F_2 and s_2≠F_2⋯)+⋯=8(7/8)^40≈0.0383
Computing sensitivity and specificity with a cutpoint of 3.3 ng/mL as having lung cancer.
N=10^5
dat=data.frame(cancer=rep(c("YES","NO"),each=N/2))
dat[dat$cancer=="YES","CYFRA"]=rnorm(N/2,mean=4.7,sd=9.2)
dat[dat$cancer=="NO","CYFRA"]=rnorm(N/2,mean=1.6,sd=4.3)
dat[dat$cancer=="YES","CEA"]=rnorm(N/2,mean=5.9,sd=19.8)
dat[dat$cancer=="NO","CEA"]=rnorm(N/2,mean=2.2,sd=5.3)
dat$CYFRA_Pred=ifelse(dat$CYFRA>=3.3,"YES","NO")
CrossTable(dat$cancer,dat$CYFRA_Pred, prop.c = F,prop.r = T,prop.t = F,
expected = F, cell.layout = F, format="SAS", prop.chisq = F)
##
## ======================================
## dat$CYFRA_Pred
## dat$cancer NO YES Total
## --------------------------------------
## NO 32591 17409 50000
## row prop. 0.652 0.348 0.500
## --------------------------------------
## YES 21983 28017 50000
## row prop. 0.440 0.560 0.500
## --------------------------------------
## Total 54574 45426 100000
## ======================================
cat("\nSensitivity based on diag test CYFRA 21.1=",
sensitivity(factor(dat$CYFRA_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Sensitivity based on diag test CYFRA 21.1= 0.56034
cat("\nSpecifivity based on diag test CYFRA 21.1=",
specificity(factor(dat$CYFRA_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Specifivity based on diag test CYFRA 21.1= 0.65182
ii.Test based on CEA 5.0 ng.
dat$CEA_Pred=ifelse(dat$CEA>=5,"YES","NO")
CrossTable(dat$cancer,dat$CEA_Pred)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## dat$CEA_Pred
## dat$cancer NO YES Total
## ------------------------------------------
## NO 35014 14986 50000
## 1034.948 1488.023
## 0.700 0.300 0.500
## 0.594 0.365
## 0.35 0.15
## ------------------------------------------
## YES 23965 26035 50000
## 1034.948 1488.023
## 0.479 0.521 0.500
## 0.406 0.635
## 0.24 0.26
## ------------------------------------------
## Total 58979 41021 100000
## 0.59 0.41
## ==========================================
cat("\nSensitivity based on diag test CEA 5.0 ng =",
sensitivity(factor(dat$CEA_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Sensitivity based on diag test CEA 5.0 ng = 0.5207
cat("\nSpecifivity based on diag test CEA 5.0 ng =",
specificity(factor(dat$CEA_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Specifivity based on diag test CEA 5.0 ng = 0.70028
We generate data from a large sample with an equal number of lung cancer patients
Diagnostic tests with low sensitivity imply that in the general population such tests do correctly identify enough risky patients, therefore such criteria are crucial for large-scale diagnostics. On the other hand, considering only high-risk populations such diagnostic tests might be effective because these types of misclassification could be much less among such high-risk groups.
iv. If a patient has a CYFRA-level below the considered cut-off 3.3, then given the specificity of this diagnostic test there is at least a 65% chance that the patient has no cancer. This can be read from the definition of specificity which can be translated as a proportion of correctly identified non-risk patients below the cut-off.
v. Here we set the rule for predicted classification of lung cancer if at least one of the cut-offs or rules in (i) and (ii) apply.
dat$Joint_Pred=factor(ifelse(dat$CEA>=5|dat$CYFRA>=3.3,"YES","NO"))
CrossTable(dat$cancer,dat$Joint_Pred)
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
## ==========================================
## dat$Joint_Pred
## dat$cancer NO YES Total
## ------------------------------------------
## NO 22828 27172 50000
## 2240.838 1124.693
## 0.457 0.543 0.500
## 0.683 0.408
## 0.228 0.272
## ------------------------------------------
## YES 10590 39410 50000
## 2240.838 1124.693
## 0.212 0.788 0.500
## 0.317 0.592
## 0.106 0.394
## ------------------------------------------
## Total 33418 66582 100000
## 0.334 0.666
## ==========================================
cat("\nSensitivity based on the joint diag test =",
sensitivity(factor(dat$Joint_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Sensitivity based on the joint diag test = 0.7882
cat("\nSpecifivity based on the joint diag test =",
specificity(factor(dat$Joint_Pred),factor(dat$cancer), negative="NO", positive="YES"))
##
## Specifivity based on the joint diag test = 0.45656
We observe an increase in specificity but still a lower sensitivity rate. This implies that such diagnostics are good for identifying people not having cancer. But at the same time has poor performance for identifying people having cancer.
PROBLEM 2: ICE CREAM SALES
Congratulations! You have inherited an ice cream truck for the summer of 2022. You will be selling ice cream from the truck every day that summer: 101 days in total, from Memorial Day weekend until Labor Day weekend. This problem will step you through by projecting how much ice cream you expect to sell (as well as how much revenue you expect to make) during the summer.
a) Ice cream sales are known to be very dependent on weather. Suppose that for any particular day, the weather is either rainy or sunny; on average, one-third of the days are rainy and two-thirds of the days are sunny. Compute the expected number of sunny days in summer 2022 and the probability of there being more rainy days than sunny days in summer 2022. You may assume that the weather is independent between days.
b) It is more realistic to think that tomorrow’s weather depends on today’s weather (for all days throughout the summer). Suppose that if it is sunny today, there is an 80% chance that tomorrow will also be sunny; however, if it is rainy today, there is only a 30% chance that tomorrow will be sunny. Additionally, suppose that the weather on ‘Day 0’ (i.e., the day before your ice cream truck opens) is known to be sunny. Based on a simulation with 10,000 replicates, estimate the probability of there being more rainy days than sunny days in summer 2022. Write a brief paragraph outlining the logic of the simulation and clearly comment on your code. The number of ice cream cones sold on a typical day depends on the weather. When it is sunny, the number of ice cream cones sold is approximately normally distributed with a mean of 200 and a standard deviation of 40. When it is rainy, the number of ice cream cones sold is approximately normally distributed with a mean of 120 and a standard deviation of 30. Use the round () function to round to the nearest whole number. You make $2.00 in profit for each ice cream cone sold and your fixed daily operating cost is $300 per day.
c) Based on a simulation with 10,000 replicates, estimate the probability that you lose money on anyone randomly selected day in the summer; write a brief paragraph outlining the logic of the simulation and clearly comment on your code. Assume the weather follows the pattern described in part a).
d) Based on a simulation with 10,000 replicates, estimate the probability that you lose money with your ice cream truck during the summer of 2022; write a brief paragraph outlining the logic of the simulation and clearly comment on your code. Assume the weather follows the pattern described in part b).
Solution
We note that the probability that a single day is sunny follows Bernoulli distribution and the number of sunny days,x_s in 101 days follows Binomial distribution with the success rate of a single sunny day (P(D_1=Sunny)=2/3) and a trial size of 101:
x_s∼Binom(2/3,101)
Next, we are interested in finding the expected value of x_s:
E(x_s )=np=101 2/3≈67.333≈67
The event that there are more sunny days than rainy is the same as the event there are at least 51 sunny days:
P(x_s≥51)=P(x_s=51)+P(x_s=52)+⋯=1-P(x_s≤50)
We use the R-statistical package to calculate this probability (although approximations using incomplete Beta-function exist).
1-pbinom(50,101,2/3)
## [1] 0.9997276
##equivalently
sum(dbinom(51:101,101,2/3))
## [1] 0.9997276
Considering the dependent model
We create the empty vector of 102 days (0 days as first) and then each step draw a Bernouilli sample with a success probability of 0.8 if it the previous day is sunny and 0.3 if the previous day is raining:
P(D_i=S|D_(i_1 )=S)=0.8 P(D_i=S|D_(i_1 )≠S)=0.3
We run collect these samples and count the number of cases where sunny days are more than 51.
set.seed(2344)
##number of simulations
N=10^4
n_sun_sim=rep(NA,N)
for(j in1:N)
{
## we create the empty vector of 102 days (0 day as first)
days=rep(NA,102)
days[1]=1
for(i in2:102)
{
days[i]=ifelse(days[i-1]==1,rbinom(1,1,0.8),rbinom(1,1,0.3))
}
n_sun_sim[j]=sum(days[-1])
}
cat("Probability of more sunny days=",sum(n_sun_sim>=51)/N)
## Probability of more sunny days= 0.886
We are looking for the probability that on a single day which could be rainy or sunny the money is lost. Given the variable profit of $2 and daily fixed cost of $300 this is an event of selling less than 150 cones:
P(Sun and n_c≤149)+P(Rain and n_c≤149)=P(Sun)P(n_c≤149|Sun)+P(Rain)P(n_c≤149|Rain)
The number of cone sales,n_c, is conditional on Bernouilli variable x_s that takes 1 if it is sunny and follows normal distribution.
n_c |x_s=1∼N(200,〖40〗^2 ) n_c | x_s=0∼N(120,〖30〗^2 )
#N=10^4
n_lose_sim=rep(NA,N)
for(j in1:N)
{
x_s=rbinom(1,1,2/3)
n_lose_sim[j]=ifelse(x_s==1,rnorm(1,mean=200,sd=40),rnorm(1,mean=120,sd=30))
}
cat("Proportion of financially unsuccessful days=",sum(n_lose_sim<=149)/N)
## Proportion of financially unsuccessful days= 0.3479
We can actually check validity (or accuracy) of this estimate using the additivity of expectation operator and R:
P(n_c<149)=P(n_c<149|x_s=1)*P(x_s=1)+P(n_c<149|x_s=0)*P(x_s=0)
pnorm(149,mean=200,sd=40)*2/3+pnorm(149,mean=120,sd=30)*1/3
## [1] 0.3451513
In order to calculate of losing money during the whole year of 2002, We extend the simulation.
n_balance_sim=rep(NA,N)
for(j in1:N)
{
## we create the empty vector of 102 days (0 day as first)
days=rep(NA,102)
balance=rep(NA,102)
balance[1]=0
days[1]=1
for(i in2:102)
{
days[i]=ifelse(days[i-1]==1,rbinom(1,1,0.8),rbinom(1,1,0.3))
n_c=ifelse(days[i]==1,rnorm(1,mean=200,sd=40),rnorm(1,mean=120,sd=30))
balance[i]=round(n_c)*2-300
}
n_balance_sim[j]=sum(balance)
}
cat("Proportion of financially unsuccessful years=",sum(n_balance_sim<=0)/N)
## Proportion of financially unsuccessful years= 0.0093