# Probability and Statistics

Final exam Probability and Statistics

Only the use of a non-graphical calculator and a clean copy of the formula sheet is allowed. This exam consistsof fifteen multiple choice questions and three open questions. You should answer the open questions on the examsheet.

Multiple choice questions**
**Version A

- It is known that a patient will respond to a treatment of a particular disease with

a probability of 0:9. If three patients are treated independently, determine the

probability that at least one of them will respond.

a. 0:115 b. 0:999 c. 0:987 d. 0:901 e. 0:812 f. 0:534

2. You are going to play two games of chess with an opponent whom let’s assume you

have never played against before. Your opponent is equally likely to be a beginner,

intermediate or a master. Depending on the level of your opponent, your chances

of winning an individual game are 90%, 50%, or 30%, respectively. What is the

probability of winning the first game?

a. 0:35 b. 0:49 c. 0:66 d. 0:56 e. 0:54 f. 0:45

3. [continuation of the previous question]

Congratulations: you won the first game! Given this information, what is the probability that you will also win the second game. Assume that, given the level of your

opponent, the outcomes of the games are independent.

a. 0:67 b. 0:23 c. 0:59 d. 0:76 e. 0:24 f. 0:15

4. A message is sent over a noisy channel. Suppose the message is a sequence of n bits.

The channel is assumed to be noisy, thus there is a chance that any bit might be

corrupted, resulting in an error (a 0 becomes a 1 and viceversa). Assume that the

error events are independent. Let p be the probability that an individual bit has an

error. Then the number of corrupted bits in a message can be best described with

a distribution function of the type

a. Bernoulli b. Binomial c. Normal

d. Geometric e. Exponential f. Poisson

5. The number of fish in StrandjeDelftseHout is assumed to follow a P ois(λ) distribution. Worried that there might be no fish at all, a statistician (no names given) adds

one fish to the lake. Let Y the resulting number of fish (so Y is 1 plus a P ois(λ)

random variable). Find E[Y^{2}].

a. λ^{2}+ 3λ + 1 b. (λ + 1)^{2}c. λ + 1

d. λ^{2}+ λ e. 2λ + 1 f. λ + 2

6. For the independent and identically distributed (i.i.d.) random variables X1; X2; : : : ; Xnwith mean µ and variance σ2, find the value n that will ensure that there is at least

a 99% chance that the sample mean will be within 2 standard deviations (2σ) of the

true mean µ. (Hint: use Chebychev’s inequality).

a. 100 b. 25 c. 36 d. 44 e. 120 f. 210

7. Let X and Y be two independent and identically distributed positive random variables. Which of the following statement(s) is (are) true?

(ii) and (iii) b. (ii) and (iv) c. (iii) and (iv) d. none

e. only (ii) f. only (iii)

8. Which of the following statement(s) is (are) true?

(i) Since the population size is always larger than the sample size, then the sample

statistic can never be larger than the population parameter.

(ii) The interquantile range is not a measure of dispersion.

(iii) As measure of central location (center of the data), the mean is more influenced

by extreme values (outliers) than the median.

(iv) A statistics professor asks students in a class their ages. On the basis of this

information, the professor wants to make statements about the average age of all

students in the university. The professor made use of a representative sample.

a. none b. Only (i) c.Only (iii) d.Only (iv) e. (i) and (iv)f. (iii) and (iv)

9. Consider the dataset

1,2,4,4,5,5,5,7,7,12,14.*
*Which of the boxplot below corresponds to this dataset?

- A B
**c.**C**d.**D**e.**E**f.**F - The random variables
*X*1 and*X*2 have means 1 and 2 respectively, with same variance

4 and covariance 0*:*8. Let*Y*=*X*1*–*3*X*2 and*Z*=*αX*1 +*X*2. What value of the

constant*α*makes*Y*and*Z*uncorrelated?

**a.**3**b.**2.71**c.**7**d.**-2.71**e.**-3.47**f.**-7

11. Assume the Vitamin D content of a particular brand of vitamin supplement pills is

normally distributed with mean 490 mg and standard deviation 12 mg. What is

the approximate probability that a randomly selected pill contains at least 500 mg

of Vitamin D?

**a.**0.79**b.**0.83**c.**0.05**d.**0.51**e.**0.02**f.**0.20

12. Let*X*and*Y*two continuous random variables with joint probability density function

Then *E*[*XY *] is equal to

- Let
*x*1 and*x*2 be two data points which we assume to be realizations of the random

variables*X*1 and*X*2. We know that*X*1 and*X*2 have*N*(*µ;*2) distribution where*µ*is unknown. We define

and we use *T *as an estimator of *µ*. Compute the bias of *T *.

**a. ***µ ***b. **2*µ ***c. **3*µ ***d. ***µ *+ 1 **e. **2*µ *+ 1 **f. **3*µ *+ 1

14. Which of the following statement(s) is(are) true?

(i) If (*l**n,**u**n*) is a 95% confidence interval for *θ*, then with probability

95%.

(ii) If the p-value is smaller than the significance level, then we have enough evidence

to reject the null hypothesis *H*0.

(iii) The regression line obtained by the method of least squares is such that the sum

of squared vertical distances from the data points to the line is minimized.

**a. **(i) and (ii) **b. **(i) and (iii) **c. **(ii) and (iii)

**d. **only (i) **e. **only (ii) **f. **only (iii)

15. Let *x*1,* x*2,*….,x**n*be a dataset modeled as a realization of a random sample *X*1,* X*2,*….,X**n
*with probability density function (the random

variables follow a Pareto distribution). Compute the maximum likelihood estimate

of

*α*.

**Solution**** **

1.

Probability that at least one of them will respond = 1- probability that none of them will respond

=1- (0.1)^{ 3}

=1-0.001 = 0.999

2.

There will be three cases.

We can choose beginner, immediate and masters with equal probability 1/3.

Then the probability of winning the first game will be (1/3)* {0.9+0.5+0.3} = (1/3)* 1.7

=1.7/3 =0.56

3.

Now, Given that you won the first ganme, for second game every case in the above part will have three cases like.

If the first win is from beginner then the second win can be any one from beginner, intermediate and masters.

Similarly,if the first win is from intermediate then the second win can be any one from beginner, intermediate and masters

And, if the first win is from master then the second win can be any one from beginner, intermediate and masters.

Calculating all the above probability and applying the um rule we will obtain the probability of also winning the third game will be approximately 0.67

4.

This will be a binomial distribution because the probability that a message will be corrupted is same that is p, every times the message is sent. So, for the number of corrupted bits in a message can be calculated by using binomial distribution with probability of success as p and probability of failure as (1-p).

5.

Y= 1+ P ois(λ)

E(Y) = 1+ λ

Var (Y) = λ

Now, from the formula Var(Y) = E [Y^{2}] – {E(Y)}^{ 2}

E [Y^{2}] = Var[Y] + {E(Y)}^{ 2}

= λ + (1+ λ) ^{2}

= λ^{2} + 3λ + 1

6.

Using Chebyshev’s inequality required value of n will be (1/k2) * 100

Where k is the number of standard deviation.

So the value of n will be (1/4) * 100 = 25

7.

We know that E[x^{2}]= var(x) + {E[x]}^{2}

So, square root of E[x^{2}] will always be greater than E[x] as it has an extra term of var(x).

P(X ≥ Y) < P(Y ≥ X) is also not always correct.

E [min(X, Y)] ≤ min (E[X], E[Y])

When we take minimum of X and Y and then take its expectation will always be less than or equal to the minimum of expectation of X and Y taken separately.

E[X/Y] ≤ E[X]/E[Y] this is also not always true.

8.

Since the population size is always larger than the sample size, then the sample

Statistic can never be larger than the population parameter. It is not always true because the sample statistics does not depend upon the size of the sample but it depends on the value of sample.

The interquartile range is also a measure of dispersion.

As measure of central location (centre of the data), the mean is more influenced by extreme values (outliers) than the median. Because mean is the average of all the data but median is the midpoint of all the data when sorted in ascending or descending order. So, median has less effect of outlier.

A statistics professor asks students in a class their ages. On the basis of this information, the professor wants to make statements about the average age of all students in the university. The professor made use of a representative sample. It is wrong as one cannot just take a sample and make the statement about the whole population, for making a statement about the whole population, one should take more and more samples and also consider the significance value or probability of acceptance when making the statement about the whole population.

9.

1, 2, 4, 4, 5, 5, 5, 7, 7, 12, 14

This dataset has the mean 6.

25^{th} percentile is 4 and 75^{th} percentile is 7. It has 2 outlier 12 and 14.

Based on all these facts the most suitable box plot is plot number 5^{th}.

10.

Y = X1 − 3X2 and Z = αX1 + X2

For Y and Z to be uncorrelated, Covariance of Y and Z should be Zero.

Cov(Y, Z) = Cov(X1 − 3X2, αX1 + X2) =0

= α Cov(X1, X1) + Cov(X1, X2) -3α Cov(X2, X1) – 3 Cov(X2, X2) =0

Or, 4α + 0.8 – 3α *0.8 – 3*4 =0

Or, α= (11.2/1.6) =7

11.

First calculate the z-value by using z =(x-μ)/σ

Z= (500-490)/12 = 10/12 = 0.8333

P (Z<=z) = 0.797

Hence the probability that a randomly selected pill contains at least 500 mg of Vitamin D will be

1-0.797 = 0.20 (approx.)

12

f(x, y) = x^{2} +(1/3) xy , if 0 ≤ x ≤ 1, 0 ≤ y ≤ 2,

0, otherwise.

To find E [XY] find the double integration of xy * f(x,y) over x from 0 to 1 and y from 0 to 2 which will come out to be 43/54.

13.

T= (2X1+4X2)/2

= X1+ 2 X2

Since T is estimator of μ, E [T] should be equal to μ. Any term other than μ will be the bias term.

E [T] = E[X1+2 X2] = E[X1] + 2 E[X2]

= μ + 2 μ

So, 2 μ is the bias term of T.

14.

If (ln, un) is a 95% confidence interval for θ, then it is not necessary that θ ∈ (ln, un) with probability 95% it also depends on some other factors.

If P-value is smaller than the significance level that means that it lies in critical region that is in the region of (1- alpha). So, that means we have enough evidence that there is statistically significant difference and hence enough evidence to reject the means.

The regression line obtained by the method of least squares is such that the sum of squared vertical distances from the data points to the line is minimized. Specifically, the least squares regression line of y on x is the line that makes the sum of the *squares* of the vertical distances of the data points from the line as small as possible.

15.

f(x) = αx^{−α−1}

L = f_{n}(x) = α^{n} {X1^{(-α-1)} * X2^{(-α-1)} ….. *Xn^{(-α-1)}}

Now take log and differentiate with respect to α.

We get (n/α) = ln(X1*X2*X3…….*Xn)

Or, α=n/ln(X1*X2*X3…….*Xn)

Or, α= n/ [ln(X1) + ln(X2) +…+ln (Xn)]

**Resit Exam Probability and Statistics**** **

Only the use of a **non-graphical **calculator and a clean copy of the formula sheet is allowed. This exam consistsof fifteen multiple choice questions and three open questions. You should answer the open questions on the examsheet.

**Multiple choice questions**

- Let
*A*and*B*be two events such that and

*P *(*A*) = 2*P *(*B*). Compute *P *(*A*).

2. A spam filter is designed by looking at commonly occurring phrases in spam. Suppose that 80% of email is spam. Assume that in 10% of the spam emails, the phrase

\free money” is used, whereas it is assumed that this phrase is only used in 1% of

non-spam emails. What is the approximate probability that a new arriving email

contains the phrase \free money”?

0*:*10 **b. **0*:*25 **c. **0*:*08 **d. **0*:*01 **e. **0*:*24 **f. **0*:*44

3. [continuation of the previous question]

A new email has just arrived, which does mention the expression \free money”. What

is the approximate probability that it is a spam? Hint: use the exact probability

computed in the previous exercise.

**a. **0*:*25 **b. **0*:*89 **c. **0*:*76 **d. **0*:*99 **e. **0*:*54 **f. **0*:*98

4. Let *X *and *Y *be random variables with joint probability mass function P(*X *= *a; Y *= *b*)

given by table below. Compute E[*X*].

- 15
**b.**1.90**c.**-0.14**d.**0.06**e.**-2.05**f.**-0.70

5. A random variable*X*takes only values between 0 and 1 and has the cumulative

distribution function The variance of*X*is

6. Which of the following is (are) the graph(s) of a cumulative distribution function *F *.

(i), (ii) and (iii) (i),(v) and (vi), (vii) and (viii)

**c. **(iii) and (iv) **d. **(i), (ii) and (vii)

**e. **(v), (vi) and (viii) **f. **(i), (v), (vii) and (viii)

7. During the third week after this resit, the number of students checking OSIRIS for

their grades at the exam can be modelled by a Poisson process with intensity *λ *= 10

students per hour. Find the approximate probability that 2 students check OSIRIS

between 10:00 and 10:20.

**a. **0*:*5 **b. **0*:*25 **c. **0*:*20 **d. **0*:*30 **e. **0*:*35 **f. **0*:*45

8. Assume the weights of adults in the Netherlands are approximately normally distributed with mean 84 kg and standard deviation 2*:*4 kg. What is the approximate

probability that a randomly selected person weights more than 90 kg?

**a. **0.006 **b. **0.083 **c. **0.005 **d. **0.051 **e. **0.002 **f. **0.020

9. Let the random variable *U *have a *U*(0*; *1) distribution. We want to draw random

numbers from a distribution with density

10. You have a sample of size *n *= 50. You sample with replacement 1000 times to get

1000 bootstrap samples. What is the size of each bootstrap sample?

**a. **1000 **b. **50000 **c. **5000 **d. **500 **e. **50 **f. **5

11.Let X be a random variable such that Var(*X*) *>*0 and E[*X*] = ln 2. Then

12. Assume *Z*1*; Z*2*; Z*3 are independent standard normally distributed random variables,

and let *X *= 3*Z*1 + 2*Z*2 + 3 and *Y *= 3*Z*2 + 2*Z*3 + 4. Then the Cov(*X; Y *) is

**a. **20 **b. **12 **c. **0 **d. **6 **e. **19 **f. **13

13. Let *Z *be a standard normally distributed random variable and let *Y *= *e** ^{Z}*. Find

the median of

*Y*.

**a.**0

**b.**1

*:*36

**c.**0

*:*5

**d.**1

*:*25

**e.**0

*:*7

**f.**1

14. Which of the following statement(s) is(are) true?

(i) If a hypothesis test is conducted at level

*α*= 0

*:*05, then there is a 5% chance of

rejecting the null hypothesis.

(ii) The regression line obtained by the method of least squares must pass through

at least one of the data points.

(iii) The boxplot is a graphical tool that indicates the dispersion (spread) and the

skewness in the data, and also shows outliers.

(iv) A biased estimator can never have a smaller mean squared error than an unbiased estimator.

**a.**only (i)

**b.**only (ii)

**c.**only (iii)

**d.**only (iv)

**e.**(i) and (iii)

**f.**(iii) and (iv)

15. Assume you have a random sample from a certain population and form a 96%

confidence interval for the population mean

*µ*. Let be the sample mean. What

quantity is guaranteed by the construction of the confidence intervals to be in these

confidence intervals?

**Solution**

** **1.

P (A ∪ B) = P (A) + P (B) – P(A ∩ B)

5/6 = 2 P(B) + P(B) – 1/6

3 P (B) = 5/6 + 1/6

3 P (B) = 1

P(B) = 1/3

So, P(A) = 2 P(B) = 2/3

2.

Probability that a new arriving email contains the phrase “free money” consists of 2 cases:

- Mail is spam and contain this phrase, probability of this case will be 0.80 * 0.10 = 0.08
- Mail is not spam and contain this phrase, probability of this case will be 0.20 *0.01 = 0.002

Total probability will be 0.08 + 0.002 = 0.082 =0.08(approx.)

3.

Now given that the mail contain this phrase, we have to find the probability that it is a spam. It can be found by using Baye’s theorem.

Required probability will be:

(Probability that the spam mail contains this phrase) / (total probability that the mail contains this phrase)

= 0.08/0.082

=0.98 (approx…)

4.

E[X] = 0*(0.1+0.1+0.1) + 1* (0.1+0.2+0.1) + 2* (0.1+0+0.05) + 3* (0.1+0+0.05)

= 0+ 0.4+0.3+0.45 = 1.15

5.

The population variance for a non-negative random variable can be expressed in terms of the cumulative distribution function *F* using

Putting the value of F(X) in this formula we get

Solving this we get the value of variance as 1/18.

6.

F(X) = 1-(x-1) ^2

Value of F(X) at x=0 is 0 and

Value of F(X) at x=1 is 1

Derivative of F(X) is 2(x-1) which vanishes at x=1.That is at x=1 slope will be 0.

Based on all these fact it can be seen from the options that (i), (v), (vii) and (viii) can be the suitable graph.

7.

Intensity λ = 10 is students per hour. So, for students per 20 min the intensity λ will be 10/3

Now Using the poison distribution for x=2 will be (λ^x *e^- λ)/x! Putting the value of λ=10/3=3.33 and value of x=2, the result will be obtained as 0.198 =0.20(approx…)

8.

First calculate z-score = (X-μ)/σ

= (90-84)/2.4

=6/2.4 =2.5

P (Z<=2.5) = 0.9938

Therefore, P (Z>2.5) = 1-0.9938 =0.006(approx.…)

9.

If Step 1: U = rand, then Step 2 will be calculated to be

10.

Bootstrap samples are the same as the original sample so the sample size will be 50.

11.

Given E[X] =ln2 and var(x) >0

Then there will be two possibility for E [e^-x]

a.

E[e^-x] = e^-ln2 = e^ln(1/2) = ½

b.

E[e^-x] > e^-ln2 = e^ln(1/2) = ½

12.

X = 3Z1 + 2Z2 + 3

Y = 3Z2 + 2Z3 + 4

Cov(X, Y) = Cov(3Z1 + 2Z2 + 3 , 3Z2 + 2Z3 + 4)

= Cov( 3Z1, 3Z2 + 2Z3 + 4) +6 Cov(Z2,Z2)

= 6 Var(Z)

=6

13.

Since Z be a standard normally distributed random variable mean or median of Z will be 1 Now the median of Y=e^Z will be the same as the mean of Standard normal distribution Z that will be 1.

14.

The alpha level does not indicate the percentage of chance to reject the null hypothesis. Rather, it indicate the percentage of chance to reject the null hypothesis when it is true.

It is not necessary that the regression line obtained by the method of least squares must pass through at least one of the data points. It may pass through none of the points or it may pass through every points.

The boxplot is a graphical tool that indicates the dispersion (spread) and the

Skewness in the data, and also shows outliers.

In general, since MSE is a function of the parameter, there will not be one “best” estimator in terms of MSE. Often, the MSE of two estimators will cross each other, that is, for some 5 parameter values, one is better, for other values, the other is better

15.

Confidence interval is given by (sample mean –margin of error, Sample mean +margin of error)

So, it definitely contain the sample mean.

** **