Probability and Statistics

Probability and Statistics

Final exam Probability and Statistics

Only the use of a non-graphical calculator and a clean copy of the formula sheet is allowed. This exam consistsof fifteen multiple choice questions and three open questions. You should answer the open questions on the examsheet.

Multiple choice questions
Version A

  1. It is known that a patient will respond to a treatment of a particular disease with
    a probability of 0:9. If three patients are treated independently, determine the
    probability that at least one of them will respond.
    a. 0:115 b. 0:999 c. 0:987 d. 0:901 e. 0:812 f. 0:534
    2. You are going to play two games of chess with an opponent whom let’s assume you
    have never played against before. Your opponent is equally likely to be a beginner,
    intermediate or a master. Depending on the level of your opponent, your chances
    of winning an individual game are 90%, 50%, or 30%, respectively. What is the
    probability of winning the first game?
    a. 0:35 b. 0:49 c. 0:66 d. 0:56 e. 0:54 f. 0:45
    3. [continuation of the previous question]
    Congratulations: you won the first game! Given this information, what is the probability that you will also win the second game. Assume that, given the level of your
    opponent, the outcomes of the games are independent.
    a. 0:67 b. 0:23 c. 0:59 d. 0:76 e. 0:24 f. 0:15
    4. A message is sent over a noisy channel. Suppose the message is a sequence of n bits.
    The channel is assumed to be noisy, thus there is a chance that any bit might be
    corrupted, resulting in an error (a 0 becomes a 1 and viceversa). Assume that the
    error events are independent. Let p be the probability that an individual bit has an
    error. Then the number of corrupted bits in a message can be best described with
    a distribution function of the type
    a. Bernoulli b. Binomial c. Normal
    d. Geometric e. Exponential f. Poisson
    5. The number of fish in StrandjeDelftseHout is assumed to follow a P ois(λ) distribution. Worried that there might be no fish at all, a statistician (no names given) adds
    one fish to the lake. Let Y the resulting number of fish (so Y is 1 plus a P ois(λ)
    random variable). Find E[Y 2].
    a. λ2+ 3λ + 1 b. (λ + 1)2c. λ + 1
    d. λ2+ λ e. 2λ + 1 f. λ + 2
    6. For the independent and identically distributed (i.i.d.) random variables X1; X2; : : : ; Xn
    with mean µ and variance σ2, find the value n that will ensure that there is at least
    a 99% chance that the sample mean will be within 2 standard deviations (2σ) of the
    true mean µ. (Hint: use Chebychev’s inequality).
    a. 100 b. 25 c. 36 d. 44 e. 120 f. 210
    7. Let X and Y be two independent and identically distributed positive random variables. Which of the following statement(s) is (are) true?

(ii) and (iii) b. (ii) and (iv) c. (iii) and (iv) d. none
e. only (ii) f. only (iii)
8. Which of the following statement(s) is (are) true?
(i) Since the population size is always larger than the sample size, then the sample
statistic can never be larger than the population parameter.
(ii) The interquantile range is not a measure of dispersion.
(iii) As measure of central location (center of the data), the mean is more influenced
by extreme values (outliers) than the median.
(iv) A statistics professor asks students in a class their ages. On the basis of this
information, the professor wants to make statements about the average age of all
students in the university. The professor made use of a representative sample.
a. none b. Only (i) c.Only (iii) d.Only (iv) e. (i) and (iv)f. (iii) and (iv)
9. Consider the dataset
1,2,4,4,5,5,5,7,7,12,14.
Which of the boxplot below corresponds to this dataset?

  1. A B c. C d. D e. E f. F
  2. The random variables X1 and X2 have means 1 and 2 respectively, with same variance
    4 and covariance 0:8. Let Y = X1 3X2 and Z = αX1 + X2. What value of the
    constant α makes Y and Z uncorrelated?
    a. 3 b. 2.71 c. 7 d. -2.71 e. -3.47 f. -7
    11. Assume the Vitamin D content of a particular brand of vitamin supplement pills is
    normally distributed with mean 490 mg and standard deviation 12 mg. What is
    the approximate probability that a randomly selected pill contains at least 500 mg
    of Vitamin D?
    a. 0.79 b. 0.83 c. 0.05 d. 0.51 e. 0.02 f. 0.20
    12. Let X and Y two continuous random variables with joint probability density function

Then E[XY ] is equal to

  1. Let x1 and x2 be two data points which we assume to be realizations of the random
    variables X1 and X2. We know that X1 and X2 have N(µ; 2) distribution where µ
    is unknown. We define

and we use T as an estimator of µ. Compute the bias of T .
a. µ b. 2µ c. 3µ d. µ + 1 e. 2µ + 1 f. 3µ + 1
14. Which of the following statement(s) is(are) true?
(i) If (ln,un) is a 95% confidence interval for θ, then  with probability
95%.
(ii) If the p-value is smaller than the significance level, then we have enough evidence
to reject the null hypothesis H0.
(iii) The regression line obtained by the method of least squares is such that the sum
of squared vertical distances from the data points to the line is minimized.
a. (i) and (ii) b. (i) and (iii) c. (ii) and (iii)
d. only (i) e. only (ii) f. only (iii)
15. Let x1, x2,….,xnbe a dataset modeled as a realization of a random sample X1, X2,….,Xn
with probability density function  (the random
variables follow a Pareto distribution). Compute the maximum likelihood estimate
of α. 

Solution 

1.

Probability that at least one of them will respond = 1- probability that none of them will respond

=1- (0.1) 3

=1-0.001 = 0.999

2.

There will be three cases.

We can choose beginner, immediate and masters with equal probability 1/3.

Then the probability of winning the first game will be (1/3)* {0.9+0.5+0.3} = (1/3)* 1.7

=1.7/3 =0.56

3.

Now, Given that you won the first ganme, for second game every case in the above part will have three cases like.

If the first win is from beginner then the second win can be any one from beginner, intermediate and masters.

Similarly,if the first win is from intermediate then the second win can be any one from beginner, intermediate and masters

And, if the first win is from master then the second win can be any one from beginner, intermediate and masters.

Calculating all the above probability and applying the um rule we will obtain the probability of also winning the third game will be approximately 0.67

4.

This will be a binomial distribution because the probability that a message will be corrupted is same that is p, every times the message is sent. So, for the number of corrupted bits in a message can be calculated by using binomial distribution with probability of success as p and probability of failure as (1-p).

5.

Y= 1+ P ois(λ)

E(Y) = 1+ λ

Var (Y) = λ

Now, from the formula Var(Y) = E [Y2] – {E(Y)} 2

E [Y2] = Var[Y] + {E(Y)} 2

= λ + (1+ λ) 2

= λ2 + 3λ + 1

6.

Using Chebyshev’s inequality required value of n will be (1/k2) * 100

Where k is the number of standard deviation.

So the value of n will be (1/4) * 100 = 25

7.

We know that E[x2]= var(x) + {E[x]}2

So, square root of E[x2] will always be greater than E[x] as it has an extra term of var(x).

P(X ≥ Y) < P(Y ≥ X) is also not always correct.

E [min(X, Y)] ≤ min (E[X], E[Y])

When we take minimum of X and Y and then take its expectation will always be less than or equal to the minimum of expectation of X and Y taken separately.

E[X/Y] ≤ E[X]/E[Y]   this is also not always true.

8.

Since the population size is always larger than the sample size, then the sample

Statistic can never be larger than the population parameter. It is not always true because the sample statistics does not depend upon the size of the sample but it depends on the value of sample.

The interquartile range is also a measure of dispersion.

As measure of central location (centre of the data), the mean is more influenced by extreme values (outliers) than the median. Because mean is the average of all the data but median is the midpoint of all the data when sorted in ascending or descending order. So, median has less effect of outlier.

A statistics professor asks students in a class their ages. On the basis of this information, the professor wants to make statements about the average age of all students in the university. The professor made use of a representative sample. It is wrong as one cannot just take a sample and make the statement about the whole population, for making a statement about the whole population, one should take more and more samples and also consider the significance value or probability of acceptance when making the statement about the whole population.

9.

1, 2, 4, 4, 5, 5, 5, 7, 7, 12, 14

This dataset has the mean 6.

25th percentile is 4 and 75th percentile is 7.  It has 2 outlier 12 and 14.

Based on all these facts the most suitable box plot is plot number 5th.

10.

Y = X1 − 3X2 and Z = αX1 + X2

For Y and Z to be uncorrelated, Covariance of Y and Z should be Zero.

Cov(Y, Z) = Cov(X1 − 3X2, αX1 + X2) =0

= α Cov(X1, X1) + Cov(X1, X2) -3α Cov(X2, X1) – 3 Cov(X2, X2) =0

Or, 4α + 0.8 – 3α *0.8 – 3*4 =0

Or, α= (11.2/1.6) =7

11.

First calculate the z-value by using z =(x-μ)/σ

Z=    (500-490)/12 = 10/12 = 0.8333

P (Z<=z) = 0.797

Hence the probability that a randomly selected pill contains at least 500 mg of Vitamin D will be

1-0.797 = 0.20 (approx.)

12

f(x, y) =      x2 +(1/3) xy ,      if 0 ≤ x ≤ 1, 0 ≤ y ≤ 2,

0, otherwise.

To find E [XY] find the double integration of xy * f(x,y)  over x from 0 to 1 and y from 0 to 2 which will come out to be 43/54.

13.

T= (2X1+4X2)/2

= X1+ 2 X2

Since T is estimator of μ, E [T] should be equal to μ. Any term other than μ will be the bias term.

E [T] = E[X1+2 X2] = E[X1] + 2 E[X2]

= μ + 2 μ

So, 2 μ is the bias term of T.

14.

If (ln, un) is a 95% confidence interval for θ, then it is not necessary that θ ∈ (ln, un) with probability 95% it also depends on some other factors.

If P-value is smaller than the significance level that means that it lies in critical region that is in the region of (1- alpha). So, that means we have enough evidence that there is statistically significant difference and hence enough evidence to reject the means.

The regression line obtained by the method of least squares is such that the sum of squared vertical distances from the data points to the line is minimized. Specifically, the least squares regression line of y on x is the line that makes the sum of the squares of the vertical distances of the data points from the line as small as possible.

15.

f(x) = αx−α−1

L = fn(x) = αn {X1(-α-1) * X2(-α-1) ….. *Xn(-α-1)}

Now take log and differentiate with respect to α.

We get (n/α) = ln(X1*X2*X3…….*Xn)

Or, α=n/ln(X1*X2*X3…….*Xn)

Or, α= n/ [ln(X1) + ln(X2) +…+ln (Xn)]

Resit Exam Probability and Statistics 

Only the use of a non-graphical calculator and a clean copy of the formula sheet is allowed. This exam consistsof fifteen multiple choice questions and three open questions. You should answer the open questions on the examsheet.
Multiple choice questions

  1.  Let A and B be two events such that and

P (A) = 2P (B). Compute P (A).

2. A spam filter is designed by looking at commonly occurring phrases in spam. Suppose that 80% of email is spam. Assume that in 10% of the spam emails, the phrase
\free money” is used, whereas it is assumed that this phrase is only used in 1% of
non-spam emails. What is the approximate probability that a new arriving email
contains the phrase \free money”?
0:10 b. 0:25 c. 0:08 d. 0:01 e. 0:24 f. 0:44
3. [continuation of the previous question]
A new email has just arrived, which does mention the expression \free money”. What
is the approximate probability that it is a spam? Hint: use the exact probability
computed in the previous exercise.
a. 0:25 b. 0:89 c. 0:76 d. 0:99 e. 0:54 f. 0:98
4. Let X and Y be random variables with joint probability mass function P(X = a; Y = b)
given by table below. Compute E[X].

  1. 15 b. 1.90 c. -0.14 d. 0.06 e. -2.05 f. -0.70
    5. A random variable X takes only values between 0 and 1 and has the cumulative
    distribution function The variance of X is

6. Which of the following is (are) the graph(s) of a cumulative distribution function F .

(i), (ii) and (iii) (i),(v) and (vi), (vii) and (viii)
c. (iii) and (iv) d. (i), (ii) and (vii)
e. (v), (vi) and (viii) f. (i), (v), (vii) and (viii)
7. During the third week after this resit, the number of students checking OSIRIS for
their grades at the exam can be modelled by a Poisson process with intensity λ = 10
students per hour. Find the approximate probability that 2 students check OSIRIS
between 10:00 and 10:20.
a. 0:5 b. 0:25 c. 0:20 d. 0:30 e. 0:35 f. 0:45
8. Assume the weights of adults in the Netherlands are approximately normally distributed with mean 84 kg and standard deviation 2:4 kg. What is the approximate
probability that a randomly selected person weights more than 90 kg?
a. 0.006 b. 0.083 c. 0.005 d. 0.051 e. 0.002 f. 0.020
9. Let the random variable U have a U(0; 1) distribution. We want to draw random
numbers from a distribution with density

10. You have a sample of size n = 50. You sample with replacement 1000 times to get
1000 bootstrap samples. What is the size of each bootstrap sample?
a. 1000 b. 50000 c. 5000 d. 500 e. 50 f. 5

11.Let X be a random variable such that Var(X) >0 and E[X] = ln 2. Then

12. Assume Z1; Z2; Z3 are independent standard normally distributed random variables,

and let X = 3Z1 + 2Z2 + 3 and Y = 3Z2 + 2Z3 + 4. Then the Cov(X; Y ) is
a. 20 b. 12 c. 0 d. 6 e. 19 f. 13
13. Let Z be a standard normally distributed random variable and let Y = eZ. Find
the median of Y .
a. 0 b. 1:36 c. 0:5 d. 1:25 e. 0:7 f. 1
14. Which of the following statement(s) is(are) true?
(i) If a hypothesis test is conducted at level α = 0:05, then there is a 5% chance of
rejecting the null hypothesis.
(ii) The regression line obtained by the method of least squares must pass through
at least one of the data points.
(iii) The boxplot is a graphical tool that indicates the dispersion (spread) and the
skewness in the data, and also shows outliers.
(iv) A biased estimator can never have a smaller mean squared error than an unbiased estimator.
a. only (i) b. only (ii) c. only (iii)
d. only (iv) e. (i) and (iii) f. (iii) and (iv)
15. Assume you have a random sample from a certain population and form a 96%
confidence interval for the population mean µ. Let be the sample mean. What
quantity is guaranteed by the construction of the confidence intervals to be in these
confidence intervals?

Solution

 1.

P (A ∪ B) = P (A) + P (B) – P(A ∩ B)

5/6 = 2 P(B) + P(B) – 1/6

3 P (B) = 5/6 + 1/6

3 P (B) = 1

P(B) = 1/3

So, P(A) = 2 P(B) = 2/3

2.

Probability that a new arriving email contains the phrase “free money” consists of 2 cases:

  1. Mail is spam and contain this phrase, probability of this case will be 0.80 * 0.10 = 0.08
  2. Mail is not spam and contain this phrase, probability of this case will be 0.20 *0.01 = 0.002

Total probability will be 0.08 + 0.002 = 0.082 =0.08(approx.)

3.

Now given that the mail contain this phrase, we have to find the probability that it is a spam. It can be found by using Baye’s theorem.

Required probability will be:

(Probability that the spam mail contains this phrase) / (total probability that the mail contains this phrase)

= 0.08/0.082

=0.98 (approx…)

4.

E[X] = 0*(0.1+0.1+0.1) + 1* (0.1+0.2+0.1) + 2* (0.1+0+0.05) + 3* (0.1+0+0.05)

= 0+ 0.4+0.3+0.45 = 1.15

5.

The population variance for a non-negative random variable can be expressed in terms of the cumulative distribution function F using

 

Putting the value of F(X) in this formula we get

Solving this we get the value of variance as 1/18.

6.

F(X) = 1-(x-1) ^2

Value of F(X) at x=0 is 0 and

Value of F(X) at x=1 is 1

Derivative of F(X) is 2(x-1) which vanishes at x=1.That is at x=1 slope will be 0.

Based on all these fact it can be seen from the options that (i), (v), (vii) and (viii) can be the suitable graph.

7.

Intensity λ = 10 is students per hour. So, for students per 20 min the intensity λ will be 10/3

Now Using the poison distribution for x=2 will be (λ^x *e^- λ)/x! Putting the value of λ=10/3=3.33 and value of x=2, the result will be obtained as 0.198 =0.20(approx…)

8.

First calculate z-score = (X-μ)/σ

= (90-84)/2.4

=6/2.4 =2.5

P (Z<=2.5) = 0.9938

Therefore, P (Z>2.5) = 1-0.9938 =0.006(approx.…)

9.

If Step 1: U = rand, then Step 2 will be calculated to be 

10.

Bootstrap samples are the same as the original sample so the sample size will be 50.

11.

Given E[X] =ln2 and var(x) >0

Then there will be two possibility for E [e^-x]

a.

E[e^-x] = e^-ln2  = e^ln(1/2) = ½

b.

E[e^-x] > e^-ln2  = e^ln(1/2) = ½

12.

X = 3Z1 + 2Z2 + 3

Y = 3Z2 + 2Z3 + 4

Cov(X, Y) = Cov(3Z1 + 2Z2 + 3 , 3Z2 + 2Z3 + 4)

= Cov( 3Z1, 3Z2 + 2Z3 + 4) +6 Cov(Z2,Z2)

= 6 Var(Z)

=6

13.

Since Z be a standard normally distributed random variable mean or median of Z will be 1 Now the median of Y=e^Z will be the same as the mean of Standard normal distribution Z that will be 1.

14.

The alpha level does not indicate the percentage of chance to reject the null hypothesis. Rather, it indicate the percentage of chance to reject the null hypothesis when it is true.

It is not necessary that the regression line obtained by the method of least squares must pass through at least one of the data points. It may pass through none of the points or it may pass through every points.

The boxplot is a graphical tool that indicates the dispersion (spread) and the

Skewness in the data, and also shows outliers.

In general, since MSE is a function of the parameter, there will not be one “best” estimator in terms of MSE. Often, the MSE of two estimators will cross each other, that is, for some 5 parameter values, one is better, for other values, the other is better

15.

Confidence interval is given by (sample mean –margin of error, Sample mean +margin of error)

So, it definitely contain the sample mean.