Fundamentals Of Mathematical Analysis Das And Pattanayak Pdf

This book is both a tutorial and a textbook. It is based on over 15 years of lectures in senior level calculus based courses in probability theory and mathematical statistics at the University of Louisville, USA. This book presents an introduction to probability and mathematical statistics and it is intended for students already having some mathematical background. This book contains more than 350 completely worked out examples and over 165 illustrations. Moreover, this book contains over 450 problems of varying degrees of difficulty to help students master their problem solving skill.

Discover the world's research

20+ million members
135+ million publications
700k+ research projects

Join for free

PROBABILITY

AND

MATHEMATICAL STATISTICS

Prasanna Sahoo

Department of Mathematics

University of Louisville

Louisville, KY 40292 USA

THIS BOOK IS DEDICATED TO

AMIT

SADHNA

MY PARENTS, TEACHERS

AND

STUDENTS

vii

not be reproduced in any form or by any means, electronic or mechanical,

including photocopying, recording or any information storage and retrieval

system now known or to be invented, without written permission from the

author.

PREFACE

This book is both a tutorial and a textbook. This book presents an introduc-

tion to probability and mathematical statistics and it is intended for students

already having some elementary mathematical background. It is intended for

a one-year junior or senior level undergraduate or beginning graduate level

course in probability theory and mathematical statistics. The book contains

more material than normally would be taught in a one-year course. This

should give the teacher ﬂexibility with respect to the selection of the content

and level at which the book is to be used. This book is based on over 15

years of lectures in senior level calculus based courses in probability theory

and mathematical statistics at the University of Louisville.

Probability theory and mathematical statistics are diﬃ cult subjects both

for students to comprehend and teachers to explain. Despite the publication

of a great many textbooks in this ﬁeld, each one intended to provide an im-

provement over the previous textbooks, this subject is still diﬃ cult to com-

prehend. A good set of examples makes these subjects easy to understand.

For this reason alone I have included more than 350 completely worked out

examples and over 165 illustrations. I give a rigorous treatment of the fun-

damentals of probability and statistics using mostly calculus. I have given

great attention to the clarity of the presentation of the materials. In the

text, theoretical results are presented as theorems, propositions or lemmas,

of which as a rule rigorous proofs are given. For the few exceptions to this

rule references are given to indicate where details can be found. This book

contains over 450 problems of varying degrees of diﬃ culty to help students

master their problem solving skill.

In many existing textbooks, the examples following the explanation of

a topic are too few in number or too simple to obtain a through grasp of

the principles involved. Often, in many books, examples are presented in

abbreviated form that leaves out much material between steps, and requires

that students derive the omitted materials themselves. As a result, students

ﬁnd examples diﬃ cult to understand. Moreover, in some textbooks, examples

are often worded in a confusing manner. They do not state the problem and

then present the solution. Instead, they pass through a general discussion,

never revealing what is to be solved for. In this book, I give many examples

to illustrate each topic. Often we provide illustrations to promote a better

understanding of the topic. All examples in this book are formulated as

questions and clear and concise answers are provided in step-by-step detail.

There are several good books on these subjects and perhaps there is

no need to bring a new one to the market. So for several years, this was

circulated as a series of typeset lecture notes among my students who were

preparing for the examination 110 of the Actuarial Society of America. Many

of my students encouraged me to formally write it as a book. Actuarial

students will beneﬁt greatly from this book. The book is written in simple

English; this might be an advantage to students whose native language is not

English.

I cannot claim that all the materials I have written in this book are mine.

I have learned the subject from many excellent books, such as Introduction

to Mathematical Statistics by Hogg and Craig, and An Introduction to Prob-

ability Theory and Its Applications by Feller. In fact, these books have had

a profound impact on me, and my explanations are inﬂuenced greatly by

these textbooks. If there are some similarities, then it is due to the fact

that I could not make improvements on the original explanations. I am very

thankful to the authors of these great textbooks. I am also thankful to the

Actuarial Society of America for letting me use their test problems. I thank

all my students in my probability theory and mathematical statistics courses

from 1988 to 2005 who helped me in many ways to make this book possible

in the present form. Lastly, if it weren't for the inﬁnite patience of my wife,

Sadhna, this book would never get out of the hard drive of my computer.

The author on a Macintosh computer using T

X, the typesetting system

ated by the author using MATHEMATICA, a system for doing mathematics

ics designed by Maplesoft. The author is very thankful to the University of

Louisville for providing many internal ﬁnancial grants while this book was

under preparation.

Prasanna Sahoo, Louisville

xii

TABLE OF CONTENTS

1. Probability of Events . . . . . . . . . . . . . . . . . . . 1

1.1. Introduction

1.2. Counting Techniques

1.3. Probability Measure

1.4. Some Properties of the Probability Measure

1.5. Review Exercises

2. Conditional Probability and Bayes' Theorem . . . . . . . 27

2.1. Conditional Probability

2.2. Bayes' Theorem

2.3. Review Exercises

3. Random Variables and Distribution Functions . . . . . . . 45

3.1. Introduction

3.2. Distribution Functions of Discrete Variables

3.3. Distribution Functions of Continuous Variables

3.4. Percentile for Continuous Random Variables

3.5. Review Exercises

4. Moments of Random Variables and Chebychev Inequality . 73

4.1. Moments of Random Variables

4.2. Expected Value of Random Variables

4.3. Variance of Random Variables

4.4. Chebychev Inequality

4.5. Moment Generating Functions

4.6. Review Exercises

xiii

5. Some Special Discrete Distributions . . . . . . . . . . . 107

5.1. Bernoulli Distribution

5.2. Binomial Distribution

5.3. Geometric Distribution

5.4. Negative Binomial Distribution

5.5. Hypergeometric Distribution

5.6. Poisson Distribution

5.7. Riemann Zeta Distribution

5.8. Review Exercises

6. Some Special Continuous Distributions . . . . . . . . . 141

6.1. Uniform Distribution

6.2. Gamma Distribution

6.3. Beta Distribution

6.4. Normal Distribution

6.5. Lognormal Distribution

6.6. Inverse Gaussian Distribution

6.7. Logistic Distribution

6.8. Review Exercises

7. Two Random Variables . . . . . . . . . . . . . . . . . 185

7.1. Bivariate Discrete Random Variables

7.2. Bivariate Continuous Random Variables

7.3. Conditional Distributions

7.4. Independence of Random Variables

7.5. Review Exercises

8. Product Moments of Bivariate Random Variables . . . . 213

8.1. Covariance of Bivariate Random Variables

8.2. Independence of Random Variables

8.3. Variance of the Linear Combination of Random Variables

8.4. Correlation and Independence

8.5. Moment Generating Functions

8.6. Review Exercises

xiv

9. Conditional Expectations of Bivariate Random Variables 237

9.1. Conditional Expected Values

9.2. Conditional Variance

9.3. Regression Curve and Scedastic Curves

9.4. Review Exercises

10. Functions of Random Variables and Their Distribution . 257

10.1. Distribution Function Method

10.2. Transformation Method for Univariate Case

10.3. Transformation Method for Bivariate Case

10.4. Convolution Method for Sums of Random Variables

10.5. Moment Method for Sums of Random Variables

10.6. Review Exercises

11. Some Special Discrete Bivariate Distributions . . . . . 289

11.1. Bivariate Bernoulli Distribution

11.2. Bivariate Binomial Distribution

11.3. Bivariate Geometric Distribution

11.4. Bivariate Negative Binomial Distribution

11.5. Bivariate Hypergeometric Distribution

11.6. Bivariate Poisson Distribution

11.7. Review Exercises

12. Some Special Continuous Bivariate Distributions . . . . 317

12.1. Bivariate Uniform Distribution

12.2. Bivariate Cauchy Distribution

12.3. Bivariate Gamma Distribution

12.4. Bivariate Beta Distribution

12.5. Bivariate Normal Distribution

12.6. Bivariate Logistic Distribution

12.7. Review Exercises

13. Sequences of Random Variables and Order Statistics . . 353

13.1. Distribution of Sample Mean and Variance

13.2. Laws of Large Numbers

13.3. The Central Limit Theorem

13.4. Order Statistics

13.5. Sample Percentiles

13.6. Review Exercises

14. Sampling Distributions Associated with

the Normal Population . . . . . . . . . . . . . . . . . 395

14.1. Chi-square distribution

14.2. Student's t-distribution

14.3. Snedecor's F-distribution

14.4. Review Exercises

15. Some Techniques for Finding Point

Estimators of Parameters . . . . . . . . . . . . . . . 413

15.1. Moment Method

15.2. Maximum Likelihood Method

15.3. Bayesian Method

15.3. Review Exercises

16. Criteria for Evaluating the Goodness

of Estimators . . . . . . . . . . . . . . . . . . . . . 455

16.1. The Unbiased Estimator

16.2. The Relatively Eﬃ cient Estimator

16.3. The Minimum Variance Unbiased Estimator

16.4. Suﬃ cient Estimator

16.5. Consistent Estimator

16.6. Review Exercises

xvi

17. Some Techniques for Finding Interval

Estimators of Parameters . . . . . . . . . . . . . . . 497

17.1. Interval Estimators and Conﬁdence Intervals for Parameters

17.2. Pivotal Quantity Method

17.3. Conﬁdence Interval for Population Mean

17.4. Conﬁdence Interval for Population Variance

17.5. Conﬁdence Interval for Parameter of some Distributions

not belonging to the Location-Scale Family

17.6. Approximate Conﬁdence Interval for Parameter with MLE

17.7. The Statistical or General Method

17.8. Criteria for Evaluating Conﬁdence Intervals

17.9. Review Exercises

18. Test of Statistical Hypotheses . . . . . . . . . . . . . 541

18.1. Introduction

18.2. A Method of Finding Tests

18.3. Methods of Evaluating Tests

18.4. Some Examples of Likelihood Ratio Tests

18.5. Review Exercises

19. Simple Linear Regression and Correlation Analysis . . 585

19.1. Least Squared Method

19.2. Normal Regression Analysis

19.3. The Correlation Analysis

19.4. Review Exercises

20. Analysis of Variance . . . . . . . . . . . . . . . . . . 621

20.1. One-way Analysis of Variance with Equal Sample Sizes

20.2. One-way Analysis of Variance with Unequal Sample Sizes

20.3. Pair wise Comparisons

20.4. Tests for the Homogeneity of Variances

20.5. Review Exercises

xvii

21. Goodness of Fits Tests . . . . . . . . . . . . . . . . . 653

21.1. Chi-Squared test

21.2. Kolmogorov-Smirnov test

21.3. Review Exercises

References . . . . . . . . . . . . . . . . . . . . . . . . . 671

Answers to Selected Review Exercises . . . . . . . . . . . 677

Probability and Mathematical Statistics 1

Chapter 1

PROBABILITY OF EVENTS

1.1. Introduction

During his lecture in 1929, Bertrand Russel said, "Probability is the most

important concept in modern science, especially as nobody has the slightest

notion what it means." Most people have some vague ideas about what prob-

ability of an event means. The interpretation of the word probability involves

synonyms such as chance, odds, uncertainty, prevalence, risk, expectancy etc.

"We use probability when we want to make an aﬃ rmation, but are not quite

sure," writes J.R. Lucas.

There are many distinct interpretations of the word probability. A com-

plete discussion of these interpretations will take us to areas such as phi-

losophy, theory of algorithm and randomness, religion, etc. Thus, we will

only focus on two extreme interpretations. One interpretation is due to the

so-called objective school and the other is due to the subjective school.

The subjective school deﬁnes probabilities as subjective assignments

based on rational thought with available information. Some subjective prob-

abilists interpret probabilities as the degree of belief. Thus, it is diﬃ cult to

interpret the probability of an event.

The objective school deﬁnes probabilities to be "long run " relative fre-

quencies. This means that one should compute a probability by taking the

number of favorable outcomes of an experiment and dividing it by total num-

bers of the possible outcomes of the experiment, and then taking the limit

as the number of trials becomes large. Some statisticians object to the word

"long run". The philosopher and statistician John Keynes said "in the long

run we are all dead". The objective school uses the theory developed by

Probability of Events 2

Von Mises (1928) and Kolmogorov (1965). The Russian mathematician Kol-

mogorov gave the solid foundation of probability theory using measure theory.

The advantage of Kolmogorov's theory is that one can construct probabilities

according to the rules, compute other probabilities using axioms, and then

interpret these probabilities.

In this book, we will study mathematically one interpretation of prob-

ability out of many. In fact, we will study probability theory based on the

theory developed by the late Kolmogorov. There are many applications of

probability theory. We are studying probability theory because we would

like to study mathematical statistics. Statistics is concerned with the de-

velopment of methods and their applications for collecting, analyzing and

interpreting quantitative data in such a way that the reliability of a con-

clusion based on data may be evaluated objectively by means of probability

statements. Probability theory is used to evaluate the reliability of conclu-

sions and inferences based on data. Thus, probability theory is fundamental

to mathematical statistics.

For an event A of a discrete sample space S , the probability of A can be

computed by using the formula

P(A ) = N(A)

N( S)

where N (A ) denotes the number of elements of A and N (S ) denotes the

number of elements in the sample space S . For a discrete case, the probability

of an event A can be computed by counting the number of elements in Aand

dividing it by the number of elements in the sample space S.

In the next section, we develop various counting techniques. The branch

of mathematics that deals with the various counting techniques is called

combinatorics.

1.2. Counting Techniques

There are three basic counting techniques. They are multiplication rule,

permutation and combination.

1.2.1 Multiplication Rule. If E1 is an experiment with n1 outcomes

and E2 is an experiment with n2 possible outcomes, then the experiment

which consists of performing E1 ﬁrst and then E2 consists of n1n2 possible

outcomes.

Probability and Mathematical Statistics 3

Example 1.1. Find the possible number of outcomes in a sequence of two

tosses of a fair coin.

Answer: The number of possible outcomes is 2 · 2 = 4. This is evident from

the following tree diagram.

Tree diagram

Example 1.2. Find the number of possible outcomes of the rolling of a die

and then tossing a coin.

Answer: Here n1 = 6 and n2 = 2. Thus by multiplication rule, the number

of possible outcomes is 12.

Tree diagram

Example 1.3. How many di↵ erent license plates are possible if Kentucky

uses three letters followed by three digits.

Answer: (26) 3 (10) 3

= (17576) (1000)

= 17, 576,000.

1.2.2. Permutation

Consider a set of 4 objects. Suppose we want to ﬁll 3 positions with

objects selected from the above 4. Then the number of possible ordered

arrangements is 24 and they are

Probability of Events 4

a b c b a c c a b d a b

a b d b a d c a d d a c

a c b b c a c b a d b c

a c d b c d c b d d b a

a d c b d a c d b d c a

a d b b d c c d a d c b

The number of possible ordered arrangements can be computed as follows:

Since there are 3 positions and 4 objects, the ﬁrst position can be ﬁlled in

4 di↵ erent ways. Once the ﬁrst position is ﬁlled the remaining 2 positions

can be ﬁlled from the remaining 3 objects. Thus, the second position can be

ﬁlled in 3 ways. The third position can be ﬁlled in 2 ways. Then the total

number of ways 3 positions can be ﬁlled out of 4 objects is given by

(4) (3) (2) = 24.

In general, if r positions are to be ﬁlled from n objects, then the total

number of possible ways they can be ﬁlled are given by

n( n 1)( n 2) ··· ( n r + 1)

=n!

(n r )!

=n Pr.

Thus, n Pr represents the number of ways r positions can be ﬁlled from n

objects.

Deﬁnition 1.1. Each of the n Pr arrangements is called a permutation of n

objects taken r at a time.

Example 1.4. How many permutations are there of all three of letters a, b,

and c?

Answer:

3P 3=n!

(n r )!

=3!

0! = 6

Probability and Mathematical Statistics 5

Example 1.5. Find the number of permutations of n distinct objects.

Answer:

nP n=n!

(n n )! = n !

0! = n!.

Example 1.6. Four names are drawn from the 24 members of a club for the

oﬃ ces of President, Vice-President, Treasurer, and Secretary. In how many

di↵ erent ways can this be done?

Answer:

24P 4=(24)!

(20)!

= (24) (23) (22) (21)

= 255, 024.

1.2.3. Combination

In permutation, order is important. But in many problems the order of

selection is not important and interest centers only on the set of r objects.

Let c denote the number of subsets of size r that can be selected from

ndi↵ erent objects. The robjects in each set can be ordered in r Pr ways.

Thus we have

nP r=c( rPr).

From this, we get

c=n Pr

rP r

=n!

(n r )! r!

The number c is denoted by  n

r. Thus, the above can be written as

n

r = n!

(n r )! r ! .

Deﬁnition 1.2. Each of the  n

runordered subsets is called a combination

of n objects taken r at a time.

Example 1.7. How many committees of two chemists and one physicist can

be formed from 4 chemists and 3 physicists?

Probability of Events 6

Answer:

4

2 3

1

= (6) (3)

= 18.

Thus 18 di↵ erent committees can be formed.

1.2.4. Binomial Theorem

We know from lower level mathematics courses that

(x +y )2 = x2 + 2 xy +y 2

= 2

0 x 2 +  2

1 xy +  2

2 y 2



k=0  2

k x 2k y k .

Similarly

(x +y )3 = x3 + 3 x2 y + 3 xy 2 +y 3

= 3

0 x 3 +  3

1 x 2 y +  3

2 xy 2 +  3

3 y 3



k=0  3

k x 3k y k .

In general, using induction arguments, we can show that

(x +y )n=



k=0 n

k x nk y k .

This result is called the Binomial Theorem. The coeﬃ cient  n

kis called the

binomial coeﬃ cient. A combinatorial proof of the Binomial Theorem follows.

If we write (x +y )n as the ntimes the product of the factor (x +y ), that is

(x +y )n = (x +y ) (x +y ) (x +y ) ·· · (x +y ),

then the coeﬃ cient of xnk yk is  n

k, that is the number of ways in which we

can choose the k factors providing the y's.

Probability and Mathematical Statistics 7

Remark 1.1. In 1665, Newton discovered the Binomial Series. The Binomial

Series is given by

(1 + y )↵ = 1 +  ↵

1 y + ↵

2 y 2 +···+ ↵

n y n +·· ·

= 1 + 1



k=1 ↵

k y k ,

where ↵ is a real number and

↵

k = ↵(↵ 1)(↵ 2) ··· (↵k + 1)

k! .

This  ↵

kis called the generalized binomial coeﬃcient.

Now, we investigate some properties of the binomial coeﬃ cients.

Theorem 1.1. Let n2 N (the set of natural numbers) and r = 0, 1,2, ..., n.

Then  n

r =  n

n r .

Proof: By direct veriﬁcation, we get

n

n r = n!

(nn +r )! (n r )!

=n!

r! ( n r)!

=n

r .

This theorem says that the binomial coeﬃ cients are symmetrical.

Example 1.8. Evaluate  3

1+  3

2+  3

0.

Answer: Since the combinations of 3 things taken 1 at a time are 3, we get

3

1 = 3. Similarly,  3

0 is 1. By Theorem 1,

3

1 =  3

2 = 3.

Hence  3

1 +  3

2 +  3

0 = 3 + 3 + 1 = 7.

Probability of Events 8

Theorem 1.2. For any positive integer n and r = 1, 2 ,3, ..., n , we have

n

r =  n1

r +  n1

r1 .

Proof:

(1 + y )n = (1 + y ) (1 + y )n1

= (1 + y )n1 +y (1 + y )n1



r=0 n

r y r =

n1



r=0 n1

r y r + y

n1



r=0 n1

r y r

n1



r=0 n1

r y r +

n1



r=0 n1

r y r+1 .

Equating the coeﬃ cients of yr from both sides of the above expression, we

obtain  n

r =  n1

r +  n1

r1

and the proof is now complete.

Example 1.9. Evaluate  23

10+  23

9+  24

11.

Answer:

23

10 +  23

9 +  24

11

= 24

10 +  24

11

= 25

11

=25!

(14)! (11)!

= 4, 457,400.

Example 1.10. Use the Binomial Theorem to show that



r=0

(1)r  n

r = 0.

Answer: Using the Binomial Theorem, we get

(1 + x)n=



r=0 n

r xr

Probability and Mathematical Statistics 9

for all real numbers x . Letting x =  1 in the above, we get

0 =



r=0 n

r (1)r .

Theorem 1.3. Let m and n be positive integers. Then



r=0 m

r n

k r =  m+n

k .

Proof:

(1 + y )m+n = (1 + y )m (1 + y )n

m+n



r=0 m+n

r y r =  m



r=0 m

r y r  n



r=0 n

r y r  .

Equating the coeﬃ cients of yk from the both sides of the above expression,

we obtain

m+n

k =  m

0n

k +  m

1 n

k1 + ··· +  m

k n

k k

and the conclusion of the theorem follows.

Example 1.11. Show that



r=0 n

r2

= 2n

n .

Answer: Let k =n and m = n . Then from Theorem 3, we get



r=0 m

r n

k r =  m+n

k



r=0 n

r n

n r =  2n

n



r=0 n

r n

r =  2n

n



r=0 n

r2

= 2n

n .

Probability of Events 10

Theorem 1.4. Let n be a positive integer and k = 1, 2,3, ..., n . Then

n

k =

n1



m= k 1 m

k1 .

Proof: In order to establish the above identity, we use the Binomial Theorem

together with the following result of the elementary algebra

xn  yn = ( x y )

n1



k=0

xk y n1k .

Note that



k=1 n

k x k =



k=0 n

k x k 1

= (x + 1)n  1n by Binomial Theorem

= (x + 1 1)

n1



m=0

(x + 1)m by above identity

n1



m=0



j=0 m

j xj

n1



m=0



j=0 m

j x j+1



k=1

n1



m= k 1 m

k1 x k .

Hence equating the coeﬃ cient of xk , we obtain

n

k =

n1



m= k 1 m

k1 .

This completes the proof of the theorem.

The following result

(x1 + x2 +···+ xm )n = 

n1 +n2 +···+nm = n n

n1 , n2 , ..., nm  x n 1

1x n2

2···x nm

is known as the multinomial theorem and it generalizes the binomial theorem.

The sum is taken over all positive integers n1 , n2 , ..., nm such that n1 + n2 +

· ·· + nm =n, and  n

n1 , n2 , ..., nm  = n!

n1 !n2 ! , ..., nm ! .

Probability and Mathematical Statistics 11

This coeﬃ cient is known as the multinomial coeﬃ cient.

1.3. Probability Measure

A random experiment is an experiment whose outcomes cannot be pre-

dicted with certainty. However, in most cases the collection of every possible

outcome of a random experiment can be listed.

Deﬁnition 1.3. A sample space of a random experiment is the collection of

all possible outcomes.

Example 1.12. What is the sample space for an experiment in which we

select a rat at random from a cage and determine its sex?

Answer: The sample space of this experiment is

S={ M, F }

where M denotes the male rat and Fdenotes the female rat.

Example 1.13. What is the sample space for an experiment in which the

state of Kentucky picks a three digit integer at random for its daily lottery?

Answer: The sample space of this experiment is

S={000 ,001,002,······,998,999}.

Example 1.14. What is the sample space for an experiment in which we

roll a pair of dice, one red and one green?

Answer: The sample space S for this experiment is given by

{(1, 1) (1,2) (1, 3) (1, 4) (1, 5) (1,6)

(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2,6)

(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3,6)

(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4,6)

(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5,6)

(6, 1) (6, 2) (6, 3) (6, 4) (6,5) (6, 6)}

This set S can be written as

S={( x, y)| 1 x 6 ,1 y 6}

where x represents the number rolled on red die and y denotes the number

rolled on green die.

Probability of Events 12

Deﬁnition 1.4. Each element of the sample space is called a sample point.

Deﬁnition 1.5. If the sample space consists of a countable number of sample

points, then the sample space is said to be a countable sample space.

Deﬁnition 1.6. If a sample space contains an uncountable number of sample

points, then it is called a continuous sample space.

An event A is a subset of the sample space S. It seems obvious that if A

and B are events in sample space S , then A[ B , Ac ,A\ B are also entitled

to be events. Thus precisely we deﬁne an event as follows:

Deﬁnition 1.7. A subset A of the sample space S is said to be an event if it

belongs to a collection F of subsets of S satisfying the following three rules:

(a) S2F ; (b) if A2F then Ac 2F ; and (c) if Aj 2F for j 1, then

1

j=1 2F. The collection Fis called an event space or a  -ﬁeld. If A is the

outcome of an experiment, then we say that the event A has occurred.

Example 1.15. Describe the sample space of rolling a die and interpret the

event {1,2}.

Answer: The sample space of this experiment is

S={1 ,2,3,4,5,6}.

The event {1,2 } means getting either a 1 or a 2.

Example 1.16. First describe the sample space of rolling a pair of dice,

then describe the event A that the sum of numbers rolled is 7.

Answer: The sample space of this experiment is

S={( x, y)| x, y = 1 , 2 , 3 , 4 , 5 , 6}

and

A={(1 ,6),(6,1),(2 ,5) ,(5,2),(4,3),(3 ,4)}.

Deﬁnition 1.8. Let S be the sample space of a random experiment. A prob-

ability measure P :F! [0, 1] is a set function which assigns real numbers

to the various events of Ssatisfying

(P1) P (A ) 0 for all event A2F ,

(P2) P (S ) = 1,

Probability and Mathematical Statistics 13

(P3) P 1



k=1

Ak  = 1



k=1

P(Ak )

if A1 , A2, A3 , ..., Ak, ..... are mutually disjoint events of S.

Any set function with the above three properties is a probability measure

for S . For a given sample space S , there may be more than one probability

measure. The probability of an event A is the value of the probability measure

at A , that is

P rob(A ) = P (A).

Theorem 1.5. If ; is a empty set (that is an impossible event), then

P(; ) = 0.

Proof: Let A1 =S and Ai =; for i = 2, 3, ..., 1 . Then

S=1



i=1

where Ai \Aj =; for i 6 =j . By axiom 2 and axiom 3, we get

1 = P (S ) (by axiom 2)

=P 1



i=1

Ai 



i=1

P(Ai ) (by axiom 3)

=P (A1 ) + 1



i=2

P(Ai )

=P (S ) + 1



i=2

P(;)

= 1 + 1



i=2

P(;).

Therefore 1



i=2

P(; ) = 0.

Since P (; ) 0 by axiom 1, we have

P(; ) = 0

Probability of Events 14

and the proof of the theorem is complete.

This theorem says that the probability of an impossible event is zero.

Note that if the probability of an event is zero, that does not mean the event

is empty (or impossible). There are random experiments in which there are

inﬁnitely many events each with probability 0. Similarly, if A is an event

with probability 1, then it does not mean A is the sample space S . In fact

there are random experiments in which one can ﬁnd inﬁnitely many events

each with probability 1.

Theorem 1.6. Let {A1 , A2 , ..., An } be a ﬁnite collection of n events such

that Ai \Ej =; for i 6 =j . Then

P n



i=1

Ai  =



i=1

P(Ai ).

Proof: Consider the collection {A0

i} 1

i=1 of the subsets of the sample space S

such that

1=A 1 , A 0

2=A 2 , ..., A 0

n=A n

and

n+1 =A 0

n+2 =A 0

n+3 =··· =;.

Hence

P n



i=1

Ai  = P 1



i=1

i



i=1

P(A0



i=1

P(A0

i) + 1



i=n+1

P(A0



i=1

P(Ai ) + 1



i=n+1

P(;)



i=1

P(Ai ) + 0



i=1

P(Ai )

and the proof of the theorem is now complete.

Probability and Mathematical Statistics 15

When n = 2, the above theorem yields P (A1 [A2 ) = P (A1 ) + P (A2 )

where A1 and A2 are disjoint (or mutually exclusive) events.

In the following theorem, we give a method for computing probability

of an event A by knowing the probabilities of the elementary events of the

sample space S.

Theorem 1.7. If A is an event of a discrete sample space S , then the

probability of A is equal to the sum of the probabilities of its elementary

events.

Proof: Any set A in S can be written as the union of its singleton sets. Let

{Oi }1

i=1 be the collection of all the singleton sets (or the elementary events)

of A . Then

A=1



i=1

Oi.

By axiom (P3), we get

P(A ) = P 1



i=1

Oi 



i=1

P(Oi ).

Example 1.17. If a fair coin is tossed twice, what is the probability of

getting at least one head?

Answer: The sample space of this experiment is

S={ HH, HT, T H, T T }.

The event A is given by

A={ at least one head }

={HH, HT, T H }.

By Theorem 1.7, the probability of A is the sum of the probabilities of its

elementary events. Thus, we get

P(A ) = P( HH) + P( HT ) + P( T H )

4+ 1

Probability of Events 16

Remark 1.2. Notice that here we are not computing the probability of the

elementary events by taking the number of points in the elementary event

and dividing by the total number of points in the sample space. We are

using the randomness to obtain the probability of the elementary events.

That is, we are assuming that each outcome is equally likely. This is why the

randomness is an integral part of probability theory.

Corollary 1.1. If S is a ﬁnite sample space with nsample elements and A

is an event in S with melements, then the probability of A is given by

P(A ) = m

Proof: By the previous theorem, we get

P(A ) = P m



i=1

Oi 



i=1

P(Oi )



i=1

The proof is now complete.

Example 1.18. A die is loaded in such a way that the probability of the

face with j dots turning up is proportional to j for j = 1, 2, ..., 6. What is

the probability, in one roll of the die, that an odd number of dots will turn

up?

Answer: P ({j})/ j

=k j

where k is a constant of proportionality. Next, we determine this constant k

by using the axiom (P2). Using Theorem 1.5, we get

P( S) = P({1} ) + P({2} ) + P({3} ) + P({4} ) + P({5} ) + P({6})

=k + 2k+ 3k+ 4k + 5k + 6k

= (1 + 2 + 3 + 4 + 5 + 6) k

=(6)(6 + 1)

= 21k.

Probability and Mathematical Statistics 17

Using (P2), we get

21k = 1.

Thus k = 1

21 . Hence, we have

P({ j}) = j

21 .

Now, we want to ﬁnd the probability of the odd number of dots turning up.

P(odd numbered dot will turn up) = P({1} ) + P({3} ) + P({5})

21 + 3

21 + 5

21 .

Remark 1.3. Recall that the sum of the ﬁrst n integers is equal to n

2(n+1).

That is,

1 + 2 + 3 + ······+ (n 2) + (n 1) + n = n(n + 1)

This formula was ﬁrst proven by Gauss (1777-1855) when he was a young

school boy.

Remark 1.4. Gauss proved that the sum of the ﬁrst n positive integers

is n (n+1)

2when he was a school boy. Kolmogorov, the father of modern

probability theory, proved that the sum of the ﬁrst n odd positive integers is

n2 , when he was ﬁve years old.

1.4. Some Properties of the Probability Measure

Next, we present some theorems that will illustrate the various intuitive

properties of a probability measure.

Theorem 1.8. If A be any event of the sample space S , then

P(Ac ) = 1  P(A)

where Ac denotes the complement of A with respect to S.

Proof: Let A be any subset of S . Then S =A[ Ac . Further A and Ac are

mutually disjoint. Thus, using (P3), we get

1 = P (S ) = P (A[ Ac )

=P (A ) + P (Ac ).

Probability of Events 18

Hence, we see that

P(Ac ) = 1  P(A).

This completes the proof.

Theorem 1.9. If A✓ B✓ S , then

P(A ) P( B).

Proof: Note that B =A[ (B\ A ) where B\ A denotes all the elements x

that are in B but not in A . Further, A\ (B\ A ) = ; . Hence by (P3), we get

P( B) = P( A[( B\ A))

=P (A ) + P (B\ A ).

By axiom (P1), we know that P (B\ A ) 0. Thus, from the above, we get

P( B) P(A)

and the proof is complete.

Theorem 1.10. If A is any event in S , then

0P (A ) 1.

Probability and Mathematical Statistics 19

Proof: Follows from axioms (P1) and (P2) and Theorem 1.8.

Theorem 1.10. If A and B are any two events, then

P( A[ B) = P(A ) + P( B) P( A\ B).

Proof: It is easy to see that

A[ B= A[(Ac \ B)

and

A\(Ac \ B) = ;.

Hence by (P3), we get

P( A[ B) = P(A ) + P(Ac \ B ) (1 .1)

But the set B can also be written as

B= ( A\ B)[ (Ac \ B)

Probability of Events 20

Therefore, by (P3), we get

P( B) = P( A\ B) + P(Ac \ B ) .(1.2)

Eliminating P (Ac \ B ) from (1.1) and (1.2), we get

P( A[ B) = P(A ) + P( B) P( A\ B)

and the proof of the theorem is now complete.

This above theorem tells us how to calculate the probability that at least

one of A and B occurs.

Example 1.19. If P (A ) = 0 . 25 and P (B ) = 0 . 8, then show that 0 . 05 

P( A\ B) 0.25.

Answer: Since A\ B✓ A and A\ B✓ B , by Theorem 1.8, we get

P( A\ B) P(A ) and also P( A\ B) P( B).

Hence

P( A\ B) min{ P(A) , P ( B)}.

This shows that

P( A\ B) 0.25 .(1.3)

Since A[ B✓ S , by Theorem 1.8, we get

P( A[ B) P( S)

That is, by Theorem 1.10

P(A ) + P( B) P( A\ B) P( S).

Hence, we obtain

0. 8 + 0 . 25 P (A\ B ) 1

and this yields

0. 8 + 0 . 25  1P (A\ B ).

From this, we get

0. 05 P (A\ B ).(1.4)

Probability and Mathematical Statistics 21

From (1.3) and (1.4), we get

0. 05 P (A\ B ) 0.25.

Example 1.20. Let A and B be events in a sample space S such that

P(A ) = 1

2=P( B) and P(A c \B c ) = 1

3. Find P (A[ B c ).

Answer: Notice that

A[ Bc = A[(Ac \ Bc ).

Hence,

P( A[ Bc ) = P(A ) + P(Ac \ Bc )

2+ 1

Theorem 1.11. If A1 and A2 are two events such that A1 ✓A2 , then

P(A2 \A1 ) = P(A2 ) P(A1 ).

Proof: The event A2 can be written as

A2 =A1  (A2 \A1 )

where the sets A1 and A2 \A1 are disjoint. Hence

P(A2 ) = P(A1 ) + P(A2 \A1 )

which is

P(A2 \A1 ) = P(A2 ) P(A1 )

and the proof of the theorem is now complete.

From calculus we know that a real function f : IR ! IR (the set of real

numbers) is continuous on IR if and only if, for every convergent sequence

{xn }1

n=1 in IR,

lim

n!1 f(x n ) = f lim

n!1 x n  .

Probability of Events 22

Theorem 1.12. If A1 , A2 , ..., An , ... is a sequence of events in sample space

Ssuch that A1 ✓A2 ✓ ··· ✓ An ✓ ··· , then

P 1



n=1

An  = lim

n!1 P(A n ).

Similarly, if B1 , B2 , ..., Bn , ... is a sequence of events in sample space S such

that B1 ◆B2 ◆··· ◆Bn ◆··· , then

P 1



n=1

Bn  = lim

n!1 P(B n ).

Proof: Given an increasing sequence of events

A1 ✓A2 ✓··· ✓ An ✓ ···

we deﬁne a disjoint collection of events as follows:

E1 =A1

En =An \An1 8 n 2.

Then {En }1

n=1 is a disjoint collection of events such that



n=1

An = 1



n=1

En.

Further

P 1



n=1

An  = P 1



n=1

En 



n=1

P(En )

= lim

m!1



n=1

P(En )

= lim

m!1 P(A 1 ) +



n=2

[P(An )P (An1 )]

= lim

m!1 P(A m )

= lim

n!1 P(A n ).

Probability and Mathematical Statistics 23

The second part of the theorem can be proved similarly.

Note that

lim

n!1 A n = 1



n=1

and

lim

n!1 B n = 1



n=1

Bn.

Hence the results above theorem can be written as

P lim

n!1 A n  = lim

n!1 P(A n )

and

P lim

n!1 B n  = lim

n!1 P(B n )

and the Theorem 1.12 is called the continuity theorem for the probability

measure.

1.5. Review Exercises

1. If we randomly pick two television sets in succession from a shipment of

240 television sets of which 15 are defective, what is the probability that they

will both be defective?

2. A poll of 500 people determines that 382 like ice cream and 362 like cake.

How many people like both if each of them likes at least one of the two?

(Hint: Use P (A[ B ) = P (A ) + P (B )P (A\ B ) ).

3. The Mathematics Department of the University of Louisville consists of

8 professors, 6 associate professors, 13 assistant professors. In how many of

all possible samples of size 4, chosen without replacement, will every type of

professor be represented?

4. A pair of dice consisting of a six-sided die and a four-sided die is rolled

and the sum is determined. Let A be the event that a sum of 5 is rolled and

let B be the event that a sum of 5 or a sum of 9 is rolled. Find (a) P (A ), (b)

P( B), and (c) P( A\ B).

5. A faculty leader was meeting two students in Paris, one arriving by

train from Amsterdam and the other arriving from Brussels at approximately

the same time. Let A and B be the events that the trains are on time,

respectively. If P (A ) = 0 . 93, P (B ) = 0 . 89 and P (A\ B ) = 0 . 87, then ﬁnd

the probability that at least one train is on time.

Probability of Events 24

6. Bill, George, and Ross, in order, roll a die. The ﬁrst one to roll an even

number wins and the game is ended. What is the probability that Bill will

win the game?

7. Let A and B be events such that P (A ) = 1

2=P( B) and P(A c \B c ) = 1

Find the probability of the event Ac [ Bc .

8. Suppose a box contains 4 blue, 5 white, 6 red and 7 green balls. In how

many of all possible samples of size 5, chosen without replacement, will every

color be represented?

9. Using the Binomial Theorem, show that



k=0

k n

k = n2n1 .

10. A function consists of a domain A , a co-domain B and a rule f . The

rule f assigns to each number in the domain A one and only one letter in the

co-domain B . If A = {1,2,3 } and B = {x, y, z , w} , then ﬁnd all the distinct

functions that can be formed from the set A into the set B.

11. Let S be a countable sample space. Let {Oi }1

i=1 be the collection of all

the elementary events in S . What should be the value of the constant c such

that P (Oi ) = c 1

3 i will be a probability measure in S?

12. A box contains ﬁve green balls, three black balls, and seven red balls.

Two balls are selected at random without replacement from the box. What

is the probability that both balls are the same color?

13. Find the sample space of the random experiment which consists of tossing

a coin until the ﬁrst head is obtained. Is this sample space discrete?

14. Find the sample space of the random experiment which consists of tossing

a coin inﬁnitely many times. Is this sample space discrete?

15. Five fair dice are thrown. What is the probability that a full house is

thrown (that is, where two dice show one number and other three dice show

a second number)?

16. If a fair coin is tossed repeatedly, what is the probability that the third

head occurs on the nth toss?

17. In a particular softball league each team consists of 5 women and 5

men. In determining a batting order for 10 players, a woman must bat ﬁrst,

and successive batters must be of opposite sex. How many di↵ erent batting

orders are possible for a team?

Probability and Mathematical Statistics 25

18. An urn contains 3 red balls, 2 green balls and 1 yellow ball. Three balls

are selected at random and without replacement from the urn. What is the

probability that at least 1 color is not drawn?

19. A box contains four $10 bills, six $5 bills and two $1 bills. Two bills are

taken at random from the box without replacement. What is the probability

that both bills will be of the same denomination?

20. An urn contains n white counters numbered 1 through n ,n black coun-

ters numbered 1 through n , and n red counter numbered 1 through n . If

two counters are to be drawn at random without replacement, what is the

probability that both counters will be of the same color or bear the same

number?

21. Two people take turns rolling a fair die. Person X rolls ﬁrst, then

person Y , then X , and so on. The winner is the ﬁrst to roll a 6. What is the

probability that person Xwins?

22. Mr. Flowers plants 10 rose bushes in a row. Eight of the bushes are

white and two are red, and he plants them in random order. What is the

probability that he will consecutively plant seven or more white bushes?

23. Using mathematical induction, show that

dxn [ f ( x)·g ( x)] =



k=0n

k dk

dxk [ f ( x)] · d nk

dxnk [ g ( x)] .

Probability and Mathematical Statistics 27

Chapter 2

CONDITIONAL

PROBABILITIES

AND

BAYES' THEOREM

2.1. Conditional Probabilities

First, we give a heuristic argument for the deﬁnition of conditional prob-

ability, and then based on our heuristic argument, we deﬁne the conditional

probability.

Consider a random experiment whose sample space is S . Let B⇢ S .

In many situations, we are only concerned with those outcomes that are

elements of B . This means that we consider B to be our new sample space.

For the time being, suppose S is a nonempty ﬁnite sample space and Bis

a nonempty subset of S . Given this new discrete sample space B , how do

we deﬁne the probability of an event A ? Intuitively, one should deﬁne the

probability of A with respect to the new sample space B as (see the ﬁgure

above)

P( Agiven B) = the number of elements in A \B

the number of elements in B .

Conditional Probability and Bayes' Theorem 28

We denote the conditional probability of A given the new sample space Bas

P(A/B ). Hence with this notation, we say that

P(A/B ) = N(A\ B )

N( B)

=P (A\ B )

P( B) ,

since N (S ) 6 = 0. Here N (S ) denotes the number of elements in S.

Thus, if the sample space is ﬁnite, then the above deﬁnition of the prob-

ability of an event A given that the event B has occurred makes sense in-

tuitively. Now we deﬁne the conditional probability for any sample space

(discrete or continuous) as follows.

Deﬁnition 2.1. Let S be a sample space associated with a random exper-

iment. The conditional probability of an event A , given that event Bhas

occurred, is deﬁned by

P(A/B ) = P(A\ B )

P( B)

provided P (B )> 0.

This conditional probability measure P (A/B ) satisﬁes all three axioms

of a probability measure. That is,

(CP1) P (A/B ) 0 for all event A

(CP2) P (B/B ) = 1

(CP3) If A1 , A2 , ..., Ak , ... are mutually exclusive events, then

P(1



k=1

Ak/B) = 1



k=1

P(Ak/B).

Thus, it is a probability measure with respect to the new sample space B.

Example 2.1. A drawer contains 4 black, 6 brown, and 8 olive socks. Two

socks are selected at random from the drawer. (a) What is the probability

that both socks are of the same color? (b) What is the probability that both

socks are olive if it is known that they are of the same color?

Answer: The sample space of this experiment consists of

S={( x, y)| x, y 2 Bl, Ol, Br}.

The cardinality of Sis

N( S) =  18

2 = 153.

Probability and Mathematical Statistics 29

Let A be the event that two socks selected at random are of the same color.

Then the cardinality of A is given by

N(A ) =  4

2 +  6

2 +  8

2

= 6 + 15 + 28

= 49.

Therefore, the probability of A is given by

P(A ) = 49

18

2=49

153 .

Let B be the event that two socks selected at random are olive. Then the

cardinality of B is given by

N( B) =  8

2

and hence

P( B) =  8

2

18

2=28

153 .

Notice that B⇢ A . Hence,

P(B/A ) = P(A\ B )

P(A)

=P (B)

P(A)

= 28

153  153

49 

=28

49 = 4

Let A and B be two mutually disjoint events in a sample space S . We

want to ﬁnd a formula for computing the probability that the event A occurs

before the event B in a sequence trials. Let P (A ) and P (B ) be the probabil-

ities that A and B occur, respectively. Then the probability that neither A

nor B occurs is 1 P (A )P (B ). Let us denote this probability by r , that

is r = 1 P (A )P (B).

In the ﬁrst trial, either A occurs, or B occurs, or neither A nor B occurs.

In the ﬁrst trial if A occurs, then the probability of A occurs before B is 1.

Conditional Probability and Bayes' Theorem 30

If B occurs in the ﬁrst trial, then the probability of A occurs before B is 0.

If neither A nor B occurs in the ﬁrst trial, we look at the outcomes of the

second trial. In the second trial if A occurs, then the probability of A occurs

before B is 1. If B occurs in the second trial, then the probability of A occurs

before B is 0. If neither A nor B occurs in the second trial, we look at the

outcomes of the third trial, and so on. This argument can be summarized in

the following diagram.

Hence the probability that the event A comes before the event B is given by

P(A before B) = P(A ) + r P (A ) + r2 P(A ) + r3 P(A ) + · · · + rn P(A ) + ···

=P (A ) [1 + r +r2 + ··· +rn +··· ]

=P (A ) 1

1r

=P (A ) 1

1 [1 P (A )P (B)]

=P (A)

P(A ) + P( B) .

The event A before B can also be interpreted as a conditional event. In

this interpretation the event A before B means the occurrence of the event

Agiven that A[ Bhas already occurred. Thus we again have

P( A/A [ B ) = P ( A\( A[ B))

P( A[ B)

=P (A)

P(A ) + P( B) .

Example 2.2. A pair of four-sided dice is rolled and the sum is determined.

What is the probability that a sum of 3 is rolled before a sum of 5 is rolled

in a sequence of rolls of the dice?

Probability and Mathematical Statistics 31

Answer: The sample space of this random experiment is

{(1, 1) (1, 2) (1, 3) (1,4)

(2, 1) (2, 2) (2, 3) (2,4)

(3, 1) (3, 2) (3, 3) (3,4)

(4, 1) (4, 2) (4,3) (4, 4)}.

Let A denote the event of getting a sum of 3 and B denote the event of

getting a sum of 5. The probability that a sum of 3 is rolled before a sum

of 5 is rolled can be thought of as the conditional probability of a sum of 3,

given that a sum of 3 or 5 has occurred. That is, P (A/A [ B ). Hence

P( A/A [ B ) = P ( A\( A[ B))

P( A[ B)

=P (A)

P(A ) + P( B)

=N (A)

N(A ) + N( B)

2 + 4

Example 2.3. If we randomly pick two television sets in succession from a

shipment of 240 television sets of which 15 are defective, what is the proba-

bility that they will be both defective?

Answer: Let A denote the event that the ﬁrst television picked was defective.

Let B denote the event that the second television picked was defective. Then

A\ Bwill denote the event that both televisions picked were defective. Using

the conditional probability, we can calculate

P( A\ B) = P(A ) P(B/A)

= 15

240  14

239 

1912 .

In Example 2.3, we assume that we are sampling without replacement.

Deﬁnition 2.2. If an object is selected and then replaced before the next

object is selected, this is known as sampling with replacement. Otherwise, it

is called sampling without replacement.

Conditional Probability and Bayes' Theorem 32

Rolling a die is equivalent to sampling with replacement, whereas dealing

a deck of cards to players is sampling without replacement.

Example 2.4. A box of fuses contains 20 fuses, of which 5 are defective. If

3 of the fuses are selected at random and removed from the box in succession

without replacement, what is the probability that all three fuses are defective?

Answer: Let A be the event that the ﬁrst fuse selected is defective. Let B

be the event that the second fuse selected is defective. Let C be the event

that the third fuse selected is defective. The probability that all three fuses

selected are defective is P (A\ B\ C ). Hence

P( A\ B\ C) = P(A ) P(B/A ) P( C/A \ B )

= 5

20  4

19  3

18 

114 .

Deﬁnition 2.3. Two events A and B of a sample space S are called inde-

pendent if and only if

P( A\ B) = P(A ) P( B).

Example 2.5. The following diagram shows two events A and B in the

sample space S . Are the events A and B independent?

Answer: There are 10 black dots in S and event A contains 4 of these dots.

So the probability of A , is P (A ) = 4

10 . Similarly, event B contains 5 black

dots. Hence P (B ) = 5

10 . The conditional probability of A given Bis

P(A/B ) = P(A\ B )

P( B)= 2

Probability and Mathematical Statistics 33

This shows that P (A/B ) = P (A ). Hence A and B are independent.

Theorem 2.1. Let A, B ✓ S . If A and B are independent and P (B )> 0,

then

P(A/B ) = P(A).

Proof:

P(A/B ) = P(A\ B )

P( B)

=P (A) P (B)

P( B)

=P (A).

Theorem 2.2. If A and B are independent events. Then Ac and Bare

independent. Similarly A and Bc are independent.

Proof: We know that A and B are independent, that is

P( A\ B) = P(A ) P( B)

and we want to show that Ac and B are independent, that is

P(Ac \ B ) = P(Ac ) P( B).

Since

P(Ac \ B ) = P(Ac/B ) P( B)

= [1 P (A/B )] P (B)

=P (B )P (A/B)P(B)

=P (B )P (A\ B )

=P (B )P (A )P (B)

=P (B ) [1 P (A)]

=P (B)P(Ac ),

the events Ac and B are independent. Similarly, it can be shown that Aand

Bc are independent and the proof is now complete.

Remark 2.1. The concept of independence is fundamental. In fact, it is this

concept that justiﬁes the mathematical development of probability as a sepa-

rate discipline from measure theory. Mark Kac said, "independence of events

is not a purely mathematical concept." It can, however, be made plausible

Conditional Probability and Bayes' Theorem 34

that it should be interpreted by the rule of multiplication of probabilities and

this leads to the mathematical deﬁnition of independence.

Example 2.6. Flip a coin and then independently cast a die. What is the

probability of observing heads on the coin and a 2 or 3 on the die?

Answer: Let A denote the event of observing a head on the coin and let B

be the event of observing a 2 or 3 on the die. Then

P( A\ B) = P(A ) P( B)

= 1

2 2

6

Example 2.7. An urn contains 3 red, 2 white and 4 yellow balls. An

ordered sample of size 3 is drawn from the urn. If the balls are drawn with

replacement so that one outcome does not change the probabilities of others,

then what is the probability of drawing a sample that has balls of each color?

Also, ﬁnd the probability of drawing a sample that has two yellow balls and

a red ball or a red ball and two white balls?

Answer:

P( RW Y ) =  3

9 2

9 4

9 = 8

243

and

P( Y Y R or RW W ) =  4

9 4

9 3

9 +  3

9 2

9 = 20

243 .

If the balls are drawn without replacement, then

P( RW Y ) =  3

9 2

8 4

7 = 1

21 .

P( Y Y R or RW W ) =  4

9 3

8 3

7 +  3

9 2

8 1

7 = 7

84 .

There is a tendency to equate the concepts "mutually exclusive" and "inde-

pendence". This is a fallacy. Two events A and B are mutually exclusive if

A\ B=; and they are called possible if P(A ) 6 = 0 6 = P( B).

Theorem 2.2. Two possible mutually exclusive events are always dependent

(that is not independent).

Probability and Mathematical Statistics 35

Proof: Suppose not. Then

P( A\ B) = P(A ) P( B)

P(; ) = P(A ) P( B)

0 = P (A )P (B).

Hence, we get either P (A ) = 0 or P (B ) = 0. This is a contradiction to the

fact that A and B are possible events. This completes the proof.

Theorem 2.3. Two possible independent events are not mutually exclusive.

Proof: Let A and Bbe two independent events and suppose A and Bare

mutually exclusive. Then

P(A ) P( B) = P( A\ B)

=P (;)

= 0.

Therefore, we get either P (A ) = 0 or P (B ) = 0. This is a contradiction to

the fact that A and B are possible events.

The possible events A and B exclusive implies A and B are not indepen-

dent; and A and B independent implies A and B are not exclusive.

2.2. Bayes' Theorem

There are many situations where the ultimate outcome of an experiment

depends on what happens in various intermediate stages. This issue is re-

solved by the Bayes' Theorem.

Deﬁnition 2.4. Let S be a set and let P = {Ai }m

i=1 be a collection of subsets

of S . The collection P is called a partition of Sif

(a )S =



i=1

(b ) Ai \Aj =; for i 6 = j.

Conditional Probability and Bayes' Theorem 36

Theorem 2.4. If the events {Bi }m

i=1 constitute a partition of the sample

space S and P (Bi ) 6 = 0 for i= 1, 2, ..., m , then for any event A in S

P(A ) =



i=1

P(Bi ) P(A/Bi ).

Proof: Let S be a sample space and A be an event in S . Let {Bi }m

i=1 be

any partition of S . Then



i=1

(A\ Bi ) .

Thus

P(A ) =



i=1

P( A\ Bi )



i=1

P(Bi ) P(A/Bi ) .

Theorem 2.5. If the events {Bi }m

i=1 constitute a partition of the sample

space S and P (Bi ) 6 = 0 for i = 1, 2, ..., m , then for any event A in S such

that P (A ) 6 = 0

P(Bk/A ) = P(Bk )P (A/Bk )

m

i=1 P(B i )P(A/B i )k= 1, 2, ..., m.

Proof: Using the deﬁnition of conditional probability, we get

P(Bk/A ) = P(A\ Bk )

P(A ) .

Using Theorem 1, we get

P(Bk/A ) = P(A\ Bk )

m

i=1 P(B i )P(A/B i ) .

This completes the proof.

This Theorem is called Bayes Theorem. The probability P (Bk ) is called

prior probability. The probability P (Bk/A) is called posterior probability.

Example 2.8. Two boxes containing marbles are placed on a table. The

boxes are labeled B1 and B2 . Box B1 contains 7 green marbles and 4 white

Probability and Mathematical Statistics 37

marbles. Box B2 contains 3 green marbles and 10 yellow marbles. The

boxes are arranged so that the probability of selecting box B1 is 1

3and the

probability of selecting box B2 is 2

3. Kathy is blindfolded and asked to select

a marble. She will win a color TV if she selects a green marble. (a) What is

the probability that Kathy will win the TV (that is, she will select a green

marble)? (b) If Kathy wins the color TV, what is the probability that the

green marble was selected from the ﬁrst box?

Answer: Let A be the event of drawing a green marble. The prior proba-

bilities are P (B1 ) = 1

3and P (B 2 ) = 2

(a) The probability that Kathy will win the TV is

P(A ) = P( A\ B1 ) + P ( A\ B2 )

=P (A/B1 )P (B1 ) + P (A/B2 )P (B2 )

= 7

11  1

3 +  3

13  2

3

33 + 2

=91

429 + 66

429

=157

429 .

(b) Given that Kathy won the TV, the probability that the green marble was

selected from B1 is

1/3

2/3

Selecting

box B1

Selecting

box B2

Green marble

Not a green marble

Green marble

Not a green marble

7/11

4/11

3/13

10/13

Conditional Probability and Bayes' Theorem 38

P(B1/A ) = P(A/B1 ) P(B1 )

P(A/B1 ) P(B1 ) + P(A/B2 ) P(B2 )

= 7

11  1

3

7

11  1

3+  3

13  2

3

=91

157 .

Note that P (A/B1 ) is the probability of selecting a green marble from

B1 whereas P (B1/A) is the probability that the green marble was selected

from box B1 .

Example 2.9. Suppose box A contains 4 red and 5 blue chips and box B

contains 6 red and 3 blue chips. A chip is chosen at random from the box A

and placed in box B . Finally, a chip is chosen at random from among those

now in box B . What is the probability a blue chip was transferred from box

Ato box Bgiven that the chip chosen from box B is red?

Answer: Let E represent the event of moving a blue chip from box A to box

B. We want to ﬁnd the probability of a blue chip which was moved from box

Ato box Bgiven that the chip chosen from B was red. The probability of

choosing a red chip from box A is P (R ) = 4

9and the probability of choosing

a blue chip from box A is P (B ) = 5

9. If a red chip was moved from box Ato

box B , then box B has 7 red chips and 3 blue chips. Thus the probability

of choosing a red chip from box B is 7

10 . Similarly, if a blue chip was moved

from box A to box B , then the probability of choosing a red chip from box

Bis 6

10 .

Probability and Mathematical Statistics 39

Hence, the probability that a blue chip was transferred from box A to box B

given that the chip chosen from box B is red is given by

P(E/R ) = P(R/E )P (E)

P( R)

= 6

10  5

9

7

10  4

9+  6

10  5

9

=15

29 .

Example 2.10. Sixty percent of new drivers have had driver education.

During their ﬁrst year, new drivers without driver education have probability

0.08 of having an accident, but new drivers with driver education have only a

0.05 probability of an accident. What is the probability a new driver has had

driver education, given that the driver has had no accident the ﬁrst year?

Answer: Let A represent the new driver who has had driver education and

Brepresent the new driver who has had an accident in his ﬁrst year. Let Ac

and Bc be the complement of A and B , respectively. We want to ﬁnd the

probability that a new driver has had driver education, given that the driver

has had no accidents in the ﬁrst year, that is P (A/Bc ).

P(A/Bc ) = P(A\ Bc )

P( Bc )

=P (Bc /A) P (A)

P( Bc /A) P(A ) + P( Bc /Ac ) P (Ac )

=[1 P (B/A )] P (A)

[1 P (B/A )] P (A ) + [1 P (B/Ac )] [1 P (A)]

= 60

100  95

100 

40

100  92

100 +  60

100  95

100 

= 0.6077.

Example 2.11. One-half percent of the population has AIDS. There is a

test to detect AIDS. A positive test result is supposed to mean that you

Conditional Probability and Bayes' Theorem 40

have AIDS but the test is not perfect. For people with AIDS, the test misses

the diagnosis 2% of the times. And for the people without AIDS, the test

incorrectly tells 3% of them that they have AIDS. (a) What is the probability

that a person picked at random will test positive? (b) What is the probability

that you have AIDS given that your test comes back positive?

Answer: Let A denote the event of one who has AIDS and let Bdenote the

event that the test comes out positive.

(a) The probability that a person picked at random will test positive is

given by

P(test positive) = (0 .005) (0.98) + (0.995) (0.03)

= 0. 0049 + 0. 0298 = 0.035.

(b) The probability that you have AIDS given that your test comes back

positive is given by

P(A/B ) = favorable positive branches

total positive branches

=(0.005) (0.98)

(0. 005) (0 . 98) + (0 . 995) (0 .03)

=0.0049

0. 035 = 0 . 14.

0.995

AIDS

Test positive

Test negative

Test positive

Test negative

0.98

0.02

0.03

0.005

0.97

No AIDS

Remark 2.2. This example illustrates why Bayes' theorem is so important.

What we would really like to know in this situation is a ﬁrst-stage result: Do

you have AIDS? But we cannot get this information without an autopsy. The

ﬁrst stage is hidden. But the second stage is not hidden. The best we can

do is make a prediction about the ﬁrst stage. This illustrates why backward

conditional probabilities are so useful.

Probability and Mathematical Statistics 41

2.3. Review Exercises

1. Let P (A ) = 0 . 4 and P (A[ B ) = 0 . 6. For what value of P (B ) are Aand

Bindependent?

2. A die is loaded in such a way that the probability of the face with jdots

turning up is proportional to j for j = 1, 2,3,4,5, 6. In 6 independent throws

of this die, what is the probability that each face turns up exactly once?

3. A system engineer is interested in assessing the reliability of a rocket

composed of three stages. At take o↵ , the engine of the ﬁrst stage of the

rocket must lift the rocket o↵ the ground. If that engine accomplishes its

task, the engine of the second stage must now lift the rocket into orbit. Once

the engines in both stages 1 and 2 have performed successfully, the engine

of the third stage is used to complete the rocket's mission. The reliability of

the rocket is measured by the probability of the completion of the mission. If

the probabilities of successful performance of the engines of stages 1, 2 and

3 are 0.99, 0.97 and 0.98, respectively, ﬁnd the reliability of the rocket.

4. Identical twins come from the same egg and hence are of the same sex.

Fraternal twins have a 50-50 chance of being the same sex. Among twins the

probability of a fraternal set is 1

3and an identical set is 2

3. If the next set of

twins are of the same sex, what is the probability they are identical?

5. In rolling a pair of fair dice, what is the probability that a sum of 7 is

rolled before a sum of 8 is rolled ?

6. A card is drawn at random from an ordinary deck of 52 cards and re-

placed. This is done a total of 5 independent times. What is the conditional

probability of drawing the ace of spades exactly 4 times, given that this ace

is drawn at least 4 times?

7. Let A and B be independent events with P (A ) = P (B ) and P (A[ B ) =

0. 5. What is the probability of the event A?

8. An urn contains 6 red balls and 3 blue balls. One ball is selected at

random and is replaced by a ball of the other color. A second ball is then

chosen. What is the conditional probability that the ﬁrst ball selected is red,

given that the second ball was red?

Conditional Probability and Bayes' Theorem 42

9. A family has ﬁve children. Assuming that the probability of a girl on

each birth was 0. 5 and that the ﬁve births were independent, what is the

probability the family has at least one girl, given that they have at least one

boy?

10. An urn contains 4 balls numbered 0 through 3. One ball is selected at

random and removed from the urn and not replaced. All balls with nonzero

numbers less than that of the selected ball are also removed from the urn.

Then a second ball is selected at random from those remaining in the urn.

What is the probability that the second ball selected is numbered 3?

11. English and American spelling are rigour and rigor , respectively. A man

staying at Al Rashid hotel writes this word, and a letter taken at random from

his spelling is found to be a vowel. If 40 percent of the English-speaking men

at the hotel are English and 60 percent are American, what is the probability

that the writer is an Englishman?

12. A diagnostic test for a certain disease is said to be 90% accurate in that,

if a person has the disease, the test will detect with probability 0.9. Also, if

a person does not have the disease, the test will report that he or she doesn't

have it with probability 0.9. Only 1% of the population has the disease in

question. If the diagnostic test reports that a person chosen at random from

the population has the disease, what is the conditional probability that the

person, in fact, has the disease?

13. A small grocery store had 10 cartons of milk, 2 of which were sour. If

you are going to buy the 6th carton of milk sold that day at random, ﬁnd

the probability of selecting a carton of sour milk.

14. Suppose Q and S are independent events such that the probability that

at least one of them occurs is 1

3and the probability that Q occurs but S does

not occur is 1

9. What is the probability of S?

15. A box contains 2 green and 3 white balls. A ball is selected at random

from the box. If the ball is green, a card is drawn from a deck of 52 cards.

If the ball is white, a card is drawn from the deck consisting of just the 16

pictures. (a) What is the probability of drawing a king? (b) What is the

probability of a white ball was selected given that a king was drawn?

Probability and Mathematical Statistics 43

16. Five urns are numbered 3,4,5,6 and 7, respectively. Inside each urn is

n2 dollars where n is the number on the urn. The following experiment is

performed: An urn is selected at random. If its number is a prime number the

experimenter receives the amount in the urn and the experiment is over. If its

number is not a prime number, a second urn is selected from the remaining

four and the experimenter receives the total amount in the two urns selected.

What is the probability that the experimenter ends up with exactly twenty-

ﬁve dollars?

17. A cookie jar has 3 red marbles and 1 white marble. A shoebox has 1 red

marble and 1 white marble. Three marbles are chosen at random without

replacement from the cookie jar and placed in the shoebox. Then 2 marbles

are chosen at random and without replacement from the shoebox. What is

the probability that both marbles chosen from the shoebox are red?

18. A urn contains n black balls and n white balls. Three balls are chosen

from the urn at random and without replacement. What is the value of nif

the probability is 1

12 that all three balls are white?

19. An urn contains 10 balls numbered 1 through 10. Five balls are drawn

at random and without replacement. Let A be the event that "Exactly two

odd-numbered balls are drawn and they occur on odd-numbered draws from

the urn." What is the probability of event A?

20. I have ﬁve envelopes numbered 3, 4, 5, 6, 7 all hidden in a box. I

pick an envelope – if it is prime then I get the square of that number in

dollars. Otherwise (without replacement) I pick another envelope and then

get the sum of squares of the two envelopes I picked (in dollars). What is

the probability that I will get $25?

Conditional Probability and Bayes' Theorem 44

Probability and Mathematical Statistics 45

Chapter 3

RANDOM VARIABLES

AND

DISTRIBUTION FUNCTIONS

3.1. Introduction

In many random experiments, the elements of sample space are not nec-

essarily numbers. For example, in a coin tossing experiment the sample space

consists of

S={Head ,Tail}.

Statistical methods involve primarily numerical data. Hence, one has to

'mathematize' the outcomes of the sample space. This mathematization, or

quantiﬁcation, is achieved through the notion of random variables.

Deﬁnition 3.1. Consider a random experiment whose sample space is S . A

random variable X is a function from the sample space S into the set of real

numbers IR such that for each interval I in IR, the set {s2 S |X (s )2I } is an

event in S.

In a particular experiment a random variable X would be some function

that assigns a real number X (s ) to each possible outcome s in the sample

space. Given a random experiment, there can be many random variables.

This is due to the fact that given two (ﬁnite) sets A and B , the number

of distinct functions one can come up with is |B||A| . Here |A | means the

cardinality of the set A.

Random variable is not a variable. Also, it is not random. Thus some-

one named it inappropriately. The following analogy speaks the role of the

random variable. Random variable is like the Holy Roman Empire – it was

Random Variables and Distribution Functions 46

not holy, it was not Roman, and it was not an empire. A random variable is

neither random nor variable, it is simply a function. The values it takes on

are both random and variable.

Deﬁnition 3.2. The set {x2 IR |x =X (s), s 2 S} is called the space of the

random variable X .

The space of the random variable X will be denoted by RX . The space

of the random variable X is actually the range of the function X :S! IR.

Example 3.1. Consider the coin tossing experiment. Construct a random

variable X for this experiment. What is the space of this random variable

Answer: The sample space of this experiment is given by

S={Head ,Tail}.

Let us deﬁne a function from S into the set of reals as follows

X(Head ) = 0

X( T ail) = 1.

Then X is a valid map and thus by our deﬁnition of random variable, it is a

random variable for the coin tossing experiment. The space of this random

variable is

RX = {0 , 1} .

Tail

Head

Sample Space

Real line

0 1

X(head) = 0 and X(tail) = 1

Example 3.2. Consider an experiment in which a coin is tossed ten times.

What is the sample space of this experiment? How many elements are in this

sample space? Deﬁne a random variable for this sample space and then ﬁnd

the space of the random variable.

Probability and Mathematical Statistics 47

Answer: The sample space of this experiment is given by

S={ s| sis a sequence of 10 heads or tails}.

The cardinality of Sis

|S | = 210 .

Let X :S! IR be a function from the sample space S into the set of reals IR

deﬁned as follows:

X(s ) = number of heads in sequence s.

Then X is a random variable. This random variable, for example, maps the

sequence H HT T T H T T HH to the real number 5, that is

X( H HT T T H T T HH ) = 5.

The space of this random variable is

RX = {0 , 1 , 2 , ..., 10} .

Now, we introduce some notations. By (X = x ) we mean the event {s 2

S| X(s ) = x} . Similarly, ( a < X < b) means the event { s2 S| a < X < b}

of the sample space S . These are illustrated in the following diagrams.

Sample Space

Real line

(X=x) means the event A

Sample Space

Real line

(a<X<b) means the event B

a b

There are three types of random variables: discrete, continuous, and

mixed. However, in most applications we encounter either discrete or contin-

uous random variable. In this book we only treat these two types of random

variables. First, we consider the discrete case and then we examine the con-

tinuous case.

Deﬁnition 3.3. If the space of random variable X is countable, then Xis

called a discrete random variable.

Random Variables and Distribution Functions 48

3.2. Distribution Functions of Discrete Random Variables

Every random variable is characterized through its probability density

function.

Deﬁnition 3.4. Let RX be the space of the random variable X . The

function f : RX ! IR deﬁned by

f(x ) = P( X= x)

is called the probability density function (pdf ) of X.

Example 3.3. In an introductory statistics class of 50 students, there are 11

freshman, 19 sophomores, 14 juniors and 6 seniors. One student is selected at

random. What is the sample space of this experiment? Construct a random

variable X for this sample space and then ﬁnd its space. Further, ﬁnd the

probability density function of this random variable X.

Answer: The sample space of this random experiment is

S={ F r, So, J r, Sr}.

Deﬁne a function X :S! IR as follows:

X( F r) = 1 , X (So ) = 2

X(Jr ) = 3 , X (Sr ) = 4 .

Then clearly X is a random variable in S . The space of X is given by

RX = {1 , 2 , 3 , 4} .

The probability density function of X is given by

f(1) = P( X= 1) = 11

f(2) = P( X= 2) = 19

f(3) = P( X= 3) = 14

f(4) = P( X= 4) = 6

50 .

Example 3.4. A box contains 5 colored balls, 2 black and 3 white. Balls

are drawn successively without replacement. If the random variable X is the

Probability and Mathematical Statistics 49

number of draws until the last black ball is obtained, ﬁnd the probability

density function for the random variable X.

Answer: Let 'B' denote the black ball, and 'W' denote the white ball. Then

the sample space S of this experiment is given by (see the ﬁgure below)

S={ BB , B W B, W BB , B W W B, W BW B , W W B B,

BW W W B , W W B W B, W W W BB , W B W W B}.

Hence the sample space has 10 points, that is |S | = 10. It is easy to see that

the space of the random variable X is { 2,3,4,5}.

Sample Space SReal line

BWB

WBB

BWWB

WBWB

WWBB

BWWWB

WWBWB

WWWBB

WBWWB

Therefore, the probability density function of X is given by

f(2) = P( X= 2) = 1

10 , f (3) = P (X = 3) = 2

f(4) = P( X= 4) = 3

10 , f (5) = P (X = 5) = 4

10 .

Random Variables and Distribution Functions 50

Thus

f(x ) = x1

10 , x = 2, 3,4,5.

Example 3.5. A pair of dice consisting of a six-sided die and a four-sided

die is rolled and the sum is determined. Let the random variable Xdenote

this sum. Find the sample space, the space of the random variable, and

probability density function of X.

Answer: The sample space of this random experiment is given by

{(1, 1) (1,2) (1, 3) (1, 4) (1, 5) (1,6)

(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2,6)

(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3,6)

(4, 1) (4, 2) (4, 3) (4, 4) (4,5) (4, 6)}

The space of the random variable X is given by

RX = {2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10} .

Therefore, the probability density function of X is given by

f(2) = P( X= 2) = 1

24 , f (3) = P (X = 3) = 2

f(4) = P( X= 4) = 3

24 , f (5) = P (X = 5) = 4

f(6) = P( X= 6) = 4

24 , f (7) = P (X = 7) = 4

f(8) = P( X= 8) = 3

24 , f (9) = P (X = 9) = 2

f(10) = P( X= 10) = 1

24 .

Example 3.6. A fair coin is tossed 3 times. Let the random variable X

denote the number of heads in 3 tosses of the coin. Find the sample space,

the space of the random variable, and the probability density function of X.

Answer: The sample space S of this experiment consists of all binary se-

quences of length 3, that is

S={ T T T, T T H, T HT, HT T , T HH, HT H, HHT, HHH }.

Probability and Mathematical Statistics 51

Sample Space SReal line

TTT

TTH

THT

HTT

THH

HTH

HHT

HHH

The space of this random variable is given by

RX = {0 , 1 , 2 , 3} .

Therefore, the probability density function of X is given by

f(0) = P( X= 0) = 1

f(1) = P( X= 1) = 3

f(2) = P( X= 2) = 3

f(3) = P( X= 3) = 1

This can be written as follows:

f(x ) =  3

x 1

2 x  1

2 3x

x= 0 ,1 ,2 ,3.

The probability density function f (x ) of a random variable Xcompletely

characterizes it. Some basic properties of a discrete probability density func-

tion are summarized below.

Theorem 3.1. If X is a discrete random variable with space RX and prob-

ability density function f (x ), then

(a) f (x ) 0 for all x in RX , and

(b) 

x2RX

f(x ) = 1.

Example 3.7. If the probability of a random variable X with space RX =

{1,2,3, ..., 12 } is given by

f(x ) = k(2 x1),

Random Variables and Distribution Functions 52

then, what is the value of the constant k?

Answer:

1 = 

x2RX

f(x)

=

x2RX

k(2 x1)



x=1

k(2 x1)

=k 2



x=1

x12

=k 2(12)(13)

2 12

=k 144.

Hence

k=1

144 .

Deﬁnition 3.5. The cumulative distribution function F (x ) of a random

variable X is deﬁned as

F(x ) = P( X x)

for all real numbers x.

Theorem 3.2. If X is a random variable with the space RX , then

F(x ) = 

tx

f(t)

for x2 RX .

Example 3.8. If the probability density function of the random variable X

is given by

144 (2x 1) for x = 1, 2,3, ..., 12

then ﬁnd the cumulative distribution function of X.

Answer: The space of the random variable X is given by

RX = {1 , 2 , 3 , ..., 12} .

Probability and Mathematical Statistics 53

Then

F(1) = 

t1

f(t ) = f(1) = 1

144

F(2) = 

t2

f(t ) = f(1) + f(2) = 1

144 + 3

144 = 4

144

F(3) = 

t3

f(t ) = f(1) + f(2) + f(3) = 1

144 + 3

144 + 5

144 = 9

144

.. ........

F(12) = 

t12

f(t ) = f(1) + f(2) + ···+ f(12) = 1.

F(x ) represents the accumulation of f(t ) up to t x.

Theorem 3.3. Let X be a random variable with cumulative distribution

function F (x ). Then the cumulative distribution function satisﬁes the fol-

lowings:

(a) F (1 ) = 0,

(b) F (1 ) = 1, and

all reals x, y.

The proof of this theorem is trivial and we leave it to the students.

Theorem 3.4. If the space RX of the random variable X is given by RX =

{x1 < x2 < x3 < ··· < xn }, then

f(x1 ) = F(x1 )

f(x2 ) = F(x2 ) F(x1 )

f(x3 ) = F(x3 ) F(x2 )

.. ........

f(xn ) = F(xn ) F(xn1 ).

Random Variables and Distribution Functions 54

x1 x2 x3 x4

f(x1)

f(x2)

f(x3)

f(x4)

F(x1)

F(x2)

F(x3)

F(x4) 1

Theorem 3.2 tells us how to ﬁnd cumulative distribution function from the

probability density function, whereas Theorem 3.4 tells us how to ﬁnd the

probability density function given the cumulative distribution function.

Example 3.9. Find the probability density function of the random variable

Xwhose cumulative distribution function is

F(x ) =











0. 00 if x < 1

0. 25 if  1 x < 1

0. 50 if 1  x < 3

0. 75 if 3  x < 5

1. 00 if x 5 .

Also, ﬁnd (a) P (X 3), (b) P (X = 3), and (c) P (X < 3).

Answer: The space of this random variable is given by

RX = {1 , 1 , 3 , 5} .

By the previous theorem, the probability density function of X is given by

f( 1) = 0 .25

f(1) = 0 .50  0 .25 = 0.25

f(3) = 0 .75  0 .50 = 0.25

f(5) = 1 .00  0 .75 = 0.25.

The probability P (X 3) can be computed by using the deﬁnition of F.

Hence

P( X3) = F(3) = 0 .75.

Probability and Mathematical Statistics 55

The probability P (X = 3) can be computed from

P( X= 3) = F(3)  F(1) = 0 .75  0 .50 = 0.25.

Finally, we get P (X < 3) from

P( X < 3) = P( X1) = F(1) = 0 .5.

We close this section with an example showing that there is no one-to-

one correspondence between a random variable and its distribution function.

Consider a coin tossing experiment with the sample space consisting of a

head and a tail, that is S ={ head, tail } . Deﬁne two random variables X1

and X2 from S as follows:

X1 ( head ) = 0 and X1 ( tail ) = 1

and

X2 ( head ) = 1 and X2 ( tail ) = 0.

It is easy to see that both these random variables have the same distribution

function, namely

FX i (x) =  0 if x < 0

2if 0 x < 1

1 if 1  x

for i = 1, 2. Hence there is no one-to-one correspondence between a random

variable and its distribution function.

3.3. Distribution Functions of Continuous Random Variables

A random variable X is said to be continuous if its space is either an

interval or a union of intervals. The folllowing deﬁnition formally deﬁnes a

continuous random variable.

Deﬁnition 3.6. A random variable X is said to be a continuous random

variable if there exists a continuous function f : IR ! [0, 1 ) such that for

every set of real numbers A

P( X2 A) =  A

f(x ) dx. (1)

Deﬁnition 3.7. The function f in (1) is called the probability density

function of the continuous random variable X.

Random Variables and Distribution Functions 56

It can be easily shown that for every probability density function f,

1

1

f(x) dx = 1.

Example 3.10. Is the real valued function f : IR ! IR deﬁned by

f(x ) =  2 x 2 if 1 < x < 2

0 otherwise,

a probability density function for some random variable X?

Answer: We have to show that f is nonnegative and the area under f (x)

is unity. Since the domain of f is the interval (0, 1), it is clear that fis

nonnegative. Next, we calculate

1

1

f(x ) dx = 2

2x2 dx

= 2 1

x2

= 2 1

2 1

= 1.

Thus f is a probability density function.

Example 3.11. Is the real valued function f : IR ! IR deﬁned by

f(x ) =  1 + |x| if 1 < x < 1

0 otherwise,

a probability density function for some random variable X?

Probability and Mathematical Statistics 57

Answer: It is easy to see that fis nonnegative, that is f (x ) 0, since

f(x ) = 1 + |x| . Next we show that the area under fis not unity. For this we

compute

1

1

f(x ) dx = 1

1

(1 + |x| ) dx

= 0

1

(1  x ) dx + 1

(1 + x ) dx

= x 1

2x 2 0

1

+ x +1

2x 2 1

= 1 + 1

2+ 1 + 1

= 3.

Thus f is not a probability density function for some random variable X.

Example 3.12. For what value of the constant c , the real valued function

f: IR ! IR given by

f(x ) = c

1 + (x✓ )2 , 1 <x< 1,

where ✓ is a real parameter, is a probability density function for random

variable X?

Answer: Since f is nonnegative, we see that c 0. To ﬁnd the value of c,

Random Variables and Distribution Functions 58

we use the fact that for pdf the area is unity, that is

1 =  1

1

f(x ) dx

= 1

1

1 + (x✓ )2 dx

= 1

1

1 + z2 dz

=c tan1 z  1

1

=c tan1 (1 ) tan1 (1)

=c 1

2⇡ + 1

2⇡ 

=c⇡ .

Hence c = 1

⇡and the density function becomes

f(x ) = 1

⇡[1 + (x ✓)2 ], 1 < x < 1.

This density function is called the Cauchy distribution function with param-

eter ✓ . If a random variable X has this pdf then it is called a Cauchy random

variable and is denoted by X⇠ CAU (✓).

This distribution is symmetrical about ✓ . Further, it achieves it maxi-

mum at x =✓ . The following ﬁgure illustrates the symmetry of the distribu-

tion for ✓ = 2.

Example 3.13. For what value of the constant c , the real valued function

f: IR ! IR given by

f(x ) =  cif a x b

0 otherwise,

Probability and Mathematical Statistics 59

where a, b are real constants, is a probability density function for random

variable X?

Answer: Since f is a pdf, k is nonnegative. Further, since the area under f

is unity, we get

1 =  1

1

f(x ) dx

= b

c dx

=c [x]b

=c [b a ].

Hence c = 1

b a , and the pdf becomes

f(x ) =  1

b a if a x b

0 otherwise.

This probability density function is called the uniform distribution on

the interval [a, b ]. If a random variable X has this pdf then it is called a

uniform random variable and is denoted by X⇠ UN I F (a, b ). The following

is a graph of the probability density function of a random variable on the

interval [2,5].

Deﬁnition 3.8. Let f (x ) be the probability density function of a continu-

ous random variable X . The cumulative distribution function F (x ) of Xis

deﬁned as

F(x ) = P( X x) =  x

1

f(t ) dt.

The cumulative distribution function F (x ) represents the area under the

probability density function f (x ) on the interval (1, x ) (see ﬁgure below).

Random Variables and Distribution Functions 60

Like the discrete case, the cdf is an increasing function of x , and it takes

value 0 at negative inﬁnity and 1 at positive inﬁnity.

Theorem 3.5. If F (x ) is the cumulative distribution function of a contin-

uous random variable X , the probability density function f (x ) of X is the

derivative of F (x ), that is

dx F ( x) = f ( x).

Proof: By Fundamental Theorem of Calculus, we get

dx ( F (x)) = d

dx  x

1

f(t ) dt

=f (x ) dx

dx = f (x).

This theorem tells us that if the random variable is continuous, then we can

ﬁnd the pdf given cdf by taking the derivative of the cdf. Recall that for

discrete random variable, the pdf at a point in space of the random variable

can be obtained from the cdf by taking the di↵ erence between the cdf at the

point and the cdf immediately below the point.

Example 3.14. What is the cumulative distribution function of the Cauchy

random variable with parameter ✓?

Answer: The cdf of X is given by

F(x ) =  x

1

f(t ) dt

= x

1

⇡[1 + (t ✓)2 ] dt

= x✓

1

⇡[1+z2 ]dz

⇡tan 1 (x✓ ) + 1

Probability and Mathematical Statistics 61

Example 3.15. What is the probability density function of the random

variable whose cdf is

F(x ) = 1

1 + ex , 1 < x < 1 ?

Answer: The pdf of the random variable is given by

f(x ) = d

dx F ( x)

dx  1

1 + ex 

dx  1 + e x  1

= ( 1) (1 + ex )2 d

dx  1 + e x 

=e x

(1 + ex )2 .

Next, we brieﬂy discuss the problem of ﬁnding probability when the cdf

is given. We summarize our results in the following theorem.

Theorem 3.6. Let X be a continuous random variable whose cdf is F (x).

Then followings are true:

(a) P (X < x) = F (x),

(b) P (X > x ) = 1 F (x),

(d) P (a < X < b) = F (b )F (a).

3.4. Percentiles for Continuous Random Variables

In this section, we discuss various percentiles of a continuous random

variable. If the random variable is discrete, then to discuss percentile, we

have to know the order statistics of samples. We shall treat the percentile of

discrete random variable in Chapter 13.

Deﬁnition 3.9. Let p be a real number between 0 and 1. A 100pth percentile

of the distribution of a random variable X is any real number qsatisfying

P( X q) pand P( X > q) 1 p.

A 100pth percentile is a measure of location for the probability distribu-

tion in the sense that q divides the distribution of the probability mass into

Random Variables and Distribution Functions 62

two parts, one having probability mass p and other having probability mass

1p (see diagram below).

Example 3.16. If the random variable X has the density function

f(x ) = 





ex2 for x < 2

0 otherwise,

then what is the 75th percentile of X?

Answer: Since 100pth = 75, we get p= 0. 75. By deﬁnition of percentile, we

have

0. 75 = p = q

1

f(x ) dx

= q

1

ex2 dx

= ex2  q

1

=eq2 .

From this solving for q , we get the 75th percentile to be

q= 2 + ln  3

4 .

Example 3.17. What is the 87.5 percentile for the distribution with density

function

f(x ) = 1

2e|x |  1 < x < 1?

Answer: Note that this density function is symmetric about the y -axis, that

is f (x ) = f (x ).

Probability and Mathematical Statistics 63

Hence

0

1

f(x ) dx =1

Now we compute the 87.5th percentile qof the above distribution.

87.5

100 =  q

1

f(x ) dx

= 0

1

2e|x | dx + q

2e|x | dx

= 0

1

2e x dx +  q

2e x dx

2+  q

2e x dx

2+ 1

2 1

2e q .

Therefore solving for q , we get

0. 125 = 1

2e q

q= ln  25

100  = ln 4.

Hence the 87. 5th percentile of the distribution is ln 4.

Example 3.18. Let the continuous random variable X have the density

function f (x ) as shown in the ﬁgure below:

Random Variables and Distribution Functions 64

What is the 25th percentile of the distribution of X?

Answer: Since the line passes through the points (0, 0) and (a, 1

4), the func-

tion f (x ) is equal to

f(x ) = 1

4ax.

Since f (x ) is a density function the area under f (x ) should be unity. Hence

1 =  a

f(x ) dx

= a

4ax dx

8aa 2

Thus a = 8. Hence the probability density function of Xis

f(x ) = 1

32 x.

Now we want to ﬁnd the 25th percentile.

100 =  q

f(x ) dx

= q

32 x dx

64 q 2 .

Hence q = p 16, that is the 25th percentile of the above distribution is 4.

Deﬁnition 3.10. The 25th and 75th percentiles of any distribution are

called the ﬁrst and the third quartiles, respectively.

Probability and Mathematical Statistics 65

Deﬁnition 3.11. The 50th percentile of any distribution is called the median

of the distribution.

The median divides the distribution of the probability mass into two

equal parts (see the following ﬁgure).

If a probability density function f (x ) is symmetric about the y -axis, then the

median is always 0.

Example 3.19. A random variable is called standard normal if its proba-

bility density function is of the form

f(x ) = 1

p2⇡ e 1

2x 2 ,1 < x < 1.

What is the median of X?

Answer: Notice that f (x ) = f (x ), hence the probability density function

is symmetric about the y -axis. Thus the median of X is 0.

Deﬁnition 3.12. A mode of the distribution of a continuous random variable

Xis the value of xwhere the probability density function f (x ) attains a

relative maximum (see diagram).

Relative Maximum

mode mode

f(x)

Random Variables and Distribution Functions 66

A mode of a random variable X is one of its most probable values. A

random variable can have more than one mode.

Example 3.20. Let X be a uniform random variable on the interval [0,1],

that is X⇠ U N IF (0, 1). How many modes does X have?

Answer: Since X⇠ UN I F (0, 1), the probability density function of Xis

f(x ) =  1 if 0 x  1

0 otherwise.

Hence the derivative of f (x ) is

f0 (x ) = 0 x2(0 ,1).

Therefore X has inﬁnitely many modes.

Example 3.21. Let X be a Cauchy random variable with parameter ✓ = 0,

that is X⇠ CAU (0). What is the mode of X?

Answer: Since X⇠ CAU (0), the probability density function of f (x ) is

f(x ) = 1

⇡(1 + x2 )  1 <x< 1.

Hence

f0 (x ) =  2x

⇡(1 + x2 )2 .

Setting this derivative to 0, we get x = 0. Thus the mode of X is 0.

Example 3.22. Let X be a continuous random variable with density func-

tion

f(x ) = 





x2ebx for x 0

0 otherwise,

where b > 0. What is the mode of X?

Answer:

0 = df

= 2xebx x2bebx

= (2  bx) x = 0.

Hence

x= 0 or x=2

Probability and Mathematical Statistics 67

Thus the mode of X is 2

b. The graph of the f (x ) for b = 4 is shown below.

Example 3.23. A continuous random variable has density function

f(x ) = 





3x2

✓3 for 0 x✓

0 otherwise,

for some ✓> 0. What is the ratio of the mode to the median for this

distribution?

Answer: For ﬁxed ✓> 0, the density function f (x ) is an increasing function.

Thus, f (x ) has maximum at the right end point of the interval [0,✓ ]. Hence

the mode of this distribution is ✓.

Next we compute the median of this distribution.

2=  q

f(x ) dx

= q

3x2

✓3 dx

= x3

✓3 q

= q 3

✓3  .

Hence

q= 2 1

3✓.

Thus the ratio of the mode of this distribution to the median is

mode

median = ✓

2 1

3✓= 3

p2.

Random Variables and Distribution Functions 68

Example 3.24. A continuous random variable has density function

f(x ) = 





3x2

✓3 for 0 x✓

0 otherwise,

for some ✓> 0. What is the probability of X less than the ratio of the mode

to the median of this distribution?

Answer: In the previous example, we have shown that the ratio of the mode

to the median of this distribution is given by

a:= mode

median = 3

p2.

Hence the probability of X less than the ratio of the mode to the median of

this distribution is

P( X < a) =  a

f(x ) dx

= a

3x2

✓3 dx

= x3

✓3 a

=a3

✓3

= 3

p2 3

✓3 =2

✓3 .

3.5. Review Exercises

1. Let the random variable X have the density function

f(x ) = 





k x for 0  x 2

0 elsewhere.

If the mode of this distribution is at x = p2

4, then what is the median of X?

2. The random variable X has density function

f(x ) = 





c xk+1 (1  x)k for 0 < x < 1

0 otherwise,

where c > 0 and 1 <k< 2. What is the mode of X?

Probability and Mathematical Statistics 69

3. The random variable X has density function

f(x ) = 





(k + 1) x2 for 0 <x<1

0 otherwise,

where k is a constant. What is the median of X?

4. What are the median, and mode, respectively, for the density function

f(x ) = 1

⇡(1 + x2 ), 1 < x < 1?

5. What is the 10th percentile of the random variable X whose probability

density function is

f(x ) =  1

✓e  x

✓if x 0,✓ > 0

0 elsewhere?

6. What is the median of the random variable X whose probability density

function is

f(x ) =  1

2e  x

2if x 0

0 elsewhere?

7. A continuous random variable X has the density

f(x ) = 





3x2

8for 0 x  2

0 otherwise.

What is the probability that X is greater than its 75th percentile?

8. What is the probability density function of the random variable X if its

cumulative distribution function is given by

F(x ) = 









0. 0 if x < 2

0. 5 if 2  x < 3

0. 7 if 3  x < ⇡

1. 0 if x⇡ ?

9. Let the distribution of X for x > 0 be

F(x ) = 1 



k=0

xkex

k! .

Random Variables and Distribution Functions 70

What is the density function of X for x > 0?

10. Let X be a random variable with cumulative distribution function

F(x ) = 





1ex for x > 0

0 for x 0.

What is the P  0 eX  4?

11. Let X be a continuous random variable with density function

f(x ) = 





a x2 e10 x for x 0

0 otherwise,

where a > 0. What is the probability of X greater than or equal to the mode

of X?

12. Let the random variable X have the density function

f(x ) = 





k x for 0  x 2

0 elsewhere.

If the mode of this distribution is at x = p2

4, then what is the probability of

Xless than the median of X?

13. The random variable X has density function

f(x ) = 





(k + 1) x2 for 0 <x<1

0 otherwise,

where k is a constant. What is the probability of X between the ﬁrst and

third quartiles?

14. Let X be a random variable having continuous cumulative distribu-

tion function F (x ). What is the cumulative distribution function Y=

max(0, X )?

15. Let X be a random variable with probability density function

f(x ) = 2

3x for x = 1, 2,3, ....

What is the probability that X is even?

Probability and Mathematical Statistics 71

16. An urn contains 5 balls numbered 1 through 5. Two balls are selected

at random without replacement from the urn. If the random variable X

denotes the sum of the numbers on the 2 balls, then what are the space and

the probability density function of X?

17. A pair of six-sided dice is rolled and the sum is determined. If the

random variable X denotes the sum of the numbers rolled, then what are the

space and the probability density function of X?

18. Five digit codes are selected at random from the set {0,1,2, ..., 9 } with

replacement. If the random variable X denotes the number of zeros in ran-

domly chosen codes, then what are the space and the probability density

function of X?

19. A urn contains 10 coins of which 4 are counterfeit. Coins are removed

from the urn, one at a time, until all counterfeit coins are found. If the

random variable X denotes the number of coins removed to ﬁnd the ﬁrst

counterfeit one, then what are the space and the probability density function

of X?

20. Let X be a random variable with probability density function

f(x ) = 2c

3x for x = 1, 2,3,4, ..., 1

for some constant c . What is the value of c? What is the probability that X

is even?

21. If the random variable X possesses the density function

f(x ) =  cx if 0 x2

0 otherwise,

then what is the value of c for which f (x ) is a probability density function?

What is the cumulative distribution function of X . Graph the functions f (x)

and F (x ). Use F (x ) to compute P (1 X2).

22. The length of time required by students to complete a 1-hour exam is a

random variable with a pdf given by

f(x ) =  cx 2 + xif 0 x  1

0 otherwise,

then what the probability a student ﬁnishes in less than a half hour?

Random Variables and Distribution Functions 72

23. What is the probability of, when blindfolded, hitting a circle inscribed

on a square wall?

24. Let f (x ) be a continuous probability density function. Show that, for

every 1 <µ< 1 and > 0, the function 1

f xµ

is also a probability

density function.

25. Let X be a random variable with probability density function f (x ) and

cumulative distribution function F (x ). True or False?

(a) f (x ) can't be larger than 1. (b) F (x ) can't be larger than 1. (c) f (x)

can't decrease. (d) F (x ) can't decrease. (e) f (x ) can't be negative. (f ) F (x)

can't be negative. (g) Area under fmust be 1. (h) Area under F must be

1. (i) f can't jump. (j) Fcan't jump.

Probability and Mathematical Statistics 73

Moments of Random Variables and Chebychev Inequality 74

Chapter 4

MOMENTS OF RANDOM

VARIABLES

AND

CHEBYCHEV INEQUALITY

4.1. Moments of Random Variables

In this chapter, we introduce the concepts of various moments of a ran-

dom variable. Further, we examine the expected value and the variance of

random variables in detail. We shall conclude this chapter with a discussion

of Chebychev's inequality.

Deﬁnition 4.1. The nth moment about the origin of a random variable X,

as denoted by E (Xn ), is deﬁned to be

E( Xn ) = 









x2RX

xn f (x) if X is discrete

1

1 x n f(x ) dx if X is continuous

for n = 0, 1,2,3, .... , provided the right side converges absolutely.

If n = 1, then E (X ) is called the ﬁrst moment about the origin. If

n= 2, then E( X2 ) is called the second moment of Xabout the origin. In

general, these moments may or may not exist for a given random variable.

If for a random variable, a particular moment does not exist, then we say

that the random variable does not have that moment. For these moments to

exist one requires absolute convergence of the sum or the integral. Next, we

shall deﬁne two important characteristics of a random variable, namely the

expected value and variance. Occasionally E (Xn ) will be written as E [Xn ].

Probability and Mathematical Statistics 75

4.2. Expected Value of Random Variables

A random variable X is characterized by its probability density function,

which deﬁnes the relative likelihood of assuming one value over the others.

In Chapter 3, we have seen that given a probability density function fof

a random variable X , one can construct the distribution function F of it

through summation or integration. Conversely, the density function f (x)

can be obtained as the marginal value or derivative of F (x ). The density

function can be used to infer a number of characteristics of the underlying

random variable. The two most important attributes are measures of location

and dispersion. In this section, we treat the measure of location and treat

the other measure in the next section.

Deﬁnition 4.2. Let X be a random variable with space RX and probability

density function f (x ). The mean µX of the random variable X is deﬁned as

µX = 









x2RX

x f (x ) if X is discrete

1

1 x f (x ) dx if X is continuous

if the right hand side exists.

The mean of a random variable is a composite of its values weighted by the

corresponding probabilities. The mean is a measure of central tendency: the

value that the random variable takes "on average." The mean is also called

the expected value of the random variable X and is denoted by E (X ). The

symbol E is called the expectation operator. The expected value of a random

variable may or may not exist.

Example 4.1. If X is a uniform random variable on the interval (2, 7), then

what is the mean of X?

Moments of Random Variables and Chebychev Inequality 76

Answer: The density function of Xis

f(x ) =  1

5if 2 <x<7

0 otherwise.

Thus the mean or the expected value of Xis

µX = E ( X )

= 1

1

x f (x ) dx

= 7

5dx

= 1

10 x 2  7

10 (49 4)

=45

=2 + 7

In general, if X⇠ U N IF (a, b ), then E (X ) = 1

2(a+ b).

Example 4.2. If X is a Cauchy random variable with parameter ✓ , that is

X⇠ CAU (✓ ), then what is the expected value of X ?

Answer: We want to ﬁnd the expected value of X if it exists. The expected

value of X will exist if the integral  IR xf (x) dx converges absolutely, that is

IR |x f (x)| dx < 1.

If this integral diverges, then the expected value of X does not exist. Hence,

let us ﬁnd out if  IR |x f (x)| dx converges or not.

Probability and Mathematical Statistics 77

IR |x f (x)| dx

= 1

1 |x f (x)| dx

= 1

1    x 1

⇡[1 + (x ✓)2 ]    dx

= 1

1    (z+✓ )1

⇡[1+z2 ]   dz

=✓ + 2  1

⇡[1+z2 ]dz

=✓ + 1

⇡ln(1 + z2 ) 1

=✓ +1

⇡lim

b!1 ln(1 + b 2 )

=✓ +1

=1.

Since, the above integral does not exist, the expected value for the Cauchy

distribution also does not exist.

Remark 4.1. Indeed, it can be shown that a random variable X with the

Cauchy distribution, E (Xn ), does not exist for any natural number n . Thus,

Cauchy random variables have no moments at all.

Example 4.3. If the probability density function of the random variable X

f(x ) = 





(1  p)x1 p if x = 1, 2,3,4, ..., 1

0 otherwise,

then what is the expected value of X?

Moments of Random Variables and Chebychev Inequality 78

Answer: The expected value of Xis

E( X) = 

x2RX

x f (x)



x=1

x(1  p)x1 p

=pd

dp   1



x=1

x(1  p)x1  dp 

=pd

dp  1



x=1 x(1  p) x1 dp 

=pd

dp  1



x=1

(1  p)x 

=pd

dp  (1  p)1

1 (1  p ) 

=pd

dp  1

p

=p 1

p2

Hence the expected value of X is the reciprocal of the parameter p.

Deﬁnition 4.3. If a random variable X whose probability density function

is given by

f(x ) =  (1  p ) x1 pif x = 1, 2,3,4, ..., 1

0 otherwise

is called a geometric random variable and is denoted by X⇠ GEO(p ).

Example 4.4. A couple decides to have 3 children. If none of the 3 is a

girl, they will try again; and if they still don't get a girl, they will try once

more. If the random variable X denotes the number of children the couple

will have following this scheme, then what is the expected value of X?

Answer: Since the couple can have 3 or 4 or 5 children, the space of the

random variable Xis

RX = {3 , 4 , 5} .

Probability and Mathematical Statistics 79

The probability density function of X is given by

f(3) = P( X= 3)

=P (at least one girl)

= 1 P (no girls)

= 1 P (3 boys in 3 tries)

= 1  (P (1 boy in each try))3

= 1  1

23

f(4) = P( X= 4)

=P (3 boys and 1 girl in last try)

= (P (1 boy in each try))3 P (1 girl in last try)

= 1

2 3  1

2

16 .

f(5) = P( X= 5)

=P (4 boys and 1 girl in last try) + P (5 boys in 5 tries)

=P (1 boy in each try)4 P (1 girl in last try) + P (1 boy in each try)5

= 1

2 4  1

2 +  1

25

16 .

Hence, the expected value of the random variable is

E( X) = 

x2RX

x f (x)



x=3

x f (x)

= 3 f(3) + 4 f(4) + 5 f(5)

= 3 14

16 + 4 1

16 + 5 1

=42 + 4 + 5

=51

16 = 3 3

16 .

Moments of Random Variables and Chebychev Inequality 80

Remark 4.2. We interpret this physically as meaning that if many couples

have children according to this scheme, it is likely that the average family

size would be near 3 3

16 children.

Example 4.5. A lot of 8 TV sets includes 3 that are defective. If 4 of the

sets are chosen at random for shipment to a hotel, how many defective sets

can they expect?

Answer: Let X be the random variable representing the number of defective

TV sets in a shipment of 4. Then the space of the random variable Xis

RX = {0 , 1 , 2 , 3} .

Then the probability density function of X is given by

f(x ) = P( X= x)

=P (x defective TV sets in a shipment of four)

= 3

x 5

4x

8

4x= 0 ,1,2 ,3.

Hence, we have

f(0) =  3

0 5

4

8

4=5

f(1) =  3

1 5

3

8

4=30

f(2) =  3

2 5

2

8

4=30

f(3) =  3

3 5

1

8

4=5

70 .

Therefore, the expected value of X is given by

E( X) = 

x2RX

x f (x)



x f (x)

=f (1) + 2 f (2) + 3 f(3)

=30

70 + 2 30

70 + 3 5

=30 + 60 + 15

=105

70 = 1.5.

Probability and Mathematical Statistics 81

Remark 4.3. Since they cannot possibly get 1.5 defective TV sets, it should

be noted that the term "expect" is not used in its colloquial sense. Indeed, it

should be interpreted as an average pertaining to repeated shipments made

under given conditions.

Now we prove a result concerning the expected value operator E.

Theorem 4.1. Let X be a random variable with pdf f (x ). If a and bare

any two real numbers, then

E( aX + b) = a E ( X ) + b.

Proof: We will prove only for the continuous case.

E( aX + b) =  1

1

(a x + b )f (x ) dx

= 1

1

a x f (x ) dx + 1

1

b f (x ) dx

=a 1

1

x f (x ) dx +b

=aE( X ) + b.

To prove the discrete case, replace the integral by summation. This completes

the proof.

4.3. Variance of Random Variables

The spread of the distribution of a random variable X is its variance.

Deﬁnition 4.4. Let X be a random variable with mean µX . The variance

of X , denoted by V ar (X ), is deﬁned as

V ar( X ) = E [ X µX ]2  .

It is also denoted by  2

X. The positive square root of the variance is

called the standard deviation of the random variable X . Like variance, the

standard deviation also measures the spread. The following theorem tells us

how to compute the variance in an alternative way.

Theorem 4.2. If X is a random variable with mean µX and variance  2

then

2

X=E( X 2 )(µ X ) 2 .

Moments of Random Variables and Chebychev Inequality 82

Proof:

2

X=E [ X µ X ] 2 

=E X2  2µX X +µ2

X

=E (X2 ) 2 µX E (X ) + ( µX )2

=E (X2 ) 2 µXµX + ( µX )2

=E (X2 ) ( µX )2 .

Theorem 4.3. If X is a random variable with mean µX and variance  2

then

V ar( aX + b) = a2 V ar ( X ),

where a and b are arbitrary real constants.

Proof:

V ar( a X + b) = E  [ ( a X + b) µaX +b ]2 

=E [a X +b E (a X +b ) ]2 

=E [a X +b a µX+ b ]2 

=E a2 [X µX ]2 

=a2 E  [X µX ]2 

=a2 V ar ( X ).

Example 4.6. Let X have the density function

f(x ) =  2x

k2 for 0 x k

0 otherwise.

For what value of k is the variance of X equal to 2?

Answer: The expected value of Xis

E( X) =  k

x f (x ) dx

= k

x2 x

k2 dx

3k.

Probability and Mathematical Statistics 83

E( X2 ) =  k

x2 f (x)dx

= k

x2 2 x

k2 dx

4k 2 .

Hence, the variance is given by

V ar( X ) = E ( X2 ) ( µX )2

4k 2  4

9k 2

18 k 2 .

Since this variance is given to be 2, we get

18 k 2 = 2

and this implies that k = ± 6. But k is given to be greater than 0, hence k

must be equal to 6.

Example 4.7. If the probability density function of the random variable is

f(x ) = 





1|x | for |x |< 1

0 otherwise,

then what is the variance of X?

Answer: Since V ar (X ) = E (X2 ) µ2

X, we need to ﬁnd the ﬁrst and second

moments of X . The ﬁrst moment of X is given by

µX = E ( X )

= 1

1

x f (x ) dx

= 1

1

x(1  |x| ) dx

= 0

1

x(1 + x) dx + 1

x(1  x)dx

= 0

1

(x + x2 ) dx + 1

(x x2 ) dx

3 1

2+ 1

2 1

= 0.

Moments of Random Variables and Chebychev Inequality 84

The second moment E (X2 ) of X is given by

E( X2 ) =  1

1

x2 f (x)dx

= 1

1

x2 (1  |x|) dx

= 0

1

x2 (1 + x) dx + 1

x2 (1  x)dx

= 0

1

(x2 + x3 ) dx + 1

(x2 x3 ) dx

3 1

4+ 1

3 1

Thus, the variance of X is given by

V ar( X ) = E ( X2 ) µ2

X=1

6 0 = 1

Example 4.8. Suppose the random variable X has mean µ and variance

2 >0. What are the values of the numbers aand bsuch that a+ bX has

mean 0 and variance 1?

Answer: The mean of the random variable is 0. Hence

0 = E (a + bX)

=a +b E (X)

=a +b µ.

Thus a = b µ . Similarly, the variance of a + bX is 1. That is

1 = V ar (a + bX )

=b2 V ar ( X )

=b2 2 .

Probability and Mathematical Statistics 85

Hence

b=1

and a= µ



b= 1

and a= µ

.

Example 4.9. Suppose X has the density function

f(x ) =  3 x 2 for 0 <x<1

0 otherwise.

What is the expected area of a random isosceles right triangle with hy-

potenuse X?

Answer: Let ABC denote this random isosceles right triangle. Let AC = x.

Then

AB = BC = x

Area of ABC = 1

p2= x 2

The expected area of this random triangle is given by

E(area of random ABC) =  1

43x2 dx =3

20 .

B C

The expected area

of ABC is 0.15

Moments of Random Variables and Chebychev Inequality 86

For the next example, we need these following results. For  1<x< 1, let

g(x ) = 1



k=0

a xk = a

1x.

Then

g0 (x ) = 1



k=1

a k xk1 = a

(1  x)2 ,

and

g00 (x ) = 1



k=2

a k ( k 1) xk2 =2 a

(1  x)3 .

Example 4.10. If the probability density function of the random variable

Xis

f(x ) = 





(1  p)x1 p if x = 1, 2,3,4, ..., 1

0 otherwise,

then what is the variance of X?

Answer: We want to ﬁnd the variance of X. But variance of X is deﬁned

V ar( X ) = E  X2  [ E ( X ) ]2

=E (X(X 1)) + E (X ) [E (X ) ]2 .

We write the variance in the above manner because E (X2 ) has no closed form

solution. However, one can ﬁnd the closed form solution of E (X(X 1)).

From Example 4.3, we know that E (X ) = 1

p. Hence, we now focus on ﬁnding

the second factorial moment of X , that is E (X(X 1)).

E( X( X1)) = 1



x=1

x( x1) (1  p)x1 p



x=2

x( x1) (1  p) (1  p)x2 p

=2p(1  p)

(1 (1  p))3 = 2 (1  p )

p2 .

Hence

V ar( X ) = E ( X ( X 1)) + E ( X) [ E ( X ) ]2 = 2 (1  p)

p2 + 1

p 1

p2 =1 p

Probability and Mathematical Statistics 87

4.4. Chebychev Inequality

We have taken it for granted, in section 4.2, that the standard deviation

(which is the positive square root of the variance) measures the spread of

a distribution of a random variable. The spread is measured by the area

between "two values". The area under the pdf between two values is the

probability of X between the two values. If the standard deviation measures

the spread, then  should control the area between the "two values".

It is well known that if the probability density function is standard nor-

mal, that is

f(x ) = 1

p2⇡ e 1

2x 2 ,1 <x<1,

then the mean µ = 0 and the standard deviation  = 1, and the area between

the values µ and µ + is 68%.

Similarly, the area between the values µ 2 and µ + 2 is 95%. In this

way, the standard deviation controls the area between the values µ k and

µ+ kfor some kif the distribution is standard normal. If we do not know

the probability density function of a random variable, can we ﬁnd an estimate

of the area between the values µ k and µ +k for some given k ? This

problem was solved by Chebychev, a well known Russian mathematician. He

proved that the area under f (x ) on the interval [µ k , µ +k ] is at least

1k2 . This is equivalent to saying the probability that a random variable

is within k standard deviations of the mean is at least 1 k 2 .

Theorem 4.4 (Chebychev Inequality). Let X be a random variable with

probability density function f (x ). If µ and > 0 are the mean and standard

deviation of X , then

P(| X µ| < k ) 1 1

for any nonzero real positive constant k.

Moments of Random Variables and Chebychev Inequality 88

Mean - k SD Mean + k SD

Mean

at least 1-k -2

Proof: We assume that the random variable X is continuous. If X is not

continuous we replace the integral by summation in the following proof. From

the deﬁnition of variance, we have the following:

2 = 1

1

(x µ )2 f(x ) dx

= µk 

1

(x µ )2 f(x ) dx + µ+k 

µ k 

(x µ )2 f(x ) dx

+ 1

µ+ k 

(x µ )2 f(x ) dx.

Since,  µ+k 

µ k (x µ ) 2 f(x ) dx is positive, we get from the above

2  µk 

1

(x µ )2 f(x ) dx + 1

µ+ k 

(x µ )2 f(x ) dx. (4.1)

If x2 (1, µ k ), then

x µ k.

Hence

k µ x

for

k2 2 ( µ x)2 .

That is (µ x)2  k2 2 . Similarly, if x2 (µ +k  ,1), then

x µ+ k

Probability and Mathematical Statistics 89

Therefore

k2 2 ( µ x)2 .

Thus if x 62 (µ k , µ +k ), then

(µ x )2 k2 2 . (4.2)

Using (4.2) and (4.1), we get

2 k2 2  µk 

1

f(x ) dx + 1

µ+ k 

f(x ) dx.

Hence

k2   µk 

1

f(x ) dx + 1

µ+ k 

f(x ) dx.

Therefore 1

k2  P(X µ k ) + P (X µ +k ).

Thus 1

k2  P(|X µ |k )

which is

P(| X µ| < k ) 1 1

k2 .

This completes the proof of this theorem.

The following integration formula

1

xn (1  x)m dx = n!m!

(n +m + 1)!

will be used in the next example. In this formula m and n represent any two

positive integers.

Example 4.11. Let the probability density function of a random variable

Xbe

f(x ) =  630 x 4 (1  x ) 4 if 0 <x<1

0 otherwise.

What is the exact value of P (|X µ | 2 )? What is the approximate value

of P (|X µ | 2 ) when one uses the Chebychev inequality?

Moments of Random Variables and Chebychev Inequality 90

Answer: First, we ﬁnd the mean and variance of the above distribution.

The mean of X is given by

E( X) =  1

x f (x ) dx

= 1

630 x5 (1  x)4 dx

= 630 5! 4!

(5 + 4 + 1)!

= 630 5! 4!

10!

= 630 2880

3628800

=630

1260

Similarly, the variance of X can be computed from

V ar( X ) =  1

x2 f (x) dx  µ2

= 1

630 x6 (1  x)4 dx  1

= 630 6! 4!

(6 + 4 + 1)!  1

= 630 6! 4!

11!  1

= 630 6

22  1

=12

44  11

44 .

Therefore, the standard deviation of Xis

= 1

44 = 0.15.

Probability and Mathematical Statistics 91

Thus P (|X µ |2 ) = P (|X 0.5| 0.3)

=P (0. 3 X  0. 5 0.3)

=P (0. 2 X  0.8)

= 0.8

0.2

630 x4 (1  x)4 dx

= 0.96.

If we use the Chebychev inequality, then we get an approximation of the

exact value we have. This approximate value is

P(| X µ|2 ) 1  1

4= 0.75

Hence, Chebychev inequality tells us that if we do not know the distribution

of X , then P (|X µ | 2 ) is at least 0.75.

Lower the standard deviation, and the smaller is the spread of the distri-

bution. If the standard deviation is zero, then the distribution has no spread.

This means that the distribution is concentrated at a single point. In the

literature, such distributions are called degenerate distributions. The above

ﬁgure shows how the spread decreases with the decrease of the standard

deviation.

4.5. Moment Generating Functions

We have seen in Section 3 that there are some distributions, such as

geometric, whose moments are diﬃ cult to compute from the deﬁnition. A

Moments of Random Variables and Chebychev Inequality 92

moment generating function is a real valued function from which one can

generate all the moments of a given random variable. In many cases, it

is easier to compute various moments of X using the moment generating

function.

Deﬁnition 4.5. Let X be a random variable whose probability density

function is f (x ). A real valued function M : IR ! IR deﬁned by

M(t ) = E et X 

is called the moment generating function of X if this expected value exists

for all t in the interval h < t < h for some h > 0.

In general, not every random variable has a moment generating function.

But if the moment generating function of a random variable exists, then it

is unique. At the end of this section, we will give an example of a random

variable which does not have a moment generating function.

Using the deﬁnition of expected value of a random variable, we obtain

the explicit representation for M (t ) as

M(t ) = 









x2RX

et x f (x) if X is discrete

1

1 e t x f(x ) dx if X is continuous.

Example 4.12. Let X be a random variable whose moment generating

function is M (t ) and n be any natural number. What is the nth derivative

of M (t ) at t = 0?

Answer: d

dt M ( t) = d

dt E  e t X 

=E d

dt e t X 

=E X et X .

Similarly,

dt2 M ( t) = d 2

dt2 E  e t X 

=E d2

dt2 e t X 

=E X2 et X .

Probability and Mathematical Statistics 93

Hence, in general we get

dtn M ( t) = d n

dtn E  e t X 

=E dn

dtn e t X 

=E Xn et X .

If we set t = 0 in the nth derivative, we get

dtn M ( t)   t=0

=E Xn et X  t=0 =E( X n ).

Hence the nth derivative of the moment generating function of X evaluated

at t = 0 is the nth moment of X about the origin.

This example tells us if we know the moment generating function of

a random variable; then we can generate all the moments of X by taking

derivatives of the moment generating function and then evaluating them at

zero.

Example 4.13. What is the moment generating function of the random

variable X whose probability density function is given by

f(x ) =  e x for x > 0

0 otherwise?

What are the mean and variance of X?

Answer: The moment generating function of Xis

M(t ) = E et X 

= 1

et x f (x)dx

= 1

et x ex dx

= 1

e(1t) x dx

1t e(1t) x  1

1t if 1 t > 0.

Moments of Random Variables and Chebychev Inequality 94

The expected value of X can be computed from M (t ) as

E( X) = d

dt M ( t)   t=0

dt (1  t)1    t=0

= (1  t)2  t=0

(1  t)2    t=0

= 1.

Similarly

E( X2 ) = d2

dt2 M ( t)   t=0

=d2

dt2 (1  t)1    t=0

= 2 (1  t)3  t=0

(1  t)3    t=0

= 2.

Therefore, the variance of Xis

V ar( X ) = E ( X2 ) (µ)2 = 2  1 = 1 .

Example 4.14. Let X have the probability density function

f(x ) =  1

9 8

9 x for x = 0, 1,2, ..., 1

0 otherwise.

What is the moment generating function of the random variable X?

Probability and Mathematical Statistics 95

Answer:

M(t ) = E et X 



x=0

et x f (x)



x=0

et x  1

9 8

9x

= 1

9 1



x=0 e t 8

9x

= 1

9 1

1et 8

if et 8

9< 1

9 8et if t < ln  9

8 .

Example 4.15. Let X be a continuous random variable with density func-

tion

f(x ) =  b e b x for x > 0

0 otherwise ,

where b > 0. If M (t ) is the moment generating function of X , then what is

M( 6 b)?

Answer:

M(t ) = E et X 

= 1

b et x eb x dx

=b 1

e(b t) x dx

b t e(b t) x  1

b tif bt > 0.

Hence M ( 6 b ) = b

7b= 1

Example 4.16. Let the random variable X have moment generating func-

tion M (t ) = (1  t)2 for t < 1. What is the third moment of X about the

origin?

Answer: To compute the third moment E (X3 ) of X about the origin, we

Moments of Random Variables and Chebychev Inequality 96

need to compute the third derivative of M (t ) at t = 0.

M(t ) = (1  t)2

M0 (t ) = 2 (1  t)3

M00 (t ) = 6 (1  t)4

M000 (t ) = 24 (1  t)5 .

Thus the third moment of X is given by

E X3  =24

(1  0)5 = 24.

Theorem 4.5. Let M (t ) be the moment generating function of the random

variable X . If

M(t ) = a0 + a1 t + a2t2 + · ·· + antn + ··· (4.3)

is the Taylor series expansion of M (t ), then

E( Xn ) = ( n!) an

for all natural number n.

Proof: Let M (t ) be the moment generating function of the random variable

X. The Taylor series expansion of M(t ) about 0 is given by

M(t ) = M(0) + M 0 (0)

1! t + M 00 (0)

2! t 2 + M 000 (0)

3! t 3 +···+M (n) (0)

n! t n + ·· ·

Since E (Xn ) = M (n) (0) for n 1 and M (0) = 1, we have

M(t ) = 1 + E(X)

1! t + E (X2 )

2! t 2 + E (X3 )

3! t 3 +···+E (Xn )

n! t n +··· (4.4)

From (4.3) and (4.4), equating the coeﬃ cients of the like powers of t , we

obtain

an = E ( X n )

which is

E( Xn ) = ( n!) an.

This proves the theorem.

Probability and Mathematical Statistics 97

Example 4.17. What is the 479th moment of X about the origin, if the

moment generating function of X is 1

1+t ?

Answer The Taylor series expansion of M (t ) = 1

1+t can be obtained by using

long division (a technique we have learned in high school).

M(t ) = 1

1 + t

1 (t )

= 1 + (t ) + (t)2 + (t)3 +··· + (t)n +···

= 1 t + t2 t3 + t4 +··· + (1)n tn + ···

Therefore an = (1)n and from this we obtain a479 =  1. By Theorem 4.5,

E X479  = (479!) a479 = 479!

Example 4.18. If the moment generating of a random variable Xis

M(t ) = 1



j=0

e(t j1)

j! ,

then what is the probability of the event X = 2?

Answer: By examining the given moment generating function of X , it

is easy to note that X is a discrete random variable with space RX =

{0,1,2, ···, 1}. Hence by deﬁnition, the moment generating function of

Xis

M(t ) = 1



j=0

et j f ( j ) . (4.5)

But we are given that

M(t ) = 1



j=0

e(t j1)



j=0

e1

j! e t j .

From (4.5) and the above, equating the coeﬃ cients of etj , we get

f( j) = e 1

j!for j= 0, 1,2, ..., 1.

Moments of Random Variables and Chebychev Inequality 98

Thus, the probability of the event X = 2 is given by

P( X= 2) = f(2) = e 1

2! = 1

2e.

Example 4.19. Let X be a random variable with

E( Xn ) = 0 .8 for n = 1 , 2 , 3 , ..., 1.

What are the moment generating function and probability density function

of X?

Answer:

M(t ) = M(0) + 1



n=1

M(n) (0)  tn

n! 

=M (0) + 1



n=1

E( Xn ) tn

n! 

= 1 + 0. 8 1



n=1  t n

n! 

= 0. 2 + 0.8 + 0. 8 1



n=1  t n

n! 

= 0. 2 + 0. 8 1



n=0  t n

n! 

= 0. 2 e0t + 0. 8 e1t .

Therefore, we get f (0) = P (X = 0) = 0 . 2 and f (1) = P (X = 1) = 0 .8.

Hence the moment generating function of Xis

M(t ) = 0 . 2 + 0 . 8 et,

and the probability density function of Xis

f(x ) =  | x0.2| for x = 0,1

0 otherwise.

Example 4.20. If the moment generating function of a random variable X

is given by

M(t ) = 5

15 e t + 4

15 e 2t + 3

15 e 3t + 2

15 e 4t + 1

15 e 5t ,

Probability and Mathematical Statistics 99

then what is the probability density function of X ? What is the space of the

random variable X?

Answer: The moment generating function of X is given to be

M(t ) = 5

15 e t + 4

15 e 2t + 3

15 e 3t + 2

15 e 4t + 1

15 e 5t .

This suggests that X is a discrete random variable. Since X is a discrete

random variable, by deﬁnition of the moment generating function, we see

that

M(t ) = 

x2RX

et x f (x)

=et x 1 f (x1 ) + et x 2 f (x2 ) + et x 3 f (x3 ) + et x 4 f (x4 ) + et x 5 f (x5 ).

Hence we have

f(x1 ) = f(1) = 5

f(x2 ) = f(2) = 4

f(x3 ) = f(3) = 3

f(x4 ) = f(4) = 2

f(x5 ) = f(5) = 1

15 .

Therefore the probability density function of X is given by

f(x ) = 6x

15 for x = 1, 2,3,4,5

and the space of the random variable Xis

RX = {1 , 2 , 3 , 4 , 5} .

Example 4.21. If the probability density function of a discrete random

variable Xis

f(x ) = 6

⇡2 x2 , for x = 1 , 2 , 3 , ..., 1 ,

then what is the moment generating function of X?

Moments of Random Variables and Chebychev Inequality 100

Answer: If the moment generating function of X exists, then

M(t ) = 1



x=1

etx f (x)



x=1

etx  p 6

⇡x2



x=1  e tx 6

⇡2 x2 

⇡2



x=1

etx

x2 .

Now we show that the above inﬁnite series diverges if t belongs to the interval

(h, h ) for any h > 0. To prove that this series is divergent, we do the ratio

test, that is

lim

n!1  a n+1

an  = lim

n!1  e t ( n+1)

(n + 1)2

et n 

= lim

n!1  e t n e t

(n + 1)2

et n 

= lim

n!1 e t  n

n+ 1  2 

=et.

For any h > 0, since et is not always less than 1 for all t in the interval

(h, h ), we conclude that the above inﬁnite series diverges and hence for

this random variable X the moment generating function does not exist.

Notice that for the above random variable, E [Xn ] does not exist for

any natural number n . Hence the discrete random variable X in Example

4.21 has no moments. Similarly, the continuous random variable Xwhose

Probability and Mathematical Statistics 101

probability density function is

f(x ) = 





x2 for 1 x < 1

0 otherwise,

has no moment generating function and no moments.

In the following theorem we summarize some important properties of the

moment generating function of a random variable.

Theorem 4.6. Let X be a random variable with the moment generating

function MX (t ). If a and b are any two real constants, then

MX+a (t) = ea t MX (t) (4.6)

Mb X (t) = MX ( b t) (4 .7)

MX+a

b(t) = e a

bt M X t

b .(4.8)

Proof: First, we prove (4.6).

MX+a (t) = E et(X+a ) 

=E et X+t a 

=E et X et a 

=et a E et X 

=et a MX (t).

Similarly, we prove (4.7).

Mb X (t) = E et(b X ) 

=E e(t b) X 

=MX ( t b ).

By using (4.6) and (4.7), we easily get (4.8).

MX+a

b(t) = M X

b+ a

b(t)

=ea

bt M X

b(t)

=ea

bt M X t

b .

Moments of Random Variables and Chebychev Inequality 102

This completes the proof of this theorem.

Deﬁnition 4.6. The nth factorial moment of a random variable Xis

E( X( X1)( X2) ··· ( X n+ 1)).

Deﬁnition 4.7. The factorial moment generating function (FMGF) of Xis

denoted by G(t ) and deﬁned as

G(t) = E tX .

It is not diﬃ cult to establish a relationship between the moment generat-

ing function (MGF) and the factorial moment generating function (FMGF).

The relationship between them is the following:

G(t) = E tX  = E eln tX  = E eXln t  = M (ln t).

Thus, if we know the MGF of a random variable, we can determine its FMGF

and conversely.

Deﬁnition 4.8. Let X be a random variable. The characteristic function

(t ) of X is deﬁned as

(t ) = E ei t X 

=E (cos(tX ) + i sin (tX ) )

=E (cos(tX ) ) + i E ( sin(tX ) ) .

The probability density function can be recovered from the characteristic

function by using the following formula

f(x ) = 1

2⇡ 1

1

ei t x  (t)dt.

Unlike the moment generating function, the characteristic function of a

random variable always exists. For example, the Cauchy random variable X

with probability density f (x ) = 1

⇡(1+x2 ) has no moment generating function.

However, the characteristic function is

(t ) = E ei t X 

= 1

1

eitx

⇡(1 + x2 ) dx

=e|t | .

Probability and Mathematical Statistics 103

To evaluate the above integral one needs the theory of residues from the

complex analysis.

The characteristic function  (t ) satisﬁes the same set of properties as the

moment generating functions as given in Theorem 4.6.

The following integrals

1

xmex dx = m! if m is a positive integer

and  1

px e x dx = p ⇡

are needed for some problems in the Review Exercises of this chapter. These

formulas will be discussed in Chapter 6 while we describe the properties and

usefulness of the gamma distribution.

We end this chapter with the following comment about the Taylor's se-

ries. Taylor's series was discovered to mimic the decimal expansion of real

numbers. For example

125 = 1 (10)2+ 2 (10)1 + 5 (10)0

is an expansion of the number 125 with respect to base 10. Similarly,

125 = 1 (9)2 + 4 (9)1 + 8 (9)0

is an expansion of the number 125 in base 9 and it is 148. Since given a

function f : IR ! IR and x2 IR, f (x ) is a real number and it can be expanded

with respect to the base x . The expansion of f (x ) with respect to base xwill

have a form

f(x ) = a0x0 + a1x1 + a2x2 + a3x3 + ···

which is

f(x ) = 1



k=0

akxk.

If we know the coeﬃ cients ak for k = 0, 1,2 ,3, ... , then we will have the

expansion of f (x ) in base x . Taylor found the remarkable fact that the the

coeﬃ cients ak can be computed if f (x ) is suﬃ ciently di↵ erentiable. He proved

that for k = 1, 2,3, ...

ak = f (k) (0)

k!with f (0) = f(0).

Moments of Random Variables and Chebychev Inequality 104

4.6. Review Exercises

1. In a state lottery a ﬁve-digit integer is selected at random. If a player

bets 1 dollar on a particular number, the payo↵ (if that number is selected)

is $500 minus the $1 paid for the ticket. Let X equal the payo↵ to the better.

Find the expected value of X.

2. A discrete random variable X has probability density function of the form

f(x ) =  c(8  x ) for x = 0, 1,2,3,4,5

0 otherwise.

(a) Find the constant c . (b) Find P (X > 2). (c) Find the expected value

E( X) for the random variable X.

3. A random variable X has a cumulative distribution function

F(x ) = 





2xif 0 < x  1

x1

2if 1 < x  3

(a) Graph F (x ). (b) Graph f (x ). (c) Find P (X 0. 5). (d) Find P (X 0.5).

(e) Find P (X 1. 25). (f) Find P (X = 1 .25).

4. Let X be a random variable with probability density function

f(x ) =  1

8xfor x = 1, 2,5

0 otherwise.

(a) Find the expected value of X . (b) Find the variance of X . (c) Find the

expected value of 2X + 3. (d) Find the variance of 2X+ 3. (e) Find the

expected value of 3X 5X2 + 1.

5. The measured radius of a circle, R , has probability density function

f( r) =  6 r(1  r) if 0 <r<1

0 otherwise.

(a) Find the expected value of the radius. (b) Find the expected circumfer-

ence. (c) Find the expected area.

6. Let X be a continuous random variable with density function

f(x ) = 





✓x+3

2✓ 3

2x 2 for 0 < x < 1

p✓

0 otherwise,

Probability and Mathematical Statistics 105

where ✓> 0. What is the expected value of X?

7. Suppose X is a random variable with mean µ and variance 2 > 0. For

what value of a , where a > 0 is E  a X  1

a 2 minimized?

8. A rectangle is to be constructed having dimension X by 2X , where Xis

a random variable with probability density function

f(x ) =  1

2for 0 <x<2

0 otherwise.

What is the expected area of the rectangle?

9. A box is to be constructed so that the height is 10 inches and its base

is X inches by X inches. If X has a uniform distribution over the interval

[2, 8], then what is the expected volume of the box in cubic inches?

10. If X is a random variable with density function

f(x ) = 





1. 4 e2x + 0 . 9 e3x for x > 0

0 elsewhere,

then what is the expected value of X?

11. A fair coin is tossed. If a head occurs, 1 die is rolled; if a tail occurs, 2

dice are rolled. Let X be the total on the die or dice. What is the expected

value of X?

12. If velocities of the molecules of a gas have the probability density

(Maxwell's law)

f( v) = 





a v2 eh2 v2 for v 0

0 otherwise,

then what are the expectation and the variance of the velocity of the

molecules and also the magnitude of a for some given h?

13. A couple decides to have children until they get a girl, but they agree to

stop with a maximum of 3 children even if they haven't gotten a girl. If X

and Y denote the number of children and number of girls, respectively, then

what are E (X ) and E (Y)?

14. In roulette, a wheel stops with equal probability at any of the 38 numbers

0,00,1,2, ..., 36. If you bet $1 on a number, then you win $36 (net gain is

Moments of Random Variables and Chebychev Inequality 106

$35) if the number comes up; otherwise, you lose your dollar. What are your

expected winnings?

15. If the moment generating function for the random variable X is MX (t ) =

1+t , what is the third moment of X about the point x = 2?

16. If the mean and the variance of a certain distribution are 2 and 8, what

are the ﬁrst three terms in the series expansion of the moment generating

function?

17. Let X be a random variable with density function

f(x ) = 





a eax for x > 0

0 otherwise,

where a > 0. If M (t ) denotes the moment generating function of X , what is

M(3a)?

18. Suppose the random variable X has moment generating

M(t ) = 1

(1  t)k , for t < 1

.

What is the nth moment of X?

19. Two balls are dropped in such a way that each ball is equally likely to

fall into any one of four holes. Both balls may fall into the same hole. Let X

denote the number of unoccupied holes at the end of the experiment. What

is the moment generating function of X?

20. If the moment generating function of X is M (t ) = 1

(1t )2 for t < 1, then

what is the fourth moment of X?

21. Let the random variable X have the moment generating function

M(t ) = e 3t

1t2 ,  1< t < 1.

What are the mean and the variance of X , respectively?

22. Let the random variable X have the moment generating function

M(t ) = e3t+t2 .

What is the second moment of X about x = 0?

Probability and Mathematical Statistics 107

23. Suppose the random variable X has the cumulative density function

F(x ). Show that the expected value of the random variable ( X c)2is

minimum if c equals the expected value of X.

24. Suppose the continuous random variable X has the cumulative density

function F (x ). Show that the expected value of the random variable |X c |

is minimum if c equals the median of X (that is, F (c ) = 0 .5).

25. Let the random variable X have the probability density function

f(x ) = 1

2e|x |  1 < x < 1.

What are the expected value and the variance of X?

26. If MX (t ) = k (2 + 3et )4 , what is the value of k ?

27. Given the moment generating function of Xas

M(t ) = 1 + t+ 4 t2 + 10t3 + 14t4 + ·· ·

what is the third moment of X about its mean?

28. A set of measurements X has a mean of 7 and standard deviation of 0.2.

For simplicity, a linear transformation Y = aX +b is to be applied to make

the mean and variance both equal to 1. What are the values of the constants

aand b?

29. A fair coin is to be tossed 3 times. The player receives 10 dollars if all

three turn up heads and pays 3 dollars if there is one or no heads. No gain or

loss is incurred otherwise. If Y is the gain of the player, what the expected

value of Y?

30. If X has the probability density function

f(x ) =  e x for x > 0

0 otherwise,

then what is the expected value of the random variable Y =e 3

4X + 6?

31. If the probability density function of the random variable Xif

f(x ) = 





(1  p)x1 p if x = 1, 2,3, ..., 1

0 otherwise,

then what is the expected value of the random variable X 1 ?

Some Special Discrete Distributions 108

Chapter 5

SOME SPECIAL

DISCRETE

DISTRIBUTIONS

Given a random experiment, we can ﬁnd the set of all possible outcomes

which is known as the sample space. Objects in a sample space may not be

numbers. Thus, we use the notion of random variable to quantify the qual-

itative elements of the sample space. A random variable is characterized by

either its probability density function or its cumulative distribution function.

The other characteristics of a random variable are its mean, variance and

moment generating function. In this chapter, we explore some frequently

encountered discrete distributions and study their important characteristics.

5.1. Bernoulli Distribution

A Bernoulli trial is a random experiment in which there are precisely two

possible outcomes, which we conveniently call 'failure' (F) and 'success' (S).

We can deﬁne a random variable from the sample space {S, F } into the set

of real numbers as follows:

X( F) = 0 X( S) = 1.

Probability and Mathematical Statistics 109

1= X(S)

X(F) = 0

Sample Space

The probability density function of this random variable is

f(0) = P( X= 0) = 1  p

f(1) = P( X= 1) = p,

where p denotes the probability of success. Hence

f(x ) = px (1  p)1x , x = 0 , 1.

Deﬁnition 5.1. The random variable X is called the Bernoulli random

variable if its probability density function is of the form

f(x ) = px (1  p)1x , x = 0 , 1

where p is the probability of success.

We denote the Bernoulli random variable by writing X⇠ BER (p ).

Example 5.1. What is the probability of getting a score of not less than 5

in a throw of a six-sided die?

Answer: Although there are six possible scores {1,2,3,4,5,6} , we are

grouping them into two sets, namely {1,2,3,4 } and {5,6} . Any score in

{1,2,3,4 } is a failure and any score in {5,6 } is a success. Thus, this is a

Bernoulli trial with

P( X= 0) = P(failure) = 4

6and P (X = 1) = P (success) = 2

Hence, the probability of getting a score of not less than 5 in a throw of a

six-sided die is 2

Some Special Discrete Distributions 110

Theorem 5.1. If X is a Bernoulli random variable with parameter p , then

the mean, variance and moment generating functions are respectively given

µX =p

2

X=p(1  p)

MX (t) = (1  p) + p et.

Proof: The mean of the Bernoulli random variable is

µX =



x=0

x f (x)



x=0

x px (1  p)1x

=p.

Similarly, the variance of X is given by

2



x=0

(x µX )2 f(x)



x=0

(x p )2 px (1  p)1x

=p2 (1  p ) + p (1  p)2

=p (1  p ) [p + (1  p)]

=p (1  p).

Next, we ﬁnd the moment generating function of the Bernoulli random vari-

able

M(t ) = E etX 



x=0

etx px (1  p)1x

= (1  p ) + etp.

This completes the proof. The moment generating function of X and all the

moments of X are shown below for p = 0. 5. Note that for the Bernoulli

distribution all its moments about zero are same and equal to p.

Probability and Mathematical Statistics 111

5.2. Binomial Distribution

Consider a ﬁxed number n of mutually independent Bernoulli trails. Sup-

pose these trials have same probability of success, say p . A random variable

Xis called a binomial random variable if it represents the total number of

successes in n independent Bernoulli trials.

Now we determine the probability density function of a binomial random

variable. Recall that the probability density function of X is deﬁned as

f(x ) = P( X= x).

Thus, to ﬁnd the probability density function of X we have to ﬁnd the prob-

ability of x successes in n independent trails.

If we have x successes in n trails, then the probability of each n-tuple

with x successes and n x failures is

px (1  p)nx .

However, there are  n

xtuples with x successes and n x failures in ntrials.

Hence

P( X= x) =  n

x p x (1  p)nx .

Therefore, the probability density function of Xis

f(x ) =  n

x p x (1  p)nx , x = 0, 1, ..., n.

Deﬁnition 5.2. The random variable X is called the binomial random

variable with parameters p and n if its probability density function is of the

form

f(x ) =  n

x p x (1  p)nx , x = 0, 1, ..., n

Some Special Discrete Distributions 112

where 0 <p< 1 is the probability of success.

We will denote a binomial random variable with parameters p and nas

X⇠ BIN ( n, p).

Example 5.2. Is the real valued function f (x ) given by

f(x ) =  n

x p x (1  p)nx , x = 0, 1, ..., n

where n and p are parameters, a probability density function?

Answer: To answer this question, we have to check that f (x ) is nonnegative

and  n

x=0 f(x ) is 1. It is easy to see that f (x ) 0. We show that sum is

one. n



x=0

f(x ) =



x=0 n

x p x (1  p)nx

= (p + 1  p)n

= 1.

Hence f (x ) is really a probability density function.

Example 5.3. On a ﬁve-question multiple-choice test there are ﬁve possible

answers, of which one is correct. If a student guesses randomly and indepen-

dently, what is the probability that she is correct only on questions 1 and

Answer: Here the probability of success is p = 1

5, and thus 1 p= 4

Therefore, the probability that she is correct on questions 1 and 4 is

P(correct on questions 1 and 4) = p2 (1  p)3

= 1

5 2  4

53

=64

55 = 0 . 02048.

Probability and Mathematical Statistics 113

Example 5.4. On a ﬁve-question multiple-choice test there are ﬁve possible

answers, of which one is correct. If a student guesses randomly and indepen-

dently, what is the probability that she is correct only on two questions?

Answer: Here the probability of success is p = 1

5, and thus 1 p= 4

5. There

are  5

2 di↵erent ways she can be correct on two questions. Therefore, the

probability that she is correct on two questions is

P(correct on two questions) = 5

2 p 2 (1  p ) 3

= 10  1

5 2  4

53

=640

55 = 0 . 2048.

Example 5.5. What is the probability of rolling two sixes and three nonsixes

in 5 independent casts of a fair die?

Answer: Let the random variable X denote the number of sixes in 5 in-

dependent casts of a fair die. Then X is a binomial random variable with

probability of success p and n = 5. The probability of getting a six is p = 1

Hence

P( X= 2) = f(2) =  5

2 1

6 2  5

63

= 10  1

36  125

216 

=1250

7776 = 0.160751.

Example 5.6. What is the probability of rolling at most two sixes in 5

independent casts of a fair die?

Answer: Let the random variable X denote number of sixes in 5 independent

casts of a fair die. Then X is a binomial random variable with probability

of success p and n = 5. The probability of getting a six is p = 1

6. Hence, the

Some Special Discrete Distributions 114

probability of rolling at most two sixes is

P( X2) = F(2) = f(0) + f(1) + f(2)

= 5

0 1

6 0  5

65

+ 5

1 1

6 1  5

64

+ 5

2 1

6 2  5

63



k=0  5

k 1

6 k  5

6 5k

2(0.9421 + 0.9734) = 0.9577 (from binomial table)

Theorem 5.2. If X is binomial random variable with parameters p and n,

then the mean, variance and moment generating functions are respectively

given by

µX = n p

2

X=n p (1  p)

MX (t) =  (1  p) + p et  n.

Proof: First, we determine the moment generating function M (t ) of Xand

then we generate mean and variance from M (t).

M(t ) = E etX 



x=0

etx  n

x p x (1  p)nx



x=0 n

x p e t  x (1  p)nx

= p et + 1  p n.

Hence

M0 (t ) = n p et + 1  p n1 p et.

Probability and Mathematical Statistics 115

Therefore

µX = M0 (0) = n p.

Similarly

M00 (t ) = n p et + 1  p n1 p et + n( n1)  p et + 1  p n2  p et  2.

Therefore

E( X2 ) = M00 (0) = n( n1) p2 + n p.

Hence

V ar( X ) = E ( X2 ) µ2

X=n( n1) p 2 +n p  n 2 p 2 =n p (1  p).

This completes the proof.

Example 5.7. Suppose that 2000 points are selected independently and at

random from the unit squares S = {(x, y ) | 0 x, y  1} . Let X equal the

number of points that fall in A = {(x, y ) | x2 +y2 < 1} . How is Xdistributed?

What are the mean, variance and standard deviation of X?

Answer: If a point falls in A , then it is a success. If a point falls in the

complement of A , then it is a failure. The probability of success is

p=area of A

area of S = 1

4⇡.

Since, the random variable represents the number of successes in 2000 inde-

pendent trials, the random variable X is a binomial with parameters p = ⇡

and n = 2000, that is X⇠ BIN (2000 , ⇡

4).

Some Special Discrete Distributions 116

Hence by Theorem 5.2,

µX = 2000 ⇡

4= 1570.8,

and

2

X= 2000  1⇡

4 ⇡

4= 337.1.

The standard deviation of Xis

X = p 337. 1 = 18 .36.

Example 5.8. Let the probability that the birth weight (in grams) of babies

in America is less than 2547 grams be 0. 1. If X equals the number of babies

that weigh less than 2547 grams at birth among 20 of these babies selected

at random, then what is P (X 3)?

Answer: If a baby weighs less than 2547, then it is a success; otherwise it is

a failure. Thus X is a binomial random variable with probability of success

pand n= 20. We are given that p= 0 .1. Hence

P( X3) =



k=0 20

k 1

10  k  9

10  20k

= 0. 867 (from table).

Example 5.9. Let X1 , X2, X3 be three independent Bernoulli random vari-

ables with the same probability of success p . What is the probability density

function of the random variable X = X1 + X2 + X3 ?

Answer: The sample space of the three independent Bernoulli trials is

S={ F F F, F F S, F SF, SF F, F SS, SF S, SSF, SSS }.

Probability and Mathematical Statistics 117

The random variable X = X1 + X2 + X3 represents the number of successes

in each element of S . The following diagram illustrates this.

Sum of three Bernoulli Trials

Let p be the probability of success. Then

f(0) = P( X= 0) = P( FFF ) = (1  p)3

f(1) = P( X= 1) = P( F F S) + P (F S F ) + P (SF F ) = 3 p (1  p)2

f(2) = P( X= 2) = P( F SS ) + P ( SFS ) + P( S SF ) = 3 p2 (1  p)

f(3) = P( X= 3) = P( SSS ) = p3.

Hence

f(x ) =  3

x p x (1  p)3x , x = 0, 1,2,3.

Thus

X⇠ BIN (3 , p ).

In general, if Xi ⇠BER(p ), then  n

i=1 X i ⇠BIN ( n, p) and hence

E n



i=1

Xi  = n p

and

V ar  n



i=1

Xi  = n p (1  p).

The binomial distribution can arise whenever we select a random sample

of n units with replacement. Each unit in the population is classiﬁed into one

of two categories according to whether it does or does not possess a certain

property. For example, the unit may be a person and the property may be

Some Special Discrete Distributions 118

whether he intends to vote "yes". If the unit is a machine part, the property

may be whether the part is defective and so on. If the proportion of units in

the population possessing the property of interest is p , and if Z denotes the

number of units in the sample of size n that possess the given property, then

Z⇠ BIN ( n, p ).

5.3. Geometric Distribution

If X represents the total number of successes in n independent Bernoulli

trials, then the random variable

X⇠ BIN ( n, p ),

where p is the probability of success of a single Bernoulli trial and the prob-

ability density function of X is given by

f(x ) =  n

x p x (1  p)nx , x = 0, 1, ..., n.

Let X denote the trial number on which the ﬁrst success occurs.

Sample Space

FFFFFFS

FFFFFS

FFFFS

FFFS

FFS

1 2 3 4 5 6 7

Geometric Random Variable

Space of the random variable

Hence the probability that the ﬁrst success occurs on xth trial is given by

f(x ) = P( X= x) = (1  p)x1 p.

Hence, the probability density function of Xis

f(x ) = (1  p)x1 p x = 1 , 2 , 3 , ..., 1,

where p denotes the probability of success in a single Bernoulli trial.

Probability and Mathematical Statistics 119

Deﬁnition 5.3. A random variable X has a geometric distribution if its

probability density function is given by

f(x ) = (1  p)x1 p x = 1 , 2 , 3 , ..., 1,

where p denotes the probability of success in a single Bernoulli trial.

If X has a geometric distribution we denote it as X⇠ GEO(p ).

Example 5.10. Is the real valued function f (x ) deﬁned by

f(x ) = (1  p)x1 p x = 1 , 2 , 3 , ..., 1

where 0 <p< 1 is a parameter, a probability density function?

Answer: It is easy to check that f (x ) 0. Thus we only show that the sum

is one. 1



x=1

f(x ) = 1



x=1

(1  p)x1 p

=p1



y=0

(1  p)y , where y =x 1

=p 1

1 (1  p ) = 1.

Hence f (x ) is a probability density function.

Example 5.11. The probability that a machine produces a defective item

is 0.02. Each item is checked as it is produced. Assuming that these are

independent trials, what is the probability that at least 100 items must be

checked to ﬁnd one that is defective?

Some Special Discrete Distributions 120

Answer: Let X denote the trial number on which the ﬁrst defective item is

observed. We want to ﬁnd

P( X100) = 1



x=100

f(x)



x=100

(1  p)x1 p

= (1  p)99 1



y=0

(1  p)y p

= (1  p)99

= (0.98)99 = 0.1353.

Hence the probability that at least 100 items must be checked to ﬁnd one

that is defective is 0.1353.

Example 5.12. A gambler plays roulette at Monte Carlo and continues

gambling, wagering the same amount each time on "Red", until he wins for

the ﬁrst time. If the probability of "Red" is 18

38 and the gambler has only

enough money for 5 trials, then (a) what is the probability that he will win

before he exhausts his funds; (b) what is the probability that he wins on the

second trial?

Answer:

p= P(Red ) = 18

38 .

(a) Hence the probability that he will win before he exhausts his funds is

given by

P( X5) = 1  P( X6)

= 1  (1  p)5

= 1   1 18

38  5

= 1  (0.5263)5 = 1  0. 044 = 0 .956.

(b) Similarly, the probability that he wins on the second trial is given by

P( X= 2) = f(2)

= (1  p)21 p

= 1 18

38  18

38 

=360

1444 = 0.2493.

Probability and Mathematical Statistics 121

The following theorem provides us with the mean, variance and moment

generating function of a random variable with the geometric distribution.

Theorem 5.3. If X is a geometric random variable with parameter p , then

the mean, variance and moment generating functions are respectively given

µX =1

2

X=1p

MX (t) = p e t

1 (1  p ) et , if t < ln (1 p).

Proof: First, we compute the moment generating function of X and then

we generate all the mean and variance of X from it.

M(t ) = 1



x=1

etx (1  p)x1 p

=p1



y=0

et(y +1) (1  p)y , where y= x 1

=p et 1



y=0 e t (1 p) y

=p et

1 (1  p ) et , if t < ln (1 p).

Some Special Discrete Distributions 122

Di↵ erentiating M (t ) with respect to t , we obtain

M0 (t ) = (1  (1  p ) et ) p et + p et (1  p ) et

[1  (1  p)et ]2

=p e t [1  (1  p ) et + (1  p ) et ]

[1  (1  p)et ]2

=p et

[1  (1  p)et ]2 .

Hence

µX = E ( X ) = M0 (0) = 1

Similarly, the second derivative of M (t ) can be obtained from the ﬁrst deriva-

tive as

M00 (t ) = [1  (1  p ) et ]2 p et + p et 2 [1  (1  p ) et ] (1  p ) et

[1  (1  p)et ]4 .

Hence

M00 (0) = p 3 + 2 p 2 (1  p)

p4 =2 p

p2 .

Therefore, the variance of Xis

2

X=M 00 (0) (M 0 (0) ) 2

=2p

p2  1

=1p

p2 .

This completes the proof of the theorem.

Theorem 5.4. The random variable X is geometric if and only if it satisﬁes

the memoryless property, that is

P( X > m + n / X > n) = P ( X > m)

for all natural numbers n and m.

Proof: It is easy to check that the geometric distribution satisﬁes the lack

of memory property

P( X > m + n / X > n) = P ( X > m)

Probability and Mathematical Statistics 123

which is

P( X > m + n and X > n) = P( X > m) P (X > n) . (5.1)

If X is geometric, that is X⇠ (1  p)x1 p , then

P( X > n + m) = 1



x=n+m+1

(1  p)x1 p

= (1  p)n+m

= (1  p)n (1  p)m

=P (X > n )P (X > m).

Hence the geometric distribution has the lack of memory property. Let X be

a random variable which satisﬁes the lack of memory property, that is

P( X > m + n and X > n) = P( X > m) P (X > n).

We want to show that X is geometric. Deﬁne g :N! IR by

g(n ) := P( X > n) (5 .2)

Using (5.2) in (5.1), we get

g( m+ n) = g(m ) g(n )8 m, n 2 N , (5.3)

since P (X > m +n and X > n ) = P (X > m + n ). Letting m = 1 in (5.3),

we see that

g( n+ 1) = g(n ) g(1)

=g (n 1) g (1)2

=g (n 2) g (1)3

=··· ···

=g (n (n 1)) g (1)n

=g (1)n+1

=an+1 ,

where a is an arbitrary constant. Hence g (n ) = an . From (5.2), we get

1F (n ) = P (X > n ) = an

Some Special Discrete Distributions 124

and thus

F(n ) = 1  an.

Since F (n ) is a distribution function

1 = lim

n!1 F(n ) = lim

n!1 (1 a n ).

From the above, we conclude that 0 <a< 1. We rename the constant aas

(1  p ). Thus,

F(n ) = 1  (1  p)n .

The probability density function of X is hence

f(1) = F(1) = p

f(2) = F(2)  F(1) = 1  (1  p)2  1 + (1  p) = (1  p)p

f(3) = F(3)  F(2) = 1  (1  p)3  1 + (1  p)2 = (1  p)2 p

··· ···

f(x ) = F(x ) F( x1) = (1  p)x1 p.

Thus X is geometric with parameter p . This completes the proof.

The di↵ erence between the binomial and the geometric distributions is

the following. In binomial distribution, the number of trials was predeter-

mined, whereas in geometric it is the random variable.

5.4. Negative Binomial Distribution

Let X denote the trial number on which the r th success occurs. Here r

is a positive integer greater than or equal to one. This is equivalent to saying

that the random variable X denotes the number of trials needed to observe

the r th successes. Suppose we want to ﬁnd the probability that the ﬁfth head

is observed on the 10th independent ﬂip of an unbiased coin. This is a case

of ﬁnding P (X = 10). Let us ﬁnd the general case P (X = x).

P( X= x) = P (ﬁrst x 1 trials contain x r failures and r 1 successes)

·P( rth success in x th trial)

= x1

r1 p r1 (1  p)xr p

= x1

r1 p r (1  p)xr , x = r, r + 1, ..., 1.

Probability and Mathematical Statistics 125

Hence the probability density function of the random variable X is given by

f(x ) =  x1

r1 p r (1  p)xr , x = r, r + 1, ..., 1.

Notice that this probability density function f (x ) can also be expressed as

f(x ) =  x+r  1

r1 p r (1  p)x , x = 0, 1, ..., 1.

SSSS

FSSSS

SFSSS

SSFSS

SSSFS

FFSSSS

FSFSSS

FSSFSS

FSSSFS

X is NBIN(4,P)

Deﬁnition 5.4. A random variable X has the negative binomial (or Pascal)

distribution if its probability density function is of the form

f(x ) =  x1

r1 p r (1  p)xr , x = r, r + 1, ..., 1,

where p is the probability of success in a single Bernoulli trial. We denote

the random variable X whose distribution is negative binomial distribution

by writing X⇠ NBIN ( r, p ).

We need the following technical result to show that the above function

is really a probability density function. The technical result we are going to

establish is called the negative binomial theorem.

Some Special Discrete Distributions 126

Theorem 5.5. Let r be a nonzero positive integer. Then

(1 y )r = 1



x= r x1

r1 y xr

where |y |< 1.

Proof: Deﬁne

h( y ) = (1  y )r .

Now expanding h( y ) by Taylor series method about y = 0, we get

(1 y )r = 1



k=0

h(k) (0)

k! y k ,

where h(k) ( y ) is the k th derivative of h . This k th derivative of h( y ) can be

directly computed and direct computation gives

h(k) ( y ) = r ( r + 1) ( r + 2) ··· ( r+ k 1) (1  y )(r+k ) .

Hence, we get

h(k) (0) = r ( r + 1) ( r + 2) ··· ( r+ k 1) = ( r+k  1)!

(r 1)! .

Letting this into the Taylor's expansion of h( y ), we get

(1 y )r = 1



k=0

(r +k  1)!

(r 1)! k !y k



k=0 r+k  1

r1 y k .

Letting x =k +r , we get

(1 y )r = 1



x= r x1

r1 y xr .

This completes the proof of the theorem.

Theorem 5.5 can also be proved using the geometric series



n=0

yn =1

1y (5.4)

Probability and Mathematical Statistics 127

where |y |< 1. Di↵ erentiating k times both sides of the equality (5.4) and

then simplifying we have



n= k n

k y nk =1

(1 y )k+1 . (5.5)

Letting n =x 1 and k =r 1 in (5.5), we have the asserted result.

Example 5.13. Is the real valued function deﬁned by

f(x ) =  x1

r1 p r (1  p)xr , x = r, r + 1, ..., 1,

where 0 <p< 1 is a parameter, a probability density function?

Answer: It is easy to check that f (x ) 0. Now we show that 1



x=r

f(x ) is

equal to one.



x=r

f(x ) = 1



x= r x1

r1 p r (1  p)xr

=pr 1



x= r x1

r1 (1  p)xr

=pr (1  (1  p))r

=prpr

= 1.

Computing the mean and variance of the negative binomial distribution

using deﬁnition is diﬃ cult. However, if we use the moment generating ap-

proach, then it is not so diﬃ cult. Hence in the next example, we determine

the moment generating function of this negative binomial distribution.

Example 5.14. What is the moment generating function of the random

variable X whose probability density function is

f(x ) =  x1

r1 p r (1  p)xr , x = r, r + 1, ..., 1?

Answer: The moment generating function of this negative binomial random

Some Special Discrete Distributions 128

variable is

M(t ) = 1



x=r

etx f (x)



x=r

etx  x1

r1 p r (1  p)xr

=pr 1



x=r

et(x r) etr  x1

r1 (1  p)xr

=pretr 1



x= r x1

r1 e t(x r) (1  p)xr

=pretr 1



x= r x1

r1 e t (1  p) xr

=pretr  1 (1  p)et  r

= p et

1 (1  p)et  r

,if t < ln(1  p).

The following theorem can easily be proved.

Theorem 5.6. If X⇠ N BI N ( r, p ), then

E( X) = r

V ar( X ) = r (1  p)

M(t ) =  p et

1 (1  p)et  r

,if t < ln(1  p).

Example 5.15. What is the probability that the ﬁfth head is observed on

the 10th independent ﬂip of a coin?

Answer: Let X denote the number of trials needed to observe 5th head.

Hence X has a negative binomial distribution with r = 5 and p = 1

We want to ﬁnd

P( X= 10) = f(10)

= 9

4 p 5 (1  p ) 5

= 9

4 1

210

=63

512 .

Probability and Mathematical Statistics 129

We close this section with the following comment. In the negative bino-

mial distribution the parameter r is a positive integer. One can generalize

the negative binomial distribution to allow noninteger values of the parame-

ter r . To do this let us write the probability density function of the negative

binomial distribution as

f(x ) =  x1

r1 p r (1  p)xr

=(x 1)!

(r 1)! (x r )! p r (1  p ) xr

=(x)

(r ) (x r 1) p r (1  p ) xr , for x= r, r + 1 , ..., 1,

where

(z ) =  1

tz1 et dt

is the well known gamma function. The gamma function generalizes the

notion of factorial and it will be treated in the next chapter.

5.5. Hypergeometric Distribution

Consider a collection of n objects which can be classiﬁed into two classes,

say class 1 and class 2. Suppose that there are n1 objects in class 1 and n2

objects in class 2. A collection of r objects is selected from these n objects

at random and without replacement. We are interested in ﬁnding out the

probability that exactly x of these r objects are from class 1. If x of these r

objects are from class 1, then the remaining r x objects must be from class

2. We can select x objects from class 1 in any one of  n 1

xways. Similarly,

the remaining r x objects can be selected in  n 2

rx ways. Thus, the number

of ways one can select a subset of r objects from a set of n objects, such that

Some Special Discrete Distributions 130

xnumber of objects will be from class 1 and r xnumber of objects will be

from class 2, is given by  n 1

x n2

rx. Hence,

P( X= x) =  n 1

x n2

rx

n

r,

where x r, x  n1 and r x n2 .

Class IClass II

Out of n1 objects

x will be

selected

Out of n2

objects

r-x will

chosen

From

n1+n2

objects

select r

objects

such that x

objects are

of class I &

r-x are of

class II

Deﬁnition 5.5. A random variable X is said to have a hypergeometric

distribution if its probability density function is of the form

f(x ) =  n 1

x n2

rx

n 1 +n2

r, x = 0 , 1 , 2 , ..., r

where x n1 and r x n2 with n1 and n2 being two positive integers. We

shall denote such a random variable by writing

X⇠ H Y P (n1 , n2, r ).

Example 5.16. Suppose there are 3 defective items in a lot of 50 items. A

sample of size 10 is taken at random and without replacement. Let Xdenote

the number of defective items in the sample. What is the probability that

the sample contains at most one defective item?

Answer: Clearly, X⇠ H Y P (3,47, 10). Hence the probability that the

sample contains at most one defective item is

P( X1) = P( X= 0) + P( X= 1)

=3

0 47

10

50

10+ 3

1 47

9

50

10

= 0. 504 + 0.4

= 0.904.

Probability and Mathematical Statistics 131

Example 5.17. A random sample of 5 students is drawn without replace-

ment from among 300 seniors, and each of these 5 seniors is asked if she/he

has tried a certain drug. Suppose 50% of the seniors actually have tried the

drug. What is the probability that two of the students interviewed have tried

the drug?

Answer: Let X denote the number of students interviewed who have tried

the drug. Hence the probability that two of the students interviewed have

tried the drug is

P( X= 2) =  150

2 150

3

300

5

= 0.3146.

Example 5.18. A radio supply house has 200 transistor radios, of which

3 are improperly soldered and 197 are properly soldered. The supply house

randomly draws 4 radios without replacement and sends them to a customer.

What is the probability that the supply house sends 2 improperly soldered

radios to its customer?

Answer: The probability that the supply house sends 2 improperly soldered

Some Special Discrete Distributions 132

radios to its customer is

P( X= 2) =  3

2 197

2

200

4

= 0.000895.

Theorem 5.7. If X⇠ H Y P (n1 , n2, r ), then

E( X) = rn1

n1 +n2

V ar( X ) = r n 1

n1 +n2  n 2

n1 +n2  n 1 + n 2  r

n1 +n2 1 .

Proof: Let X⇠ H Y P (n1 , n2, r ). We compute the mean and variance of

Xby computing the ﬁrst and the second factorial moments of the random

variable X . First, we compute the ﬁrst factorial moment (which is same as

the expected value) of X . The expected value of X is given by

E( X) =



x=0

x f (x)



x=0

x n1

x n2

rx

n 1 +n2

r

=n1



x=1

(n1  1)!

(x 1)! ( n1 x )!  n 2

rx

n 1 +n2

r

=n1



x=1  n 1 1

x1  n 2

rx

n1 +n2

r n 1 +n2 1

r1 

=rn1

n1 +n2

r1



y=0  n 1 1

y n2

r1 y

n 1 +n2 1

r1,where y= x1

=rn1

n1 +n2

The last equality is obtained since

r1



y=0  n 1 1

y n2

r1 y

n 1 +n2 1

r1 = 1.

Similarly, we ﬁnd the second factorial moment of X to be

E( X( X1)) = r(r 1) n1 (n1  1)

(n1 + n2 ) ( n1 + n2  1) .

Probability and Mathematical Statistics 133

Therefore, the variance of Xis

V ar( X ) = E ( X2 ) E ( X )2

=E (X(X 1)) + E (X )E (X)2

=r (r 1) n1 (n1  1)

(n1 + n2 ) ( n1 + n2  1) +r n 1

n1 +n2   r n 1

n1 +n2  2

=r n1

n1 +n2  n 2

n1 +n2  n 1 + n 2  r

n1 +n2 1 .

5.6. Poisson Distribution

In this section, we deﬁne an important discrete distribution which is

widely used for modeling many real life situations. First, we deﬁne this

distribution and then we present some of its important properties.

Deﬁnition 5.6. A random variable X is said to have a Poisson distribution

if its probability density function is given by

f(x ) = e   x

x! , x = 0, 1,2,··· , 1,

where 0 < <1 is a parameter. We denote such a random variable by

X⇠ P OI ().

The probability density function f is called the Poisson distribution after

Simeon D. Poisson (1781-1840).

Example 5.19. Is the real valued function deﬁned by

f(x ) = e   x

x! , x = 0, 1,2,··· , 1,

where 0 <<1 is a parameter, a probability density function?

Some Special Discrete Distributions 134

Answer: It is easy to check f (x ) 0. We show that 1



x=0

f(x ) is equal to

one. 1



x=0

f(x ) = 1



x=0

e x

=e 1



x=0

x

=e e = 1.

Theorem 5.8. If X⇠ P O I ( ), then

E( X) = 

V ar( X ) = 

M(t ) = e(et  1) .

Proof: First, we ﬁnd the moment generating function of X.

M(t ) = 1



x=0

etx f (x)



x=0

etx e   x

=e 1



x=0

etx  x

=e 1



x=0

(et  )x

=e eet

=e(et  1) .

Thus,

M0 (t ) =  ete(et  1) ,

and

E( X) = M0 (0) = .

Similarly,

M00 (t ) =  ete(et  1) +   et  2e(et  1) .

Hence

M00 (0) = E( X2 ) = 2 + .

Probability and Mathematical Statistics 135

Therefore

V ar( X ) = E ( X2 ) ( E ( X ) )2 = 2 +  2 =  .

This completes the proof.

Example 5.20. A random variable X has a Poisson distribution with a

mean of 3. What is the probability that X is bounded by 1 and 3, that is,

P(1  X3)?

Answer:

µX = 3 = 

f(x ) =  x e 

Hence

f(x ) = 3 x e 3

x! , x = 0, 1,2, ...

Therefore

P(1  X3) = f(1) + f(2) + f(3)

= 3 e3 + 9

2e 3 + 27

6e 3

= 12 e3 .

Example 5.21. The number of traﬃ c accidents per week in a small city

has a Poisson distribution with mean equal to 3. What is the probability of

exactly 2 accidents occur in 2 weeks?

Answer: The mean traﬃ c accident is 3. Thus, the mean accidents in two

weeks are

= (3) (2) = 6.

Some Special Discrete Distributions 136

Since

f(x ) =  x e 

we get

f(2) = 6 2 e 6

2! = 18 e6 .

Example 5.22. Let X have a Poisson distribution with parameter  = 1.

What is the probability that X 2 given that X 4?

Answer:

P( X2 / X  4) = P (2  X4)

P( X4) .

P(2  X4) =



x=2

x e



x=2

=17

24 e.

Similarly

P( X4) = 1



x=0

=65

24 e.

Therefore, we have

P( X2 / X  4) = 17

65 .

Example 5.23. If the moment generating function of a random variable X

is M (t ) = e4.6 (et  1) , then what are the mean and variance of X ? What is

the probability that X is between 3 and 6, that is P (3 <X<6)?

Probability and Mathematical Statistics 137

Answer: Since the moment generating function of X is given by

M(t ) = e4.6 (et 1)

we conclude that X⇠ P OI ( ) with  = 4 . 6. Thus, by Theorem 5.8, we get

E( X) = 4 .6 = V ar ( X).

P(3 < X < 6) = f (4) + f (5)

=F (5) F (3)

= 0. 686  0.326

= 0.36.

5.7. Riemann Zeta Distribution

The zeta distribution was used by the Italian economist Vilfredo Pareto

(1848-1923) to study the distribution of family incomes of a given country.

Deﬁnition 5.7. A random variable X is said to have Riemann zeta distri-

bution if its probability density function is of the form

f(x ) = 1

⇣( ↵+ 1) x (↵+1) , x = 1, 2,3, ..., 1

where ↵> 0 is a parameter and

⇣(s ) = 1 +  1

2s

+ 1

3s

+ 1

4s

+··· + 1

xs

+···

is the well known the Riemann zeta function. A random variable having a

Riemann zeta distribution with parameter ↵ will be denoted by X⇠ RIZ (↵).

The following ﬁgures illustrate the Riemann zeta distribution for the case

↵= 2 and ↵= 1.

Some Special Discrete Distributions 138

The following theorem is easy to prove and we leave its proof to the reader.

Theorem 5.9. If X⇠ RIZ (↵ ), then

E( X) = ⇣ (↵)

⇣( ↵+ 1)

V ar( X ) = ⇣ (↵ 1)⇣(↵ + 1)  (⇣(↵))2

(⇣(↵ + 1))2 .

Remark 5.1. If 0 <↵  1, then ⇣ (↵ ) = 1 . Hence if X⇠ RIZ (↵ ) and the

parameter ↵ 1, then the variance of X is inﬁnite.

5.8. Review Exercises

1. What is the probability of getting exactly 3 heads in 5 ﬂips of a fair coin?

2. On six successive ﬂips of a fair coin, what is the probability of observing

3 heads and 3 tails?

3. What is the probability that in 3 rolls of a pair of six-sided dice, exactly

one total of 7 is rolled?

4. What is the probability of getting exactly four "sixes" when a die is rolled

7 times?

5. In a family of 4 children, what is the probability that there will be exactly

two boys?

6. If a fair coin is tossed 4 times, what is the probability of getting at least

two heads?

7. In Louisville the probability that a thunderstorm will occur on any day

during the spring is 0.05. Assuming independence, what is the probability

that the ﬁrst thunderstorm occurs on April 5? (Assume spring begins on

March 1.)

8. A ball is drawn from an urn containing 3 white and 3 black balls. After

the ball is drawn, it is then replaced and another ball is drawn. This goes on

indeﬁnitely. What is the probability that, of the ﬁrst 4 balls drawn, exactly

2 are white?

9. What is the probability that a person ﬂipping a fair coin requires four

tosses to get a head?

10. Assume that hitting oil at one drilling location is independent of another,

and that, in a particular region, the probability of success at any individual

Probability and Mathematical Statistics 139

location is 0.3. Suppose the drilling company believes that a venture will

be proﬁtable if the number of wells drilled until the second success occurs

is less than or equal to 7. What is the probability that the venture will be

proﬁtable?

11. Suppose an experiment consists of tossing a fair coin until three heads

occur. What is the probability that the experiment ends after exactly six

ﬂips of the coin with a head on the ﬁfth toss as well as on the sixth?

12. Customers at Fred's Cafe wins a $100 prize if their cash register re-

ceipts show a star on each of the ﬁve consecutive days Monday, Tuesday, ...,

Friday in any one week. The cash register is programmed to print stars on

a randomly selected 10% of the receipts. If Mark eats at Fred's Cafe once

each day for four consecutive weeks, and if the appearance of the stars is

an independent process, what is the probability that Mark will win at least

$100?

13. If a fair coin is tossed repeatedly, what is the probability that the third

head occurs on the nth toss?

14. Suppose 30 percent of all electrical fuses manufactured by a certain

company fail to meet municipal building standards. What is the probability

that in a random sample of 10 fuses, exactly 3 will fail to meet municipal

building standards?

15. A bin of 10 light bulbs contains 4 that are defective. If 3 bulbs are chosen

without replacement from the bin, what is the probability that exactly kof

the bulbs in the sample are defective?

16. Let X denote the number of independent rolls of a fair die required to

obtain the ﬁrst "3". What is P (X 6)?

17. The number of automobiles crossing a certain intersection during any

time interval of length t minutes between 3:00 P.M. and 4:00 P.M. has a

Poisson distribution with mean t . Let W be time elapsed after 3:00 P.M.

before the ﬁrst automobile crosses the intersection. What is the probability

that W is less than 2 minutes?

18. In rolling one die repeatedly, what is the probability of getting the third

six on the xth roll?

19. A coin is tossed 6 times. What is the probability that the number of

heads in the ﬁrst 3 throws is the same as the number in the last 3 throws?

Some Special Discrete Distributions 140

20. One hundred pennies are being distributed independently and at random

into 30 boxes, labeled 1, 2, ..., 30. What is the probability that there are

exactly 3 pennies in box number 1?

21. The density function of a certain random variable is

f(x ) =  22

4x (0.2) 4x (0.8) 224x if x = 0, 1

4, 2

4,···, 22

0 otherwise.

What is the expected value of X2 ?

22. If MX (t ) = k (2 + 3et )100 , what is the value of k ? What is the variance

of the random variable X?

23. If MX (t ) = k e t

75et  3 , what is the value of k ? What is the variance of

the random variable X?

24. If for a Poisson distribution 2f (0) + f (2) = 2f (1), what is the mean of

the distribution?

25. The number of hits, X , per baseball game, has a Poisson distribution.

If the probability of a no-hit game is 1

3, what is the probability of having 2

or more hits in speciﬁed game?

26. Suppose X has a Poisson distribution with a standard deviation of 4.

What is the conditional probability that X is exactly 1 given that X 1 ?

27. A die is loaded in such a way that the probability of the face with jdots

turning up is proportional to j2 for j = 1, 2,3,4,5, 6. What is the probability

of rolling at most three sixes in 5 independent casts of this die?

28. A die is loaded in such a way that the probability of the face with jdots

turning up is proportional to j2 for j = 1, 2,3,4,5, 6. What is the probability

of getting the third six on the 7th roll of this loaded die?

Probability and Mathematical Statistics 141

Some Special Continuous Distributions 142

Chapter 6

SOME SPECIAL

CONTINUOUS

DISTRIBUTIONS

In this chapter, we study some well known continuous probability density

functions. We want to study them because they arise in many applications.

We begin with the simplest probability density function.

6.1. Uniform Distribution

Let the random variable X denote the outcome when a point is selected

at random from an interval [a, b ]. We want to ﬁnd the probability of the

event X x , that is we would like to determine the probability that the

point selected from [a, b ] would be less than or equal to x . To compute this

probability, we need a probability measure µ that satisﬁes the three axioms of

Kolmogorov (namely nonnegativity, normalization and countable additivity).

For continuous variables, the events are interval or union of intervals. The

length of the interval when normalized satisﬁes all the three axioms and thus

it can be used as a probability measure for one-dimensional random variables.

Hence

P( X x) = length of [a , x]

length of [a, b] .

Thus, the cumulative distribution function Fis

F(x ) = P( X x) = x a

b a, a xb,

where a and b are any two real constants with a < b . To determine the

probability density function from cumulative density function, we calculate

the derivative of F (x ). Hence

f(x ) = d

dx F ( x) = 1

b a, a xb.

Probability and Mathematical Statistics 143

Deﬁnition 6.1. A random variable X is said to be uniform on the interval

[a, b ] if its probability density function is of the form

f(x ) = 1

b a, a xb,

where a and b are constants. We denote a random variable X with the

uniform distribution on the interval [a, b ] as X⇠ UN I F (a, b).

The uniform distribution provides a probability model for selecting points

at random from an interval [a, b ]. An important application of uniform dis-

tribution lies in random number generation. The following theorem gives

the mean, variance and moment generating function of a uniform random

variable.

Theorem 6.1. If X is uniform on the interval [a, b ] then the mean, variance

and moment generating function of X are given by

E( X) = b+a

V ar( X ) = ( b a)2

M(t ) = 





1 if t = 0

etb eta

t(ba ),if t 6 = 0

Proof:

E( X) =  b

x f (x ) dx

= b

b adx

b a x2

2b

2(b+ a ).

Some Special Continuous Distributions 144

E( X2 ) =  b

x2 f (x)dx

= b

x2 1

b adx

b a x3

3b

b a

b3 a3

(b a )

(b a ) ( b2 + ba + a2 )

3(b2 + ba + a2 ).

Hence, the variance of X is given by

V ar( X ) = E ( X2 ) ( E ( X ) )2

3(b2 + ba + a2 ) (b+ a )2

12  4b2 + 4 ba + 4 a2  3a2  3b2  6ba

12  b 2  2ba + a2 

12 (b a )2 .

Next, we compute the moment generating function of X . First, we handle

the case t 6 = 0. Assume t 6= 0. Hence

M(t ) = E etX 

= b

etx 1

b adx

b a etx

tb

=e tb eta

t( b a) .

If t = 0, we have know that M (0) = 1, hence we get

M(t ) = 





1 if t = 0

etb eta

t(ba ),if t 6 = 0

Probability and Mathematical Statistics 145

and this completes the proof.

Example 6.1. Suppose Y⇠ U N IF (0, 1) and Y = 1

4X 2 . What is the

probability density function of X?

Answer: We shall ﬁnd the probability density function of X through the

cumulative distribution function of Y . The cumulative distribution function

of X is given by

F(x ) = P( X x)

=P X2 x2 

=P 1

4X 2  1

4x 2 

=P Y x2

4

= x2

f( y) dy

= x2

=x2

Thus

f(x ) = d

dx F ( x) = x

Hence the probability density function of X is given by

f(x ) =  x

2for 0 x2

0 otherwise.

Some Special Continuous Distributions 146

Example 6.2. If X has a uniform distribution on the interval from 0 to 10,

then what is P  X + 10

X7?

Answer: Since X⇠ UN I F (0, 10), the probability density function of Xis

f(x ) = 1

10 for 0 x  10. Hence

P X+10

X7 =P  X2 + 10  7X 

=P X2  7X + 10  0

=P ((X 5) (X 2)  0)

=P (X  2 or X  5)

= 1 P (2 X  5)

= 1  5

f(x ) dx

= 1  5

10 dx

= 1  3

10 = 7

10 .

Example 6.3. If X is uniform on the interval from 0 to 3, what is the

probability that the quadratic equation 4t2 + 4tX +X + 2 = 0 has real

solutions?

Answer: Since X⇠ UN I F (0, 3), the probability density function of Xis

f(x ) =  1

30x 3

0 otherwise.

Probability and Mathematical Statistics 147

The quadratic equation 4t2 + 4tX +X + 2 = 0 has real solution if the

discriminant of this equation is positive. That is

16X2  16(X + 2)  0,

which is

X2  X2 0.

From this, we get

(X 2) (X + 1)  0.

The probability that the quadratic equation 4t2 + 4tX +X + 2 = 0 has real

roots is equivalent to

P( ( X2) ( X+ 1)  0 ) = P( X 1 or X2)

=P (X   1) + P (X  2)

= 1

1

f(x ) dx + 3

f(x ) dx

= 0 +  3

3dx

3= 0.3333.

Theorem 6.2. If X is a continuous random variable with a strictly increasing

cumulative distribution function F (x ), then the random variable Y , deﬁned

Y= F( X)

has the uniform distribution on the interval [0,1].

Proof: Since F is strictly increasing, the inverse F 1 (x ) of F (x ) exists. We

want to show that the probability density function g (y ) of Y is g (y ) = 1.

First, we ﬁnd the cumulative distribution G( y ) function of Y.

G( y ) = P ( Y y )

=P (F(X ) y )

=P X F1 (y)

=F F1 (y)

=y.

Some Special Continuous Distributions 148

Hence the probability density function of Y is given by

g( y) = d

dy G(y ) = d

dy y = 1 .

The following problem can be solved using this theorem but we solve it

without this theorem.

Example 6.4. If the probability density function of Xis

f(x ) = e x

(1 + ex )2 , 1 <x< 1 ,

then what is the probability density function of Y = 1

1+eX ?

Answer: The cumulative distribution function of Y is given by

G( y ) = P ( Y y )

=P 1

1 + eX  y 

=P 1 + eX  1

y

=P eX  1y

y

=P  X  ln 1y

y

=P X  ln 1y

y

= ln 1y

1

ex

(1 + ex )2 dx

= 1

1 + ex  ln 1y

1

1 + 1y

=y.

Hence, the probability density function of Y is given by

f( y) =  1 if 0 <y<1

0 otherwise.

Probability and Mathematical Statistics 149

Example 6.5. A box to be constructed so that its height is 10 inches and

its base is X inches by X inches. If X has a uniform distribution over the

interval (2, 8), then what is the expected volume of the box in cubic inches?

Answer: Since X⇠ UN I F (2,8),

f(x ) = 1

8 2= 1

6on (2, 8).

The volume V of the box is

V= 10 X2 .

Hence

E( V) = E 10 X2 

= 10 E  X2 

= 10  8

x2 1

6dx

=10

6 x3

38

=10

18  8 3  2 3  = (5) (8) (7) = 280 cubic inches.

Example 6.6. Two numbers are chosen independently and at random from

the interval (0, 1). What is the probability that the two numbers di↵ ers by

more than 1

Answer: See ﬁgure below:

Choose x from the x -axis between 0 and 1, and choose y from the y-axis

between 0 and 1. The probability that the two numbers di↵ er by more than

Some Special Continuous Distributions 150

2is equal to the area of the shaded region. Thus

P | X Y| >1

2 =

8+ 1

1= 1

6.2. Gamma Distribution

The gamma distribution involves the notion of gamma function. First,

we develop the notion of gamma function and study some of its well known

properties. The gamma function,  (z ), is a generalization of the notion of

factorial. The gamma function is deﬁned as

(z ) :=  1

xz1 ex dx,

where z is positive real number (that is, z > 0). The condition z > 0 is

assumed for the convergence of the integral. Although the integral does not

converge for z < 0, it can be shown by using an alternative deﬁnition of

gamma function that it is deﬁned for all z2 IR \ { 0, 1, 2, 3, ... }.

The integral on the right side of the above expression is called Euler's

second integral, after the Swiss mathematician Leonhard Euler (1707-1783).

The graph of the gamma function is shown below. Observe that the zero and

negative integers correspond to vertical asymptotes of the graph of gamma

function.

Lemma 6.1. (1) = 1.

Proof:

(1) =  1

x0ex dx =  ex  1

0= 1.

Lemma 6.2. The gamma function  (z ) satisﬁes the functional equation

(z ) = (z 1) (z 1) for all real number z > 1.

Probability and Mathematical Statistics 151

Proof: Let z be a real number such that z > 1, and consider

(z ) =  1

xz1 ex dx

= xz1 ex  1

0+ 1

(z 1) xz2 ex dx

= (z 1)  1

xz2 ex dx

= (z 1)  (z 1).

Although, we have proved this lemma for all real z > 1, actually this

lemma holds also for all real number z2 IR \ { 1,0, 1, 2, 3, ... }.

Lemma 6.3.  1

2=p ⇡.

Proof: We want to show that

 1

2 =  1

ex

px dx

is equal to p ⇡ . We substitute y = p x , hence the above integral becomes

 1

2 =  1

ex

px dx

= 2  1

ey2 dy, where y= px.

Hence

 1

2 = 2  1

eu2 du

and also

 1

2 = 2  1

ev2 dv.

Multiplying the above two expressions, we get

  1

22

= 4  1

0 1

e(u2 +v2 ) du dv.

Now we change the integral into polar form by the transformation u=

rcos(✓ ) and v= rsin(✓ ). The Jacobian of the transformation is

J( r, ✓ ) = det 



@✓





=det  cos(✓)r sin(✓)

sin(✓ )r cos(✓ ) 

=r cos2(✓ ) + r sin2(✓)

=r.

Some Special Continuous Distributions 152

Hence, we get

  1

22

= 4  ⇡

0 1

er2 J ( r, ✓) dr d✓

= 4  ⇡

0 1

er2 r dr d✓

= 2  ⇡

0 1

er2 2 r dr d✓

= 2  ⇡

0 1

er2 dr2 d✓

= 2  ⇡

(1) d✓

=⇡.

Therefore, we get

 1

2 = p ⇡.

Lemma 6.4.    1

2=2 p ⇡ .

Proof: By Lemma 6.2, we get

(z ) = (z 1) (z 1)

for all z2 IR \ { 1,0, 1, 2, 3, ... } . Letting z = 1

2, we get

 1

2 =  1

2 1    1

2 1

which is

 1

2 =2 1

2 =2 p ⇡ .

Example 6.7. Evaluate  5

2.

Answer:

 5

2 = 3

2  1

2 = 3

4p ⇡.

Example 6.8. Evaluate    7

2.

Probability and Mathematical Statistics 153

Answer: Consider

 1

2 = 3

2   3

2

=  3

2  5

2    5

2

=  3

2  5

2  7

2    7

2 .

Hence

 7

2 =   2

3  2

5  2

7    1

2 = 16

105 p ⇡ .

Example 6.9. Evaluate  (7.8).

Answer:  (7.8) = (6.8) (5.8) (4.8) (3.8) (2.8) (1. 8)  (1.8)

= (3625. 7)  (1.8)

= (3625. 7) (0. 9314) = 3376.9.

Here we have used the gamma table to ﬁnd  (1. 8) to be 0.9314.

Example 6.10. If n is a natural number, then  (n + 1) = n!.

Answer: (n + 1) = n (n)

=n (n 1)  (n 1)

=n (n 1) (n 2)  (n 2)

=··· ···

=n (n 1) (n 2) ··· (1) (1)

=n!

Now we are ready to deﬁne the gamma distribution.

Deﬁnition 6.2. A continuous random variable X is said to have a gamma

distribution if its probability density function is given by

f(x ) = 





( ↵) ✓↵ x ↵ 1 e  x

✓if 0 < x < 1

0 otherwise,

where ↵> 0 and ✓> 0. We denote a random variable with gamma distri-

bution as X⇠ GAM (✓ ,↵ ). The following diagram shows the graph of the

gamma density for various values of values of the parameters ✓ and ↵.

Some Special Continuous Distributions 154

The following theorem gives the expected value, the variance, and the

moment generating function of the gamma random variable

Theorem 6.3. If X⇠ GAM (✓ ,↵ ), then

E( X) = ✓ ↵

V ar( X ) = ✓2 ↵

M(t ) =  1

1✓t↵

,if t < 1

✓.

Proof: First, we derive the moment generating function of X and then we

compute the mean and variance of it. The moment generating function

M(t ) = E etX 

= 1

(↵ )✓↵ x ↵1 e  x

✓e tx dx

= 1

(↵ )✓↵ x ↵1 e  1

✓(1✓ t )x dx

= 1

(↵ )✓ ↵

✓↵

(1 ✓ t)↵ y ↵1 e y dy, where y= 1

✓(1 ✓t)x

(1 ✓ t)↵  1

(↵ )y ↵1 e y dy

(1 ✓ t)↵ , since the integrand is GAM (1 ,↵ ).

Probability and Mathematical Statistics 155

The ﬁrst derivative of the moment generating function is

M0 (t ) = d

dt (1 ✓ t)↵

= (↵ ) (1 ✓ t)↵1 (✓ )

=↵ ✓ (1 ✓ t)(↵+1) .

Hence from above, we ﬁnd the expected value of X to be

E( X) = M0 (0) = ↵ ✓.

Similarly,

M00 (t ) = d

dt  ↵ ✓ (1 ✓ t)(↵+1) 

=↵ ✓ (↵ + 1) ✓ (1 ✓ t)(↵+2)

=↵ (↵ + 1) ✓2 (1 ✓ t)(↵+2) .

Thus, the variance of Xis

V ar( X ) = M 00 (0)  ( M0 (0))2

=↵ (↵ + 1) ✓2  ↵2 ✓ 2

=↵ ✓2

and proof of the theorem is now complete

In ﬁgure below the graphs of moment generating function for various

values of the parameters are illustrated.

Example 6.11. Let X have the density function

f(x ) = 





( ↵) ✓↵ x ↵ 1 e  x

✓if 0 < x < 1

0 otherwise,

Some Special Continuous Distributions 156

where ↵> 0 and ✓> 0. If ↵ = 4, what is the mean of 1

X3 ?

Answer:

E X3  = 1

x3 f ( x)dx

= 1

(4) ✓4 x 3 e  x

✓dx

3! ✓4  1

e x

✓dx

3! ✓3  1

✓e  x

✓dx

3! ✓3 since the integrand is GAM(✓, 1).

Deﬁnition 6.3. A continuous random variable is said to be an exponential

random variable with parameter ✓ if its probability density function is of the

form

f(x ) = 





✓e  x

✓if x > 0

0 otherwise,

where ✓> 0. If a random variable X has an exponential density function

with parameter ✓ , then we denote it by writing X⇠ EX P (✓).

An exponential distribution is a special case of the gamma distribution.

If the parameter ↵ = 1, then the gamma distribution reduces to the expo-

nential distribution. Hence most of the information about an exponential

distribution can be obtained from the gamma distribution.

Example 6.12. What is the cumulative density function of a random vari-

able which has an exponential distribution with variance 25?

Probability and Mathematical Statistics 157

Answer: Since an exponential distribution is a special case of the gamma

distribution with ↵ = 1, from Theorem 6.3, we get V ar (X ) = ✓2 . But this

is given to be 25. Thus, ✓2 = 25 or ✓= 5. Hence, the probability density

function of Xis

F(x ) =  x

f(t ) dt

= x

5e  t

5dt

5  5e t

5x

= 1  e x

Example 6.13. If the random variable X has a gamma distribution with

parameters ↵ = 1 and ✓= 1, then what is the probability that X is between

its mean and median?

Answer: Since X⇠ GAM(1 , 1), the probability density function of Xis

f(x ) =  e x if x > 0

0 otherwise.

Hence, the median q of X can be calculated from

2=  q

ex dx

= ex  q

= 1  eq .

Hence 1

2= 1 eq

Some Special Continuous Distributions 158

and from this, we get

q= ln 2.

The mean of X can be found from the Theorem 6.3.

E( X) = ↵ ✓ = 1.

Hence the mean of X is 1 and the median of X is ln 2. Thus

P(ln 2  X1) =  1

ln 2

ex dx

= ex  1

ln 2

=eln 2  1

2 1

=e2

2e.

Example 6.14. If the random variable X has a gamma distribution with

parameters ↵ = 1 and ✓ = 2, then what is the probability density function

of the random variable Y = eX ?

Answer: First, we calculate the cumulative distribution function G( y ) of Y.

G( y ) = P( Y y )

=P eX y

=P (X  ln y)

= ln y

2e  x

2dx

2  2e x

2 ln y

= 1  1

2ln y

= 1  1

py .

Hence, the probability density function of Y is given by

g( y) = d

dy G(y ) = d

dy  1 1

py  =1

2y p y.

Probability and Mathematical Statistics 159

Thus, if X⇠ GAM(1 , 2), then probability density function of eX is

f(x ) =  1

2xp x if 1 x < 1

0 otherwise.

Deﬁnition 6.4. A continuous random variable X is said to have a chi-square

distribution with r degrees of freedom if its probability density function is of

the form

f(x ) = 





( r

2) 2 r

2x r

21 e  x

2if 0 < x < 1

0 otherwise,

where r > 0. If X has a chi-square distribution, then we denote it by writing

X⇠2 ( r).

The gamma distribution reduces to the chi-square distribution if ↵ = r

2and

✓= 2. Thus, the chi-square distribution is a special case of the gamma

distribution. Further, if r ! 1 , then the chi-square distribution tends to

the normal distribution.

Some Special Continuous Distributions 160

The chi-square distribution was originated in the works of British Statis-

tician Karl Pearson (1857-1936) but it was originally discovered by German

physicist F. R. Helmert (1843-1917).

Example 6.15. If X⇠ GAM (1 , 1), then what is the probability density

function of the random variable 2X?

Answer: We will use the moment generating method to ﬁnd the distribution

of 2X . The moment generating function of a gamma random variable is given

by (see Theorem 6.3)

M(t ) = (1 ✓ t)↵ ,if t < 1

✓.

Since X⇠ GAM (1 , 1), the moment generating function of X is given by

MX (t) = 1

1t, t < 1.

Hence, the moment generating function of 2Xis

M2X (t) = MX (2t)

1 2t

(1  2t ) 2

= MGF of 2 (2).

Hence, if X is an exponential with parameter 1, then 2X is chi-square with

2 degrees of freedom.

Example 6.16. If X⇠ 2 (5), then what is the probability that X is between

1.145 and 12.83?

Answer: The probability of X between 1.145 and 12.83 can be calculated

from the following:

P(1. 145  X12.83)

=P (X  12. 83) P (X 1.145)

= 12.83

f(x ) dx  1.145

f(x ) dx

= 12.83

 5

22 5

21 e  x

2dx  1.145

 5

22 5

21 e  x

2dx

= 0. 975  0. 050 (from 2 table)

= 0.925.

Probability and Mathematical Statistics 161

These integrals are hard to evaluate and so their values are taken from the

chi-square table.

Example 6.17. If X⇠ 2 (7), then what are values of the constants aand

bsuch that P( a < X < b) = 0 .95?

Answer: Since

0. 95 = P (a < X < b) = P (X < b )P (X < a),

we get

P( X < b) = 0 .95 + P ( X < a).

We choose a = 1. 690, so that

P( X < 1. 690) = 0 .025.

From this, we get

P( X < b) = 0 .95 + 0.025 = 0.975

Thus, from the chi-square table, we get b = 16.01.

Deﬁnition 6.5. A continuous random variable X is said to have a n-Erlang

distribution if its probability density function is of the form

f(x ) = 





e x ( x) n1

(n 1)! ,if 0 < x < 1

0 otherwise,

where > 0 is a parameter.

The gamma distribution reduces to n-Erlang distribution if ↵ = n , where

nis a positive integer, and ✓ = 1

. The gamma distribution can be generalized

to include the Weibull distribution. We call this generalized distribution the

uniﬁed distribution. The form of this distribution is the following:

f(x ) = 









↵

✓↵ ( ↵ +1) x ↵1e x (↵ ↵1)

✓,if 0 < x < 1

0 otherwise,

where ✓> 0, ↵> 0, and 2 {0,1 } are parameters.

If = 0, the uniﬁed distribution reduces

f(x ) = 





↵

✓x ↵1 e  x↵

✓,if 0 < x < 1

0 otherwise

Some Special Continuous Distributions 162

which is known as the Weibull distribution. For ↵ = 1, the Weibull distribu-

tion becomes an exponential distribution. The Weibull distribution provides

probabilistic models for life-length data of components or systems. The mean

and variance of the Weibull distribution are given by

E( X) = ✓ 1

↵ 1 + 1

↵ ,

V ar( X ) = ✓ 2

↵ 1 + 2

↵   1 + 1

↵ 2  .

From this Weibull distribution, one can get the Rayleigh distribution by

taking ✓ = 22 and ↵ = 2. The Rayleigh distribution is given by

f(x ) = 





2 e  x2

2 2 ,if 0 < x < 1

0 otherwise.

If = 1, the uniﬁed distribution reduces to the gamma distribution.

6.3. Beta Distribution

The beta distribution is one of the basic distributions in statistics. It

has many applications in classical as well as Bayesian statistics. It is a ver-

satile distribution and as such it is used in modeling the behavior of random

variables that are positive but bounded in possible values. Proportions and

percentages fall in this category.

The beta distribution involves the notion of beta function. First we

explain the notion of the beta integral and some of its simple properties. Let

↵and be any two positive real numbers. The beta function B (↵,  ) is

deﬁned as

B(↵ ,) =  1

x↵1 (1  x)1 dx.

First, we prove a theorem that establishes the connection between the

beta function and the gamma function.

Theorem 6.4. Let ↵ and  be any two positive real numbers. Then

B(↵ ,) =  ( ↵) ( )

(↵ + ) ,

where

(z ) =  1

xz1 ex dx

Probability and Mathematical Statistics 163

is the gamma function.

Proof: We prove this theorem by computing

(↵ ) ( ) =  1

x↵1 ex dx 1

y1 ey dy

=1

u2↵2 eu2 2udu 1

v22 ev2 2vdv

= 4  1

0 1

u2↵1 v 21 e(u2 +v2 ) dudv

= 4  ⇡

0 1

r2↵+2  2 (cos ✓ )2↵1 (sin ✓ )21 er2 rdrd✓

=1

(r2 )↵+  1 er2 dr2   2 ⇡

(cos ✓ )2↵1 (sin ✓ )21 d✓ 

=(↵ + )  2 ⇡

(cos ✓ )2↵1 (sin ✓ )21 d✓ 

=(↵ + ) 1

t↵1 (1  t)1 dt

=(↵ + )B (↵,  ).

The second line in the above integral is obtained by substituting x = u2 and

y= v2 . Similarly, the fourth and seventh lines are obtained by substituting

u= rcos ✓ , v= rsin ✓ , and t= cos2 ✓ , respectively. This proves the theorem.

The following two corollaries are consequences of the last theorem.

Corollary 6.1. For every positive ↵ and  , the beta function is symmetric,

that is

B(↵ ,) = B( ,↵).

Corollary 6.2. For every positive ↵ and  , the beta function can be written

B(↵ ,) = 2  ⇡

(cos ✓ )2↵1 (sin ✓ )21 d✓.

The following corollary is obtained substituting s = t

1t in the deﬁnition

of the beta function.

Corollary 6.3. For every positive ↵ and  , the beta function can be ex-

pressed as

B(↵ ,) =  1

s↵1

(1 + s)↵+ ds.

Some Special Continuous Distributions 164

Using Theorem 6.4 and the property of gamma function, we have the

following corollary.

Corollary 6.4. For every positive real number  and every positive integer

↵, the beta function reduces to

B(↵ ,) = ( ↵1)!

(↵ 1 +  )(↵ 2 +  )··· (1 +  ) .

Corollary 6.5. For every pair of positive integers ↵ and  , the beta function

satisﬁes the following recursive relation

B(↵ ,) = ( ↵1)( 1)

(↵ +  1)(↵ + 2) B (↵ 1, 1).

Deﬁnition 6.6. A random variable X is said to have the beta density

function if its probability density function is of the form

f(x ) =  1

B(↵,  ) x ↵  1 (1 x)   1 ,if 0 < x < 1

0 otherwise

for every positive ↵ and  . If X has a beta distribution, then we symbolically

denote this by writing X⇠ BE T A (↵,  ).

The following ﬁgure illustrates the graph of the beta distribution for

various values of ↵ and .

The beta distribution reduces to the uniform distribution over (0, 1) if

↵= 1 = . The following theorem gives the mean and variance of the beta

distribution.

Probability and Mathematical Statistics 165

Theorem 6.5. If X⇠ B ET A (↵,  ),

E( X) = ↵

↵+ 

V ar( X ) = ↵

(↵ + )2(↵ + + 1) .

Proof: The expected value of X is given by

E( X) =  1

x f (x ) dx

B(↵ ,) 1

x↵ (1  x)1 dx

=B (↵ + 1, )

B(↵ ,)

=(↵ + 1) ()

(↵ + + 1)

(↵ + )

(↵ ) ()

=↵(↵ ) ()

(↵ + ) (↵ + )

(↵ + )

(↵ ) ()

=↵

↵+ .

Similarly, we can show that

E X2  =↵ (↵ + 1)

(↵ + + 1) (↵ + ) .

Therefore

V ar( X ) = E  X2   E ( X ) = ↵

(↵ + )2(↵ + + 1)

and the proof of the theorem is now complete.

Example 6.18. The percentage of impurities per batch in a certain chemical

product is a random variable X that follows the beta distribution given by

f(x ) =  60 x 3 (1  x ) 2 for 0 < x < 1

0 otherwise.

What is the probability that a randomly selected batch will have more than

25% impurities?

Some Special Continuous Distributions 166

Proof: The probability that a randomly selected batch will have more than

25% impurities is given by

P( X0. 25) =  1

0.25

60 x3 (1  x)2 dx

= 60  1

0. 25 x 3 2x 4 +x 5  dx

= 60  x 4

4 2x5

5+ x6

61

0.25

= 60 657

40960 = 0.9624.

Example 6.19. The proportion of time per day that all checkout counters

in a supermarket are busy follows a distribution

f(x ) =  k x 2 (1  x)9 for 0 < x < 1

0 otherwise.

What is the value of the constant k so that f (x ) is a valid probability density

function?

Proof: Using the deﬁnition of the beta function, we get that

1

x2 (1  x)9 dx = B (3 , 10).

Hence by Theorem 6.4, we obtain

B(3 ,10) =  (3) (10)

(13) = 1

660 .

Hence k should be equal to 660.

The beta distribution can be generalized to any bounded interval [a, b].

This generalized distribution is called the generalized beta distribution. If

a random variable X has this generalized beta distribution we denote it by

writing X⇠ GBE T A (↵,  , a, b ). The probability density of the generalized

beta distribution is given by

f(x ) = 





B(↵,  )

(xa)↵1 (bx)1

(ba)↵+  1 if a < x < b

0 otherwise

Probability and Mathematical Statistics 167

where ↵,  , a > 0.

If X⇠ GBE T A (↵,  , a, b ), then

E( X) = ( b a) ↵

↵+ +a

V ar( X ) = ( b a)2 ↵

(↵ + )2(↵ + + 1) .

It can be shown that if X = (b a) Y +a and Y⇠ BE T A (↵,  ), then

X⇠ GBE T A(↵ , , a, b). Thus using Theorem 6.5, we get

E( X) = E(( b a) Y+ a) = ( b a) E( Y) + a= ( b a) ↵

↵+ +a

and

V ar( X ) = V ar((ba) Y +a ) = ( ba)2 V ar( Y ) = ( ba)2 ↵

(↵ + )2(↵ + + 1) .

6.4. Normal Distribution

Among continuous probability distributions, the normal distribution is

very well known since it arises in many applications. Normal distribution

was discovered by a French mathematician Abraham DeMoivre (1667-1754).

DeMoivre wrote two important books. One is called the Annuities Upon

Lives, the ﬁrst book on actuarial sciences and the second book is called the

Doctrine of Chances, one of the early books on the probability theory. Pierre-

Simon Laplace (1749-1827) applied normal distribution to astronomy. Carl

Friedrich Gauss (1777-1855) used normal distribution in his studies of prob-

lems in physics and astronomy. Adolphe Quetelet (1796-1874) demonstrated

that man's physical traits (such as height, chest expansion, weight etc.) as

well as social traits follow normal distribution. The main importance of nor-

mal distribution lies on the central limit theorem which says that the sample

mean has a normal distribution if the sample size is large.

Deﬁnition 6.7. A random variable X is said to have a normal distribution

if its probability density function is given by

f(x ) = 1

p 2 ⇡e  1

2( xµ

) 2 ,1 <x<1,

where 1 <µ< 1 and 0 <2 <1 are arbitrary parameters. If X has a

normal distribution with parameters µ and 2 , then we write X⇠ N (µ, 2 ).

Some Special Continuous Distributions 168

Example 6.20. Is the real valued function deﬁned by

f(x ) = 1

p 2 ⇡e  1

2( xµ

) 2 ,1 <x<1

a probability density function of some random variable X?

Answer: To answer this question, we must check that f is nonnegative

and it integrates to 1. The nonnegative part is trivial since the exponential

function is always positive. Hence using property of the gamma function, we

show that f integrates to 1 on IR.

1

1

f(x ) dx = 1

1

p 2 ⇡e  1

2( xµ

) 2 dx

= 2  1

p 2 ⇡e  1

2( xµ

) 2 dx

p 2 ⇡ 1

ez 

p2zdz, where z = 1

2 xµ

2

p⇡  1

pz ez dz

p⇡  1

2 = 1

p⇡ p ⇡= 1.

The following theorem tells us that the parameter µ is the mean and the

parameter 2 is the variance of the normal distribution.

Probability and Mathematical Statistics 169

Theorem 6.6. If X⇠ N (µ, 2 ), then

E( X) = µ

V ar( X ) =  2

M(t ) = eµt+ 1

2 2 t 2 .

Proof: We prove this theorem by ﬁrst computing the moment generating

function and ﬁnding out the mean and variance of X from it.

M(t ) = E etX 

= 1

1

etx f (x)dx

= 1

1

etx 1

p 2 ⇡e  1

2( xµ

) 2 dx

= 1

1

etx 1

p 2 ⇡e  1

2 2 ( x 2 2µx+µ 2 ) dx

= 1

1

p 2 ⇡e  1

2 2 ( x 2 2µx+µ 2 2 2 tx ) dx

= 1

1

p 2 ⇡e  1

2 2 ( xµ 2 t ) 2

eµt+ 1

2 2 t 2 dx

=eµt+ 1

2 2 t 2  1

1

p 2 ⇡e  1

2 2 ( xµ 2 t ) 2

=eµt+ 1

2 2 t 2 .

The last integral integrates to 1 because the integrand is the probability

density function of a normal random variable whose mean is µ +2 t and

variance 2 , that is N (µ +2 t, 2 ). Finally, from the moment generating

function one determines the mean and variance of the normal distribution.

We leave this part to the reader.

Example 6.21. If X is any random variable with mean µ and variance 2 >

0, then what are the mean and variance of the random variable Y =Xµ

?

Some Special Continuous Distributions 170

Answer: The mean of the random variable Yis

E( Y) = E Xµ



E(X µ )

(E(X ) µ)

(µ µ)

= 0.

The variance of Y is given by

V ar( Y ) = V ar  X µ



2 V ar (X µ )

V ar(X)

2  2

= 1.

Hence, if we deﬁne a new random variable by taking a random variable and

subtracting its mean from it and then dividing the resulting by its stan-

dard deviation, then this new random variable will have zero mean and unit

variance.

Deﬁnition 6.8. A normal random variable is said to be standard normal, if

its mean is zero and variance is one. We denote a standard normal random

variable X by X⇠ N (0,1).

The probability density function of standard normal distribution is the

following:

f(x ) = 1

p2⇡ e x2

2,1 <x<1.

Example 6.22. If X⇠ N (0, 1), what is the probability of the random

variable X less than or equal to 1.72?

Answer:

P( X 1. 72) = 1  P( X1.72)

= 1  0. 9573 (from table)

= 0.0427.

Probability and Mathematical Statistics 171

Example 6.23. If Z⇠ N (0, 1), what is the value of the constant c such

that P (|Z | c ) = 0 .95?

Answer: 0.95 = P (|Z | c )

=P (c Z c )

=P (Z c )P (Z   c)

= 2 P (Z c ) 1.

Hence

P( Z c) = 0.975,

and from this using the table we get

c= 1 .96.

The following theorem is very important and allows us to ﬁnd probabil-

ities by using the standard normal table.

Theorem 6.7. If X⇠ N (µ, 2 ), then the random variable Z =Xµ

⇠

N(0 ,1).

Proof: We will show that Z is standard normal by ﬁnding the probability

density function of Z . We compute the probability density of Z by cumulative

distribution function method.

F( z) = P( Z z)

=P Xµ

z

=P (X z +µ)

= z+µ

1

p 2 ⇡e  1

2( xµ

) 2 dx

= z

1

p 2 ⇡ e 1

2w 2 dw, where w =xµ

.

Hence

f( z) = F0 ( z) = 1

p2⇡ e 1

2z 2 .

The following example illustrates how to use standard normal table to

ﬁnd probability for normal random variables.

Example 6.24. If X⇠ N (3, 16), then what is P (4 X  8)?

Some Special Continuous Distributions 172

Answer:

P(4  X8) = P 43

4 X3

4 83

4

=P 1

4Z 5

4

=P (Z 1. 25) P (Z 0.25)

= 0. 8944  0.5987

= 0.2957.

Example 6.25. If X⇠ N (25, 36), then what is the value of the constant c

such that P (|X 25| c ) = 0 .9544?

Answer: 0.9544 = P (|X 25| c)

=P (c X  25 c)

=P  c

6 X25

6 c

6

=P  c

6Z c

6

=P Z c

6 P Z  c

6

= 2 P  Z c

6  1.

Hence

P Z c

6 = 0.9772

and from this, using the normal table, we get

6= 2 or c= 12.

The following theorem can be proved similar to Theorem 6.7.

Theorem 6.8. If X⇠ N (µ, 2 ), then the random variable  Xµ

 2 ⇠ 2 (1).

Proof: Let W = Xµ

 2 and Z= Xµ

. We will show that the random

variable W is chi-square with 1 degree of freedom. This amounts to showing

that the probability density function of W to be

g( w) = 





p2⇡w e  1

2w if 0 <w<1

0 otherwise .

Probability and Mathematical Statistics 173

We compute the probability density function of W by distribution function

method. Let G( w ) be the cumulative distribution function W , which is

G( w ) = P ( W w )

=P Xµ

2

w

=P p w Xµ

p w

=P p w Zp w

= pw

p w

f( z) dz,

where f (z ) denotes the probability density function of the standard normal

random variable Z . Thus, the probability density function of W is given by

g( w) = d

dw G(w)

dw  pw

p w

f( z) dz

=f p w dpw

dw  f  p w  d (p w )

p2⇡ e 1

2w 1

2p w + 1

p2⇡ e 1

2w 1

2p w

p2⇡we  1

2w .

Thus, we have shown that W is chi-square with one degree of freedom and

the proof is now complete.

Example 6.26. If X⇠ N (7, 4), what is P  15. 364  (X 7)2 20.095?

Answer: Since X⇠ N (7, 4), we get µ = 7 and  = 2. Thus

P 15. 364  ( X7)2 20.095

=P 15.364

4  X7

22

20.095

4

=P 3. 841 Z2  5.024

=P 0 Z2  5.024 P  0 Z2  3.841

= 0. 975  0.949

= 0.026.

Some Special Continuous Distributions 174

A generalization of the normal distribution is the following:

g(x ) = ⌫ ' (⌫)

2(1/ ⌫ )e   '(⌫)

|xµ| ⌫

where

'( ⌫ ) =   (3/⌫ )

(1/⌫ )

and ⌫and  are real positive constants and 1 <µ< 1 is a real con-

stant. The constant µ represents the mean and the constant  represents

the standard deviation of the generalized normal distribution. If ⌫ = 2, then

generalized normal distribution reduces to the normal distribution. If ⌫ = 1,

then the generalized normal distribution reduces to the Laplace distribution

whose density function is given by

f(x ) = 1

2✓e  |xµ|

✓

where ✓ = 

p2 . The generalized normal distribution is very useful in signal

processing and in particular modeling of the discrete cosine transform (DCT)

coeﬃ cients of a digital image.

6.5. Lognormal Distribution

The study lognormal distribution was initiated by Galton and McAlister

in 1879. They came across this distribution while studying the use of the

geometric mean as an estimate of location. Later, Kapteyn (1903) discussed

the genesis of this distribution. This distribution can be deﬁned as the distri-

bution of a random variable whose logarithm is normally distributed. Often

the size distribution of organisms, the distribution of species, the distribu-

tion of the number of persons in a census occupation class, the distribution of

stars in the universe, and the distribution of the size of incomes are modeled

by lognormal distributions. The lognormal distribution is used in biology,

astronomy, economics, pharmacology and engineering. This distribution is

sometimes known as the Galton-McAlister distribution. In economics, the

lognormal distribution is called the Cobb-Douglas distribution.

Deﬁnition 6.10. A random variable X is said to have a lognormal distri-

bution if its probability density function is given by

f(x ) = 





xp 2 ⇡e  1

2 ln(x)µ

 2

,if 0 < x < 1

0 otherwise ,

Probability and Mathematical Statistics 175

where 1 <µ< 1 and 0 <2 <1 are arbitrary parameters.

If X has a lognormal distribution with parameters µ and 2 , then we

write X⇠  \(µ, 2 ).

Example 6.27. If X⇠  \(µ, 2 ), what is the 100 pth percentile of X?

Answer: Let q be the 100pth percentile of X . Then by deﬁnition of per-

centile, we get

p= q

xp 2 ⇡ e  1

2 ln(x)µ

 2

dx.

Substituting z = ln(x)µ

in the above integral, we have

p= ln(q)µ



1

p2⇡ e 1

2z 2 dz

= zp

1

p2⇡ e 1

2z 2 dz,

where zp = ln(q)µ

is the 100p th of the standard normal random variable.

Hence 100pth percentile of Xis

q= ezp + µ ,

where zp is the 100pth percentile of the standard normal random variable Z.

Theorem 6.9. If X⇠  \(µ, 2 ), then

E( X) = eµ+ 1

2 2

V ar( X ) =  e 2  1 e2µ+2 .

Some Special Continuous Distributions 176

Proof: Let t be a positive integer. We compute the tth moment of X.

E Xt  = 1

xt f (x)dx

= 1

xt 1

xp 2 ⇡ e  1

2 ln(x)µ

 2

dx.

Substituting z = ln(x ) in the last integral, we get

E Xt  = 1

1

etz 1

p 2 ⇡e  1

2( zµ

) 2 dz = MZ (t),

where MZ (t ) denotes the moment generating function of the random variable

Z⇠ N( µ, 2 ). Therefore,

MZ (t) = eµt+ 1

2 2 t 2 .

Thus letting t = 1, we get

E( X) = eµ+ 1

2 2 .

Similarly, taking t = 2, we have

E( X2 ) = e2µ+22 .

Thus, we have

V ar( X ) = E ( X2 ) E ( X )2 =  e 2  1 e2µ+ 2

and now the proof of the theorem is complete.

Example 6.28. If X⇠  \(0, 4), then what is the probability that Xis

between 1 and 12.1825?

Answer: Since X⇠  \(0, 4), the random variable Y = ln(X )⇠N (0,4).

Hence

P(1  X12. 1825) = P(ln(1)  ln( X) ln(12.1825))

=P (0 Y 2.50)

=P (0 Z 1.25)

=P (Z  1. 25) P (Z  0)

= 0. 8944  0.5000

= 0.4944.

Probability and Mathematical Statistics 177

Example 6.29. If the amount of time needed to solve a problem by a group

of students follows the lognormal distribution with parameters µ and 2 ,

then what is the value of µ so that the probability of solving a problem in 10

minutes or less by any randomly picked student is 95% when 2 = 4?

Answer: Let the random variable X denote the amount of time needed

to a solve a problem. Then X⇠  \(µ, 4). We want to ﬁnd µ so that

P( X10) = 0 .95. Hence

0. 95 = P (X 10)

=P (ln(X ) ln(10))

=P (ln(X )µ  ln(10)  µ)

=P ln(X) µ

2 ln(10) µ

2

=P Z ln(10) µ

2 ,

where Z⇠ N (0, 1). Using the table for standard normal distribution, we get

ln(10)  µ

2= 1.65.

Hence

µ= ln(10)  2(1. 65) = 2 . 3025  3. 300 = 0.9975.

6.6. Inverse Gaussian Distribution

If a suﬃ ciently small macroscopic particle is suspended in a ﬂuid that is

in thermal equilibrium, the particle will move about erratically in response

to natural collisional bombardments by the individual molecules of the ﬂuid.

This erratic motion is called "Brownian motion" after the botanist Robert

Brown (1773-1858) who ﬁrst observed this erratic motion in 1828. Inde-

pendently, Einstein (1905) and Smoluchowski (1906) gave the mathematical

description of Brownian motion. The distribution of the ﬁrst passage time

in Brownian motion is the inverse Gaussian distribution. This distribution

was systematically studied by Tweedie in 1945. The interpurchase times of

toothpaste of a family, the duration of labor strikes in a geographical region,

word frequency in a language, conversion time for convertible bonds, length

of employee service, and crop ﬁeld size follow inverse Gaussian distribution.

Inverse Gaussian distribution is very useful for analysis of certain skewed

data.

Some Special Continuous Distributions 178

Deﬁnition 6.10. A random variable X is said to have an inverse Gaussian

distribution if its probability density function is given by

f(x ) = 







 

2⇡ x  3

2e   (xµ)2

2µ 2 x,if 0 <x< 1

0 otherwise,

where 0 <µ<1 and 0 < <1 are arbitrary parameters.

If X has an inverse Gaussian distribution with parameters µ and  , then

we write X⇠ IG( µ,  ).

The characteristic function  (t ) of X⇠ IG(µ,  ) is

(t ) = E  eitX 



µ 1 1 2iµ2t

.

Probability and Mathematical Statistics 179

Using this, we have the following theorem.

Theorem 6.10. If X⇠ I G( µ,  ), then

E( X) = µ

V ar( X ) = µ 3

.

Proof: Since  (t ) = E  eitX  , the derivative 0 (t ) = i E  X eitX  . Therefore

0 (0) = i E (X ). We know the characteristic function  (t ) of X⇠ IG(µ,  )

(t ) = e



µ 1 1 2iµ2t

.

Di↵ erentiating  (t ) with respect to t , we have

0 (t ) = d

dt  e



µ 1 1 2iµ2t





µ 1 1 2iµ2t

d

dt  

µ 1 1 2iµ2t



=iµ e



µ 1 1 2iµ2t

12iµ2t

  1

Hence 0 (0) = i µ . Therefore, E (X ) = µ . Similarly, one can show that

V ar( X ) = µ 3

.

This completes the proof of the theorem.

The distribution function F (x ) of the inverse Gaussian random variable

Xwith parameters µand  was computed by Shuster (1968) as

F(x ) =   

µ x

µ1 +e 2

µ  

µ x

µ+ 1  ,

where  is the distribution function of the standard normal distribution

function.

6.7. Logistics Distribution

The logistic distribution is often considered as an alternative to the uni-

variate normal distribution. The logistic distribution has a shape very close

Some Special Continuous Distributions 180

to that of a normal distribution but has heavier tails than the normal. The

logistic distribution is used in modeling demographic data. It is also used as

an alternative to the Weibull distribution in life-testing.

Deﬁnition 6.11. A random variable X is said to have a logistic distribution

if its probability density function is given by

f(x ) = ⇡

p 3

e ⇡

p3 ( x µ

)

1 + e ⇡

p3 ( x µ

) 2  1 < x < 1,

where 1 <µ< 1 and > 0 are parameters.

If X has a logistic distribution with parameters µ and  , then we write

X⇠ LOG( µ,  ).

Theorem 6.11. If X⇠ LOG( µ,  ), then

E( X) = µ

V ar( X ) =  2

M(t ) = eµt   1 + p 3

⇡t   1p 3

⇡t , |t| < ⇡

p 3.

Proof: First, we derive the moment generating function of X and then we

Probability and Mathematical Statistics 181

compute the mean and variance of it. The moment generating function is

M(t ) =  1

1

etx f (x)dx

= 1

1

etx ⇡

p 3

e ⇡

p3 ( x µ

)

1 + e ⇡

p3 ( x µ

) 2 dx

=eµt  1

1

esw e w

(1 + ew )2 dw, where w=⇡ (x µ)

p3 and s = p 3 

⇡t

=eµt  1

1 e w  s e w

(1 + ew )2 dw

=eµt  1

0z  1 1 s dz, where z = 1

1 + ew

=eµt  1

zs (1  z)s dz

=eµt B (1 + s, 1 s)

=eµt  (1 + s) (1  s)

(1 + s + 1  s)

=eµt  (1 + s) (1  s)

(2)

=eµt  (1 + s ) (1  s)

=eµt   1 + p 3

⇡t   1p 3

⇡t

=eµt  p 3

⇡t  cosec  p 3 

⇡t  .

We leave the rest of the proof to the reader.

6.8. Review Exercises

1. If Y⇠ U N IF (0, 1), then what is the probability density function of

X= ln Y?

2. Let the probability density function of X be

f(x ) =  e x if x > 0

0 otherwise .

Let Y = 1  eX . Find the distribution of Y.

Some Special Continuous Distributions 182

3. After a certain time the weight W of crystals formed is given approxi-

mately by W = eX where X⇠ N (µ, 2 ). What is the probability density

function of W for 0 < w < 1 ?

4. What is the probability that a normal random variable with mean 6 and

standard deviation 3 will fall between 5.7 and 7.5 ?

5. Let X have a distribution with the 75th percentile equal to 1

3and proba-

bility density function equal to

f(x ) =   e x for 0 <x<1

0 otherwise.

What is the value of the parameter ?

6. If a normal distribution with mean µ and variance 2 > 0 has 46th

percentile equal to 20 , then what is µ in term of standard deviation?

7. Let X be a random variable with cumulative distribution function

F(x ) =  0 if x0

1ex if x > 0.

What is P  0 eX  4?

8. Let X have the density function

f(x ) = 





( ↵ +)

( ↵) ( ) x ↵ 1 (1 x)  1 for 0 < x < 1

0 otherwise,

where ↵> 0 and > 0. If  = 6 and ↵ = 5, what is the mean of the random

variable (1 X )1 ?

9. R.A. Fisher proved that when n 30 and Y has a chi-square distribution

with n degrees freedom, then p 2Yp 2n 1 has an approximate standard

normal distribution. Under this approximation, what is the 90th percentile

of Y when n = 41 ?

10. Let Y have a chi-square distribution with 32 degrees of freedom so that

its variance is 64. If P (Y > c ) = 0 . 0668, then what is the approximate value

of the constant c?

11. If in a certain normal distribution of X , the probability is 0.5 that Xis

less than 500 and 0.0227 that X is greater than 650. What is the standard

deviation of X?

Probability and Mathematical Statistics 183

12. If X⇠ N (5, 4), then what is the probability that 8 < Y < 13 where

Y= 2 X+ 1?

13. Given the probability density function of a random variable Xas

f(x ) = 





✓e✓x if x > 0

0 otherwise,

what is the nth moment of X about the origin?

14. If the random variable X is normal with mean 1 and standard deviation

2, then what is P  X2  2X 8?

15. Suppose X has a standard normal distribution and Y = eX . What is

the k th moment of Y?

16. If the random variable X has uniform distribution on the interval [0, a],

what is the probability that the random variable greater than its square, that

is P  X > X 2  ?

17. If the random variable Y has a chi-square distribution with 54 degrees

of freedom, then what is the approximate 84th percentile of Y?

18. Let X be a continuous random variable with density function

f(x ) =  2

x2 for 1 < x < 2

0 elsewhere.

If Y = p X , what is the density function for Y where nonzero?

19. If X is normal with mean 0 and variance 4, then what is the probability

of the event X 4

X0, that is P  X 4

X0?

20. If the waiting time at Rally's drive-in-window is normally distributed

with mean 13 minutes and standard deviation 2 minutes, then what percent-

age of customers wait longer than 10 minutes but less than 15 minutes?

21. If X is uniform on the interval from  5 to 5, what is the probability that

the quadratic equation 100t2 + 20tX + 2X+ 3 = 0 has complex solutions?

22. If the random variable X⇠ Exp(✓ ), then what is the probability density

function of the random variable Y =X p X?

23. If the random variable X⇠ N (0, 1), then what is the probability density

function of the random variable Y =  |X|?

Some Special Continuous Distributions 184

24. If the random variable X⇠  \(µ, 2 ), then what is the probability

density function of the random variable ln(X)?

25. If the random variable X⇠  \(µ, 2 ), then what is the mode of X?

26. If the random variable X⇠  \(µ, 2 ), then what is the median of X?

27. If the random variable X⇠  \(µ, 2 ), then what is the probability that

the quadratic equation 4t2 + 4tX +X + 2 = 0 has real solutions?

28. Consider the Karl Pearson's di↵ erential equation p(x ) dy

dx +q(x ) y= 0

where p(x ) = a + bx + cx2 and q (x ) = x d . Show that if a =c = 0,

b > 0 , d > b, then y (x) is gamma; and if a = 0, b= c, d1

b<1, d

b>1,

then y (x ) is beta.

29. Let a, b, ↵ , be any four real numbers with a < b and ↵,  positive.

If X⇠ BE T A (↵,  ), then what is the probability density function of the

random variable Y = (b a) X + a?

30. A nonnegative continuous random variable X is said to be memoryless if

P( X > s + t/X > t) = P (X > s) for all s, t  0. Show that the exponential

random variable is memoryless.

31. Show that every nonnegative continuous memoryless random variable is

an exponential random variable.

32. Using gamma function evaluate the following integrals:

(i)  1

0e x2 dx; (ii)  1

0x e x2 dx; (iii)  1

0x 2 e x2 dx; (iv)  1

0x 3 e x2 dx.

33. Using beta function evaluate the following integrals:

(i)  1

0x 2 (1 x) 2 dx; (ii)  100

0x 5 (100 x) 7 dx; (iii)  1

0x 11 (1 x 3 ) 7 dx.

34. If  (z ) denotes the gamma function, then prove that

(1 + t) (1  t ) = tcosec(t).

35. Let ↵ and  be given positive real numbers, with ↵<  . If two points

are selected at random from a straight line segment of length  , what is the

probability that the distance between them is at least ↵?

36. If the random variable X⇠ GAM(✓ ,↵ ), then what is the nth moment

of X about the origin?

Probability and Mathematical Statistics 185

Two Random Variables 186

Chapter 7

TWO RANDOM VARIABLES

There are many random experiments that involve more than one random

variable. For example, an educator may study the joint behavior of grades

and time devoted to study; a physician may study the joint behavior of blood

pressure and weight. Similarly an economist may study the joint behavior of

business volume and proﬁt. In fact, most real problems we come across will

have more than one underlying random variable of interest.

7.1. Bivariate Discrete Random Variables

In this section, we develop all the necessary terminologies for studying

bivariate discrete random variables.

Deﬁnition 7.1. A discrete bivariate random variable (X, Y ) is an ordered

pair of discrete random variables.

Deﬁnition 7.2. Let (X, Y ) be a bivariate random variable and let RX and

RY be the range spaces of X and Y , respectively. A real-valued function

f: RX ⇥RY ! IR is called a joint probability density function for X and Y

if and only if

f( x, y) = P ( X= x, Y = y )

for all (x, y )2 RX ⇥RY . Here, the event (X = x, Y =y ) means the

intersection of the events (X = x ) and (Y =y ), that is

(X = x ) (Y =y ).

Example 7.1. Roll a pair of unbiased dice. If X denotes the smaller and

Ydenotes the larger outcome on the dice, then what is the joint probability

density function of X and Y?

Probability and Mathematical Statistics 187

Answer: The sample space S of rolling two dice consists of

{(1, 1) (1,2) (1, 3) (1, 4) (1, 5) (1,6)

(2, 1) (2, 2) (2, 3) (2, 4) (2, 5) (2,6)

(3, 1) (3, 2) (3, 3) (3, 4) (3, 5) (3,6)

(4, 1) (4, 2) (4, 3) (4, 4) (4, 5) (4,6)

(5, 1) (5, 2) (5, 3) (5, 4) (5, 5) (5,6)

(6, 1) (6, 2) (6, 3) (6, 4) (6,5) (6, 6)}

The probability density function f (x, y ) can be computed for X = 2 and

Y= 3 as follows: There are two outcomes namely (2 ,3) and (3,2) in the

sample S of 36 outcomes which contribute to the joint event (X = 2, Y = 3).

Hence

f(2 ,3) = P( X= 2 , Y = 3) = 2

36 .

Similarly, we can compute the rest of the probabilities. The following table

shows these probabilities:

36 0

36 0 0

36 0 0 0

36 0 0 0 0

36 0 0 0 0 0

1 2 3 4 5 6

These tabulated values can be written as

f( x, y) = 









36 if 1 x =y 6

36 if 1 x < y 6

0 otherwise.

Example 7.2. A group of 9 executives of a certain ﬁrm include 4 who

are married, 3 who never married, and 2 who are divorced. Three of the

Two Random Variables 188

executives are to be selected for promotion. Let X denote the number of

married executives and Y the number of never married executives among

the 3 selected for promotion. Assuming that the three are randomly selected

from the nine available, what is the joint probability density function of the

random variables X and Y?

Answer: The number of ways we can choose 3 out of 9 is  9

3 which is 84.

Thus

f(0 ,0) = P( X= 0 , Y = 0) = 0

84 = 0

f(1 ,0) = P( X= 1 , Y = 0) =  4

1 3

0 2

2

84 = 4

f(2 ,0) = P( X= 2 , Y = 0) =  4

2 3

0 2

1

84 = 12

f(3 ,0) = P( X= 3 , Y = 0) =  4

3 3

0 2

0

84 = 4

84 .

Similarly, we can ﬁnd the rest of the probabilities. The following table gives

the complete information about these probabilities.

84 0 0 0

84 0 0

84 0

0 0 4

0 1 2 3

Deﬁnition 7.3. Let (X, Y ) be a discrete bivariate random variable. Let

RX and RY be the range spaces of X and Y , respectively. Let f ( x, y ) be the

joint probability density function of X and Y . The function

f1 (x) = 

y2RY

f( x, y)

Probability and Mathematical Statistics 189

is called the marginal probability density function of X . Similarly, the func-

tion

f2 ( y ) = 

x2RX

f( x, y)

is called the marginal probability density function of Y.

The following diagram illustrates the concept of marginal graphically.

Example 7.3. If the joint probability density function of the discrete random

variables X and Y is given by

f( x, y) = 









36 if 1 x =y 6

36 if 1 x < y 6

0 otherwise,

then what are marginals of X and Y?

Answer: The marginal of X can be obtained by summing the joint proba-

bility density function f (x, y ) for all y values in the range space RY of the

random variable Y . That is

f1 (x) = 

y2RY

f( x, y)



y=1

f( x, y)

=f (x, x ) + 

y>x

f( x, y) + 

y<x

f( x, y)

36 + (6 x) 2

36 + 0

36 [13 2x ], x = 1 , 2 , ..., 6.

Two Random Variables 190

Similarly, one can obtain the marginal probability density of Y by summing

over for all x values in the range space RX of the random variable X . Hence

f2 ( y ) = 

x2RX

f( x, y)



x=1

f( x, y)

=f (y, y ) + 

x<y

f( x, y) + 

x>y

f( x, y)

36 + (y 1) 2

36 + 0

36 [2y 1] , y = 1, 2, ..., 6.

Example 7.4. Let X and Y be discrete random variables with joint proba-

bility density function

f( x, y) =  1

21 (x+ y ) if x = 1, 2; y = 1, 2,3

0 otherwise.

What are the marginal probability density functions of X and Y?

Answer: The marginal of X is given by

f1 (x) =



y=1

21 (x+ y )

21 3x+ 1

21 [1 + 2 + 3]

=x + 2

7, x = 1, 2.

Similarly, the marginal of Y is given by

f2 ( y ) =



x=1

21 (x+ y )

=2y

21 + 3

=3 + 2 y

21 , y = 1, 2,3.

From the above examples, note that the marginal f1 (x ) is obtained by sum-

ming across the columns. Similarly, the marginal f2 ( y ) is obtained by sum-

ming across the rows.

Probability and Mathematical Statistics 191

The following theorem follows from the deﬁnition of the joint probability

density function.

Theorem 7.1. A real valued function f of two variables is a joint probability

density function of a pair of discrete random variables X and Y (with range

spaces RX and RY , respectively) if and only if

(a )f (x, y ) 0 for all (x, y )2 RX ⇥R Y;

(b ) 

x2RX 

y2RY

f( x, y) = 1.

Example 7.5. For what value of the constant k the function given by

f( x, y) =  k xy if x = 1, 2, 3; y = 1, 2,3

0 otherwise

is a joint probability density function of some random variables X and Y?

Answer: Since

1 =



x=1



y=1

f( x, y)



x=1



y=1

k x y

=k [1 + 2 + 3 + 2 + 4 + 6 + 3 + 6 + 9]

= 36 k.

Hence

k=1

and the corresponding density function is given by

f( x, y) =  1

36 xy if x = 1, 2, 3; y = 1, 2,3

0 otherwise .

As in the case of one random variable, there are many situations where

one wants to know the probability that the values of two random variables

are less than or equal to some real numbers x and y.

Two Random Variables 192

Deﬁnition 7.4. Let X and Y be any two discrete random variables. The

real valued function F : IR2 ! IR is called the joint cumulative probability

distribution function of X and Y if and only if

F( x, y) = P ( X x, Y  y )

for all (x, y )2 IR2 . Here, the event (X x, Y  y ) means (X x ) (Y y ).

From this deﬁnition it can be shown that for any real numbers a and b

F( a X b, c  Y d) = F ( b, d) + F ( a, c) F ( a, d) F ( b, c ).

Further, one can also show that

F( x, y) = 

s x 

ty

f( s, t)

where (s, t ) is any pair of nonnegative numbers.

7.2. Bivariate Continuous Random Variables

In this section, we shall extend the idea of probability density functions

of one random variable to that of two random variables.

Deﬁnition 7.5. The joint probability density function of the random vari-

ables X and Y is an integrable function f (x, y ) such that

(a) f (x, y ) 0 for all (x, y )2 IR2 ; and

(b)  1

1  1

1 f(x, y ) dx dy = 1.

Example 7.6. Let the joint density function of X and Y be given by

f( x, y) =  k xy 2 if 0 < x < y < 1

0 otherwise.

What is the value of the constant k?

Probability and Mathematical Statistics 193

Answer: Since f is a joint probability density function, we have

1 =  1

1  1

1

f( x, y) dx dy

= 1

0 y

k x y2 dx dy

= 1

k y2  y

x dx dy

2 1

y4 dy

10  y 5  1

10 .

Hence k = 10.

If we know the joint probability density function f of the random vari-

ables X and Y , then we can compute the probability of the event Afrom

P(A ) =   A

f( x, y) dx dy.

Example 7.7. Let the joint density of the continuous random variables X

and Y be

f( x, y) =  6

5x 2 + 2 xy  if 0  x 1; 0  y 1

0 elsewhere.

What is the probability of the event (X Y ) ?

Two Random Variables 194

Answer: Let A = (X Y ). we want to ﬁnd

P(A ) =   A

f( x, y) dx dy

= 1

0 y

5 x 2 + 2 x y  dxdy

5 1

0x 3

3+x2 y  x=y

x=0

5 1

3y 3 dy

5 y 4 1

Deﬁnition 7.6. Let (X, Y ) be a continuous bivariate random variable. Let

f( x, y) be the joint probability density function of X and Y . The function

f1 (x) =  1

1

f( x, y)dy

is called the marginal probability density function of X . Similarly, the func-

tion

f2 ( y ) =  1

1

f( x, y)dx

is called the marginal probability density function of Y.

Example 7.8. If the joint density function for X and Y is given by

f( x, y) = 





4for 0 < y 2 <x<1

0 otherwise,

then what is the marginal density function of X , for 0 < x < 1?

Answer: The domain of the f consists of the region bounded by the curve

x= y2 and the vertical line x= 1. (See the ﬁgure on the next page.)

Probability and Mathematical Statistics 195

Hence

f1 (x) =  px

p x

4dy

= 3

4y  px

p x

2px.

Example 7.9. Let X and Y have joint density function

f( x, y) =  2 e xy for 0 < x  y < 1

0 otherwise.

What is the marginal density of X where nonzero?

Two Random Variables 196

Answer: The marginal density of X is given by

f1 (x) =  1

1

f( x, y)dy

= 1

2exy dy

= 2 ex  1

ey dy

= 2 ex  ey  1

= 2 ex ex

= 2 e2x 0 < x < 1.

Example 7.10. Let (X, Y ) be distributed uniformly on the circular disk

centered at (0, 0) with radius 2

p⇡ . What is the marginal density function of

Xwhere nonzero?

Answer: The equation of a circle with radius 2

p⇡ and center at the origin is

x2 + y2 =4

⇡.

Hence, solving this equation for y , we get

y=± 4

⇡x2.

Thus, the marginal density of X is given by

Probability and Mathematical Statistics 197

f1 (x) =  p 4

⇡x2

p 4

⇡x2

f( x, y)dy

= p 4

⇡x2

p 4

⇡x2

area of the circle dy

= p 4

⇡x2

p 4

⇡x2

4dy

= 1

4y  p 4

⇡x2

p 4

⇡x2

2 4

⇡x2.

Deﬁnition 7.7. Let X and Y be the continuous random variables with

joint probability density function f (x, y ). The joint cumulative distribution

function F (x, y ) of X and Y is deﬁned as

F( x, y) = P ( X x, Y  y ) =  y

1  x

1

f( u, v) du dv

for all (x, y )2 IR2.

From the fundamental theorem of calculus, we again obtain

f( x, y) = @ 2 F

@x @y.

Example 7.11. If the joint cumulative distribution function of X and Yis

given by

F( x, y) = 





52x 3 y+ 3 x 2 y 2 for 0 < x, y < 1

0 elsewhere,

then what is the joint density of X and Y?

Two Random Variables 198

Answer:

f( x, y) = 1

@y 2 x 3 y+ 3 x2 y2 

@x 2 x 3 + 6 x 2 y 

5 6x2 + 12 x y 

5(x2 + 2 x y ).

Hence, the joint density of X and Y is given by

f( x, y) =  6

5x 2 + 2 x y  for 0 < x, y < 1

0 elsewhere.

Example 7.12. Let X and Y have the joint density function

f( x, y) =  2 x for 0 < x < 1; 0 < y < 1

0 elsewhere.

What is P  X +Y 1 / X  1

2?

Answer: (See the diagram below.)

Probability and Mathematical Statistics 199

P X+ Y1 / X  1

2 = P  (X+ Y 1)   X 1

2

P X1

2

= 1

0 1

02x dx dy + 1

2 1y

02x dxdy

1

0 1

02x dxdy

Example 7.13. Let X and Y have the joint density function

f( x, y) =  x+y for 0 x  1; 0 y1

0 elsewhere.

What is P (2X 1 / X +Y 1) ?

Answer: We know that

P(2 X1 / X + Y 1) = P  X1

2(X+ Y 1)

P( X+ Y1) .

P[ X+ Y1] =  1

0 1x

(x +y ) dy dx

= x2

2 x3

3 (1 x)3

61

6= 1

Two Random Variables 200

Similarly

P X1

2 (X+ Y 1) = 1

0 1x

(x +y ) dy dx

= x2

2 x3

3 (1 x)3

6 1

=11

48 .

Thus,

P(2 X1 / X + Y 1) =  11

48   3

1 = 11

16 .

7.3. Conditional Distributions

First, we motivate the deﬁnition of conditional distribution using dis-

crete random variables and then based on this motivation we give a general

deﬁnition of the conditional distribution. Let X and Y be two discrete ran-

dom variables with joint probability density f (x, y ). Then by deﬁnition of

the joint probability density, we have

f( x, y) = P ( X= x, Y = y ).

If A = {X = x} ,B = {Y =y} and f2 ( y ) = P (Y =y ), then from the above

equation we have

P({ X= x} / { Y= y }) = P ( A / B)

=P (A B )

P( B)

=P ({X= x } and {Y= y })

P( Y= y)

=f (x, y)

f2 ( y) .

If we write the P ({X = x}/ { Y =y } ) as g (x / y ), then we have

g( x / y) = f ( x, y)

f2 ( y) .

Probability and Mathematical Statistics 201

For the discrete bivariate random variables, we can write the conditional

probability of the event {X = x} given the event {Y =y} as the ratio of the

probability of the event {X = x}  { Y =y} to the probability of the event

{Y= y } which is

g( x / y) = f ( x, y)

f2 ( y) .

We use this fact to deﬁne the conditional probability density function given

two random variables X and Y.

Deﬁnition 7.8. Let X and Y be any two random variables with joint density

f( x, y) and marginals f1 (x) and f2 ( y ). The conditional probability density

function g of X , given (the event) Y =y , is deﬁned as

g( x / y) = f ( x, y)

f2 ( y) f 2 ( y)> 0.

Similarly, the conditional probability density function h of Y , given (the event)

X= x, is deﬁned as

h( y / x) = f ( x, y)

f1 (x) f 1 ( x)> 0.

Example 7.14. Let X and Y be discrete random variables with joint prob-

ability function

f( x, y) =  1

21 (x+ y ) for x = 1, 2, 3; y = 1, 2.

0 elsewhere.

What is the conditional probability density function of X , given Y = 2 ?

Answer: We want to ﬁnd g (x/ 2). Since

g( x / 2) = f ( x, 2)

f2 (2)

we should ﬁrst compute the marginal of Y , that is f2 (2). The marginal of Y

is given by

f2 ( y ) =



x=1

21 (x+ y )

21 (6 + 3 y ).

Two Random Variables 202

Hence f2 (2) = 12

21 . Thus, the conditional probability density function of X,

given Y = 2, is

g(x/ 2) = f(x, 2)

f2 (2)

21 (x + 2)

12 (x + 2), x = 1, 2,3.

Example 7.15. Let X and Y be discrete random variables with joint prob-

ability density function

f( x, y) =  x+y

32 for x = 1, 2; y = 1, 2,3,4

0 otherwise.

What is the conditional probability of Y given X =x ?

Answer:

f1 (x) =



y=1

f( x, y)



y=1

(x +y )

32 (4 x + 10).

Therefore

h(y/x) = f ( x, y)

f1 (x)

32 (x+ y )

32 (4 x + 10)

=x +y

4x + 10 .

Thus, the conditional probability Y given X =x is

h(y/x) =  x+y

4x +10 for x = 1, 2; y = 1, 2,3,4

0 otherwise.

Example 7.16. Let X and Y be continuous random variables with joint pdf

f( x, y) =  12 x for 0 <y< 2 x < 1

0 otherwise .

Probability and Mathematical Statistics 203

What is the conditional density function of Y given X =x ?

Answer: First, we have to ﬁnd the marginal of X.

f1 (x) =  1

1

f( x, y)dy

= 2x

12 x dy

= 24 x2.

Thus, the conditional density of Y given X =x is

h(y/x) = f ( x, y)

f1 (x)

=12 x

24 x2

2x, for 0 < y < 2x < 1

and zero elsewhere.

Example 7.17. Let X and Y be random variables such that X has density

function

f1 (x) =  24 x 2 for 0 < x < 1

0 elsewhere

Two Random Variables 204

and the conditional density of Y given X =x is

h(y/x) =  y

2x2 for 0 < y < 2x

0 elsewhere .

What is the conditional density of X given Y =y over the appropriate

domain?

Answer: The joint density f (x, y ) of X and Y is given by

f( x, y) = h(y/x)f1 (x)

2x2 24 x2

= 12y for 0 < y < 2x < 1.

The marginal density of Y is given by

f2 ( y ) =  1

1

f( x, y)dx

= 1

12 y dx

= 6 y (1 y ), for 0 < y < 1.

Hence, the conditional density of X given Y =y is

g(x/y ) = f(x, y)

f2 ( y )

=12y

6y (1 y )

1y.

Thus, the conditional density of X given Y =y is given by

g(x/y ) =  2

1y for 0 <y< 2x < 1

0 otherwise.

Note that for a speciﬁc x , the function f (x, y ) is the intersection (proﬁle)

of the surface z =f (x, y ) by the plane x = constant. The conditional density

f(y/x ), is the proﬁle of f( x, y ) normalized by the factor 1

f1 (x) .

Probability and Mathematical Statistics 205

7.4. Independence of Random Variables

In this section, we deﬁne the concept of stochastic independence of two

random variables X and Y . The conditional probability density function g

of X given Y =y usually depends on y . If g is independent of y , then the

random variables X and Y are said to be independent. This motivates the

following deﬁnition.

Deﬁnition 7.8. Let X and Y be any two random variables with joint density

f( x, y) and marginals f1 (x) and f2 ( y ). The random variables X and Y are

(stochastically) independent if and only if

f( x, y) = f1 (x)f2 ( y )

for all (x, y )2 RX ⇥RY .

Example 7.18. Let X and Y be discrete random variables with joint density

f( x, y) = 





36 for 1 x =y 6

36 for 1 x < y  6.

Are X and Y stochastically independent?

Answer: The marginals of X and Y are given by

f1 (x) =



y=1

f( x, y)

=f (x, x ) + 

y>x

f( x, y) + 

y<x

f( x, y)

36 + (6 x) 2

36 + 0

=13  2x

36 , for x = 1, 2, ..., 6

and

f2 ( y ) =



x=1

f( x, y)

=f (y, y ) + 

x<y

f( x, y) + 

x>y

f( x, y)

36 + (y 1) 2

36 + 0

=2y 1

36 , for y = 1, 2, ..., 6.

Two Random Variables 206

Since

f(1 ,1) = 1

36 6 = 11

36 = f 1 (1) f2 (1),

we conclude that f (x, y ) 6 = f1 (x ) f2 (y ), and X and Y are not independent.

This example also illustrates that the marginals of X and Y can be

determined if one knows the joint density f (x, y ). However, if one knows the

marginals of X and Y , then it is not possible to ﬁnd the joint density of X

and Y unless the random variables are independent.

Example 7.19. Let X and Y have the joint density

f( x, y) =  e (x+y ) for 0 < x, y < 1

0 otherwise.

Are X and Y stochastically independent?

Answer: The marginals of X and Y are given by

f1 (x) =  1

f( x, y) dy = 1

e(x+y ) dy = ex

and

f2 ( y ) =  1

f( x, y) dx = 1

e(x+y ) dx = ey .

Hence

f( x, y) = e(x+y ) =ex ey =f1 (x)f2 ( y ).

Thus, X and Y are stochastically independent.

Notice that if the joint density f (x, y ) of X and Y can be factored into

two nonnegative functions, one solely depending on x and the other solely

depending on y , then X and Y are independent. We can use this factorization

approach to predict when X and Y are not independent.

Example 7.20. Let X and Y have the joint density

f( x, y) =  x+y for 0 < x < 1; 0 <y<1

0 otherwise.

Are X and Y stochastically independent?

Answer: Notice that f (x, y) = x+ y

=x 1 + y

x .

Probability and Mathematical Statistics 207

Thus, the joint density cannot be factored into two nonnegative functions

one depending on x and the other depending on y ; and therefore X and Y

are not independent.

If X and Y are independent, then the random variables U =  (X ) and

V= ( Y) are also independent. Here  , : IR ! IR are some real valued

functions. From this comment, one can conclude that if X and Y are inde-

pendent, then the random variables eX and Y3 +Y2 +1 are also independent.

Deﬁnition 7.9. The random variables X and Y are said to be independent

and identically distributed (IID) if and only if they are independent and have

the same distribution.

Example 7.21. Let X and Y be two independent random variables with

identical probability density function given by

f(x ) =  e x for x > 0

0 elsewhere.

What is the probability density function of W = min{X, Y } ?

Answer: Let G( w ) be the cumulative distribution function of W . Then

G( w ) = P ( W w )

= 1 P (W > w )

= 1 P (min{X, Y } > w )

= 1 P (X > w and Y > w)

= 1 P (X > w )P (Y > w ) (since X and Y are independent)

= 1   1

ex dx 1

ey dy

= 1  ew  2

= 1  e2w .

Thus, the probability density function of Wis

g( w) = d

dw G(w ) = d

dw  1 e 2w  = 2 e 2w .

Hence

g( w) =  2 e 2w for w > 0

0 elsewhere.

Two Random Variables 208

7.5. Review Exercises

1. Let X and Y be discrete random variables with joint probability density

function

f( x, y) =  1

21 (x+ y ) for x = 1, 2, 3; y = 1,2

0 otherwise.

What are the marginals of X and Y?

2. Roll a pair of unbiased dice. Let X be the maximum of the two faces and

Ybe the sum of the two faces. What is the joint density of Xand Y?

3. For what value of c is the real valued function

f( x, y) =  c ( x + 2 y ) for x = 1 , 2; y = 1 , 2

0 otherwise

a joint density for some random variables X and Y?

4. Let X and Y have the joint density

f( x, y) =  e (x+y ) for 0  x, y < 1

0 otherwise.

What is P (X Y 2) ?

5. If the random variable X is uniform on the interval from  1 to 1, and the

random variable Y is uniform on the interval from 0 to 1, what is the prob-

ability that the the quadratic equation t2 + 2Xt +Y = 0 has real solutions?

Assume X and Y are independent.

6. Let Y have a uniform distribution on the interval (0, 1), and let the

conditional density of X given Y =y be uniform on the interval from 0 to

py . What is the marginal density of X for 0 < x < 1?

Probability and Mathematical Statistics 209

7. If the joint cumulative distribution of the random variables X and Yis

F( x, y) = 





(1  ex )(1  ey ) for x > 0, y > 0

0 otherwise,

what is the joint probability density function of the random variables Xand

Y, and the P(1 <X< 3 ,1 < Y < 2)?

8. If the random variables X and Y have the joint density

f( x, y) = 





7xfor 1 x +y 2, x  0, y  0

0 otherwise,

what is the probability P (Y X2 ) ?

9. If the random variables X and Y have the joint density

f( x, y) = 





7xfor 1 x +y 2, x  0, y  0

0 otherwise,

what is the probability P [max(X, Y )> 1] ?

10. Let X and Y have the joint probability density function

f( x, y) =  5

16 xy 2 for 0 < x < y < 2

0 elsewhere.

What is the marginal density function of X where it is nonzero?

11. Let X and Y have the joint probability density function

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

What is the marginal density function of Y , where nonzero?

12. A point (X, Y ) is chosen at random from a uniform distribution on the

circular disk of radius centered at the point (1, 1). For a given value of X = x

between 0 and 2 and for y in the appropriate domain, what is the conditional

density function for Y?

Two Random Variables 210

13. Let X and Y be continuous random variables with joint density function

f( x, y) =  3

4(2 xy ) for 0 < x, y < 2; 0 < x + y < 2

0 otherwise.

What is the conditional probability P (X < 1| Y < 1) ?

14. Let X and Y be continuous random variables with joint density function

f( x, y) =  12 x for 0 <y< 2 x < 1

0 otherwise.

What is the conditional density function of Y given X =x ?

15. Let X and Y be continuous random variables with joint density function

f( x, y) =  24 xy for x > 0, y > 0, 0 < x + y < 1

0 otherwise.

What is the conditional probability P  X < 1

2|Y= 1

4?

16. Let X and Y be two independent random variables with identical prob-

ability density function given by

f(x ) =  e x for x > 0

0 elsewhere.

What is the probability density function of W = max{X, Y } ?

17. Let X and Y be two independent random variables with identical prob-

ability density function given by

f(x ) = 





3x2

✓3 for 0 x✓

0 elsewhere,

for some ✓> 0. What is the probability density function of W = min{X, Y }?

18. Ron and Glenna agree to meet between 5 P.M. and 6 P.M. Suppose

that each of them arrive at a time distributed uniformly at random in this

time interval, independent of the other. Each will wait for the other at most

10 minutes (and if other does not show up they will leave). What is the

probability that they actually go out?

Probability and Mathematical Statistics 211

19. Let X and Y be two independent random variables distributed uniformly

on the interval [0, 1]. What is the probability of the event Y 1

2given that

Y1 2 X?

20. Let X and Y have the joint density

f( x, y) =  8 xy for 0 < y < x < 1

0 otherwise.

What is P (X + Y > 1) ?

21. Let X and Y be continuous random variables with joint density function

f( x, y) =  2 for 0 y x < 1

0 otherwise.

Are X and Y stochastically independent?

22. Let X and Y be continuous random variables with joint density function

f( x, y) =  2 x for 0 < x, y < 1

0 otherwise.

Are X and Y stochastically independent?

23. A bus and a passenger arrive at a bus stop at a uniformly distributed

time over the interval 0 to 1 hour. Assume the arrival times of the bus and

passenger are independent of one another and that the passenger will wait

up to 15 minutes for the bus to arrive. What is the probability that the

passenger will catch the bus?

24. Let X and Y be continuous random variables with joint density function

f( x, y) =  4 xy for 0  x, y  1

0 otherwise.

What is the probability of the event X 1

2given that Y 3

25. Let X and Y be continuous random variables with joint density function

f( x, y) =  1

2for 0 x y 2

0 otherwise.

What is the probability of the event X 1

2given that Y = 1?

Two Random Variables 212

26. If the joint density of the random variables X and Yis

f( x, y) =  1 if 0 x y 1

2if 1 x  2, 0y 1

0 otherwise,

what is the probability of the event  X 3

2, Y  1

2?

27. If the joint density of the random variables X and Yis

f( x, y) = 



 e min{x,y } 1  e (x+y ) if 0 < x, y < 1

0 otherwise,

then what is the marginal density function of X , where nonzero?

Probability and Mathematical Statistics 213

Product Moments of Bivariate Random Variables 214

Chapter 8

PRODUCT MOMENTS

BIVARIATE

RANDOM VARIABLES

In this chapter, we deﬁne various product moments of a bivariate random

variable. The main concept we introduce in this chapter is the notion of

covariance between two random variables. Using this notion, we study the

statistical dependence of two random variables.

8.1. Covariance of Bivariate Random Variables

First, we deﬁne the notion of product moment of two random variables

and then using this product moment, we give the deﬁnition of covariance

between two random variables.

Deﬁnition 8.1. Let X and Y be any two random variables with joint density

function f (x, y ). The product moment of X and Y , denoted by E (XY ), is

deﬁned as

E( XY ) = 









x2RX 

y2RY

xy f ( x, y) if X and Y are discrete

1

1  1

1 xy f (x, y ) dx dy if X and Y are continuous.

Here, RX and RY represent the range spaces of X and Y respectively.

Deﬁnition 8.2. Let X and Y be any two random variables with joint density

function f (x, y ). The covariance between X and Y , denoted by Cov( X, Y )

(or XY ), is deﬁned as

Cov ( X, Y ) = E ( ( X µX ) ( Y µY ) ),

Probability and Mathematical Statistics 215

where µX and µY are mean of X and Y , respectively.

Notice that the covariance of X and Y is really the product moment of

X µX and Y µY . Further, the mean of µX is given by

µX = E ( X ) =  1

1

x f1 (x ) dx = 1

1  1

1

x f ( x, y) dx dy,

and similarly the mean of Y is given by

µY = E ( Y ) =  1

1

y f2 ( y) dy = 1

1  1

1

y f ( x, y) dy dx.

Theorem 8.1. Let X and Y be any two random variables. Then

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ).

Proof:

Cov ( X, Y ) = E (( X µX ) ( Y µY ))

=E (XY  µX Y µY X + µXµY )

=E (XY ) µX E (Y ) µY E (X ) + µXµY

=E (XY ) µXµY µYµX + µXµY

=E (XY ) µXµY

=E (XY )E (X )E (Y).

Corollary 8.1. C ov ( X, X ) =  2

Proof: Cov(X, X ) = E (XX ) E ( X) E ( X )

=E (X2 ) µ2

=V ar (X)

=2

Example 8.1. Let X and Y be discrete random variables with joint density

f( x, y) =  x+2y

18 for x = 1, 2; y = 1,2

0 elsewhere.

What is the covariance XY between X and Y.

Product Moments of Bivariate Random Variables 216

Answer: The marginal of Xis

f1 (x) =



y=1

x+ 2y

18 = 1

18 (2x + 6).

Hence the expected value of Xis

E( X) =



x=1

x f1 (x)

= 1 f1 (1) + 2f1 (2)

18 + 2 10

=28

18 .

Similarly, the marginal of Yis

f2 ( y ) =



x=1

x+ 2y

18 = 1

18 (3 + 4y ).

Hence the expected value of Yis

E( Y) =



y=1

y f2 ( y )

= 1 f2 (1) + 2f2 (2)

18 + 2 11

=29

18 .

Further, the product moment of X and Y is given by

E( XY ) =



x=1



y=1

x y f ( x, y)

=f (1, 1) + 2 f (1, 2) + 2 f (2, 1) + 4 f (2,2)

18 + 2 5

18 + 2 4

18 + 4 6

=3 + 10 + 8 + 24

=45

18 .

Probability and Mathematical Statistics 217

Hence, the covariance between X and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

=45

18   28

18  29

18 

=(45) (18)  (28) (29)

(18) (18)

=810 812

324

= 2

324 = 0.00617.

Remark 8.1. For an arbitrary random variable, the product moment and

covariance may or may not exist. Further, note that unlike variance, the

covariance between two random variables may be negative.

Example 8.2. Let X and Y have the joint density function

f( x, y) =  x+y if 0 < x, y < 1

0 elsewhere .

What is the covariance between X and Y?

Answer: The marginal density of Xis

f1 (x) =  1

(x +y ) dy

= x y +y 2

2 y=1

y=0

=x +1

Thus, the expected value of X is given by

E( X) =  1

x f1 (x ) dx

= 1

x( x+1

2)dx

= x3

3+ x2

41

12 .

Product Moments of Bivariate Random Variables 218

Similarly (or using the fact that the density is symmetric in x and y ), we get

E( Y) = 7

12 .

Now, we compute the product moment of X and Y.

E( XY ) =  1

0 1

x y( x+ y) dx dy

= 1

0 1

(x2 y + x y2 ) dx dy

= 1

0x 3 y

3+ x 2 y 2

2 x=1

x=0

= 1

0y

3+ y 2

2 dy

= y 2

6+ y 3

61

6+ 1

12 .

Hence the covariance between X and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

12   7

12  7

12 

=48 49

144

= 1

144 .

Example 8.3. Let X and Y be continuous random variables with joint

density function

f( x, y) =  2 if 0 < y < 1 x; 0 < x < 1

0 elsewhere .

What is the covariance between X and Y?

Answer: The marginal density of X is given by

f1 (x) =  1x

2dy = 2 (1  x).

Probability and Mathematical Statistics 219

Hence the expected value of Xis

µX = E ( X ) =  1

x f1 (x ) dx = 1

2 (1  x ) dx = 1

Similarly, the marginal of Yis

f2 ( y ) =  1y

2dx = 2 (1 y ).

Hence the expected value of Yis

µY = E ( Y ) =  1

y f2 ( y) dy = 1

2 (1 y ) dy = 1

The product moment of X and Y is given by

E( XY ) =  1

0 1x

x y f ( x, y) dy dx

= 1

0 1x

x y 2 dy dx

= 2  1

x y 2

2 1x

= 2 1

2 1

x(1  x)2 dx

= 1

0x2x 2 +x 3 dx

= 1

2x 2  2

3x 3 + 1

4x 4 1

12 .

Therefore, the covariance between X and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

12  1

36  4

36 = 1

36 .

Product Moments of Bivariate Random Variables 220

Theorem 8.2. If X and Y are any two random variables and a , b , c , and d

are real constants, then

Cov ( a X + b, c Y + d) = a c C ov ( X, Y ).

Proof:

Cov ( a X + b, c Y + d)

=E ((aX + b)(cY + d )) E (aX + b )E (cY + d)

=E (acXY + adX + bcY + bd ) (aE (X ) + b ) ( cE (Y ) + d)

=ac E (X Y ) + ad E (X ) + bc E (Y ) + bd

[ac E(X )E (Y ) + ad E (X ) + bc E (Y ) + bd]

=ac [E(XY )E (X )E (Y)]

=ac Cov (X, Y ).

Example 8.4. If the product moment of X and Y is 3 and the mean of

Xand Yare both equal to 2, then what is the covariance of the random

variables 2X + 10 and  5

2Y+ 3 ?

Answer: Since E (XY ) = 3 and E (X ) = 2 = E (Y ), the covariance of X

and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ) = 3  4 = 1 .

Then the covariance of 2X + 10 and  5

2Y+ 3 is given by

Cov  2 X + 10 , 5

2Y + 3  = 2   5

2 Cov (X, Y )

= ( 5) (1)

= 5.

Remark 8.2. Notice that the Theorem 8.2 can be furthered improved. That

is, if X ,Y ,Z are three random variables, then

Cov ( X+ Y, Z ) = Cov( X, Z) + C ov ( Y, Z )

and

Cov ( X, Y + Z ) = C ov ( X, Y ) + C ov ( X, Z ).

Probability and Mathematical Statistics 221

The ﬁrst formula can be established as follows. Consider

Cov ( X+ Y, Z ) = E (( X+ Y ) Z) E ( X+ Y) E ( Z )

=E (XZ + Y Z )E (X)E(Z )E (Y)E(Z)

=E (XZ )E (X)E(Z ) + E (Y Z )E (Y)E(Z)

=Cov ( X, Z ) + Cov( Y, Z ).

8.2. Independence of Random Variables

In this section, we study the e↵ ect of independence on the product mo-

ment (and hence on the covariance). We begin with a simple theorem.

Theorem 8.3. If X and Y are independent random variables, then

E( XY ) = E( X) E( Y).

Proof: Recall that X and Y are independent if and only if

f( x, y) = f1 (x)f2 ( y ).

Let us assume that X and Y are continuous. Therefore

E( XY ) =  1

1  1

1

x y f ( x, y) dx dy

= 1

1  1

1

x y f1 (x ) f2 ( y) dx dy

=1

1

x f1 (x ) dx 1

1

y f2 ( y) dy

=E (X )E (Y).

If X and Y are discrete, then replace the integrals by appropriate sums to

prove the same result.

Example 8.5. Let X and Y be two independent random variables with

respective density functions:

f(x ) =  3 x 2 if 0 < x < 1

0 otherwise

and

g( y) =  4 y 3 if 0 <y<1

0 otherwise .

Product Moments of Bivariate Random Variables 222

What is E X

Y?

Answer: Since X and Y are independent, the joint density of X and Yis

given by

h( x, y) = f (x) g ( y ).

Therefore

E X

Y =  1

1  1

1

yh(x, y ) dx dy

= 1

0 1

yf(x )g (y ) dx dy

= 1

0 1

y3x2 4y3 dx dy

=1

3x3dx 1

4y2 dy

= 3

4  4

3 = 1.

Remark 8.3. The independence of X and Y does not imply E X

Y= E(X)

E( Y)

but only implies E X

Y=E( X) E Y 1 . Further, note that E Y 1 is not

equal to 1

E( Y).

Theorem 8.4. If X and Y are independent random variables, then the

covariance between X and Y is always zero, that is

Cov ( X, Y ) = 0.

Proof: Suppose X and Y are independent, then by Theorem 8.3, we have

E( XY ) = E( X) E( Y). Consider

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

=E (X )E (Y )E (X )E (Y)

= 0.

Example 8.6. Let the random variables X and Y have the joint density

f( x, y) =  1

4if (x, y)2{ (0 , 1) , (0 , 1) , (1 , 0) , (1 , 0) }

0 otherwise.

What is the covariance of X and Y ? Are the random variables X and Y

independent?

Probability and Mathematical Statistics 223

Answer: The joint density of X and Y are shown in the following table with

the marginals f1 (x ) and f2 ( y ).

(x, y )  1 0 1 f2 (y)

1 0 1

40 1

1 0 1

40 1

f1 (x) 1

From this table, we see that

0 = f (0, 0) 6 = f1 (0) f2 (0) =  2

4  2

4 = 1

and thus

f( x, y) 6= f1 (x)f2 ( y )

for all (x, y ) is the range space of the joint variable (X, Y ). Therefore Xand

Yare not independent.

Next, we compute the covariance between X and Y . For this we need

Product Moments of Bivariate Random Variables 224

E( X), E( Y) and E( XY ). The expected value of X is

E( X) =



x=1

xf1 (x)

= ( 1) f1 ( 1) + (0) f1 (0) + (1) f1 (1)

= 1

4+ 0 + 1

= 0.

Similarly, the expected value of Yis

E( Y) =



y=1

yf2 ( y )

= ( 1) f2 ( 1) + (0) f2 (0) + (1) f2 (1)

= 1

4+ 0 + 1

= 0.

The product moment of X and Y is given by

E( XY ) =



x=1



y=1

x y f ( x, y)

= (1) f (1,  1) + (0) f (1, 0) + (  1) f (1,1)

+ (0) f (0,  1) + (0) f (0, 0) + (0) f (0,1)

+ ( 1) f (1,  1) + (0) f (1, 0) + (1) f (1,1)

= 0.

Hence, the covariance between X and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ) = 0.

Remark 8.4. This example shows that if the covariance of X and Y is zero

that does not mean the random variables are independent. However, we know

from Theorem 8.4 that if X and Y are independent, then the Cov (X, Y ) is

always zero.

Probability and Mathematical Statistics 225

8.3. Variance of the Linear Combination of Random Variables

Given two random variables, X and Y , we determine the variance of

their linear combination, that is aX + bY .

Theorem 8.5. Let X and Y be any two random variables and let a and b

be any two real numbers. Then

V ar( aX + bY ) = a2 V ar( X ) + b2 V ar( Y ) + 2 a b Cov( X, Y ).

Proof:

V ar( aX + bY )

=E [aX + bY  E (aX + bY )]2 

=E [aX + bY  a E (X ) b E (Y)]2 

=E [a(X µX ) + b (Y µY )]2 

=E a2 ( X µX )2 + b2 ( Y µY )2 + 2 a b ( X µX ) (Y µY )

=a2 E  ( X µX )2  + b2 E  ( X µX )2  + 2 a b E (( X µX ) ( Y µY ))

=a2 V ar ( X ) + b2 V ar ( Y ) + 2 a b C ov (X, Y ).

Example 8.7. If V ar (X +Y ) = 3, V ar (X Y ) = 1, E (X ) = 1 and

E( Y) = 2, then what is E( X Y ) ?

Answer: V ar(X +Y ) =  2

X+ 2

Y+ 2 Cov( X, Y ),

V ar( X Y ) =  2

X+ 2

Y2Cov ( X, Y ).

Hence, we get

Cov ( X, Y ) = 1

4[V ar( X+ Y) V ar( X Y ) ]

4[3 1]

Therefore, the product moment of X and Y is given by

E( XY ) = Cov( X, Y ) + E( X) E( Y)

2+ (1) (2)

Product Moments of Bivariate Random Variables 226

Example 8.8. Let X and Y be random variables with V ar (X ) = 4,

V ar( Y ) = 9 and V ar( X Y ) = 16. What is Cov ( X, Y ) ?

Answer:

V ar( X Y ) = V ar( X ) + V ar( Y) 2 C ov ( X, Y )

16 = 4 + 9  2 Cov ( X, Y ).

Hence

Cov ( X, Y ) =  3

Remark 8.5. The Theorem 8.5 can be extended to three or more random

variables. In case of three random variables X, Y, Z , we have

V ar( X+ Y+ Z )

=V ar (X ) + V ar (Y ) + V ar (Z)

+ 2Cov ( X, Y ) + 2Cov ( Y, Z ) + 2Cov( Z, X ).

To see this consider

V ar( X+ Y+ Z )

=V ar ((X +Y ) + Z)

=V ar (X +Y ) + V ar (Z ) + 2 Cov(X + Y , Z)

=V ar (X +Y ) + V ar (Z ) + 2 Cov(X, Z ) + 2 C ov(Y , Z)

=V ar (X ) + V ar (Y ) + 2 Cov(X, Y )

+V ar (Z ) + 2 Cov(X, Z ) + 2 Cov(Y , Z)

=V ar (X ) + V ar (Y ) + V ar (Z)

+ 2Cov ( X, Y ) + 2Cov ( Y, Z ) + 2Cov( Z, X ).

Theorem 8.6. If X and Y are independent random variables with E (X ) =

0 = E (Y ), then

V ar( XY ) = V ar( X) V ar( Y ).

Proof:

V ar( XY ) = E  (XY )2  ( E ( X) E ( Y ))2

=E (XY )2 

=E X2 Y2 

=E X2  E Y2  (by independence of X and Y)

=V ar (X ) V ar(Y).

Probability and Mathematical Statistics 227

Example 8.9. Let X and Y be independent random variables, each with

density

f(x ) =  1

2✓ for ✓ < x < ✓

0 otherwise.

If the V ar (XY ) = 64

9, then what is the value of ✓?

Answer:

E( X) =  ✓

✓

2✓ x dx = 1

2✓ x 2

2✓

✓

= 0.

Since Y has the same density, we conclude that E (Y ) = 0. Hence

9=V ar( XY )

=V ar (X ) V ar(Y)

= ✓

✓

2✓ x 2 dx  ✓

✓

2✓y 2 dy 

= ✓ 2

3 ✓ 2

3

=✓ 4

Hence, we obtain

✓4 = 64 or ✓= 2p 2.

8.4. Correlation and Independence

The functional dependency of the random variable Y on the random

variable X can be obtained by examining the correlation coeﬃ cient. The

deﬁnition of the correlation coeﬃ cient ⇢ between X and Y is given below.

Deﬁnition 8.3. Let X and Y be two random variables with variances  2

and  2

Y, respectively. Let the covariance of X and Y be Cov ( X, Y ). Then

the correlation coeﬃ cient ⇢ between X and Y is given by

⇢=Cov (X, Y )

XY

Theorem 8.7. If X and Y are independent, the correlation coeﬃ cient be-

tween X and Y is zero.

Product Moments of Bivariate Random Variables 228

Proof:

⇢=Cov (X, Y )

XY

= 0.

Remark 8.4. The converse of this theorem is not true. If the correlation

coeﬃ cient of X and Y is zero, then X and Y are said to be uncorrelated.

Lemma 8.1. If X? and Y? are the standardizations of the random variables

Xand Y, respectively, the correlation coeﬃ cient between X? and Y? is equal

to the correlation coeﬃ cient between X and Y.

Proof: Let ⇢? be the correlation coeﬃ cient between X? and Y? . Further,

let ⇢ denote the correlation coeﬃ cient between X and Y . We will show that

⇢? =⇢. Consider

⇢? = Cov (X? , Y ? )

X ?Y ?

=Cov (X? , Y ? )

=Cov  X µ X

X

,YµY

Y 

XY

Cov ( X µX , Y  µY )

=Cov (X, Y )

XY

=⇢.

This lemma states that the value of the correlation coeﬃ cient between

two random variables does not change by standardization of them.

Theorem 8.8. For any random variables X and Y , the correlation coeﬃ cient

⇢satisﬁes

1 ⇢  1,

and ⇢ = 1 or ⇢ =  1 implies that the random variable Y = a X + b , where a

and b are arbitrary real constants with a 6 = 0.

Proof: Let µX be the mean of X and µY be the mean of Y , and  2

Xand  2

be the variances of X and Y , respectively. Further, let

X⇤ = XµX

X

and Y⇤ =YµY

Y

Probability and Mathematical Statistics 229

be the standardization of X and Y , respectively. Then

µX ⇤ = 0 and  2

X⇤ = 1,

and

µY ⇤ = 0 and  2

Y⇤ = 1.

Thus V ar (X⇤  Y⇤ ) = V ar (X⇤ ) + V ar (Y⇤ ) 2Cov(X⇤ , Y ⇤ )

=2

X⇤ + 2

Y⇤ 2⇢ ⇤  X ⇤  Y ⇤

= 1 + 1  2⇢⇤

= 1 + 1  2⇢ (by Lemma 8 .1)

= 2(1  ⇢ ).

Since the variance of a random variable is always positive, we get

2 (1  ⇢ ) 0

which is

⇢1.

By a similar argument, using V ar (X⇤ +Y⇤ ), one can show that  1 ⇢.

Hence, we have  1⇢  1. Now, we show that if ⇢ = 1 or ⇢ =  1, then Y

and X are related through an aﬃ ne transformation. Consider the case ⇢ = 1,

then

V ar( X⇤  Y⇤ ) = 0.

But if the variance of a random variable is 0, then all the probability mass is

concentrated at a point (that is, the distribution of the corresponding random

variable is degenerate). Thus V ar (X⇤  Y⇤ ) = 0 implies X⇤  Y⇤ takes only

one value. But E [X⇤  Y⇤ ] = 0. Thus, we get

X⇤  Y⇤ ⌘0

X⇤ ⌘ Y⇤ .

Hence XµX

X

=YµY

Y

Solving this for Y in terms of X , we get

Y= a X +b

Product Moments of Bivariate Random Variables 230

where

a=Y

X

and b = µY  a µX.

Thus if ⇢ = 1, then Y is a linear in X . Similarly, we can show for the case

⇢= 1, the random variables X and Y are linearly related. This completes

the proof of the theorem.

8.5. Moment Generating Functions

Similar to the moment generating function for the univariate case, one

can deﬁne the moment generating function for the bivariate case to com-

pute the various product moments. The moment generating function for the

bivariate case is deﬁned as follows:

Deﬁnition 8.4. Let X and Y be two random variables with joint density

function f (x, y ). A real valued function M : IR2 ! IR deﬁned by

M( s, t) = E esX+tY 

is called the joint moment generating function of X and Y if this expected

value exists for all s is some interval h < s < h and for all t is some interval

k < t < k for some positive h and k .

It is easy to see from this deﬁnition that

M( s, 0) = E  esX 

and

M(0 , t) = E  etY .

From this we see that

E( Xk ) = @ k M(s, t)

@sk    (0,0)

, E( Yk ) = @ k M ( s, t)

@tk    (0,0)

for k = 1, 2,3,4, ... ; and

E( XY ) = @ 2 M ( s, t)

@s @t   (0,0)

Example 8.10. Let the random variables X and Y have the joint density

f( x, y) =  e y for 0 < x < y < 1

0 otherwise.

Probability and Mathematical Statistics 231

What is the joint moment generating function for X and Y?

Answer: The joint moment generating function of X and Y is given by

M( s, t) = E esX+tY 

= 1

0 1

esx+ty f ( x, y) dy dx

= 1

0 1

esx+ty ey dy dx

= 1

0 1

esx+ty y dy dx

(1 s t ) (1  t ), provided s+ t < 1 and t < 1.

Example 8.11. If the joint moment generating function of the random

variables X and Yis

M( s, t) = e(s+3t+2s2 +18t2 +12st)

what is the covariance of X and Y?

Answer:

Product Moments of Bivariate Random Variables 232

M( s, t) = e(s+3t+2s2 +18t2 +12st)

@s= (1 + 4 s+ 12t )M (s, t)

@s   (0,0)

= 1 M (0,0)

= 1.

@t= (3 + 36 t+ 12s )M (s, t)

@t   (0,0)

= 3 M (0,0)

= 3.

Hence

µX = 1 and µY = 3.

Now we compute the product moment of X and Y.

@2 M( s, t)

@s @t= @

@t @M

@s

@t( M(s, t ) (1 + 4s + 12 t))

= (1 + 4s + 12t )@ M

@t+ M(s, t ) (12).

Therefore @ 2 M (s, t)

@s @t   (0,0)

= 1 (3) + 1 (12).

Thus

E( XY ) = 15

and the covariance of X and Y is given by

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

= 15  (3) (1)

= 12.

Theorem 8.9. If X and Y are independent then

MaX+bY (t) = MX (at)MY (bt),

Probability and Mathematical Statistics 233

where a and b real parameters.

Proof: Let W = aX + bY . Hence

MaX+bY (t) = MW (t)

=E etW 

=E et(aX+ bY ) 

=E etaX etbY 

=E etaX  E etbY  (by Theorem 8.3)

=MX (at )MY (bt).

This theorem is very powerful. It helps us to ﬁnd the distribution of a

linear combination of independent random variables. The following examples

illustrate how one can use this theorem to determine distribution of a linear

combination.

Example 8.12. Suppose the random variable X is normal with mean 2 and

standard deviation 3 and the random variable Y is also normal with mean

0 and standard deviation 4. If X and Y are independent, then what is the

probability distribution of the random variable X +Y ?

Answer: Since X⇠ N (2, 9), the moment generating function of X is given

MX (t) = eµt+ 1

2 2 t 2 =e2t+ 9

2t 2 .

Similarly, since Y⇠ N (0,16),

MY (t) = eµt+ 1

2 2 t 2 =e 16

2t 2 .

Since X and Y are independent, the moment generating function of X + Y

is given by

MX+Y (t) = MX (t)MY (t)

=e2t+ 9

2t 2 e 16

2t 2

=e2t+ 25

2t 2 .

Hence X +Y⇠ N (2, 25). Thus, X +Y has a normal distribution with mean

2 and variance 25. From this information we can ﬁnd the probability density

function of W =X +Y as

f( w) = 1

p50⇡ e 1

2( w2

5) 2 ,1 < w < 1.

Product Moments of Bivariate Random Variables 234

Remark 8.6. In fact if X and Y are independent normal random variables

with means µX and µY and variances  2

Xand  2

Y, respectively, then aX +bY

is also normal with mean aµX + bµY and variance a2  2

X+b 2  2

Example 8.13. Let X and Y be two independent and identically distributed

random variables. If their common distribution is chi-square with one degree

of freedom, then what is the distribution of X +Y ? What is the moment

generating function of X Y ?

Answer: Since X and Y are both 2 (1), the moment generating functions

are

MX (t) = 1

p1  2t

and

MY (t) = 1

p1  2t.

Since, the random variables X and Y are independent, the moment generat-

ing function of X +Y is given by

MX+Y (t) = MX (t)MY (t)

p1  2t

(1  2t ) 2

Hence X +Y⇠ 2 (2). Thus, if X and Y are independent chi-square random

variables, then their sum is also a chi-square random variable.

Next, we show that X Y is not a chi-square random variable, even if

Xand Yare both chi-square.

MXY (t) = MX (t)MY ( t)

p1  2t

p1 + 2t

p1  4t2 .

This moment generating function does not correspond to the moment gener-

ating function of a chi-square random variable with any degree of freedoms.

Further, it is surprising that this moment generating function does not cor-

respond to that of any known distributions.

Remark 8.7. If X and Y are chi-square and independent random variables,

then their linear combination is not necessarily a chi-square random variable.

Probability and Mathematical Statistics 235

Example 8.14. Let X and Y be two independent Bernoulli random variables

with parameter p . What is the distribution of X +Y ?

Answer: Since X and Y are Bernoulli with parameter p , their moment

generating functions are

MX (t) = (1  p) + petMY (t) = (1  p) + pet.

Since, X and Y are independent, the moment generating function of their

sum is the product of their moment generating functions, that is

MX+Y (t) = MX (t)MY (t)

= 1p +pet   1p +pet 

= 1p +pet  2.

Hence X +Y⇠ BIN (2 , p ). Thus the sum of two independent Bernoulli

random variable is a binomial random variable with parameter 2 and p.

8.6. Review Exercises

1. Suppose that X1 and X2 are random variables with zero mean and unit

variance. If the correlation coeﬃ cient of X1 and X2 is 0. 5, then what is the

variance of Y =  2

k=1 k 2 X k ?

2. If the joint density of the random variables X and Yis

f( x, y) = 





8if (x, y)2{ ( x, 0) , (0 ,  y)| x, y = 2 , 1 , 1 ,2 }

0 otherwise,

what is the covariance of X and Y ? Are X and Y independent?

3. Suppose the random variables X and Yare independent and identically

distributed. Let Z = aX +Y . If the correlation coeﬃ cient between Xand

Zis 1

3, then what is the value of the constant a?

4. Let X and Y be two independent random variables with chi-square distri-

bution with 2 degrees of freedom. What is the moment generating function

of the random variable 2X + 3Y ? If possible, what is the distribution of

2X + 3Y?

5. Let X and Y be two independent random variables. If X⇠ BIN ( n, p )

and Y⇠ BIN ( m, p ), then what is the distribution of X +Y ?

Product Moments of Bivariate Random Variables 236

6. Let X and Y be two independent random variables. If X and Yare

both standard normal, then what is the distribution of the random variable

2X 2 +Y 2 ?

7. If the joint probability density function of X and Yis

f( x, y) =  1 if 0 <x< 1; 0 < y < 1

0 elsewhere,

then what is the joint moment generating function of X and Y?

8. Let the joint density function of X and Y be

f( x, y) = 





36 if 1 x= y 6

36 if 1 x < y  6.

What is the correlation coeﬃ cient of X and Y?

9. Suppose that X and Y are random variables with joint moment generating

function

M( s, t) =  1

4e s + 3

8e t + 3

810

for all real s and t . What is the covariance of X and Y?

10. Suppose that X and Y are random variables with joint density function

f( x, y) = 





6⇡ for x 2

4+ y 2

91

0 for x 2

4+ y 2

9>1.

What is the covariance of X and Y ? Are X and Y independent?

11. Let X and Y be two random variables. Suppose E (X ) = 1, E (Y ) = 2,

V ar( X ) = 1, V ar( Y ) = 2, and C ov( X , Y ) = 1

2. For what values of the

constants a and b , the random variable aX + bY , whose expected value is 3,

has minimum variance?

12. A box contains 5 white balls and 3 black balls. Draw 2 balls without

replacement. If X represents the number of white balls and Y represents the

number of black balls drawn, what is the covariance of X and Y?

13. If X represents the number of 1's and Y represents the number of 5's in

three tosses of a fair six-sided die, what is the correlation between X and Y?

Probability and Mathematical Statistics 237

14. Let Y and Z be two random variables. If V ar (Y ) = 4, V ar (Z ) = 16,

and Cov ( Y, Z ) = 2, then what is V ar (3Z 2Y)?

15. Three random variables X1 , X2, X3 , have equal variances 2 and coef-

ﬁcient of correlation between X1 and X2 of ⇢ and between X1 and X3 and

between X2 and X3 of zero. What is the correlation between Y and Zwhere

Y= X1 +X2 and Z= X2 +X3 ?

16. If X and Y are two independent Bernoulli random variables with pa-

rameter p , then what is the joint moment generating function of X Y ?

17. If X1 , X2 , ..., Xn are normal random variables with variance 2 and

covariance between any pair of random variables ⇢2 , what is the variance

of 1

n(X 1 +X 2 +···+X n ) ?

18. The coeﬃ cient of correlation between X and Y is 1

3and  2

X=a,

2

Y= 4a, and  2

Z= 114 where Z = 3X 4Y . What is the value of the

constant a?

19. Let X and Y be independent random variables with E (X ) = 1, E (Y ) =

2, and V ar (X ) = V ar (Y ) = 2 . For what value of the constant k is the

expected value of the random variable k (X2  Y2 ) + Y2 equals 2 ?

20. Let X be a random variable with ﬁnite variance. If Y = 15 X , then

what is the coeﬃ cient of correlation between the random variables Xand

(X +Y )X?

Conditional Expectations of Bivariate Random Variables 238

Chapter 9

CONDITIONAL

EXPECTATION

BIVARIATE

RANDOM VARIABLES

This chapter examines the conditional mean and conditional variance

associated with two random variables. The conditional mean is very useful

in Bayesian estimation of parameters with a square loss function. Further, the

notion of conditional mean sets the path for regression analysis in statistics.

9.1. Conditional Expected Values

Let X and Y be any two random variables with joint density f (x, y).

Recall that the conditional probability density of X , given the event Y =y ,

is deﬁned as

g(x/y ) = f(x, y)

f2 ( y) , f 2 ( y)> 0

where f2 ( y ) is the marginal probability density of Y . Similarly, the condi-

tional probability density of Y , given the event X = x , is deﬁned as

h(y/x) = f ( x, y)

f1 (x) , f 1 ( x)> 0

where f1 (x ) is the marginal probability density of X.

Deﬁnition 9.1. The conditional mean of X given Y =y is deﬁned as

µX|y = E ( X| y) ,

Probability and Mathematical Statistics 239

where

E( X| y) = 











x2RX

x g(x/y ) if X is discrete

1

1 x g(x/y ) dx if X is continuous.

Similarly, the conditional mean of Y given X =x is deﬁned as

µY|x = E ( Y| x),

where

E( Y| x) = 











y2RY

y h(y/x ) if Y is discrete

1

1 y h(y/x ) dy if Y is continuous.

Example 9.1. Let X and Y be discrete random variables with joint proba-

bility density function

f( x, y) =  1

21 (x+ y ) for x = 1, 2, 3; y = 1,2

0 otherwise.

What is the conditional mean of X given Y =y , that is E (X |y )?

Answer: To compute the conditional mean of X given Y =y , we need the

conditional density g (x/y ) of X given Y =y . However, to ﬁnd g (x/y ), we

need to know the marginal of Y , that is f2 ( y ). Thus, we begin with

f2 ( y ) =



x=1

21 (x+ y )

21 (6 + 3y ).

Therefore, the conditional density of X given Y =y is given by

g(x/y ) = f(x, y)

f2 ( y )

=x +y

6 + 3y, x = 1, 2,3.

Conditional Expectations of Bivariate Random Variables 240

The conditional expected value of X given the event Y = y

E( X| y) = 

x2RX

x g(x/y)



x=1

xx+y

6 + 3y

6 + 3y 3



x=1

x2 +y



x=1

x

=14 + 6y

6 + 3y, y = 1, 2.

Remark 9.1. Note that the conditional mean of X given Y =y is dependent

only on y , that is E (X |y ) is a function  of y . In the above example, this

function  is a rational function, namely  (y ) = 14+6y

6+3y .

Example 9.2. Let X and Y have the joint density function

f( x, y) =  x+y for 0 < x, y < 1

0 otherwise.

What is the conditional mean E  Y| X = 1

3?

Answer:

f1 (x) =  1

(x +y ) dy

= xy +1

2y 2 1

=x +1

Probability and Mathematical Statistics 241

h(y/x) = f ( x, y)

f1 (x)= x+ y

x+1

E Y| X=1

3 =  1

y h(y/x ) dy

= 1

yx+y

x+1

= 1

3+y

5 1

01

3y +y2  dy

5 1

6y 2 + 1

3y 3 1

5 1

6+ 2

6

5 3

6

The mean of the random variable Y is a deterministic number. The

conditional mean of Y given X = x , that is E (Y |x ), is a function  (x ) of

the variable x . Using this function, we form  (X ). This function  (X ) is a

random variable. Thus starting from the deterministic function E (Y |x ), we

have formed the random variable E (Y |X ) =  (X ). An important property

of conditional expectation is given by the following theorem.

Theorem 9.1. The expected value of the random variable E (Y |X ) is equal

to the expected value of Y , that is

Ex Ey|x ( Y | X) = Ey ( Y ),

Conditional Expectations of Bivariate Random Variables 242

where Ex ( X ) stands for the expectation of X with respect to the distribution

of X and Ey|x ( Y | X ) stands for the expected value of Y with respect to the

conditional density h(y/X ).

Proof: We prove this theorem for continuous variables and leave the discrete

case to the reader.

Ex Ey|x ( Y | X) = Ex  1

1

y h(y/X ) dy

= 1

1  1

1

y h(y/x ) dyf1 (x ) dx

= 1

1  1

1

y h(y/x)f1 (x ) dydx

= 1

1  1

1

h(y/x)f1 (x)dx y dy

= 1

1  1

1

f( x, y) dx y dy

= 1

1

y f2 ( y) dy

=Ey ( Y ).

Example 9.3. An insect lays Y number of eggs, where Y has a Poisson

distribution with parameter  . If the probability of each egg surviving is p,

then on the average how many eggs will survive?

Answer: Let X denote the number of surviving eggs. Then, given that

Y= y(that is given that the insect has laid yeggs) the random variable X

has a binomial distribution with parameters y and p . Thus

X| Y⇠ BIN ( Y , p)

Y⇠ P OI ().

Therefore, the expected number of survivors is given by

Ex ( X ) = Ey Ex|y ( X | Y) 

=Ey ( p Y ) (since X| Y⇠ BIN(Y,p))

=p Ey (Y)

=p . (since Y ⇠ POI())

Deﬁnition 9.2. A random variable X is said to have a mixture distribution

if the distribution of X depends on a quantity which also has a distribution.

Probability and Mathematical Statistics 243

Example 9.4. A fair coin is tossed. If a head occurs, 1 die is rolled; if a tail

occurs, 2 dice are rolled. Let Y be the total on the die or dice. What is the

expected value of Y?

Answer: Let X denote the outcome of tossing a coin. Then X⇠ BER(p ),

where the probability of success is p = 1

Ey ( Y ) = Ex (Ey|x ( Y | X ) )

2E y|x (Y| X = 0) + 1

2E y|x (Y| X = 1)

2 1 + 2 + 3 + 4 + 5 + 6

6

2 2 + 6 + 12 + 20 + 30 + 42 + 40 + 36 + 30 + 22 + 12

36 

2 126

36 + 252

36 

=378

= 5.25.

Note that the expected number of dots that show when 1 die is rolled is 126

36 ,

and the expected number of dots that show when 2 dice are rolled is 252

36 .

Theorem 9.2. Let X and Y be two random variables with mean µX and

µY , and standard deviation X and Y , respectively. If the conditional

expectation of Y given X =x is linear in x , then

E( Y| X= x) = µY +⇢ Y

X

(x µX ),

where ⇢ denotes the correlation coeﬃ cient of X and Y.

Proof: We assume that the random variables X and Y are continuous. If

they are discrete, the proof of the theorem follows exactly the same way by

replacing the integrals with summations. We are given that E (Y |X = x ) is

linear in x , that is

E( Y| X= x) = a x + b, (9.0)

where a and b are two constants. Hence, from above we get

1

1

y h(y/x ) dy = a x +b

Conditional Expectations of Bivariate Random Variables 244

which implies  1

1

yf(x, y)

f1 (x) dy = a x +b.

Multiplying both sides by f1 (x ), we get

1

1

y f ( x, y) dy = ( a x + b)f1 (x) (9.1)

Now integrating with respect to x , we get

1

1  1

1

y f ( x, y) dy dx = 1

1

(a x + b ) f1 (x ) dx

This yields

µY = a µX + b. (9.2)

Now, we multiply (9.1) with x and then integrate the resulting expression

with respect to x to get

1

1  1

1

xy f ( x, y) dy dx = 1

1

(a x2 + bx ) f1 (x ) dx.

From this we get

E( XY ) = a E  X2  + b µX . (9.3)

Solving (9.2) and (9.3) for the unknown a and b , we get

a= E(XY ) µXµY

2

= XY

2

= XY

XY

Y

X

=⇢Y

X

Similarly, we get

b= µY +⇢ Y

X

µX.

Letting a and b into (9.0) we obtain the asserted result and the proof of the

theorem is now complete.

Example 9.5. Suppose X and Y are random variables with E (Y |X = x ) =

x + 3 and E (X |Y =y ) =  1

4y+ 5. What is the correlation coeﬃ cient of

Xand Y?

Probability and Mathematical Statistics 245

Answer: From the Theorem 9.2, we get

µY +⇢ Y

X

(x µX ) = x + 3.

Therefore, equating the coeﬃ cients of x terms, we get

⇢Y

X

=1.(9.4)

Similarly, since

µX +⇢ X

Y

(y µY ) =  1

4y + 5

we have

⇢X

Y

= 1

4. (9.5)

Multiplying (9.4) with (9.5), we get

⇢Y

X

⇢X

Y

= ( 1)   1

4

which is

⇢2 =1

Solving this, we get

⇢=± 1

Since ⇢  Y

X =1 and  Y

X >0, we get

⇢= 1

9.2. Conditional Variance

The variance of the probability density function f (y/x ) is called the

conditional variance of Y given that X = x . This conditional variance is

deﬁned as follows:

Deﬁnition 9.3. Let X and Y be two random variables with joint den-

sity f (x, y ) and f (y/x ) be the conditional density of Y given X = x . The

conditional variance of Y given X = x , denoted by V ar (Y |x ), is deﬁned as

V ar( Y |x) = E  Y2 | x  ( E ( Y |x))2 ,

where E (Y |x ) denotes the conditional mean of Y given X = x.

Conditional Expectations of Bivariate Random Variables 246

Example 9.6. Let X and Y be continuous random variables with joint

probability density function

f( x, y) =  e y for 0 < x < y < 1

0 otherwise.

What is the conditional variance of Y given the knowledge that X = x?

Answer: The marginal density of f1 (x ) is given by

f1 (x) =  1

1

f( x, y)dy

= 1

ey dy

= ey  1

=ex .

Thus, the conditional density of Y given X =x is

h(y/x) = f ( x, y)

f1 (x)

=e y

ex

=e(y x) for y > x.

Thus, given X = x ,Y has an exponential distribution with parameter ✓ = 1

and location parameter x . The conditional mean of Y given X =x is

E( Y|x) =  1

1

y h(y/x ) dy

= 1

y e(y x) dy

= 1

(z + x ) ez dz where z =y x

=x 1

ez dz + 1

z ez dz

=x(1) + (2)

=x + 1.

Probability and Mathematical Statistics 247

Similarly, we compute the second moment of the distribution h(y/x).

E( Y2 |x) =  1

1

y2 h(y/x)dy

= 1

y2 e(y x) dy

= 1

(z + x)2 ez dz where z =y x

=x2  1

ez dz + 1

z2 ez dz + 2 x 1

z ez dz

=x2  (1) +  (3) + 2 x (2)

=x2 + 2 + 2x

= (1 + x)2 + 1 .

Therefore

V ar( Y |x) = E  Y2 |x [ E ( Y |x) ]2

= (1 + x)2 + 1  (1 + x)2

= 1.

Remark 9.2. The variance of Y is 2. This can be seen as follows: Since, the

marginal of Y is given by f2 ( y ) =  y

0e y dx = y e y , the expected value of Y

is E (Y ) =  1

0y 2 e y dy =(3) = 2, and E  Y 2 =  1

0y 3 e y dy =(4) = 6.

Thus, the variance of Y is V ar (Y ) = 6 4 = 2. However, given the knowledge

X= x, the variance of Y is 1. Thus, in a way the prior knowledge reduces

the variability (or the variance) of a random variable.

Next, we simply state the following theorem concerning the conditional

variance without proof.

Conditional Expectations of Bivariate Random Variables 248

Theorem 9.3. Let X and Y be two random variables with mean µX and

µY , and standard deviation X and Y , respectively. If the conditional

expectation of Y given X =x is linear in x , then

Ex ( V ar ( Y | X )) = (1  ⇢2 ) V ar ( Y ),

where ⇢ denotes the correlation coeﬃ cient of X and Y.

Example 9.7. Let E (Y |X = x ) = 2x and V ar (Y |X = x ) = 4 x2 , and let X

have a uniform distribution on the interval from 0 to 1. What is the variance

of Y?

Answer: If E (Y |X = x ) is linear function of x , then

E( Y| X= x) = µY +⇢ Y

X

(x µX )

and

Ex ( V ar ( Y | X ) ) =  2

Y(1 ⇢ 2 ).

We are given that

µY +⇢ Y

X

(x µX ) = 2x.

Hence, equating the coeﬃ cient of x terms, we get

⇢Y

X

= 2

which is

⇢= 2 X

Y

.(9.6)

Further, we are given that

V ar( Y | X= x) = 4x2

Since X⇠ U N IF (0, 1), we get the density of X to be f (x ) = 1 on the

interval (0, 1) Therefore,

Ex ( V ar ( Y | X ) ) =  1

1

V ar( Y | X= x) f (x)dx

= 1

4x2dx

= 4  x 3

31

Probability and Mathematical Statistics 249

By Theorem 9.3, 4

3=Ex ( V ar ( Y | X ) )

=2

Y1⇢ 2 

=2

Y14 2

2

Y

=2

Y4 2

Hence

2

Y=4

3+ 4 2

Since X⇠ U N IF (0, 1), the variance of X is given by  2

X= 1

12 . Therefore,

the variance of Y is given by

2

Y=4

3+ 4

12 = 16

12 + 4

12 = 20

12 = 5

Example 9.8. Let E (X |Y =y ) = 3y and V ar (X |Y =y ) = 2, and let Y

have density function

f( y) =  e y if y > 0

0 otherwise.

What is the variance of X?

Answer: By Theorem 9.3, we get

V ar( X | Y= y ) =  2

X1⇢ 2 = 2 (9.7)

and

µX +⇢ X

Y

(y µY ) = 3 y.

Thus

⇢= 3 Y

X

Hence from (9.7), we get Ey ( V ar ( X | Y )) = 2 and thus

2

X19 2

2

X= 2

which is

2

X= 9  2

Y+ 2.

Conditional Expectations of Bivariate Random Variables 250

Now, we compute the variance of Y . For this, we need E (Y ) and E  Y2 .

E( Y) =  1

y f ( y) dy

= 1

y ey dy

=(2)

= 1.

Similarly

E Y2  = 1

y2 f( y) dy

= 1

y2 ey dy

=(3)

= 2.

Therefore

V ar( Y ) = E  Y2  [ E ( Y ) ]2 = 2  1 = 1 .

Hence, the variance of X can be calculated as

2

X= 9  2

Y+ 2

= 9 (1) + 2

= 11.

Remark 9.3. Notice that, in Example 9.8, we calculated the variance of Y

directly using the form of f (y ). It is easy to note that f (y ) has the form of

an exponential density with parameter ✓ = 1, and therefore its variance is

the square of the parameter. This straightforward gives  2

Y= 1.

9.3. Regression Curve and Scedastic Curve

One of the major goals in most statistical studies is to establish relation-

ships between two or more random variables. For example, a company would

like to know the relationship between the potential sales of a new product

in terms of its price. Historically, regression analysis was originated in the

works of Sir Francis Galton (1822-1911) but most of the theory of regression

analysis was developed by his student Sir Ronald Fisher (1890-1962).

Probability and Mathematical Statistics 251

Deﬁnition 9.4. Let X and Y be two random variables with joint probability

density function f (x, y ) and let h(y/x ) is the conditional density of Y given

X= x. Then the conditional mean

E( Y| X= x) =  1

1

y h(y/x ) dy

is called the regression function of Y on X . The graph of this regression

function of Y on X is known as the regression curve of Y on X.

Example 9.9. Let X and Y be two random variables with joint density

f( x, y) =  x e x(1+ y) if x > 0; y > 0

0 otherwise.

What is the regression function of Y on X?

Answer: The marginal density f1 (x ) of Xis

f1 (x) =  1

1

f( x, y)dy

= 1

x ex(1+ y) dy

= 1

x ex exy dy

=x ex  1

exy dy

=x ex   1

xe xy 1

=ex .

The conditional density of Y given X =x is

h(y/x) = f ( x, y)

f1 (x)

=x e x(1+ y )

ex

=x exy .

Conditional Expectations of Bivariate Random Variables 252

The conditional mean of Y given that X =x is

E( Y| X= x) =  1

1

y h(y/x ) dy

= 1

y x exy dy

x 1

zez dz (where z= xy)

x(2)

Thus, the regression function (or equation) of Y on X is given by

E( Y|x) = 1

xfor 0 <x< 1.

Deﬁnition 9.4. Let X and Y be two random variables with joint probability

density function f (x, y ) and let E (Y |X = x ) be the regression function of Y

on X . If this regression function is linear, then E (Y |X = x ) is called a linear

regression of Y on X . Otherwise, it is called nonlinear regression of Y on X.

Example 9.10. Given the regression lines E (Y |X = x ) = x + 2 and

E( X| Y= y) = 1 + 1

2y, what is the expected value of X?

Answer: Since the conditional expectation E (Y |X = x ) is linear in x , we

get

µY +⇢ Y

X

(x µX ) = x + 2.

Hence, equating the coeﬃ cients of x and constant terms, we get

⇢Y

X

= 1 (9.8)

Probability and Mathematical Statistics 253

and

µY ⇢ Y

X

µX = 2 , (9.9)

respectively. Now, using (9.8) in (9.9), we get

µY µX = 2 . (9.10)

Similarly, since E (X |Y =y ) is linear in y , we get

⇢X

Y

2(9.11)

and

µX ⇢ X

Y

µY = 1 , (9.12)

Hence, letting (9.10) into (9.11) and simplifying, we get

2µX µY = 2. (9.13)

Now adding (9.13) to (9.10), we see that

µX = 4.

Remark 9.4. In statistics, a linear regression usually means the conditional

expectation E (Y /x ) is linear in the parameters, but not in x . Therefore,

E(Y/x ) = ↵ +✓ x2 will be a linear model, where as E(Y/x ) = ↵ x✓ is not a

linear regression model.

Deﬁnition 9.5. Let X and Y be two random variables with joint probability

density function f (x, y ) and let h(y/x ) is the conditional density of Y given

X= x. Then the conditional variance

V ar( Y | X= x) =  1

1

y2 h(y/x)dy

is called the scedastic function of Y on X . The graph of this scedastic function

of Y on X is known as the scedastic curve of Y on X.

Scedastic curves and regression curves are used for constructing families

of bivariate probability density functions with speciﬁed marginals.

Conditional Expectations of Bivariate Random Variables 254

9.4. Review Exercises

1. Given the regression lines E (Y |X = x ) = x +2 and E (X |Y =y ) = 1 + 1

2y,

what is expected value of Y?

2. If the joint density of X and Yis

f( x, y) = 





kif  1 <x< 1; x2 <y<1

0 elsewhere ,

where k is a constant, what is E (Y |X = x ) ?

3. Suppose the joint density of X and Y is deﬁned by

f( x, y) =  10xy2 if 0 <x<y< 1

0 elsewhere.

What is E  X2 |Y =y ?

4. Let X and Y joint density function

f( x, y) =  2e2(x+y ) if 0 < x < y < 1

0 elsewhere.

What is the expected value of Y , given X = x , for x > 0 ?

5. Let X and Y joint density function

f( x, y) =  8 xy if 0 < x < 1; 0 < y < x

0 elsewhere.

What is the regression curve y on x , that is, E (Y/X = x)?

6. Suppose X and Y are random variables with means µX and µY , respec-

tively; and E (Y |X = x ) =  1

3x+ 10 and E (X |Y =y ) =  3

4y+ 2. What are

the values of µX and µY ?

7. Let X and Y have joint density

f( x, y) =  24

5(x+ y ) for 0  2y x 1

0 otherwise.

What is the conditional expectation of X given Y =y ?

Probability and Mathematical Statistics 255

8. Let X and Y have joint density

f( x, y) =  c xy 2 for 0 y  2x ; 1 x5

0 otherwise.

What is the conditional expectation of Y given X =x ?

9. Let X and Y have joint density

f( x, y) =  e y for yx 0

0 otherwise.

What is the conditional expectation of X given Y =y ?

10. Let X and Y have joint density

f( x, y) =  2 xy for 0 y 2x 2

0 otherwise.

What is the conditional expectation of Y given X =x ?

11. Let E (Y |X = x ) = 2 + 5 x , V ar(Y |X = x ) = 3, and let X have the

density function

f(x ) =  1

4x e  x

2if 0 <x<1

0 otherwise.

What is the mean and variance of random variable Y?

12. Let E (Y |X = x ) = 2x and V ar (Y |X = x ) = 4 x2 + 3, and let X have

the density function

f(x ) = 





p⇡ x 2 e x2 for 0 x < 1

0 elsewhere.

What is the variance of Y?

13. Let X and Y have joint density

f( x, y) =  2 for 0 < y < 1 x; and 0 < x < 1

0 otherwise.

What is the conditional variance of Y given X =x ?

Conditional Expectations of Bivariate Random Variables 256

14. Let X and Y have joint density

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

What is the conditional variance of Y given X =x ?

15. Let X and Y have joint density

f( x, y) =  6

7xfor 1 x +y 2; x 0, y  0

0 elsewhere.

What is the marginal density of Y ? What is the conditional variance of X

given Y = 3

16. Let X and Y have joint density

f( x, y) =  12 x for 0 <y< 2 x < 1

0 elsewhere.

What is the conditional variance of Y given X = 0. 5 ?

17. Let the random variable W denote the number of students who take

business calculus each semester at the University of Louisville. If the random

variable W has a Poisson distribution with parameter  equal to 300 and the

probability of each student passing the course is 3

5, then on an average how

many students will pass the business calculus?

18. If the conditional density of Y given X =x is given by

f(y/x ) = 



 5

yx y (1 x) 5y if y = 0, 1,2, ..., 5

0 otherwise,

and the marginal density of Xis

f1 (x) = 





4x3 if 0 < x < 1

0 otherwise,

then what is the conditional expectation of Y given the event X = x?

19. If the joint density of the random variables X and Yis

f( x, y) = 





2+(2x 1)(2y 1)

2if 0 < x, y < 1

0 otherwise,

Probability and Mathematical Statistics 257

then what is the regression function of Y on X?

20. If the joint density of the random variables X and Yis

f( x, y) = 



 e min{x,y } 1  e (x+y ) if 0 < x, y < 1

0 otherwise,

then what is the conditional expectation of Y given X = x?

Transformation of Random Variables and their Distributions 258

Chapter 10

TRANSFORMATION

RANDOM VARIABLES

AND

THEIR DISTRIBUTIONS

In many statistical applications, given the probability distribution of

a univariate random variable X , one would like to know the probability

distribution of another univariate random variable Y =  (X ), where is

some known function. For example, if we know the probability distribution

of the random variable X , we would like know the distribution of Y = ln(X).

For univariate random variable X , some commonly used transformed random

variable Y of X are: Y =X2 ,Y = |X| ,Y =  |X| ,Y = ln(X ), Y=

Xµ

, and Y= Xµ

 2 . Similarly for a bivariate random variable (X, Y ),

some of the most common transformations of X and Y are X +Y , XY , X

min{X, Y } , max {X, Y } or p X2 +Y2 . In this chapter, we examine various

methods for ﬁnding the distribution of a transformed univariate or bivariate

random variable, when transformation and distribution of the variable are

known. First, we treat the univariate case. Then we treat the bivariate case.

We begin with an example for univariate discrete random variable.

Example 10.1. The probability density function of the random variable X

is shown in the table below.

x2 10 1234

f(x ) 1

Probability and Mathematical Statistics 259

What is the probability density function of the random variable Y =X2 ?

Answer: The space of the random variable X is RX = { 2, 1,0,1,2,3,4}.

Then the space of the random variable Y is RY = {x2 |x2 RX } . Thus,

RY = {0 , 1 , 4 , 9 , 16} . Now we compute the probability density function g ( y )

for y in RY .

g(0) = P( Y= 0) = P( X2 = 0) = P( X= 0)) = 1

g(1) = P( Y= 1) = P( X2 = 1) = P( X= 1) + P( X= 1) = 3

g(4) = P( Y= 4) = P( X2 = 4) = P( X= 2) + P( X= 2) = 2

g(9) = P( Y= 9) = P( X2 = 9) = P( X= 3) = 2

g(16) = P( Y= 16) = P( X2 = 16) = P( X= 4) = 2

10 .

We summarize the distribution of Y in the following table.

y0 1 4 9 16

g( y)1

Density Function of Y = X

Example 10.2. The probability density function of the random variable X

is shown in the table below.

x1 2 3 4 5 6

f(x ) 1

What is the probability density function of the random variable Y = 2X +1?

Transformation of Random Variables and their Distributions 260

Answer: The space of the random variable X is RX = {1,2 ,3 ,4 ,5 ,6}.

Then the space of the random variable Y is RY = {2x + 1 |x2 RX } . Thus,

RY = {3 , 5 , 7 , 9 , 11 , 13} . Next we compute the probability density function

g( y) for yin RY . The pdf g( y) is given by

g(3) = P( Y= 3) = P(2 X+ 1 = 3) = P( X= 1)) = 1

g(5) = P( Y= 5) = P(2 X+ 1 = 5) = P( X= 2)) = 1

g(7) = P( Y= 7) = P(2 X+ 1 = 7) = P( X= 3)) = 1

g(9) = P( Y= 9) = P(2 X+ 1 = 9) = P( X= 4)) = 1

g(11) = P( Y= 11) = P(2 X+ 1 = 11) = P( X= 5)) = 1

g(13) = P( Y= 13) = P(2 X+ 1 = 13) = P( X= 6)) = 1

We summarize the distribution of Y in the following table.

y3 5 7 9 11 13

g( y)1

The distribution of X and 2X + 1 are illustrated below.

Density Function of Y = 2X+1

In Example 10.1, we computed the distribution (that is, the proba-

bility density function) of transformed random variable Y =  (X ), where

(x ) = x2 . This transformation is not either increasing or decreasing (that

is, monotonic) in the space, RX , of the random variable X . Therefore, the

distribution of Y turn out to be quite di↵ erent from that of X . In Example

10.2, the form of distribution of the transform random variable Y =  (X),

where  (x ) = 2x + 1, is essentially same. This is mainly due to the fact that

(x ) = 2x + 1 is monotonic in RX .

Probability and Mathematical Statistics 261

In this chapter, we shall examine the probability density function of trans-

formed random variables by knowing the density functions of the original

random variables. There are several methods for ﬁnding the probability den-

sity function of a transformed random variable. Some of these methods are:

(1) distribution function method

(2) transformation method

(3) convolution method, and

(4) moment generating function method.

Among these four methods, the transformation method is the most useful one.

The convolution method is a special case of this method. The transformation

method is derived using the distribution function method.

10.1. Distribution Function Method

We have seen in chapter six that an easy way to ﬁnd the probability

density function of a transformation of continuous random variables is to

determine its distribution function and then its density function by di↵eren-

tiation.

Example 10.3. A box is to be constructed so that the height is 4 inches and

its base is X inches by X inches. If X has a standard normal distribution,

what is the distribution of the volume of the box?

Answer: The volume of the box is a random variable, since X is a random

variable. This random variable V is given by V = 4X2 . To ﬁnd the density

function of V , we ﬁrst determine the form of the distribution function G( v )

of V and then we di↵ erentiate G( v ) to ﬁnd the density function of V . The

distribution function of V is given by

G( v ) = P ( V v )

=P 4X2  v 

=P  1

2p vX  1

2p v 

= 1

2pv

1

2pv

p2⇡ e 1

2x 2 dx

= 2  1

2pv

p2⇡ e 1

2x 2 dx (since the integrand is even).

Transformation of Random Variables and their Distributions 262

Hence, by the Fundamental Theorem of Calculus, we get

g( v) = dG(v)

dv  2  1

2pv

p2⇡ e 1

2x 2 dx

= 2 1

p2⇡ e 1

2( 1

2pv ) 2 1

2 dpv

p2⇡ e 1

8v 1

2p v

 1

2p8 v 1

21 e  v

=V ⇠GAM  8, 1

2 .

Example 10.4. If the density function of Xis

f(x ) = 





2for 1 < x < 1

0 otherwise,

what is the probability density function of Y =X2 ?

Answer: We ﬁrst ﬁnd the cumulative distribution function of Y and then by

di↵ erentiation, we obtain the density of Y . The distribution function G( y )

of Y is given by

G( y ) = P ( Y y )

=P X2 y

=P (p y X p y )

= py

p y

2dx

=p y.

Probability and Mathematical Statistics 263

Hence, the density function of Y is given by

g( y) = dG(y)

=dpy

2p y for 0 <y< 1.

10.2. Transformation Method for Univariate Case

The following theorem is the backbone of the transformation method.

Theorem 10.1. Let X be a continuous random variable with probability

density function f (x ). Let y =T (x ) be an increasing (or decreasing) function.

Then the density function of the random variable Y =T (X ) is given by

g( y) =    

dy    f(W(y))

where x =W (y ) is the inverse function of T (x).

Proof: Suppose y =T (x ) is an increasing function. The distribution func-

tion G( y ) of Y is given by

G( y ) = P ( Y y )

=P (T(X ) y )

=P (X W (y))

= W(y)

1

f(x ) dx.

Transformation of Random Variables and their Distributions 264

Then, di↵ erentiating we get the density function of Y , which is

g( y) = dG(y)

dy  W(y)

1

f(x ) dx

=f (W(y )) dW (y)

=f (W(y )) dx

dy (since x=W (y)).

On the other hand, if y =T (x ) is a decreasing function, then the distribution

function of Y is given by

G( y ) = P ( Y y )

=P (T(X ) y )

=P (X W (y )) (since T (x ) is decreasing)

= 1 P (X W (y))

= 1  W(y)

1

f(x ) dx.

As before, di↵ erentiating we get the density function of Y , which is

g( y) = dG(y)

dy  1 W(y)

1

f(x ) dx

=f (W(y )) dW (y)

=f (W(y )) dx

dy (since x=W (y)).

Hence, combining both the cases, we get

g( y) =    

dy    f(W(y))

and the proof of the theorem is now complete.

Example 10.5. Let Z =Xµ

. If X⇠ N µ,  2 , what is the probability

density function of Z?

Probability and Mathematical Statistics 265

Answer:

z= U(x ) = xµ

.

Hence, the inverse of U is given by

W( z) = x

=z +µ.

Therefore dx

dz =  .

Hence, by Theorem 10.1, the density of Z is given by

g( z) =    

dz    f(W(y))

= 1

p2⇡2 e 1

2 W (z)µ

 2

p2⇡ e 1

2( z+µµ

) 2

p2⇡ e 1

2z 2 .

Example 10.6. Let Z =Xµ

. If X⇠ N µ,  2 , then show that Z 2 is

chi-square with one degree of freedom, that Z2 ⇠ 2 (1).

Answer:

y= T(x ) =  xµ

2

x= µ+p y.

W( y) = µ+p y, y > 0.

dy = 

2p y.

Transformation of Random Variables and their Distributions 266

The density of Yis

g( y) =    

dy    f(W(y))

= 1

2p yf (W(y))

= 1

2p y

p2⇡2 e 1

2 W (y)µ

 2

2p 2⇡ye  1

2 p y+µµ

 2

2p 2⇡ye  1

2p ⇡ p 2y  1

2e  1

2 1

2p2 y  1

2e  1

2y .

Hence Y⇠ 2 (1).

Example 10.7. Let Y = ln X . If X⇠ U NI F (0, 1), then what is the

density function of Y where nonzero?

Answer: We are given that

y= T(x ) =  ln x.

Hence, the inverse of y =T (x ) is given by

W( y) = x

=ey .

Therefore dx

dy = e y .

Probability and Mathematical Statistics 267

Hence, by Theorem 10.1, the probability density of Y is given by

g( y) =    

dy    f(W(y))

=ey f ( W ( y ))

=ey .

Thus Y⇠ EX P (1). Hence, if X⇠ U N IF (0, 1), then the random variable

ln X ⇠EX P (1).

Although all the examples we have in this section involve continuous

random variables, the transformation method also works for the discrete

random variables.

10.3. Transformation Method for Bivariate Case

In this section, we extend the Theorem 10.2 to the bivariate case and

present some examples to illustrate the importance of this extension. We

state this theorem without a proof.

Theorem 10.2. Let X and Y be two continuous random variables with

joint density f (x, y ). Let U =P (X, Y ) and V = Q(X, Y ) be functions of X

and Y . If the functions P (x, y ) and Q(x, y ) have single valued inverses, say

X= R( U, V ) and Y= S (U, V ), then the joint density g ( u, v) of U and V is

given by

g( u, v) = | J| f ( R (u, v) , S ( u, v)),

where J denotes the Jacobian and given by

J= det  @x

@v

=@x

@v @x

@u.

Transformation of Random Variables and their Distributions 268

Example 10.8. Let X and Y have the joint probability density function

f( x, y) =  8 xy for 0 <x<y<1

0 otherwise.

What is the joint density of U = X

Yand V= Y ?

Answer: Since

U= X

V= Y





we get by solving for X and Y

X= U Y =U V

Y= V. 

Hence, the Jacobian of the transformation is given by

J=@ x

@v @x

=v· 1u· 0

=v.

The joint density function of U and Vis

g( u, v) = | J| f ( R (u, v) , S ( u, v))

=|v |f (uv, v)

=v 8 (uv ) v

= 8 uv3.

Note that, since

0< x < y < 1

we have

0< uv < v < 1.

The last inequalities yield

0< uv < v

0< v < 1. 

Probability and Mathematical Statistics 269

Therefore, we get

0<u< 1

0<v< 1. 

Thus, the joint density of U and V is given by

g( u, v) =  8 uv 3 for 0 <u< 1; 0 <v<1

0 otherwise.

Example 10.9. Let each of the independent random variables X and Y

have the density function

f(x ) =  e x for 0 < x < 1

0 otherwise.

What is the joint density of U =X and V = 2X + 3Y and the domain on

which this density is positive?

Answer: Since U = X

V= 2 X+ 3 Y, 

we get by solving for X and Y

X= U

Y=1

3V 2

3U. 





Hence, the Jacobian of the transformation is given by

J=@ x

@v @x

= 1 · 1

3  0·  2

3

Transformation of Random Variables and their Distributions 270

The joint density function of U and Vis

g( u, v) = | J| f ( R (u, v) , S ( u, v))

=   

3   f u, 1

3v 2

3u

3e u e  1

3v+ 2

3e  ( u+v

3).

Since 0 < x < 1

0< y < 1,

we get

0<u<1

0< v < 1,

Further, since v = 2u + 3y and 3 y > 0, we have

v > 2u.

Hence, the domain of g (u, v ) where nonzero is given by

0< 2u < v < 1.

The joint density g (u, v ) of the random variables U and V is given by

g( u, v) = 





3e  ( u+v

3)for 0 < 2u < v < 1

0 otherwise.

Example 10.10. Let X and Y be independent random variables, each with

density function

f(x ) = 





ex for 0 <x< 1

0 otherwise,

where > 0. Let U =X + 2Y and V = 2X +Y . What is the joint density

of U and V?

Answer: Since U =X + 2Y

V= 2 X+ Y, 

Probability and Mathematical Statistics 271

we get by solving for X and Y

X= 1

3U + 2

Y=2

3U 1

3V. 









Hence, the Jacobian of the transformation is given by

J=@ x

@v @x

=  1

3  1

3   2

3 2

3

9 4

= 1

The joint density function of U and Vis

g( u, v) = | J| f ( R ( u, v) , S ( u, v))

=    1

3   f(R(u, v )) f( S(u, v ))

3eR(u,v) eS(u,v)

3 2 e [R(u,v)+S(u,v)]

3 2 e  ( u+v

3).

Hence, the joint density g (u, v ) of the random variables U and V is given by

g( u, v) = 





3 2 e ( u+v

3)for 0 <u<1; 0 <v<1

0 otherwise.

Example 10.11. Let X and Y be independent random variables, each with

density function

f(x ) = 1

p2⇡ e 1

2x 2 ,1 < x < 1.

Let U = X

Yand V= Y . What is the joint density of U and V ? Also, what

is the density of U?

Transformation of Random Variables and their Distributions 272

Answer: Since

U= X

V= Y, 





we get by solving for X and Y

X= UV

Y= V. 

Hence, the Jacobian of the transformation is given by

J=@ x

@v @x

=v· (1) u· (0)

=v.

The joint density function of U and Vis

g( u, v) = | J| f ( R (u, v) , S ( u, v))

=|v |f (R(u, v )) f (S(u, v))

=|v | 1

p2⇡ e 1

2R 2 (u,v) 1

p2⇡ e 1

2S 2 (u,v)

=|v | 1

2⇡ e  1

2[ R 2 (u,v)+S 2 (u,v)]

=|v | 1

2⇡ e  1

2[ u 2 v 2 +v 2 ]

=|v | 1

2⇡ e  1

2v 2 ( u 2 +1 ).

Hence, the joint density g (u, v ) of the random variables U and V is given by

g( u, v) = | v| 1

2⇡ e  1

2v 2 ( u 2 +1 ),

where 1 <u< 1 and 1 <v< 1 .

Probability and Mathematical Statistics 273

Next, we want to ﬁnd the density of U . We can obtain this by ﬁnding the

marginal of U from the joint density of U and V . Hence, the marginal g1 (u)

of U is given by

g1 (u) =  1

1

g( u, v)dv

= 1

1 |v|1

2⇡ e  1

2v 2 ( u 2 +1 )dv

= 0

1 v1

2⇡ e  1

2v 2 ( u 2 +1 )dv + 1

2⇡ e  1

2v 2 ( u 2 +1 )dv

2⇡ 1

2 2

u2 + 1 e  1

2v 2 ( u 2 +1 )0

1

2⇡ 1

2 2

u2 + 1 e  1

2v 2 ( u 2 +1 )1

2⇡

u2 + 1 + 1

2⇡

u2 + 1

⇡(u2 + 1) .

Thus U⇠ CAU (1).

Remark 10.1. If X and Y are independent and standard normal random

variables, then the quotient X

Yis always a Cauchy random variable. However,

the converse of this is not true. For example, if X and Y are independent

and each have the same density function

f(x ) = p 2

⇡

1 + x4 , 1 <x< 1 ,

then it can be shown that the random variable X

Yis a Cauchy random vari-

able. Laha (1959) and Kotlarski (1960) have given a complete description

of the family of all probability density function f such that the quotient X

Transformation of Random Variables and their Distributions 274

follows the standard Cauchy distribution whenever X and Y are independent

and identically distributed random variables with common density f.

Example 10.12. Let X have a Poisson distribution with mean  . Find a

transformation T (x ) so that V ar (T (X ) ) is free of  , for large values of .

Answer: We expand the function T (x ) by Taylor's series about  . Then,

neglecting the higher orders terms for large values of  , we get

T(x ) = T( ) + ( x) T0 ( ) + ······

where T0 ( ) represents derivative of T (x ) at x =  . Now, we compute the

variance of T (X).

V ar ( T ( X ) ) = V ar ( T ( ) + ( X ) T0 ( ) + ···)

=V ar (T ( ) ) + V ar ( (X  )T0 ( ) )

= 0 + [T0 ()]2 V ar(X )

= [T0 ()]2 V ar(X)

= [T0 ()]2 .

We want V ar (T (X ) ) to be free of  for large  . Therefore, we have

[T0 ()]2  = k,

where k is a constant. From this, we get

T0 ( ) = c

p ,

where c = p k . Solving this di↵ erential equation, we get

T( ) = c 1

p d

= 2c p .

Hence, the transformation T (x ) = 2 cp x will free V ar (T (X ) ) of  if the

random variable X⇠ P OI ().

Example 10.13. Let X⇠ P OI (1 ) and Y⇠ P OI (2 ). What is the

probability density function of X +Y if X and Y are independent?

Answer: Let us denote U =X +Y and V =X . First of all, we ﬁnd the

joint density of U and V and then summing the joint density we determine

Probability and Mathematical Statistics 275

the marginal of U which is the density function of X +Y ? Now writing X

and Y in terms of U and V , we get

X= V

Y= U X= U V. 

Hence, the Jacobian of the transformation is given by

J=@ x

@v @x

= (0)( 1) (1)(1)

=1.

The joint density function of U and Vis

g( u, v) = | J| f ( R (u, v) , S ( u, v))

=|1 |f (v, u  v )

=f (v )f (u v )

= e 1 v

v! e 2  uv

(u v )! 

=e (1 +2 ) v

1 uv

(v )! (u v )! ,

where v = 0, 1,2, ..., u and u = 0, 1,2, ..., 1 . Hence, the marginal density of

Uis given by

g1 (u) =



v=0

e(1 +2 ) v

1 uv

(v )! (u v )!

=e(1 +2 )



v=0

v

1 uv

(v )! (u v )!

=e(1 +2 )



v=0

u!  u

v v

1 uv

=e (1 +2 )

u!(1 +2 )u .

Thus, the density function of U =X +Y is given by

g1 (u) = 





e( 1+ 2)

u! ( 1 + 2 ) u for u = 0, 1,2, ..., 1

0 otherwise.

This example tells us that if X⇠ P OI (1 ) and Y⇠ P OI (2 ) and they are

independent, then X +Y⇠ P OI (1 + 2 ).

Transformation of Random Variables and their Distributions 276

Theorem 10.3. Let the joint density of the random variables X and Y be

f( x, y). Then probability density functions of X+ Y, XY , and Y

Xare given

hX+Y ( v ) =  1

1

f( u, v  u)du

hXY ( v ) =  1

1

|u |f  u, v

u du

Y(v ) =  1

1 |u| f ( u, vu)du,

respectively.

Proof: Let U =X and V =X +Y . So that X =R (U, V ) = U , and

Y= S( U, V ) = V U . Hence, the Jacobian of the transformation is given

J=@ x

@v @x

@u= 1.

The joint density function of U and Vis

g( u, v) = | J| f ( R (u, v) , S ( u, v))

=f (R(u, v), S (u, v))

=f (u, v  u ).

Hence, the marginal density of V =X +Y is given by

hX+Y ( v ) =  1

1

f( u, v  u)du.

Similarly, one can obtain the other two density functions. This completes

the proof.

In addition, if the random variables X and Y in Theorem 10.3 are in-

dependent and have the probability density functions f (x ) and g (y ) respec-

tively, then we have

hX+Y ( z ) =  1

1

g( y) f( z y) dy

hXY ( z ) =  1

1

|y |g (y) f z

y dy

Y(z ) =  1

1 |y| g( y) f(zy ) dy.

Probability and Mathematical Statistics 277

Each of the following ﬁgures shows how the distribution of the random

variable X +Y is obtained from the joint distribution of (X, Y ).

Distribution of X+Y

3 4 5

1 2 3

Distribution of X+Y

1 2 3

2 3 4 5 6

Example 10.14. Roll an unbiased die twice. If X denotes the outcome

in the ﬁrst roll and Y denotes the outcome in the second roll, what is the

distribution of the random variable Z = max{X, Y }?

Answer: The space of X is RX = {1,2,3,4,5,6} . Similarly, the space of Y

is RY = {1,2,3,4,5,6} . Hence the space of the random variable (X, Y ) is

RX ⇥RY . The following table shows the distribution of ( X, Y ).

1 2 3 4 5 6

The space of the random variable Z = max{X, Y } is RZ = {1,2,3,4,5,6}.

Thus Z = 1 only if (X, Y ) = (1, 1). Hence P (Z = 1) = 1

36 . Similarly, Z = 2

only if (X, Y ) = (1, 2),(2, 2) or (2, 1). Hence, P (Z = 2) = 3

36 . Proceeding in

a similar manner, we get the distribution of Z which is summarized in the

table below.

Transformation of Random Variables and their Distributions 278

z123 456

h( z) 1

In this example, the random variable Z may be described as the best out of

two rolls. Note that the probability density of Z can also be stated as

h( z ) = 2 z1

36 , for z2{1 , 2,3,4,5,6}.

10.4. Convolution Method for Sums of Random Variables

In this section, we illustrate how convolution technique can be used in

ﬁnding the distribution of the sum of random variables when they are inde-

pendent. This convolution technique does not work if the random variables

are not independent.

Deﬁnition 10.1. Let f and g be two real valued functions. The convolution

of f and g , denoted by f? g , is deﬁned as

(f? g )(z ) =  1

1

f( z y) g( y) dy

= 1

1

g( z x) f (x)dx.

Hence from this deﬁnition it is clear that f? g =g? f .

Let X and Y be two independent random variables with probability

density functions f (x ) and g (y ). Then by Theorem 10.3, we get

h( z ) =  1

1

f( z y) g( y) dy.

Thus, this result shows that the density of the random variable Z =X + Y

is the convolution of the density of X with the density of Y.

Example 10.15. What is the probability density of the sum of two inde-

pendent random variables, each of which is uniformly distributed over the

interval [0,1]?

Answer: Let Z =X +Y , where X⇠ U NI F (0, 1) and Y⇠ U NI F (0,1).

Hence, the density function f (x ) of the random variable X is given by

f(x ) =  1 for 0 x1

0 otherwise.

Probability and Mathematical Statistics 279

Similarly, the density function g (y ) of Y is given by

g( y) =  1 for 0 y  1

0 otherwise.

Since X and Y are independent, the density function of Z can be obtained

by the method of convolution. Since, the sum z =x +y is between 0 and 2,

we consider two cases. First, suppose 0 z 1, then

h( z ) = ( f? g ) ( z )

= 1

1

f( z x) g (x)dx

= 1

f( z x) g (x)dx

= z

f( z x) g (x) dx + 1

f( z x) g (x)dx

= z

f( z x) g (x) dx + 0 (since f ( z x) = 0 between z and 1)

= z

=z.

Similarly, if 1 z 2, then

h( z ) = ( f? g ) ( z )

= 1

1

f( z x) g (x)dx

= 1

f( z x) g (x)dx

= z1

f( z x) g (x) dx + 1

z1

f( z x) g (x)dx

= 0 +  1

z1

f( z x) g (x) dx (since f ( z x) = 0 between 0 and z1)

= 1

z1

= 2  z.

Transformation of Random Variables and their Distributions 280

Thus, the density function of Z =X +Y is given by

h( z ) =











0 for 1 < z  0

zfor 0  z1

2z for 1 z2

0 for 2 <z<1 .

The graph of this density function looks like a tent and it is called a tent func-

tion. However, in literature, this density function is known as the Simpson's

distribution.

Example 10.16. What is the probability density of the sum of two inde-

pendent random variables, each of which is gamma with parameter ↵ = 1

and ✓ = 1 ?

Answer: Let Z =X +Y , where X⇠ GAM (1 , 1) and Y⇠ GAM (1 , 1).

Hence, the density function f (x ) of the random variable X is given by

f(x ) =  e x for 0 < x < 1

0 otherwise.

Similarly, the density function g (y ) of Y is given by

g( y) =  e y for 0 <y<1

0 otherwise.

Since X and Y are independent, the density function of Z can be obtained

by the method of convolution. Notice that the sum z =x +y is between 0

Probability and Mathematical Statistics 281

and 1 , and 0 < x < z . Hence, the density function of Z is given by

h( z ) = ( f? g ) ( z )

= 1

1

f( z x) g (x)dx

= 1

f( z x) g (x)dx

= z

e(z x) ex dx

= z

ez+ x ex dx

= z

ez dx

=z ez

(2) 12 z 21 e  z

Hence Z⇠ GAM (1 , 2). Thus, if X⇠ GAM (1 , 1) and Y⇠ GAM (1 , 1),

then X +Y⇠ GAM (1 , 2), that X +Y is a gamma with ↵ = 2 and ✓= 1.

Recall that a gamma random variable with ↵ = 1 is known as an exponential

random variable with parameter ✓ . Thus, in view of the above example, we

see that the sum of two independent exponential random variables is not

necessarily an exponential variable.

Example 10.17. What is the probability density of the sum of two inde-

pendent random variables, each of which is standard normal?

Answer: Let Z =X +Y , where X⇠ N (0, 1) and Y⇠ N (0, 1). Hence, the

density function f (x ) of the random variable X is given by

f(x ) = 1

p2⇡ e x2

Similarly, the density function g (y ) of Y is given by

g( y) = 1

p2⇡ e y 2

Since X and Y are independent, the density function of Z can be obtained

by the method of convolution. Notice that the sum z =x +y is between 1

Transformation of Random Variables and their Distributions 282

and 1 . Hence, the density function of Z is given by

h( z ) = ( f? g ) ( z )

= 1

1

f( z x) g (x)dx

2⇡ 1

1

e(z x)2

2e  x 2

2dx

2⇡ e  z 2

4 1

1

e ( x z

2) 2 dx

2⇡ e  z 2

4p⇡ 1

1

p⇡ e ( x z

2) 2 dx

2⇡ e  z 2

41

1

p⇡ ew2 dw , where w= x z

p4⇡ e z 2

p4⇡ e 1

2 z0

p2  2

The integral in the brackets equals to one, since the integrand is the normal

density function with mean µ = 0 and variance 2 = 1

2. Hence sum of two

standard normal random variables is again a normal random variable with

mean zero and variance 2.

Example 10.18. What is the probability density of the sum of two inde-

pendent random variables, each of which is Cauchy?

Answer: Let Z =X +Y , where X⇠ N (0, 1) and Y⇠ N (0, 1). Hence, the

density function f (x ) of the random variable X and Y are is given by

f(x ) = 1

⇡(1 + x2 ) and g (y ) = 1

⇡(1 + y2 ) ,

respectively. Since X and Y are independent, the density function of Zcan

be obtained by the method of convolution. Notice that the sum z =x +y is

between 1 and 1 . Hence, the density function of Z is given by

h( z ) = ( f? g ) ( z )

= 1

1

f( z x) g (x)dx

= 1

1

⇡(1 + (z x)2)

⇡(1 + x2 ) dx

⇡2  1

1

1 + (z x)2

1 + x2 dx.

Probability and Mathematical Statistics 283

To integrate the above integral, we decompose the integrand using partial

fraction decomposition. Hence

1 + (z x)2

1 + x2 = 2 A x + B

1 + x2 + 2C (z x) + D

1 + (z x)2

where

A=1

z(4 + z2 )= Cand B = 1

4 + z2 = D.

Now integration yields

⇡2  1

1

1 + (z x)2

1 + x2 dx

⇡2 z2 (4 + z2 ) zln  1 + x2

1 + (z x)2  + z 2 tan 1 xz2 tan1 ( z  x) 1

1

⇡2 z2 (4 + z2 ) 0 + z 2 ⇡+z2 ⇡

⇡(4 + z2 ) .

Hence the sum of two independent Cauchy random variables is not a Cauchy

random variable.

If X⇠ CAU (0) and Y⇠ CAU (0), then it can be easily shown using

Example 10.18 that the random variable Z =X+Y

2is again Cauchy, that is

Z⇠ CAU (0). This is a remarkable property of the Cauchy distribution.

So far we have considered the convolution of two continuous independent

random variables. However, the concept can be modiﬁed to the case when

the random variables are discrete.

Let X and Y be two discrete random variables both taking on values

that are integers. Let Z =X +Y be the sum of the two random variables.

Hence Z takes values on the set of integers. Suppose that X =n where nis

some integer. Then Z =z if and only if Y =z n . Thus the events (Z =z )

is the union of the pair wise disjoint events (X = n ) and (Y =z n ) where

nruns over the integers. The cdf H (z ) of Z can be obtained as follows:

P( Z= z) = 1



n=1

P( X= n) P( Y= z n)

which is

h( z ) = 1



n=1

f(n ) g( z n),

Transformation of Random Variables and their Distributions 284

where F (x ) and G(y ) are the cdf of X and Y , respectively.

Deﬁnition 10.2. Let X and Y be two independent integer-valued discrete

random variables, with pdfs f (x ) and g (y ) respectively. Then the convolution

of f (x ) and g (y ) is the cdf h =f? g given by

h(m) = 1



n=1

f(n ) g( m n),

for m = 1, ..., 2, 1,0,1,2, .... 1 . The function h( z ) is the pdf of the

discrete random variable Z =X +Y .

Example 10.19. Let each of the random variable X and Y represents the

outcomes of a six-sided die. What is the cumulative density function of the

sum of X and Y?

Answer: Since the range of X as well as Y is {1,2,3,4,5,6 }, the range of

Z= X+ Yis RZ = {2 , 3 , 4 , ..., 11 , 12} . The pdf of Z is given by

h(2) = f (1) g (1) = 1

6· 1

6= 1

h(3) = f (1) g (2) + f (2) g (1) = 1

6· 1

6+ 1

6· 1

6= 2

h(4) = f (1) g (3) + h(2) g (2) + f (3) g (1) = 1

6· 1

6+ 1

6· 1

6+ 1

6· 1

6= 3

36 .

Continuing in this manner we obtain h (5) = 4

36 ,h(6) = 5

36 ,h(7) = 6

36 ,

h(8) = 5

36 ,h(9) = 4

36 ,h(10) = 3

36 ,h(11) = 2

36 , and h(12) = 1

36 . Putting

these into one expression we have

h( z ) =

z1



n=1

f(n) g( z n)

=6|z 7|

36 , z = 2, 3,4, ..., 12.

It is easy to note that the convolution operation is commutative as well

as associative. Using the associativity of the convolution operation one can

compute the pdf of the random variable Sn = X1 + X2 + ··· + Xn , where

X1 , X2 , ..., Xn are random variables each having the same pdf f (x ). Then

the pdf of S1 is f (x ). Since Sn = Sn1 + Xn and the pdf of Xn is f (x ), the

pdf of Sn can be obtained by induction.

Probability and Mathematical Statistics 285

10.5. Moment Generating Function Method

We know that if X and Y are independent random variables, then

MX+Y (t) = MX (t)MY (t).

This result can be used to ﬁnd the distribution of the sum X +Y . Like the

convolution method, this method can be used in ﬁnding the distribution of

X+ Yif Xand Yare independent random variables. We brieﬂy illustrate

the method using the following example.

Example 10.20. Let X⇠ P OI (1 ) and Y⇠ P OI (2 ). What is the

probability density function of X +Y if X and Y are independent?

Answer: Since, X⇠ P OI (1 ) and Y⇠ P O I (2 ), we get

MX (t) = e 1 (et 1)

and

MY (t) = e 2 (et  1).

Further, since X and Y are independent, we have

MX+Y (t) = MX (t)MY (t)

=e 1 (et 1) e 2 (et 1)

=e 1 (et  1)+2 (et  1)

=e(1 +2 )(et  1),

that is, X +Y⇠ P OI (1 + 2 ). Hence the density function h(z ) of Z =X + Y

is given by

h( z ) = 





e( 1+ 2)

z!( 1 + 2 ) z for z = 0, 1,2,3, ...

0 otherwise.

Compare this example to Example 10.13. You will see that moment method

has a deﬁnite advantage over the convolution method. However, if you use the

moment method in Example 10.15, then you will have problem identifying

the form of the density function of the random variable X +Y . Thus, it

is diﬃ cult to say which method always works. Most of the time we pick a

particular method based on the type of problem at hand.

Transformation of Random Variables and their Distributions 286

Example 10.21. What is the probability density function of the sum of

two independent random variable, each of which is gamma with parameters

✓and ↵?

Answer: Let X and Y be two independent gamma random variables with

parameters ✓ and ↵ , that is X⇠ GAM (✓ ,↵ ) and Y⇠ GAM (✓ ,↵ ). From

Theorem 6.3, the moment generating functions of X and Y are obtained as

MX (t) = (1 ✓ )↵ and MY (t) = (1 ✓ )↵ , respectively. Since, X and Y

are independent, we have

MX+Y (t) = MX (t)MY (t)

= (1 ✓ )↵ (1 ✓ )↵

= (1 ✓ )2↵ .

Thus X +Y has a moment generating function of a gamma random variable

with parameters ✓ and 2↵ . Therefore

X+ Y⇠ GAM (✓ , 2↵).

10.6. Review Exercises

1. Let X be a continuous random variable with density function

f(x ) =  e 2x + 1

2e x for 0 < x < 1

0 otherwise.

If Y = e2X , then what is the density function of Y where nonzero?

2. Suppose that X is a random variable with density function

f(x ) =  3

8x 2 for 0 < x < 2

0 otherwise.

Let Y = mX2 , where m is a ﬁxed positive number. What is the density

function of Y where nonzero?

3. Let X be a continuous random variable with density function

f(x ) =  2 e 2x for x > 0

0 otherwise

and let Y = eX . What is the density function g (y ) of Y where nonzero?

Probability and Mathematical Statistics 287

4. What is the probability density of the sum of two independent random

variables, each of which is uniformly distributed over the interval [2,2]?

5. Let X and Y be random variables with joint density function

f( x, y) =  e x for 0 <x< 1; 0 < y < 1

0 elsewhere .

If Z =X + 2Y , then what is the joint density of X and Z where nonzero?

6. Let X be a continuous random variable with density function

f(x ) =  2

x2 for 1 < x < 2

0 elsewhere.

If Y = p X , then what is the density function of Y for 1 <y< p 2?

7. What is the probability density of the sum of two independent random

variables, each of which has the density function given by

f(x ) =  10x

50 for 0 < x < 10

0 elsewhere?

8. What is the probability density of the sum of two independent random

variables, each of which has the density function given by

f(x ) =  a

x2 for a x < 1

0 elsewhere?

9. Roll an unbiased die 3 times. If U denotes the outcome in the ﬁrst roll, V

denotes the outcome in the second roll, and Wdenotes the outcome of the

third roll, what is the distribution of the random variable Z = max{U, V, W }?

10. The probability density of V , the velocity of a gas molecule, by Maxwell-

Boltzmann law is given by

f( v) = 





4h3

p⇡ v 2 e h2 v2 for 0 v < 1

0 otherwise,

where h is the Plank's constant. If m represents the mass of a gas molecule,

then what is the probability density of the kinetic energy Z = 1

2mV 2 ?

Transformation of Random Variables and their Distributions 288

11. If the random variables X and Y have the joint density

f( x, y) = 





7xfor 1 x +y 2, x  0, y  0

0 otherwise,

what is the joint density of U = 2X + 3Y and V = 4X +Y ?

12. If the random variables X and Y have the joint density

f( x, y) = 





7xfor 1 x +y 2, x  0, y  0

0 otherwise,

what is the density of X

13. Let X and Y have the joint probability density function

f( x, y) =  5

16 xy 2 for 0 < x < y < 2

0 elsewhere.

What is the joint density function of U = 3X 2Y and V =X + 2Ywhere

it is nonzero?

14. Let X and Y have the joint probability density function

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

What is the joint density function of U = 5X 2Y and V = 3X + 2Ywhere

it is nonzero?

15. Let X and Y have the joint probability density function

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

What is the density function of X Y ?

16. Let X and Y have the joint probability density function

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

Probability and Mathematical Statistics 289

What is the density function of X

17. Let X and Y have the joint probability density function

f( x, y) =  4 x for 0 <x< p y < 1

0 elsewhere.

What is the density function of XY ?

18. Let X and Y have the joint probability density function

f( x, y) =  5

16 xy 2 for 0 < x < y < 2

0 elsewhere.

What is the density function of Y

19. If X an uniform random variable on the interval [0, 2] and Y is an uniform

random variable on the interval [0, 3], then what is the joint probability

density function of X +Y if they are independent?

20. What is the probability density function of the sum of two independent

random variable, each of which is binomial with parameters n and p?

21. What is the probability density function of the sum of two independent

random variable, each of which is exponential with mean ✓?

22. What is the probability density function of the average of two indepen-

dent random variable, each of which is Cauchy with parameter ✓ = 0?

23. What is the probability density function of the average of two indepen-

dent random variable, each of which is normal with mean µ and variance

2 ?

24. Both roots of the quadratic equation x2 +↵x + = 0 can take all values

from  1 to +1 with equal probabilities. What are the probability density

functions of the coeﬃ cients ↵ and ?

25. If A, B, C are independent random variables uniformly distributed on

the interval from zero to one, then what is the probability that the quadratic

equation Ax2 + Bx +C = 0 has real solutions?

26. The price of a stock on a given trading day changes according to the

distribution f ( 1) = 1

4,f(0) = 1

2,f(1) = 1

8, and f (2) = 1

8. Find the

distribution for the change in stock price after two (independent) trading

days.

Some Special Discrete Bivariate Distributions 290

Chapter 11

SOME

SPECIAL DISCRETE

BIVARIATE DISTRIBUTIONS

In this chapter, we shall examine some bivariate discrete probability den-

sity functions. Ever since the ﬁrst statistical use of the bivariate normal dis-

tribution (which will be treated in Chapter 12) by Galton and Dickson in

1886, attempts have been made to develop families of bivariate distributions

to describe non-normal variations. In many textbooks, only the bivariate

normal distribution is treated. This is partly due to the dominant role the

bivariate normal distribution has played in statistical theory. Recently, how-

ever, other bivariate distributions have started appearing in probability mod-

els and statistical sampling problems. This chapter will focus on some well

known bivariate discrete distributions whose marginal distributions are well-

known univariate distributions. The book of K.V. Mardia gives an excellent

exposition on various bivariate distributions.

11.1. Bivariate Bernoulli Distribution

We deﬁne a bivariate Bernoulli random variable by specifying the form

of the joint probability distribution.

Deﬁnition 11.1. A discrete bivariate random variable (X, Y ) is said to have

the bivariate Bernoulli distribution if its joint probability density is of the

form

f( x, y) = 





x! y ! (1 x y )! p x

1p y

2(1 p 1 p 2 ) 1xy,if x, y = 0,1

0 otherwise,

Probability and Mathematical Statistics 291

where 0 < p1 , p2, p1 + p2 < 1 and x +y 1. We denote a bivariate Bernoulli

random variable by writing (X, Y )⇠ BE R (p1 , p2 ).

In the following theorem, we present the expected values and the vari-

ances of X and Y , the covariance between X and Y , and their joint moment

generating function. Recall that the joint moment generating function of X

and Y is deﬁned as M (s, t ) := E  esX+tY  .

Theorem 11.1. Let (X, Y )⇠ B ER (p1 , p2 ), where p1 and p2 are parame-

ters. Then

E( X) = p1

E( Y) = p2

V ar( X ) = p1 (1  p1 )

V ar( Y ) = p2 (1  p2 )

Cov ( X, Y ) = p1p2

M( s, t) = 1  p1 p2 +p1es +p2et.

Proof: First, we derive the joint moment generating function of X and Yand

then establish the rest of the results from it. The joint moment generating

function of X and Y is given by

M( s, t) = E  esX+tY 



x=0



y=0

f( x, y) esx+ty

=f (0, 0) + f (1, 0) es +f (0, 1) et +f (1, 1) et+s

= 1  p1 p2 + p1es + p2et + 0 et+s

= 1  p1 p2 + p1es + p2et.

The expected value of X is given by

E( X) = @ M

@s   (0,0)

@s 1p1 p2 +p1es +p2et    (0,0)

=p1es |(0,0)

=p1.

Some Special Discrete Bivariate Distributions 292

Similarly, the expected value of Y is given by

E( Y) = @ M

@t   (0,0)

@t 1p1 p2 +p1es +p2et    (0,0)

=p2et  (0,0)

=p2.

The product moment of X and Yis

E( XY ) = @ 2 M

@t@s   (0,0)

=@ 2

@t@s 1p1 p2 +p1es +p2et    (0,0)

@t( p 1 e s )    (0,0)

= 0.

Therefore the covariance of X and Yis

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ) = p1p2

Similarly, it can be shown that

E( X2 ) = p1 and E( Y2 ) = p2.

Thus, we have

V ar( X ) = E ( X2 ) E ( X )2 = p1 p2

1=p 1 (1 p 1 )

and

V ar( Y ) = E ( Y2 ) E ( Y )2 = p2 p2

2=p 2 (1 p 2 ).

This completes the proof of the theorem.

The next theorem presents some information regarding the conditional

distributions f (x/y ) and f (y/x).

Probability and Mathematical Statistics 293

Theorem 11.2. Let (X, Y )⇠ B ER (p1 , p2 ), where p1 and p2 are parame-

ters. Then the conditional distributions f (y/x ) and f (x/y ) are also Bernoulli

and

E(Y/x ) = p 2 (1  x )

1p1

E(X/y ) = p 1 (1 y )

1p2

V ar(Y /x ) = p 2 (1  p 1  p 2 ) (1  x )

(1  p1 )2

V ar(X/y ) = p 1 (1  p 1  p 2 ) (1 y )

(1  p2 )2 .

Proof: Notice that

f(y/x ) = f(x, y)

f1 (x)

=f (x, y)



y=0

f( x, y)

=f (x, y)

f( x, 0) + f ( x, 1) x = 0 , 1; y = 0 , 1; 0 x+ y  1.

Hence

f(1/ 0) = f(0 ,1)

f(0 ,0) + f(0 ,1)

=p2

1p1 p2 +p2

=p2

1p1

and

f(1/ 1) = f(1 ,1)

f(1 ,0) + f(1 ,1)

p1 + 0 = 0 .

Now we compute the conditional expectation E (Y /x ) for x = 0, 1. Hence

E( Y/x = 0) =



y=0

y f (y/0)

=f (1/0)

=p2

1p1

Some Special Discrete Bivariate Distributions 294

and

E( Y/x = 1) = f (1/ 1) = 0 .

Merging these together, we have

E(Y/x ) = p 2 (1  x )

1p1

x= 0 ,1.

Similarly, we compute

E( Y2 /x = 0) =



y=0

y2 f(y/0)

=f (1/0)

=p2

1p1

and

E( Y2 /x = 1) = f (1/ 1) = 0 .

Therefore

V ar( Y /x = 0) = E ( Y2 /x = 0)  E (Y /x = 0)2

=p2

1p1   p2

1p1  2

=p 2 (1 p1 ) p2

(1  p1 )2

=p 2 (1 p1 p2 )

(1  p1 )2

and

V ar( Y /x = 1) = 0.

Merging these together, we have

V ar(Y /x ) = p 2 (1  p 1  p 2 ) (1  x )

(1  p1 )2 x = 0, 1.

The conditional expectation E (X/y ) and the conditional variance V ar (X/y)

can be obtained in a similar manner. We leave their derivations to the reader.

11.2. Bivariate Binomial Distribution

The bivariate binomial random variable is deﬁned by specifying the form

of the joint probability distribution.

Probability and Mathematical Statistics 295

Deﬁnition 11.2. A discrete bivariate random variable (X, Y ) is said to

have the bivariate binomial distribution with parameters n, p1 , p2 if its joint

probability density is of the form

f( x, y) = 





x! y ! (nx y )! p x

1p y

2(1 p 1 p 2 ) nx y,if x, y = 0, 1, ..., n

0 otherwise,

where 0 < p1 , p2, p1 + p2 < 1, x +y n and n is a positive integer. We denote

a bivariate binomial random variable by writing (X, Y )⇠ BIN (n, p1 , p2 ).

Bivariate binomial distribution is also known as trinomial distribution.

It will be shown in the proof of Theorem 11.4 that the marginal distributions

of X and Y are BIN ( n, p1 ) and BIN ( n, p2 ), respectively.

The following two examples illustrate the applicability of bivariate bino-

mial distribution.

Example 11.1. In the city of Louisville on a Friday night, radio station A

has 50 percent listeners, radio station B has 30 percent listeners, and radio

station C has 20 percent listeners. What is the probability that among 8

listeners in the city of Louisville, randomly chosen on a Friday night, 5 will

be listening to station A , 2 will be listening to station B , and 1 will be

listening to station C?

Answer: Let X denote the number listeners that listen to station A , and

Ydenote the listeners that listen to station B. Then the joint distribution

of X and Y is bivariate binomial with n = 8, p1 = 5

10 , and p 2 = 3

10 . The

probability that among 8 listeners in the city of Louisville, randomly chosen

on a Friday night, 5 will be listening to station A, 2 will be listening to station

B, and 1 will be listening to station Cis given by

P( X= 5 , Y = 2) = f (5 , 2)

=n!

x! y ! ( n x y )! p x

1p y

2(1 p 1 p 2 ) nxy

=8!

5! 2! 1!  5

10  5  3

10  2  2

10 

= 0.0945.

Example 11.2. A certain game involves rolling a fair die and watching the

numbers of rolls of 4 and 5. What is the probability that in 10 rolls of the

die one 4 and three 5 will be observed?

Some Special Discrete Bivariate Distributions 296

Answer: Let X denote the number of 4 and Ydenote the number of 5.

Then the joint distribution of X and Y is bivariate binomial with n = 10,

p1 = 1

6,p 2 = 1

6and 1 p 1 p 2 = 4

6. Hence the probability that in 10 rolls

of the die one 4 and three 5 will be observed is

P( X= 5 , Y = 2) = f (1 , 3)

=n!

x! y ! ( n x y )! p x

1p y

2(1 p 1 p 2 ) nxy

=10!

1! 3! (10  1 3)!  1

6 1  1

6 3  1 1

6 1

6 1013

=10!

1! 3! (10  1 3)!  1

6 1  1

6 3  4

66

=573440

10077696

= 0.0569.

Using transformation method discussed in chapter 10, it can be shown

that if X1 , X2 and X3 are independent binomial random variables, then the

joint distribution of the random variables

X= X1 +X2 and Y= X1 +X3

is bivariate binomial. This approach is known as trivariate reduction tech-

nique for constructing bivariate distribution.

To establish the next theorem, we need a generalization of the binomial

theorem which was treated in Chapter 1. The following result generalizes the

binomial theorem and can be called trinomial theorem. Similar to the proof

of binomial theorem, one can establish

(a +b+ c)n=



x=0



y=0  n

x, y a x b y c nx y ,

where 0 x +y n and

n

x, y = n!

x! y ! ( n x y )! .

In the following theorem, we present the expected values of X and Y,

their variances, the covariance between X and Y, and the joint moment

generating function.

Probability and Mathematical Statistics 297

Theorem 11.3. Let (X, Y )⇠ BI N (n, p1 , p2 ), where n , p1 and p2 are

parameters. Then

E( X) = n p1

E( Y) = n p2

V ar( X ) = n p1 (1  p1 )

V ar( Y ) = n p2 (1  p2 )

Cov ( X, Y ) =  n p1 p2

M( s, t) =  1 p1 p2 +p1es +p2et  n.

Proof: First, we ﬁnd the joint moment generating function of X and Y . The

moment generating function M (s, t ) is given by

M( s, t) = E  esX+tY 



x=0



y=0

esx+ty f ( x, y )



x=0



y=0

esx+ty n!

x! y ! ( n x y )! p x

1p y

2(1 p 1 p 2 ) nxy



x=0



y=0

x! y ! ( n x y )! ( e s p 1 ) x  e t p 2  y (1  p 1  p 2 ) nxy

= 1p1 p2 +p1es +p2et  n (by trinomial theorem).

The expected value of X is given by

E( X) = @ M

@s   (0,0)

@s 1p1 p2 +p1es +p2et  n    (0,0)

=n 1p1 p2 +p1es +p2et  n1 p1es   (0,0)

=n p1.

Similarly, the expected value of Y is given by

E( Y) = @ M

@t   (0,0)

@t 1p1 p2 +p1es +p2et  n    (0,0)

=n 1p1 p2 +p1es +p2et  n1 p2et   (0,0)

=n p2.

Some Special Discrete Bivariate Distributions 298

The product moment of X and Yis

E( XY ) = @ 2 M

@t@s   (0,0)

=@ 2

@t@s 1p1 p2 +p1es +p2et  n    (0,0)

@t n  1p1 p2 +p1es +p2et  n1 p1es    (0,0)

=n (n 1)p1p2.

Therefore the covariance of X and Yis

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ) = n( n 1)p1p2 n2p1p2 = np1p2.

Similarly, it can be shown that

E( X2 ) = n( n 1)p2

1+np 1 and E (Y 2 ) = n( n 1)p 2

2+np 2 .

Thus, we have

V ar( X ) = E ( X2 ) E ( X )2

=n( n 1)p2

2+np 2 n 2 p 2

=n p1 (1  p1 )

and similarly

V ar( Y ) = E ( Y2 ) E ( Y )2 = n p2 (1  p2 ).

This completes the proof of the theorem.

The following results are needed for the next theorem and they can be

established using binomial theorem discussed in chapter 1. For any real

numbers a and b , we have



y=0

y m

y a y b my = m a (a + b)m1 (11.1)

and m



y=0

y2  m

y a y b my = m a (ma + b ) (a + b)m2 (11.2)

where m is a positive integer.

Probability and Mathematical Statistics 299

Example 11.3. If X equals the number of ones and Yequals the number of

twos and threes when a pair of fair dice are rolled, then what is the correlation

coeﬃ cient of X and Y?

Answer: The joint density of X and Y is bivariate binomial and is given by

f( x, y) = 2!

x! y ! (2  x y )!  1

6 x  2

6 y  3

6 2xy

,0 x+ y2,

where x and y are nonnegative integers. By Theorem 11.3, we have

V ar( X ) = n p1 (1  p1 ) = 2 1

6 1 1

6 = 10

36 ,

V ar( Y ) = n p2 (1  p2 ) = 2 2

6 1 2

6 = 16

36 ,

and

Cov ( X, Y ) =  n p1 p2 = 21

6= 4

36 .

Therefore

Corr( X, Y ) = Cov(X, Y )

V ar(X ) V ar( Y )

= 4

4p10

=0.3162.

The next theorem presents some information regarding the conditional

distributions f (x/y ) and f (y/x).

Theorem 11.4. Let (X, Y )⇠ BI N (n, p1 , p2 ), where n , p1 and p2 are

parameters. Then the conditional distributions f (y/x ) and f (x/y ) are also

binomial and

E(Y/x ) = p 2 ( nx)

1p1

E(X/y ) = p 1 ( ny)

1p2

V ar(Y /x ) = p 2 (1  p 1  p 2 ) ( n x)

(1  p1 )2

V ar(X/y ) = p 1 (1  p 1  p 2 ) ( ny)

(1  p2 )2 .

Some Special Discrete Bivariate Distributions 300

Proof: Since f (y/x ) = f(x,y)

f1 (x) , ﬁrst we ﬁnd the marginal density of X . The

marginal density f1 (x ) of X is given by

f1 (x) =

nx



y=0

x! y ! ( n x y )! p x

1p y

2(1 p 1 p 2 ) nxy

=n !px

x! ( n x)!

nx



y=0

(n x )!

y! ( n x y)! py

2(1 p 1 p 2 ) nxy

=n

x px

1(1 p 1 p 2 +p 2 ) nx (by binomial theorem)

=n

x px

1(1 p 1 ) nx .

In order to compute the conditional expectations, we need the conditional

densities of f (x, y ). The conditional density of Y given X =x is

f(y/x ) = f(x, y)

f1 (x)

=f (x, y)

n

xp x

1(1 p 1 ) nx

=(n x)!

(n x y )! y ! p y

2(1 p 1 p 2 ) nx y (1 p 1 ) xn

= (1  p1 )xn  n x

y py

2(1 p 1 p 2 ) nx y .

Hence the conditional expectation of Y given the event X =x is

E(Y/x ) =

nx



y=0

y(1  p1 )xn  n x

y py

2(1 p 1 p 2 ) nxy

= (1  p1 )xn

nx



y=0

y nx

y py

2(1 p 1 p 2 ) nxy

= (1  p1 )xn p2 ( n x ) (1  p1 )nx1

=p 2 (n x)

1p1

Next, we ﬁnd the conditional variance of Y given event X = x . For this

Probability and Mathematical Statistics 301

we need the conditional expectation E  Y2 /x , which is given by

E Y2 /x =

nx



y=0

y2 f( x, y)

nx



y=0

y2 (1  p1 )xn  n x

y py

2(1 p 1 p 2 ) nxy

= (1  p1 )xn

nx



y=0

y2  nx

y py

2(1 p 1 p 2 ) nxy

= (1  p1 )xn p2 ( n x ) (1  p1 )nx2 [( n x)p2 + 1  p1 p2 ]

=p 2 (n x )[(n x )p2 + 1  p1 p2 ]

(1  p1 )2 .

Hence, the conditional variance of Y given X =x is

V ar(Y /x ) = E  Y2 /x  E (Y /x)2

=p 2 (n x) [( n x )p2 + 1  p1 p2 ]

(1  p1 )2   p 2 ( n x )

1p1  2

=p 2 (1 p1 p2 ) ( n x)

(1  p1 )2 .

Similarly, one can establish

E(X/y ) = p 1 ( ny)

1p2

and V ar (X/y ) = p 1 (1  p 1  p 2 ) (n y )

(1  p2 )2 .

This completes the proof of the theorem.

Note that f (y/x ) in the above theorem is a univariate binomial probability

density function. To see this observe that

(1  p1 )xn  n x

y py

2(1 p 1 p 2 ) nxy

= nx

y p2

1p1  y  1p2

1p1  nxy

Hence, f (y/x ) is a probability density function of a binomial random variable

with parameters n x and p 2

1p1 .

The marginal density f2 ( y ) of Y can be obtained similarly as

f2 ( y ) =  n

y py

2(1 p 2 ) ny ,

Some Special Discrete Bivariate Distributions 302

where y = 0, 1, ..., n . The form of these densities show that the marginals of

bivariate binomial distribution are again binomial.

Example 11.4. Let W equal the weight of soap in a 1-kilogram box that is

distributed in India. Suppose P (W < 1) = 0 . 02 and P (W > 1. 072) = 0 .08.

Call a box of soap light, good, or heavy depending on whether W < 1,

1W  1. 072, or W > 1. 072, respectively. In a random sample of 50 boxes,

let X equal the number of light boxes and Y the number of good boxes.

What are the regression and scedastic curves of Y on X?

Answer: The joint probability density function of X and Y is given by

f( x, y) = 50!

x! y ! (50  x y )! p x

1p y

2(1 p 1 p 2 ) 50xy,0x +y 50,

where x and y are nonnegative integers. Hence, (X, Y )⇠ BIN ( n, p1 , p2 ),

where n = 50, p1 = 0. 02 and p2 = 0. 90. The regression curve of Y on Xis

given by

E(Y/x ) = p 2 ( nx)

1p1

=0.9 (50  x)

1 0.02

=45

49 (50 x).

The scedastic curve of Y on X is the conditional variance of Y given X = x

and it equal to

V ar(Y /x ) = p 2 (1  p 1  p 2 ) ( n x)

(1  p1 )2

=0.9 0.08 (50  x)

(1  0.02)2

=180

2401 (50 x).

Note that if n = 1, then bivariate binomial distribution reduces to bi-

variate Bernoulli distribution.

11.3. Bivariate Geometric Distribution

Recall that if the random variable X denotes the trial number on which

ﬁrst success occurs, then X is univariate geometric. The probability density

function of an univariate geometric variable is

f(x ) = px1 (1  p) , x = 1 , 2 , 3 , ..., 1,

Probability and Mathematical Statistics 303

where p is the probability of failure in a single Bernoulli trial. This univari-

ate geometric distribution can be generalized to the bivariate case. Guldberg

(1934) introduced the bivariate geometric distribution and Lundberg (1940)

ﬁrst used it in connection with problems of accident proneness. This distri-

bution has found many applications in various statistical methods.

Deﬁnition 11.3. A discrete bivariate random variable (X, Y ) is said to

have the bivariate geometric distribution with parameters p1 and p2 if its

joint probability density is of the form

f( x, y) = 





(x+y)!

x! y! p x

1p y

2(1 p 1 p 2 ),if x, y = 0 , 1 , ..., 1

0 otherwise,

where 0 < p1 , p2, p1 + p2 < 1. We denote a bivariate geometric random

variable by writing (X, Y )⇠ GEO (p1 , p2 ).

Example 11.5. Motor vehicles arriving at an intersection can turn right

or left or continue straight ahead. In a study of traﬃ c patterns at this

intersection over a long period of time, engineers have noted that 40 percents

of the motor vehicles turn left, 25 percents turn right, and the remainder

continue straight ahead. For the next ten cars entering the intersection,

what is the probability that 5 cars will turn left, 4 cars will turn right, and

the last car will go straight ahead?

Answer: Let X denote the number of cars turning left and Y denote the

number of cars turning right. Since, the last car will go straight ahead,

the joint distribution of X and Y is geometric with parameters p1 = 0.4,

p2 = 0.25 and p3 = 1  p1 p2 = 0.35. For the next ten cars entering the

intersection, the probability that 5 cars will turn left, 4 cars will turn right,

and the last car will go straight ahead is given by

P( X= 5 , Y = 4) = f (5 , 4)

=(x+ y )!

x! y! p x

1p y

2(1 p 1 p 2 )

=(5 + 4)!

5! 4! (0.4)5(0.25)4 (1  0. 4 0.25)

=9!

5! 4! (0.4)5(0.25)4(0.35)

= 0.00677.

Some Special Discrete Bivariate Distributions 304

The following technical result is essential for proving the following theo-

rem. If a and b are positive real numbers with 0 < a + b < 1, then



x=0



y=0

(x +y )!

x! y! a x b y = 1

1ab. (11.3)

In the following theorem, we present the expected values and the vari-

ances of X and Y , the covariance between X and Y, and the moment gen-

erating function.

Theorem 11.5. Let (X, Y )⇠ GEO (p1 , p2 ), where p1 and p2 are parame-

ters. Then

E( X) = p1

1p1 p2

E( Y) = p2

1p1 p2

V ar( X ) = p 1 (1  p 2 )

(1  p1 p2 )2

V ar( Y ) = p 2 (1  p 1 )

(1  p1 p2 )2

Cov ( X, Y ) = p 1 p 2

(1  p1 p2 )2

M( s, t) = 1 p 1  p 2

1p1es p2et .

Proof: We only ﬁnd the joint moment generating function M (s, t ) of Xand

Yand leave proof of the rests to the reader of this book. The joint moment

generating function M (s, t ) is given by

M( s, t) = E  esX+tY 



x=0



y=0

esx+ty f ( x, y )



x=0



y=0

esx+ty ( x+y )!

x! y! p x

1p y

2(1 p 1 p 2 )

= (1  p1 p2 )



x=0



y=0

(x +y )!

x! y!( p 1 e s ) x  p 2 e t  y

=(1 p1 p2 )

1p1es p2et (by (11.3) ).

Probability and Mathematical Statistics 305

The following results are needed for the next theorem. Let a be a positive

real number less than one. Then



y=0

(x +y )!

x! y! a y = 1

(1  a)x+1 , (11 . 4)



y=0

(x +y )!

x! y! y a y = a(1 + x)

(1  a)x+2 , (11 . 5)

and 1



y=0

(x +y )!

x! y! y 2 a y = a(1 + x)

(1  a)x+3 [ a(x + 1) + 1]. (11.6)

The next theorem presents some information regarding the conditional

densities f (x/y ) and f (y/x).

Theorem 11.6. Let (X, Y )⇠ GEO (p1 , p2 ), where p1 and p2 are parame-

ters. Then the conditional distributions f (y/x ) and f (x/y ) are also geomet-

rical and

E(Y/x ) = p 2 (1 + x )

1p2

E(X/y ) = p 1 (1 + y)

1p1

V ar(Y /x ) = p 2 (1 + x )

(1  p2 )2

V ar(X/y ) = p 1 (1 + y )

(1  p1 )2 .

Proof: Again, as before, we ﬁrst ﬁnd the conditional probability density of

Ygiven the event X = x . The marginal density f1 (x ) is given by

f1 (x) = 1



y=0

f( x, y)



y=0

(x +y )!

x! y! p x

1p y

2(1 p 1 p 2 )

= (1  p1 p2 ) px



y=0

(x +y )!

x! y! p y

=(1 p1 p2 )px

(1  p2 )x+1 (by (11 . 4) ).

Some Special Discrete Bivariate Distributions 306

Therefore the conditional density of Y given the event X =x is

f(y/x ) = f(x, y)

f1 (x)= ( x+y )!

x! y! p y

2(1 p 2 ) x+1.

The conditional expectation of Y given X =x is

E(Y/x ) = 1



y=0

y f (y/x)



y=0

y( x+y )!

x! y! p y

2(1 p 2 ) x+1

=p 2 (1 + x)

(1  p2 ) (by (11.5) ).

Similarly, one can show that

E(X/y ) = p 1 (1 + y)

(1  p1 ) .

To compute the conditional variance of Y given the event that X = x , ﬁrst

we have to ﬁnd E  Y2 /x , which is given by

E Y2 /x = 1



y=0

y2 f(y/x)



y=0

y2 ( x+ y)!

x! y! p y

2(1 p 2 ) x+1

=p 2 (1 + x)

(1  p2 )2 [ p 2 (1 + x) + 1] (by (11.6) ).

Therefore

V ar  Y2 /x = E  Y2 /x  E (Y/x)2

=p 2 (1 + x)

(1  p2 )2 [( p 2 (1 + x) + 1]  p 2 (1 + x)

1p2  2

=p 2 (1 + x)

(1  p2 )2 .

The rest of the moments can be determined in a similar manner. The proof

of the theorem is now complete.

Probability and Mathematical Statistics 307

11.4. Bivariate Negative Binomial Distribution

The univariate negative binomial distribution can be generalized to the

bivariate case. Guldberg (1934) introduced this distribution and Lundberg

(1940) ﬁrst used it in connection with problems of accident proneness. Arbous

and Kerrich (1951) arrived at this distribution by mixing parameters of the

bivariate Poisson distribution.

Deﬁnition 11.4. A discrete bivariate random variable (X, Y ) is said to have

the bivariate negative binomial distribution with parameters k, p1 and p2 if

its joint probability density is of the form

f( x, y) = 





(x+y+k 1)!

x! y ! ( k 1)! p x

1p y

2(1 p 1 p 2 ) k ,if x, y = 0, 1, ..., 1

0 otherwise,

where 0 < p1 , p2, p1 + p2 < 1 and k is a nonzero positive integer. We

denote a bivariate negative binomial random variable by writing (X, Y ) ⇠

NBI N ( k, p1, p2 ).

Example 11.6. An experiment consists of selecting a marble at random and

with replacement from a box containing 10 white marbles, 15 black marbles

and 5 green marbles. What is the probability that it takes exactly 11 trials

to get 5 white, 3 black and the third green marbles at the 11th trial?

Answer: Let X denote the number of white marbles and Y denote the

number of black marbles. The joint distribution of X and Y is bivariate

negative binomial with parameters p1 = 1

3,p 2 = 1

2, and k = 3. Hence the

probability that it takes exactly 11 trials to get 5 white, 3 black and the third

green marbles at the 11th trial is

P( X= 5 , Y = 3) = f (5 , 3)

=(x+ y+ k 1)!

x! y ! ( k 1)! p x

1p y

2(1 p 1 p 2 ) k

=(5 + 3 + 3 1)!

5! 3! (3  1)! (0.33)5(0.5)3 (1  0. 33  0.5)3

=10!

5! 3! 2! (0.33)5(0.5)3(0.17)3

= 0.0000503.

The negative binomial theorem which was treated in chapter 5 can be

generalized to



x=0



y=0

(x +y +k  1)!

x! y ! ( k 1)! p x

1p y

2=1

(1  p1 p2 )k . (11 . 7)

Some Special Discrete Bivariate Distributions 308

In the following theorem, we present the expected values and the vari-

ances of X and Y , the covariance between X and Y, and the moment gen-

erating function.

Theorem 11.7. Let (X, Y )⇠ N BIN (k, p1, p2 ), where k , p1 and p2 are

parameters. Then

E( X) = k p1

1p1 p2

E( Y) = k p2

1p1 p2

V ar( X ) = k p 1 (1  p 2 )

(1  p1 p2 )2

V ar( Y ) = k p 2 (1  p 1 )

(1  p1 p2 )2

Cov ( X, Y ) = k p 1 p 2

(1  p1 p2 )2

M( s, t) = (1  p 1  p 2 ) k

(1  p1es p2et )k .

Proof: We only ﬁnd the joint moment generating function M (s, t ) of the

random variables X and Yand leave the rests to the reader. The joint

moment generating function is given by

M( s, t) = E  esX+tY 



x=0



y=0

esx+ty f ( x, y)



x=0



y=0

esx+ty ( x+y+k  1)!

x! y ! ( k 1)! p x

1p y

2(1 p 1 p 2 ) k

= (1  p1 p2 )k 1



x=0



y=0

(x +y+k  1)!

x! y ! ( k 1)! ( e s p 1 ) x  e t p 2  y

=(1 p1 p2 )k

(1  p1es p2et )k (by (11 .7)).

This completes the proof of the theorem.

To establish the next theorem, we need the following two results. If ais

a positive real constant in the interval (0, 1), then



y=0

(x +y+k  1)!

x! y ! ( k 1)! a y = 1 ( x+k )

(1  a)x+k , (11 . 8)

Probability and Mathematical Statistics 309



y=0

y( x+y +k  1)!

x! y ! ( k 1)! a y = a ( x+k )

(1  a)x+k +1 , (11 . 9)

and



y=0

y2 ( x+y+k 1)!

x! y ! ( k 1)! a y = a ( x+k )

(1  a)x+k +2 [1 + ( x +k )a] . (11.10)

The next theorem presents some information regarding the conditional

densities f (x/y ) and f (y/x).

Theorem 11.8. Let (X, Y )⇠ N BIN (k, p1, p2 ), where p1 and p2 are pa-

rameters. Then the conditional densities f (y/x ) and f (x/y ) are also negative

binomial and

E(Y/x ) = p 2 ( k+x)

1p2

E(X/y ) = p 1 ( k+y )

1p1

V ar(Y /x ) = p 2 ( k+ x)

(1  p2 )2

V ar(X/y ) = p 1 ( k+y )

(1  p1 )2 .

Proof: First, we ﬁnd the marginal density of X . The marginal density f1 (x)

is given by

f1 (x) = 1



y=0

f( x, y)



y=0

(x +y +k  1)!

x! y ! ( k 1)! p x

1p y

= (1  p1 p2 )k px

(x +y+k  1)!

x! y ! ( k 1)! p y

= (1  p1 p2 )k px

(1  p2 )x+k (by (11 .8)).

The conditional density of Y given the event X =x is

f(y/x ) = f(x, y)

f1 (x)

=(x+ y+ k 1)!

x! y ! ( k 1)! p y

2(1 p 2 ) x+k .

Some Special Discrete Bivariate Distributions 310

The conditional expectation E (Y /x ) is given by

E(Y/x ) = 1



x=0



y=0

y( x+y+k  1)!

x! y ! ( k 1)! p y

2(1 p 2 ) x+k

= (1  p2 )x+k 1



x=0



y=0

y( x+y +k  1)!

x! y ! ( k 1)! p y

= (1  p2 )x+k p 2 ( x +k )

(1  p2 )x+k +1 (by (11 . 9))

=p 2 (x+ k )

(1  p2 ) .

The conditional expectation E  Y2 /x can be computed as follows

E Y2 /x = 1



x=0



y=0

y2 ( x+y+k  1)!

x! y ! ( k 1)! p y

2(1 p 2 ) x+k

= (1  p2 )x+k 1



x=0



y=0

y2 ( x+y+k  1)!

x! y ! ( k 1)! p y

= (1  p2 )x+k p 2 ( x +k )

(1  p2 )x+k +2 [1 + ( x +k) p2 ] (by (11.10))

=p 2 (x+ k )

(1  p2 )2 [1 + ( x +k) p2 ].

The conditional variance of Y given X =x is

V ar (Y /x ) = E  Y2 /x  E (Y /x)2

=p 2 (x+ k )

(1  p2 )2 [1 + ( x +k) p2 ] p 2 ( x +k )

(1  p2 )  2

=p 2 (x+ k )

(1  p2 )2 .

The conditional expected value E (X/y ) and conditional variance V ar (X/y)

can be computed in a similar way. This completes the proof.

Note that if k = 1, then bivariate negative binomial distribution reduces

to bivariate geometric distribution.

11.5. Bivariate Hypergeometric Distribution

The univariate hypergeometric distribution can be generalized to the bi-

variate case. Isserlis (1914) introduced this distribution and Pearson (1924)

Probability and Mathematical Statistics 311

gave various properties of this distribution. Pearson also ﬁtted this distri-

bution to an observed data of the number of cards of a certain suit in two

hands at whist.

Deﬁnition 11.5. A discrete bivariate random variable (X, Y ) is said to have

the bivariate hypergeometric distribution with parameters r, n1, n2, n3 if its

joint probability distribution is of the form

f( x, y) = 





(n1

x) ( n2

y) ( n3

rx y)

(n1 +n2 +n3

r),if x, y = 0, 1, ..., r

0 otherwise,

where x n1 , y  n2 , r  x y n3 and r is a positive integer less than or

equal to n1 + n2 + n3 . We denote a bivariate hypergeometric random variable

by writing (X, Y )⇠ H Y P (r, n1, n2, n3 ).

Example 11.7. A panel of prospective jurors includes 6 african american, 4

asian american and 9 white american. If the selection is random, what is the

probability that a jury will consists of 4 african american, 3 asian american

and 5 white american?

Answer: Here n1 = 7, n2 = 3 and n3 = 9 so that n= 19. A total of 12

jurors will be selected so that r = 12. In this example x = 4, y= 3 and

r x y= 5. Hence the probability that a jury will consists of 4 african

american, 3 asian american and 5 white american is

f(4 ,3) =  7

4 3

3 9

5

19

12=4410

50388 = 0.0875.

Example 11.8. Among 25 silver dollars struck in 1903 there are 15 from

the Philadelphia mint, 7 from the New Orleans mint, and 3 from the San

Francisco. If 5 of these silver dollars are picked at random, what is the

probability of getting 4 from the Philadelphia mint and 1 from the New

Orleans?

Answer: Here n = 25, r = 5 and n1 = 15, n2 = 7, n3 = 3. The the

probability of getting 4 from the Philadelphia mint and 1 from the New

Orleans is

f(4 ,1) =  15

4 7

1 3

0

25

5=9555

53130 = 0.1798.

In the following theorem, we present the expected values and the vari-

ances of X and Y , and the covariance between X and Y.

Some Special Discrete Bivariate Distributions 312

Theorem 11.9. Let (X, Y )⇠ H Y P (r, n1, n2 , n3 ), where r , n1 , n2 and n3

are parameters. Then

E( X) = r n1

n1 +n2 +n3

E( Y) = r n2

n1 +n2 +n3

V ar( X ) = r n 1 (n2 +n3 )

(n1 + n2 + n3 )2  n 1 + n 2 + n 3  r

n1 +n2 +n3 1 

V ar( Y ) = r n 2 (n1 +n3 )

(n1 + n2 + n3 )2  n 1 + n 2 + n 3  r

n1 +n2 +n3 1 

Cov ( X, Y ) =  r n 1 n 2

(n1 + n2 + n3 )2  n 1 + n 2 + n 3  r

n1 +n2 +n3 1 .

Proof: We ﬁnd only the mean and variance of X . The mean and variance

of Y can be found in a similar manner. The covariance of X and Y will be

left to the reader as an exercise. To ﬁnd the expected value of X , we need

the marginal density f1 (x ) of X . The marginal of X is given by

f1 (x) =

rx



y=0

f( x, y)

rx



y=0  n1

x n2

y n3

rx y

n 1 +n2 +n3

r

= n1

x

n 1 +n2 +n3

r

rx



y=0 n 2

y n3

r x y

= n1

x

n 1 +n2 +n3

rn 2 +n3

r x (by Theorem 1 .3)

This shows that X⇠ HY P (n1 , n2 + n3 , r ). Hence, by Theorem 5.7, we get

E( X) = r n1

n1 +n2 +n3

and

V ar( X ) = r n 1 (n2 +n3 )

(n1 + n2 + n3 )2  n 1 + n 2 + n 3  r

n1 +n2 +n3 1 .

Similarly, the random variable Y⇠ HY P (n2 , n1 + n3 , r ). Hence, again by

Theorem 5.7, we get

E( Y) = r n2

n1 +n2 +n3

Probability and Mathematical Statistics 313

and

V ar( Y ) = r n 2 (n1 +n3 )

(n1 + n2 + n3 )2  n 1 + n 2 + n 3  r

n1 +n2 +n3 1 .

The next theorem presents some information regarding the conditional

densities f (x/y ) and f (y/x).

Theorem 11.10. Let (X, Y )⇠ H Y P (r, n1, n2, n3 ), where r , n1 , n2 and n3

are parameters. Then the conditional distributions f (y/x ) and f (x/y ) are

also hypergeometric and

E(Y/x ) = n 2 ( rx)

n2 +n3

E(X/y ) = n 1 ( ry)

n1 +n3

V ar(Y /x ) = n 2 n 3

n2 +n3 1 n 1 + n 2 + n 3  x

n2 +n3  x n 1

n2 +n3 

V ar(X/y ) = n 1 n 3

n1 +n3 1 n 1 + n 2 + n 3  y

n1 +n3  y n 2

n1 +n3  .

Proof: To ﬁnd E (Y/x ), we need the conditional density f (y/x ) of Y given

the event X = x . The conditional density f (y/x ) is given by

f(y/x ) = f(x, y)

f1 (x)

= n2

y n3

rx y

n 2 +n3

r x.

Hence, the random variable Y given X =x is a hypergeometric random

variable with parameters n2 , n3 , and r x , that is

Y/x ⇠ H Y P (n2 , n3 , r  x).

Hence, by Theorem 5.7, we get

E(Y/x ) = n 2 ( rx)

n2 +n3

and

V ar(Y /x ) = n 2 n 3

n2 +n3 1 n 1 + n 2 + n 3  x

n2 +n3  x n 1

n2 +n3  .

Some Special Discrete Bivariate Distributions 314

Similarly, one can ﬁnd E (X/y ) and V ar (X/y ). The proof of the theorem is

now complete.

11.6. Bivariate Poisson Distribution

The univariate Poisson distribution can be generalized to the bivariate

case. In 1934, Campbell, ﬁrst derived this distribution. However, in 1944,

Aitken gave the explicit formula for the bivariate Poisson distribution func-

tion. In 1964, Holgate also arrived at the bivariate Poisson distribution by

deriving the joint distribution of X = X1 + X3 and Y = X2 + X3 , where

X1 , X2, X3 are independent Poisson random variables. Unlike the previous

bivariate distributions, the conditional distributions of bivariate Poisson dis-

tribution are not Poisson. In fact, Seshadri and Patil (1964), indicated that

no bivariate distribution exists having both marginal and conditional distri-

butions of Poisson form.

Deﬁnition 11.6. A discrete bivariate random variable (X, Y ) is said to

have the bivariate Poisson distribution with parameters 1 ,2 ,3 if its joint

probability density is of the form

f( x, y) = 





e(1 2 + 3 ) (1 3 )x(2 3 )y

x! y! (x, y ) for x, y = 0, 1, ..., 1

0 otherwise,

where

(x, y ) :=

min(x,y)



r=0

x(r) y (r) r

(1 3 )r(2 3 )r r!

with

x(r) := x( x 1) ··· ( x r + 1),

and 1 >3 > 0, 2 >3 > 0 are parameters. We denote a bivariate Poisson

random variable by writing (X, Y )⇠ P OI (1 ,2 ,3 ).

In the following theorem, we present the expected values and the vari-

ances of X and Y , the covariance between X and Yand the joint moment

generating function.

Theorem 11.11. Let (X, Y )⇠ P OI (1 ,2 ,3 ), where 1 , 2 and 3 are

Probability and Mathematical Statistics 315

parameters. Then

E( X) = 1

E( Y) = 2

V ar( X ) = 1

V ar( Y ) = 2

Cov ( X, Y ) = 3

M( s, t) = e1 2 3 +1 es +2 et +3 es+t .

The next theorem presents some special characteristics of the conditional

densities f (x/y ) and f (y/x).

Theorem 11.12. Let (X, Y )⇠ P OI (1 ,2 ,3 ), where 1 , 2 and 3 are

parameters. Then

E(Y/x ) = 2 3 +  3

1  x

E(X/y ) = 1 3 +  3

2  y

V ar(Y /x ) = 2 3 +  3 (1 3 )

2

1x

V ar(X/y ) = 1 3 +  3 (2 3 )

2

2y.

11.7. Review Exercises

1. A box contains 10 white marbles, 15 black marbles and 5 green marbles.

If 10 marbles are selected at random and without replacement, what is the

probability that 5 are white, 3 are black and 2 are green?

2. An urn contains 3 red balls, 2 green balls and 1 yellow ball. Three balls

are selected at random and without replacement from the urn. What is the

probability that at least 1 color is not drawn?

3. An urn contains 4 red balls, 8 green balls and 2 yellow balls. Five balls

are randomly selected, without replacement, from the urn. What is the

probability that 1 red ball, 2 green balls, and 2 yellow balls will be selected?

4. From a group of three Republicans, two Democrats, and one Independent,

a committee of two people is to be randomly selected. If X denotes the

Some Special Discrete Bivariate Distributions 316

number of Republicans and Y the number of Democrats on the committee,

then what is the variance of Y given that X = x?

5. If X equals the number of ones and Y the number of twos and threes

when a four fair dice are rolled, then what is the conditional variance of X

and Y = 1?

6. Motor vehicles arriving at an intersection can turn right or left or continue

straight ahead. In a study of traﬃ c patterns at this intersection over a long

period of time, engineers have noted that 40 percents of the motor vehicles

turn left, 25 percents turn right, and the remainder continue straight ahead.

For the next ﬁve cars entering the intersection, what is the probability that

at least one turn right?

7. Among a large number of applicants for a certain position, 60 percents

have only a high school education, 30 percents have some college training,

and 10 percents have completed a college degree. If 5 applicants are randomly

selected to be interviewed, what is the probability that at least one will have

completed a college degree?

8. In a population of 200 students who have just completed a ﬁrst course

in calculus, 50 have earned A 's, 80 B 's and remaining earned F 's. A sample

of size 25 is taken at random and without replacement from this population.

What is the probability that 10 students have A 's, 12 students have B 's and

3 students have F 's ?

9. If X equals the number of ones and Y the number of twos and threes

when a four fair dice are rolled, then what is the correlation coeﬃ cient of X

and Y?

10. If the joint moment generating function of X and Y is M (s, t ) =

k 4

7es 2et  5 , then what is the value of the constant k ? What is the corre-

lation coeﬃ cient between X and Y?

11. A die with 1 painted on three sides, 2 painted on two sides, and 3 painted

on one side is rolled 15 times. What is the probability that we will get eight

1's, six 2's and a 3 on the last roll?

12. The output of a machine is graded excellent 80 percents of time, good 15

percents of time, and defective 5 percents of time. What is the probability

that a random sample of size 15 has 10 excellent, 3 good, and 2 defective

items?

Probability and Mathematical Statistics 317

13. An industrial product is graded by a machine excellent 80 percents of

time, good 15 percents of time, and defective 5 percents of time. A random

sample of 15 items is graded. What is the probability that machine will grade

10 excellent, 3 good, and 2 defective of which one being the last one graded?

14. If (X, Y )⇠ H Y P (n1 , n2 , n3, r ), then what is the covariance of the

random variables X and Y?

Some Special Continuous Bivariate Distributions 318

Chapter 12

SOME

SPECIAL CONTINUOUS

BIVARIATE DISTRIBUTIONS

In this chapter, we study some well known continuous bivariate probabil-

ity density functions. First, we present the natural extensions of univariate

probability density functions that were treated in chapter 6. Then we present

some other bivariate distributions that have been reported in the literature.

The bivariate normal distribution has been treated in most textbooks be-

cause of its dominant role in the statistical theory. The other continuous

bivariate distributions rarely treated in any textbooks. It is in this textbook,

well known bivariate distributions have been treated for the ﬁrst time. The

monograph of K.V. Mardia gives an excellent exposition on various bivariate

distributions. We begin this chapter with the bivariate uniform distribution.

12.1. Bivariate Uniform Distribution

In this section, we study Morgenstern bivariate uniform distribution in

detail. The marginals of Morgenstern bivariate uniform distribution are uni-

form. In this sense, it is an extension of univariate uniform distribution.

Other bivariate uniform distributions will be pointed out without any in

depth treatment.

In 1956, Morgenstern introduced a one-parameter family of bivariate

distributions whose univariate marginal are uniform distributions by the fol-

lowing formula

f( x, y) = f1 (x)f2 ( y ) ( 1 + ↵ [2 F1 (x) 1] [2 F2 ( y) 1] ) ,

Probability and Mathematical Statistics 319

where ↵2 [1, 1] is a parameter. If one assumes The cdf Fi (x ) = xand

the pdf fi (x ) = 1 (i = 1, 2), then we arrive at the Morgenstern uniform

distribution on the unit square. The joint probability density function f (x, y)

of the Morgenstern uniform distribution on the unit square is given by

f( x, y) = 1 + ↵ (2 x 1) (2 y 1) ,0 < x, y  1 , 1 ↵ 1 .

Next, we deﬁne the Morgenstern uniform distribution on an arbitrary

rectangle [a, b ]⇥ [c, d].

Deﬁnition 12.1. A continuous bivariate random variable (X, Y ) is said to

have the bivariate uniform distribution on the rectangle [a, b ]⇥ [c, d ] if its

joint probability density function is of the form

f( x, y) = 





1+↵( 2x 2a

b a 1)( 2 y 2c

d c 1)

(ba ) ( dc ) for x2 [a, b ]y2 [c, d]

0 otherwise ,

where ↵ is an apriori chosen parameter in [1, 1]. We denote a Morgenstern

bivariate uniform random variable on a rectangle [a, b ]⇥ [c, d ] by writing

(X, Y )⇠ U N I F (a, b, c, d, ↵ ).

The following ﬁgures show the graph and the equi-density curves of

f( x, y) on unit square with ↵ = 0 .5.

In the following theorem, we present the expected values, the variances

of the random variables X and Y , and the covariance between X and Y.

Theorem 12.1. Let (X, Y )⇠ U N IF M (a, b, c, d, ↵ ), where a, b, c, d and ↵

Some Special Continuous Bivariate Distributions 320

are parameters. Then

E( X) = b+a

E( Y) = d+c

V ar( X ) = ( b a)2

V ar( Y ) = ( d c)2

Cov ( X, Y ) = 1

36 ↵ (b a) ( d c ).

Proof: First, we determine the marginal density of X which is given by

f1 (x) =  d

f( x, y)dy

= d

1 + ↵ 2x 2a

b a1 2 y 2c

d c1

(b a ) (d c ) dy

b a.

Thus, the marginal density of X is uniform on the interval from a to b . That

is X⇠ U N IF (a, b ). Hence by Theorem 6.1, we have

E( X) = b+a

2and V ar( X ) = ( b a)2

12 .

Similarly, one can show that Y⇠ UN I F (c, d ) and therefore by Theorem 6.1

E( Y) = d+c

2and V ar( Y ) = ( d c)2

12 .

The product moment of X and Yis

E( XY ) =  b

a d

xy f ( x, y) dx dy

= b

a d

1 + ↵ 2x 2a

b a1 2 y 2c

d c1

(b a ) (d c ) dx dy

36 ↵ (b a) ( d c) + 1

4(b+ a) ( d+ c ).

Probability and Mathematical Statistics 321

Thus, the covariance of X and Yis

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y )

36 ↵ (b a) ( d c) + 1

4(b+ a) ( d+ c) 1

4(b+ a) ( d+ c)

36 ↵ (b a) ( d c ).

This completes the proof of the theorem.

In the next theorem, we states some information related to the condi-

tional densities f (y/x ) and f (x/y).

Theorem 12.2. Let (X, Y )⇠ U N IF (a, b, c, d, ↵ ), where a, b, c, d and ↵are

parameters. Then

E(Y/x ) = d+c

2+ ↵

6 (b a )  c 2 + 4cd + d2  2 x2a

b a1

E(X/y ) = b+a

2+ ↵

6 (b a )  a 2 + 4ab + b2  2 y2c

d c1

V ar(Y /x ) = 1

36  dc

b a 2  ↵ 2 ( a+b ) (4x a b) + 3( b a)2 4↵2 x2 

V ar(X/y ) = 1

36  ba

d c 2  ↵ 2 ( c+d ) (4y c d) + 3( d c)2 4↵2 y2  .

Proof: First, we determine the conditional density function f (y/x ). Recall

that f1 (x ) = 1

b a . Hence,

f(y/x ) = 1

d c 1 + ↵ 2 x2a

b a1 2 y2c

d c1 .

Some Special Continuous Bivariate Distributions 322

The conditional expectation E (Y /x ) is given by

E(Y/x ) =  d

y f (y/x ) dy

d c d

y 1 + ↵ 2 x2a

b a1 2 y2c

d c1 dy

=d +c

2+ ↵

6 (d c)2  2 x2a

b a1  d3 c3 + 3dc2 3cd2 

=d +c

2+ ↵

6 (d c ) 2x 2a

b a1  d2 + 4dc + c2 .

Similarly, the conditional expectation E  Y2 /x is given by

E Y2 /x = d

y2 f(y/x ) dy

d c d

y2  1 + ↵ 2 x2a

b a1 2 y2c

d c1 dy

d c d 2  c2

2+ ↵

d c 2 x2a

b a1 1

6 d 2 c2  ( d c)2 

=d +c

2+ 1

6↵  d 2 c2  2 x  2a

b a1

=d +c

2 1 + ↵

3(d c) 2x 2a

b a1 .

Therefore, the conditional variance of Y given the event X =x is

V ar(Y /x ) = E  Y2 /x  E (Y /x)2

36  dc

b a 2  ↵ 2 ( a+b)(4 x a b) + 3( b a)2 4↵2 x2 .

The conditional expectation E (X/y ) and the conditional variance V ar (X/y)

can be found in a similar manner. This completes the proof of the theorem.

The following ﬁgure illustrate the regression and scedastic curves of the

Morgenstern uniform distribution function on unit square with ↵ = 0.5.

Probability and Mathematical Statistics 323

Next, we give a deﬁnition of another generalized bivariate uniform dis-

tribution.

Deﬁnition 12.2. Let S⇢ IR2 be a region in the Euclidean plane IR2with

area A . The random variables X and Y is said to be bivariate uniform over

Sif the joint density of Xand Yis of the form

f( x, y) =  1

Afor (x, y)2 S

0 otherwise .

In 1965, Plackett constructed a class of bivariate distribution F (x, y ) for

given marginals F1 (x ) and F2 ( y ) as the square root of the equation

(↵ 1) F (x, y)2  { 1 + (↵ 1) [ F1 (x ) + F2 (y )] }F (x, y ) + ↵ F1 (x ) F2 (y ) = 0

(where 0 <↵< 1 ) which satisﬁes the Fr´echet inequalities

max {F1 (x ) + F2 ( y ) 1,0}F (x, y ) min {F1 (x), F2 (y)} .

The class of bivariate joint density function constructed by Plackett is the

following

f( x, y) = ↵ f1 (x)f2 ( y)[(↵ 1) {F1 (x ) + F2 (y ) 2F1 (x)F2 ( y )} + 1]

[S(x, y)2  4↵ (↵ 1) F1 (x ) F2 (y)] 3

where

S( x, y) = 1 + (↵ 1) ( F1 (x) + F2 ( y )) .

If one takes Fi (x ) = x and fi (x ) = 1 (for i= 1, 2), then the joint density

function constructed by Plackett reduces to

f( x, y) = ↵ [(↵ 1) {x +y 2xy} + 1]

[{ 1 + (↵ 1)(x +y )}2  4↵ (↵ 1) xy ] 3

Some Special Continuous Bivariate Distributions 324

where 0  x, y  1, and ↵> 0. But unfortunately this is not a bivariate

density function since this bivariate density does not integrate to one. This

fact was missed by both Plackett (1965) and Mardia (1967).

12.2. Bivariate Cauchy Distribution

Recall that univariate Cauchy probability distribution was deﬁned in

Chapter 3 as

f(x ) = ✓

⇡ ✓+ (x ↵)2  , 1 < x < 1,

where ↵> 0 and ✓ are real parameters. The parameter ↵ is called the

location parameter. In Chapter 4, we have pointed out that any random

variable whose probability density function is Cauchy has no moments. This

random variable is further, has no moment generating function. The Cauchy

distribution is widely used for instructional purposes besides its statistical

use. The main purpose of this section is to generalize univariate Cauchy

distribution to bivariate case and study its various intrinsic properties. We

deﬁne the bivariate Cauchy random variables by using the form of their joint

probability density function.

Deﬁnition 12.3. A continuous bivariate random variable (X, Y ) is said to

have the bivariate Cauchy distribution if its joint probability density function

is of the form

f( x, y) = ✓

2⇡ [✓2 + (x↵ )2 + (y )2 ] 3

,1 < x, y < 1,

where ✓ is a positive parameter and ↵ and  are location parameters. We de-

note a bivariate Cauchy random variable by writing (X, Y )⇠ C AU (✓ ,↵ , ).

The following ﬁgures show the graph and the equi-density curves of the

Cauchy density function f (x, y ) with parameters ↵ = 0 =  and ✓ = 0 .5.

Probability and Mathematical Statistics 325

The bivariate Cauchy distribution can be derived by considering the

distribution of radio active particles emanating from a source that hit a

two-dimensional screen. This distribution is a special case of the bivariate

t-distribution which was ﬁrst constructed by Karl Pearson in 1923.

The following theorem shows that if a bivariate random variable (X, Y ) is

Cauchy, then it has no moments like the univariate Cauchy random variable.

Further, for a bivariate Cauchy random variable (X, Y ), the covariance (and

hence the correlation) between X and Y does not exist.

Theorem 12.3. Let (X, Y )⇠ CAU (✓ ,↵, ), where ✓> 0, ↵ and  are pa-

rameters. Then the moments E (X ), E (Y ), V ar (X ), V ar (Y ), and Cov(X, Y )

do not exist.

Proof: In order to ﬁnd the moments of X and Y , we need their marginal

distributions. First, we ﬁnd the marginal of X which is given by

f1 (x) =  1

1

f( x, y)dy

= 1

1

✓

2⇡ [✓2 + (x↵ )2 + (y )2 ] 3

dy.

To evaluate the above integral, we make a trigonometric substitution

y= + [✓2 + ( x↵)2 ] tan .

Hence

dy =  [✓2 + ( x↵ )2 ] sec2 d

and

✓ 2 + (x↵ )2 + (y )2 3

= ✓2 + (x↵ )2  3

21 + tan 2  3

= ✓2 + (x↵ )2  3

2sec 3 .

Some Special Continuous Bivariate Distributions 326

Using these in the above integral, we get

1

1

✓

2⇡ [✓2 + (x↵ )2 + (y )2 ] 3

=✓

2⇡ ⇡

⇡

2[✓2 + (x↵ )2 ] sec2 d

[✓2 + (x↵ )2 ] 3

2sec 3

=✓

2⇡[✓2 + (x↵ )2 ] ⇡

⇡

cos d

=✓

⇡[ ✓2 + (x ↵)2 ] .

Hence, the marginal of X is a Cauchy distribution with parameters ✓ and ↵.

Thus, for the random variable X , the expected value E (X ) and the variance

V ar( X ) do not exist (see Example 4.2). In a similar manner, it can be shown

that the marginal distribution of Y is also Cauchy with parameters ✓ and 

and hence E (Y ) and V ar (Y ) do not exist. Since

Cov ( X, Y ) = E ( XY ) E ( X) E ( Y ),

it easy to note that Cov( X, Y ) also does not exist. This completes the proof

of the theorem.

The conditional distribution of Y given the event X =x is given by

f(y/x ) = f(x, y)

f1 (x)= 1

✓2 + (x ↵)2

[✓2 + (x↵ )2 + (y )2 ] 3

Similarly, the conditional distribution of X given the event Y =y is

f(y/x ) = 1

✓2 + (y )2

[✓2 + (x↵ )2 + (y )2 ] 3

Next theorem states some properties of the conditional densities f (y/x ) and

f(x/y).

Theorem 12.4. Let (X, Y )⇠ C AU (✓ ,↵, ), where ✓> 0, ↵ and are

parameters. Then the conditional expectations

E(Y/x ) = 

E(X/y ) = ↵ ,

Probability and Mathematical Statistics 327

and the conditional variances V ar (Y/x ) and V ar (X/y ) do not exist.

Proof: First, we show that E (Y/x ) is  . The conditional expectation of Y

given the event X =x can be computed as

E(Y/x ) =  1

1

y f (y/x ) dy

= 1

1

✓2 + (x ↵)2

[✓2 + (x↵ )2 + (y )2 ] 3

4 ✓ 2 + (x↵ )2  1

1

d ✓2 + ( x↵)2 + ( y)2 

[✓2 + (x↵ )2 + (y )2 ] 3

+

2 ✓ 2 + (x↵ )2  1

1

[✓2 + (x↵ )2 + (y )2 ] 3

4 ✓ 2 + (x↵ )2    2

✓ 2 + (x↵ )2 + (y )2  1

1

+

2 ✓ 2 + (x↵ )2  ⇡

⇡

cos d

[✓2 + (x↵ )2]

= 0 + 

=.

Similarly, it can be shown that E (X/y ) = ↵ . Next, we show that the con-

ditional variance of Y given X =x does not exist. To show this, we need

E Y2 /x , which is given by

E Y2 /x = 1

1

y2 1

✓2 + (x ↵)2

[✓2 + (x↵ )2 + (y )2 ] 3

dy.

The above integral does not exist and hence the conditional second moment

of Y given X =x does not exist. As a consequence, the V ar (Y/x ) also does

not exist. Similarly, the variance of X given the event Y =y also does not

exist. This completes the proof of the theorem.

12.3. Bivariate Gamma Distributions

In this section, we present three di↵ erent bivariate gamma probability

density functions and study some of their intrinsic properties.

Deﬁnition 12.4. A continuous bivariate random variable (X, Y ) is said to

have the bivariate gamma distribution if its joint probability density function

Some Special Continuous Bivariate Distributions 328

is of the form

f( x, y) = 









(xy ) 1

2(↵1)

(1✓ ) (↵ )✓ 1

2(↵1) e  x +y

1✓ I ↵1 2 p ✓ xy

1✓ if 0 x, y < 1

0 otherwise,

where ✓2 [0, 1) and ↵> 0 are parameters, and

Ik ( z ) := 1



r=0  1

2z k +2r

r!( k+ r+ 1) .

As usual, we denote this bivariate gamma random variable by writing

(X, Y )⇠ GAM K (↵, ✓ ). The function Ik (z ) is called the modiﬁed Bessel

function of the ﬁrst kind of order k . In explicit form f (x, y ) is given by

f( x, y) = 









✓↵1 ( ↵) e  x+y

1✓



k=0

(✓ x y)↵+k 1

k!(↵ + k) (1 ✓ )↵+2k for 0  x, y < 1

0 otherwise.

The following ﬁgures show the graph of the joint density function f (x, y)

of a bivariate gamma random variable with parameters ↵ = 1 and ✓ = 0.5

and the equi-density curves of f (x, y).

In 1941, Kibble found this bivariate gamma density function. However,

Wicksell in 1933 had constructed the characteristic function of this bivariate

gamma density function without knowing the explicit form of this density

function. If { (Xi , Yi )|i = 1, 2, ..., n} is a random sample from a bivariate

normal distribution with zero means, then the bivariate random variable

(X, Y ), where X = 1



i=1

iand Y= 1



i=1

i, has bivariate gamma distri-

bution. This fact was established by Wicksell by ﬁnding the characteristic

Probability and Mathematical Statistics 329

function of (X, Y ). This bivariate gamma distribution has found applications

in noise theory (see Rice (1944, 1945)).

The following theorem provides us some important characteristic of the

bivariate gamma distribution of Kibble.

Theorem 12.5. Let the random variable (X, Y )⇠ GAM K (↵ ,✓ ), where

0<↵<1 and 0 ✓ < 1 are parameters. Then the marginals of X and Y

are univariate gamma and

E( X) = ↵

E( Y) = ↵

V ar( X ) = ↵

V ar( Y ) = ↵

Cov ( X, Y ) = ↵ ✓

M( s, t) = 1

[(1  s ) (1  t )✓ s t ]↵ .

Proof: First, we show that the marginal distribution of X is univariate

gamma with parameter ↵ (and ✓ = 1). The marginal density of X is given

f1 (x) =  1

f( x, y)dy

= 1

✓↵1 ( ↵)e  x+y

1✓



k=0

(✓ x y)↵+k 1

k!(↵ + k) (1 ✓ )↵+2k dy



k=0

✓↵1 ( ↵)e  x

1✓ (✓x )↵+k 1

k!(↵ + k) (1 ✓ )↵+2k  1

y↵+k  1 e y

1✓ dy



k=0

✓↵1 ( ↵)e  x

1✓ (✓x )↵+k 1

k!(↵ + k) (1 ✓ )↵+2k (1 ✓ ) ↵+k  (↵+ k)



k=0  ✓

1✓ k 1

k!(↵ ) x ↵+k  1 e  x

1✓

(↵ ) x ↵1 e  x

1✓



k=0

k! x✓

1✓k

(↵ ) x ↵1 e  x

1✓e x✓

1✓

(↵ ) x ↵1 e x .

Some Special Continuous Bivariate Distributions 330

Thus, the marginal distribution of X is gamma with parameters ↵ and ✓ = 1.

Therefore, by Theorem 6.3, we obtain

E( X) = ↵ , V ar( X) = ↵ .

Similarly, we can show that the marginal density of Y is gamma with param-

eters ↵ and ✓ = 1. Hence, we have

E( Y) = ↵ , V ar( Y) = ↵ .

The moment generating function can be computed in a similar manner and

we leave it to the reader. This completes the proof of the theorem.

The following results are needed for the next theorem. From calculus we

know that

ez = 1



k=0

k! ,(12.1)

and the inﬁnite series on the right converges for all z2 IR. Di↵ erentiating

both sides of (12.1) and then multiplying the resulting expression by z , one

obtains

zez = 1



k=0

kz k

k! .(12.2)

If one di↵ erentiates (12.2) again with respect to z and multiply the resulting

expression by z , then he/she will get

zez + z2 ez = 1



k=0

k2 z k

k! .(12.3)

Theorem 12.6. Let the random variable (X, Y )⇠ GAM K (↵ ,✓ ), where

0<↵<1 and 0 ✓< 1 are parameters. Then

E(Y/x ) = ✓ x+ (1 ✓ ) ↵

E(X/y ) = ✓ y+ (1 ✓ ) ↵

V ar(Y /x ) = (1 ✓ ) [ 2✓ x + (1 ✓ )↵ ]

V ar(X/y ) = (1 ✓ ) [ 2✓ y + (1 ✓ )↵ ] .

Probability and Mathematical Statistics 331

Proof: First, we will ﬁnd the conditional probability density function Y

given X = x , which is given by

f(y/x)

=f (x, y)

f1 (x)

✓↵1 x↵1 ex e  x+y

1✓



k=0

(✓ x y)↵+k 1

k!(↵ + k) (1 ✓ )↵+2k

=ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k! y ↵+k  1 e  y

1✓ .

Next, we compute the conditional expectation of Y given the event X = x.

The conditional expectation E (Y /x ) is given by

E(Y/x)

= 1

y f (y/x ) dy

= 1

y ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k! y ↵+k  1 e  y

1✓ dy

=ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k! 1

y↵+k e y

1✓ dy

=ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k!(1 ✓ )↵+k +1 (↵ +k )

= (1 ✓ ) ex x

1✓



k=0

(↵ +k ) 1

k! ✓ x

1✓k

= (1 ✓ ) ex x

1✓ ↵e ✓ x

1✓+✓x

1✓e ✓x

1✓ (by (12. 1) and (12.2))

= (1 ✓ )↵ +✓ x.

In order to determine the conditional variance of Y given the event X = x,

we need the conditional expectation of Y2 given the event X = x . This

Some Special Continuous Bivariate Distributions 332

conditional expectation can be evaluated as follows:

E( Y2 /x)

= 1

y2 f(y/x ) dy

= 1

y2 ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k! y ↵+k  1 e  y

1✓ dy

=ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k! 1

y↵+k +1 e y

1✓ dy

=ex x

1✓



k=0

(↵ +k ) (1 ✓ ) ↵+2k

(✓ x)k

k!(1 ✓ )↵+k +2 (↵ +k + 2)

= (1 ✓ )2 ex x

1✓



k=0

(↵ +k + 1) (↵ +k ) 1

k! ✓ x

1✓k

= (1 ✓ )2 ex x

1✓



k=0

(↵2 + 2↵k +k2 +↵+k ) 1

k! ✓ x

1✓k

= (1 ✓ )2  ↵2 +↵ + (2↵+ 1) ✓ x

1✓ + ✓x

1✓ +ex x

1✓



k=0

k! ✓ x

1✓ k 

= (1 ✓ )2  ↵2 +↵ + (2↵+ 1) ✓ x

1✓ + ✓x

1✓ +  ✓x

1✓ 2 

= (↵2 +↵ ) (1 ✓ )2 + 2(↵ + 1) ✓ (1 ✓ )x +✓2 x2.

The conditional variance of Y given X =x is

V ar(Y /x ) = E ( Y2 /x) E (Y /x)2

= (↵2 +↵ ) (1 ✓ )2 + 2(↵ + 1) ✓ (1 ✓ )x +✓2 x2

 (1 ✓)2 ↵2 +✓2 x2 + 2 ↵ ✓ (1 ✓)x

= (1 ✓ ) [↵ (1 ✓ ) + 2 ✓ x ] .

Since the density function f (x, y ) is symmetric, that is f (x, y ) = f (y, x),

the conditional expectation E (X/y ) and the conditional variance V ar (X/y)

can be obtained by interchanging x with y in the formulae of E (Y/x ) and

V ar(Y /x ). This completes the proof of the theorem.

In 1941, Cherian constructed a bivariate gamma distribution whose prob-

ability density function is given by

f( x, y) = 





e(x+y)

3

i=1 (↵ i )  min{x,y}

z↵3 (x z)↵1 ( y z)↵2

z(x z ) ( y z)e z dz if 0 < x, y < 1

0 otherwise,

Probability and Mathematical Statistics 333

where ↵1 ,↵2 ,↵3 2 (0, 1 ) are parameters. If a bivariate random vari-

able (X, Y ) has a Cherian bivariate gamma probability density function

with parameters ↵1 ,↵2 and ↵3 , then we denote this by writing (X, Y ) ⇠

GAMC (↵1 , ↵2 , ↵3 ).

It can be shown that the marginals of f (x, y ) are given by

f1 (x) =  1

( ↵1 + ↵3 ) x ↵ 1 + ↵ 3 1 e x if 0 < x < 1

0 otherwise

and

f2 (x) =  1

( ↵2 + ↵3 ) x ↵ 2 + ↵ 3 1 e y if 0 <y<1

0 otherwise.

Hence, we have the following theorem.

Theorem 12.7. If (X, Y )⇠ GAM C (↵ ,, ), then

E( X) = ↵ + 

E( Y) =  + 

V ar( X ) = ↵ + 

V ar( Y ) =  + 

E( XY ) =  + (↵ + )( + ) .

The following theorem can be established by ﬁrst computing the con-

ditional probability density functions. We leave the proof of the following

theorem to the reader.

Theorem 12.8. If (X, Y )⇠ GAM C (↵ ,, ), then

E(Y/x ) =  + 

↵+ xand E (X/y ) = ↵ + 

+ y.

David and Fix (1961) have studied the rank correlation and regression for

samples from this distribution. For an account of this bivariate gamma dis-

tribution the interested reader should refer to Moran (1967).

In 1934, McKay gave another bivariate gamma distribution whose prob-

ability density function is of the form

f( x, y) = 





✓↵+

( ↵) ( ) x ↵ 1 (y x )  1 e ✓y if 0 < x < y < 1

0 otherwise,

Some Special Continuous Bivariate Distributions 334

where ✓, ↵, 2 (0, 1 ) are parameters. If the form of the joint density of

the random variable (X, Y ) is similar to the density function of the bivariate

gamma distribution of McKay, then we write (X , Y )⇠ GAMM (✓ ,↵, ).

The graph of probability density function f (x, y ) of the bivariate gamma

distribution of McKay for ✓ =↵ = = 2 is shown below. The other ﬁgure

illustrates the equi-density curves of this joint density function f (x, y).

It can shown that if (X, Y )⇠ GAM M (✓ ,↵ , ), then the marginal f1 (x)

of X and the marginal f2 ( y ) of Y are given by

f1 (x) =  ✓ ↵

( ↵) x ↵ 1 e ✓x if 0 x < 1

0 otherwise

and

f2 ( y ) = 





✓↵+

( ↵ + ) x ↵ +1 e ✓x if 0 x < 1

0 otherwise.

Hence X⇠ GAM  ↵ , 1

✓and Y⇠ GAM ↵+, 1

✓. Therefore, we have the

following theorem.

Theorem 12.9. If (X, Y )⇠ GAM M (✓ ,↵ , ), then

E( X) = ↵

✓

E( Y) = ↵ + 

✓

V ar( X ) = ↵

✓2

V ar( Y ) = ↵ + 

✓2

M( s, t) =  ✓

✓st ↵  ✓

✓t

Probability and Mathematical Statistics 335

We state the various properties of the conditional densities of f (x, y),

without proof, in the following theorem.

Theorem 12.10. If (X, Y )⇠ GAM M (✓ ,↵, ), then

E(Y/x ) = x+

✓

E(X/y ) = ↵ y

↵+ 

V ar(Y /x ) = 

✓2

V ar(X/y ) = ↵ 

(↵ + )2(↵ + + 1) y 2 .

We know that the univariate exponential distribution is a special case

of the univariate gamma distribution. Similarly, the bivariate exponential

distribution is a special case of bivariate gamma distribution. On taking the

index parameters to be unity in the Kibble and Cherian bivariate gamma

distribution given above, we obtain the corresponding bivariate exponential

distributions. The bivariate exponential probability density function corre-

sponding to bivariate gamma distribution of Kibble is given by

f( x, y) = 









e ( x+y

1✓) 1



k=0

(✓ x y)k

k!( k+ 1) (1 ✓ )2k +1 if 0 < x, y < 1

0 otherwise,

where ✓2 (0, 1) is a parameter. The bivariate exponential distribution cor-

responding to the Cherian bivariate distribution is the following:

f( x, y) =   e min{x,y }  1  e (x+y ) if 0 < x, y < 1

0 otherwise.

In 1960, Gumble has studied the following bivariate exponential distribution

whose density function is given by

f( x, y) = 





[(1 + ✓ x ) (1 + ✓y )✓ ] e(x+y+✓ x y ) if 0 < x, y < 1

0 otherwise,

where ✓> 0 is a parameter.

Some Special Continuous Bivariate Distributions 336

In 1967, Marshall and Olkin introduced the following bivariate exponen-

tial distribution

F( x, y) = 





1e(↵+)x e(+)y +e(↵x +y +max{x,y} ) if x, y > 0

0 otherwise,

where ↵, , > 0 are parameters. The exponential distribution function of

Marshall and Olkin satisﬁes the lack of memory property

P( X > x + t, Y > y + t / X > t, Y > t) = P ( X > x, Y > y).

12.4. Bivariate Beta Distribution

The bivariate beta distribution (also known as Dirichlet distribution ) is

one of the basic distributions in statistics. The bivariate beta distribution

is used in geology, biology, and chemistry for handling compositional data

which are subject to nonnegativity and constant-sum constraints. It is also

used nowadays with increasing frequency in statistical modeling, distribu-

tion theory and Bayesian statistics. For example, it is used to model the

distribution of brand shares of certain consumer products, and in describing

the joint distribution of two soil strength parameters. Further, it is used in

modeling the proportions of the electorates who vote for a candidates in a

two-candidate election. In Bayesian statistics, the beta distribution is very

popular as a prior since it yields a beta distribution as posterior. In this

section, we give some basic facts about the bivariate beta distribution.

Deﬁnition 12.5. A continuous bivariate random variable (X, Y ) is said to

have the bivariate beta distribution if its joint probability density function is

of the form

f( x, y) = 





( ✓1 + ✓2 + ✓3 )

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1y ✓ 2 1 (1 xy ) ✓ 3 1 if 0 < x, y , x + y < 1

0 otherwise,

Probability and Mathematical Statistics 337

where ✓1 ,✓2 ,✓3 are positive parameters. We will denote a bivariate beta

random variable (X, Y ) with positive parameters ✓1 ,✓2 and ✓3 by writing

(X, Y )⇠ Beta(✓1 ,✓2 ,✓3 ).

The following ﬁgures show the graph and the equi-density curves of

f( x, y) on the domain of its deﬁnition.

In the following theorem, we present the expected values, the variances

of the random variables X and Y , and the correlation between X and Y.

Theorem 12.11. Let (X, Y )⇠ Beta(✓1 , ✓2 , ✓3 ), where ✓1 ,✓2 and ✓3 are

positive apriori chosen parameters. Then X⇠ Beta(✓1 , ✓2 + ✓3 ) and Y ⇠

Beta(✓2 , ✓1 +✓3 ) and

E( X) = ✓ 1

✓, V ar(X ) = ✓ 1 ( ✓✓1 )

✓2 ( ✓+ 1)

E( Y) = ✓ 2

✓, V ar(Y ) = ✓ 2 ( ✓✓2 )

✓2 ( ✓+ 1)

Cov ( X, Y ) =  ✓ 1 ✓ 2

✓2 ( ✓+ 1)

where ✓ = ✓1 + ✓2 + ✓3 .

Proof: First, we show that X⇠ Beta(✓1 , ✓2 + ✓3 ) and Y⇠ Beta(✓2 , ✓1 + ✓3 ).

Since (X, Y )⇠ Beta(✓2 , ✓1 , ✓3 ), the joint density of (X, Y ) is given by

f( x, y) =  (✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1 y ✓ 2 1 (1 xy ) ✓ 3 1 ,

Some Special Continuous Bivariate Distributions 338

where ✓ = ✓1 + ✓2 + ✓3 . Thus the marginal density of X is given by

f1 (x) =  1

f( x, y)dy

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1  1x

y✓ 2 1 (1  x y)✓ 3 1 dy

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1 (1  x) ✓ 3 1  1x

y✓ 2 1  1 y

1x ✓ 3 1

Now we substitute u = 1  y

1x in the above integral. Then we have

f1 (x) =  (✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1 (1  x) ✓ 2 +✓3 1  1

u✓ 2 1 (1  u)✓ 3 1 du

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) x ✓ 1 1 (1  x) ✓ 2 +✓3 1 B ( ✓ 2 , ✓ 3 )

=(✓)

( ✓1 ) ( ✓2 +✓3 ) x ✓ 1 1 (1  x) ✓ 2 +✓3 1

since  1

u✓ 2 1 (1  u)✓ 3 1 du = B (✓2 , ✓3 ) =  ( ✓ 2 )( ✓3 )

( ✓2 +✓3 ) .

This proves that the random variable X⇠ Beta(✓1 , ✓2 + ✓3 ). Similarly,

one can shows that the random variable Y⇠ Beta(✓2 , ✓1 + ✓3 ). Now using

Theorem 6.5, we see that

E( X) = ✓ 1

✓, V ar(X ) = ✓ 1 ( ✓✓1 )

✓2 ( ✓+ 1)

E( Y) = ✓ 2

✓, V ar(X ) = ✓ 2 ( ✓✓2 )

✓2 ( ✓+ 1) ,

where ✓ = ✓1 + ✓2 + ✓3 .

Next, we compute the product moment of X and Y . Consider

E( XY )

= 1

0 1x

xy f ( x, y) dy dx

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) 1

0 1x

xy x✓ 1 1 y ✓ 2 1 (1  x y )✓ 3 1 dydx

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) 1

0 1x

x✓ 1 y ✓ 2 (1  x y )✓ 3 1 dydx

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) 1

x✓ 1 (1  x)✓ 3 1  1x

y✓ 2  1 y

1x ✓ 3 1

dydx.

Probability and Mathematical Statistics 339

Now we substitute u = y

1x in the above integral to obtain

E( XY ) =  (✓)

( ✓1 ) ( ✓2 ) ( ✓3 ) 1

x✓ 1 (1  x)✓ 2 +✓3  1

u✓ 2 (1  u)✓ 3 1 dudx

Since  1

u✓ 2 (1  u)✓ 3 1 du = B (✓2 + 1 , ✓3 )

and  1

x✓ 1 (1  x)✓ 2 +✓3 dx = B (✓1 + 1 , ✓2 +✓3 + 1)

we have

E( XY ) =  (✓)

( ✓1 ) ( ✓2 ) ( ✓3 )B ( ✓ 2 + 1, ✓ 3 )B(✓1 + 1 ,✓2 +✓3 + 1)

=(✓)

( ✓1 ) ( ✓2 ) ( ✓3 )

✓1 ( ✓1 )(✓2 +✓3 ) ( ✓2 +✓3 )

(✓)(✓ + 1) (✓)

✓2 ( ✓2 ) ( ✓3 )

(✓2 + ✓3 )(✓2 + ✓3 )

=✓ 1 ✓2

✓( ✓+ 1) where ✓=✓1 +✓2 +✓3 .

Now it is easy to compute the covariance of X and Ysince

Cov ( X, Y ) = E ( XY ) E ( X ) E ( Y )

=✓ 1 ✓2

✓( ✓+ 1)  ✓1

✓

✓2

✓

=✓ 1 ✓2

✓2 ( ✓+ 1) .

The proof of the theorem is now complete.

The correlation coeﬃ cient of X and Y can be computed using the co-

variance as

⇢=Cov (X, Y )

V ar(X ) V ar( Y)=   ✓ 1 ✓ 2

(✓1 + ✓3 )(✓2 + ✓3 ) .

Next theorem states some properties of the conditional density functions

f(x/y ) and f(y/x).

Theorem 12.12. Let (X, Y )⇠ Beta(✓1 , ✓2 , ✓3 ) where ✓1 ,✓2 and ✓3 are

positive parameters. Then

E(Y/x ) = ✓ 2 (1  x )

✓2 +✓3

, V ar(Y /x ) = ✓ 2 ✓ 3 (1  x ) 2

(✓2 + ✓3 )2(✓2 + ✓3 + 1)

E(X/y ) = ✓ 1 (1 y )

✓1 +✓3

, V ar(X/y ) = ✓ 1 ✓ 3 (1 y ) 2

(✓1 + ✓3 )2(✓1 + ✓3 + 1) .

Some Special Continuous Bivariate Distributions 340

Proof: We know that if (X, Y )⇠ Beta(✓1 , ✓2 , ✓3 ), the random variable

X⇠ Beta(✓1 , ✓2 +✓3 ). Therefore

f(y/x ) = f(x, y)

f1 (x)

1x

( ✓2 +✓3 )

( ✓2 ) ( ✓3 ) y

1x ✓ 2 1  1y

1x ✓ 3 1

for all 0 <y< 1 x . Thus the random variable Y

1x   X=x is a beta random

variable with parameters ✓2 and ✓3 .

Now we compute the conditional expectation of Y /x . Consider

E(Y/x ) =  1x

y f (y/x ) dy

1x

( ✓2 +✓3 )

( ✓2 ) ( ✓3 ) 1x

y y

1x ✓ 2 1  1y

1x ✓ 3 1

dy.

Now we substitute u = y

1x in the above integral to obtain

E(Y/x ) =  (✓2 +✓3 )

( ✓2 ) ( ✓3 ) (1  x) 1

u✓ 2 (1  u)✓ 3 1 du

=(✓2 + ✓3 )

( ✓2 ) ( ✓3 ) (1  x)B (✓2 + 1, ✓3 )

=(✓2 + ✓3 )

( ✓2 ) ( ✓3 ) (1  x) ✓ 2 (✓2 )( ✓3 )

(✓2 + ✓3 ) (✓2 + ✓3 )

=✓2

✓2 +✓3

(1  x).

Next, we compute E (Y2 /x ). Consider

E( Y2 /x) =  1x

y2 f(y/x ) dy

1x

( ✓2 +✓3 )

( ✓2 ) ( ✓3 ) 1x

y2  y

1x ✓ 2 1  1y

1x ✓ 3 1

=(✓2 + ✓3 )

( ✓2 ) ( ✓3 ) (1  x)2  1

u✓ 2 +1 (1  u)✓ 3 1 du

=(✓2 + ✓3 )

( ✓2 ) ( ✓3 ) (1  x)2 B(✓2 + 2, ✓3 )

=(✓2 + ✓3 )

( ✓2 ) ( ✓3 ) (1  x)2 ( ✓ 2 + 1) ✓ 2 (✓2 )( ✓3 )

(✓2 + ✓3 + 1) (✓2 + ✓3 )  (✓2 + ✓3 )

=(✓2 + 1) ✓2

(✓2 + ✓3 + 1) (✓2 + ✓3

(1  x)2 .

Probability and Mathematical Statistics 341

Therefore

V ar(Y /x ) = E ( Y2 /x) E (Y /x)2 = ✓ 2 ✓ 3 (1  x)2

(✓2 + ✓3 )2(✓2 + ✓3 + 1) .

Similarly, one can compute E (X/y ) and V ar (X/y ). We leave this com-

putation to the reader. Now the proof of the theorem is now complete.

The Dirichlet distribution can be extended from the unit square (0, 1)2

to an arbitrary rectangle (a1 , b1 )⇥ (a2 , b2 ).

Deﬁnition 12.6. A continuous bivariate random variable (X1 , X2 ) is said to

have the generalized bivariate beta distribution if its joint probability density

function is of the form

f(x1 , x2 ) =  (✓1 +✓2 +✓3 )

( ✓1 ) ( ✓2 ) ( ✓3 )



k=1  x k a k

bk ak  ✓ k 1  1 x k  a k

bk ak  ✓ 3 1

where 0 < x1 , x2, x1 + x2 < 1 and ✓1 ,✓2 ,✓3 , a1 , b1, a2 , b2 are parameters. We

will denote a bivariate generalized beta random variable (X, Y ) with positive

parameters ✓1 ,✓2 and ✓3 by writing (X, Y )⇠ GBeta(✓1 , ✓2 , ✓3 , a1, b1, a2 , b2 ).

It can be shown that if Xk = (bk ak )Yk + ak (for k = 1, 2) and each

(Y1 , Y2 )⇠ Beta(✓1 ,✓2 ,✓3 ), then ( X1 , X2 )⇠ GBeta(✓1 ,✓2 ,✓3 , a1, b1, a2, b2 ).

Therefore, by Theorem 12.11

Theorem 12.13. Let (X, Y )⇠ GBeta(✓1 , ✓2 , ✓3 , a1 , b1, a2, b2 ), where ✓1 ,✓2

and ✓3 are positive apriori chosen parameters. Then X⇠ Beta(✓1 , ✓2 + ✓3 )

and Y⇠ Beta(✓2 , ✓1 + ✓3 ) and

E( X) = ( b1 a1 ) ✓ 1

✓+a1 , V ar ( X ) = (b1 a1 )2 ✓ 1 ( ✓✓1 )

✓2 ( ✓+ 1)

E( Y) = ( b2 a2 ) ✓ 2

✓+a2 , V ar ( Y ) = (b2 a2 )2 ✓ 2 ( ✓✓2 )

✓2 ( ✓+ 1)

Cov ( X, Y ) = ( b1 a1 )(b2 a2 ) ✓ 1 ✓ 2

✓2 ( ✓+ 1)

where ✓ = ✓1 + ✓2 + ✓3 .

Another generalization of the bivariate beta distribution is the following:

Deﬁnition 12.7. A continuous bivariate random variable (X1 , X2 ) is said to

have the generalized bivariate beta distribution if its joint probability density

function is of the form

f(x1 , x2 ) = 1

B(↵1 ,1 ) B(↵2 ,2 ) x ↵ 1 1 (1  x 1 )  1 ↵2 2 x ↵ 2 1

2(1 x 1 x 2 )  2 1

Some Special Continuous Bivariate Distributions 342

where 0 < x1 , x2, x1 + x2 < 1 and ↵1 ,↵2 ,1 ,2 are parameters.

It is not diﬃ cult to see that X⇠ Beta(↵1 , 1 ) and Y⇠ Beta(↵2 , 2 ).

12.5. Bivariate Normal Distribution

The bivariate normal distribution is a generalization of the univariate

normal distribution. The ﬁrst statistical treatment of the bivariate normal

distribution was given by Galton and Dickson in 1886. Although there are

several other bivariate distributions as discussed above, the bivariate normal

distribution still plays a dominant role. The development of normal theory

has been intensive and most thinking has centered upon bivariate normal

distribution because of the relative simplicity of mathematical treatment of

it. In this section, we give an in depth treatment of the bivariate normal

distribution.

Deﬁnition 12.8. A continuous bivariate random variable (X, Y ) is said to

have the bivariate normal distribution if its joint probability density function

is of the form

f( x, y) = 1

2⇡ 1 2  1⇢2 e  1

2Q(x,y),1 < x, y < 1,

where µ1 , µ2 2 IR, 1 ,2 2 (0, 1 ) and ⇢2 (1, 1) are parameters, and

Q( x, y) := 1

1⇢2  x µ1

1  2

2⇢ x µ1

1  yµ2

2  +  yµ2

2  2  .

As usual, we denote this bivariate normal random variable by writing

(X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ). The graph of f (x, y ) has a shape of a "moun-

tain". The pair (µ1 , µ2 ) tells us where the center of the mountain is located

in the (x, y )-plane, while  2

1and  2

2measure the spread of this mountain in

the x -direction and y -direction, respectively. The parameter ⇢determines

the shape and orientation on the (x, y )-plane of the mountain. The following

ﬁgures show the graphs of the bivariate normal distributions with di↵ erent

values of correlation coeﬃ cient ⇢ . The ﬁrst two ﬁgures illustrate the graph of

the bivariate normal distribution with ⇢ = 0, µ1 = µ2 = 0, and 1 = 2 = 1

and the equi-density plots. The next two ﬁgures illustrate the graph of the

bivariate normal distribution with ⇢ = 0. 5, µ1 = µ2 = 0, and 1 = 2 = 0.5

and the equi-density plots. The last two ﬁgures illustrate the graph of the

bivariate normal distribution with ⇢ = 0. 5, µ1 = µ2 = 0, and 1 = 2 = 0 .5

and the equi-density plots.

Probability and Mathematical Statistics 343

One of the remarkable features of the bivariate normal distribution is

that if we vertically slice the graph of f (x, y ) along any direction, we obtain

a univariate normal distribution. In particular, if we vertically slice the graph

of the f (x, y ) along the x -axis, we obtain a univariate normal distribution.

That is the marginal of f (x, y ) is again normal. One can show that the

marginals of f (x, y ) are given by

f1 (x) = 1

1 p 2 ⇡ e  1

2 xµ1

1  2

and

f2 ( y ) = 1

2 p 2 ⇡ e  1

2 xµ2

2  2

In view of these, the following theorem is obvious.

Some Special Continuous Bivariate Distributions 344

Theorem 12.14. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ), then

E( X) = µ1

E( Y) = µ2

V ar( X ) =  2

V ar( Y ) =  2

Corr( X, Y ) = ⇢

M( s, t) = eµ 1 s+µ2t+ 1

2( 2

1s 2 +2⇢ 1  2 st+ 2

2t 2 ) .

Proof: It is easy to establish the formulae for E (X ), E (Y ), V ar (X ) and

V ar( Y ). Here we only establish the moment generating function. Since

(X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ), we have X⇠ N  µ1 , 2

1and Y⇠ N µ 2 , 2

2.

Further, for any s and t , the random variable W = sX + tY is again normal

with

µW =sµ1 +tµ2 and  2

W=s 2  2

1+ 2st⇢ 1  2 +t 2  2 .

Since W is a normal random variable, its moment generating function is given

M(⌧ ) = eµ W ⌧+ 1

2⌧ 2  2

The joint moment generating function of (X, Y ) is

M( s, t) = E  esX+tY 

=eµ W + 1

2 2

=eµ 1 s+µ2t+ 1

2( 2

1s 2 +2⇢ 1  2 st+ 2

2t 2 ) .

This completes the proof of the theorem.

It can be shown that the conditional density of Y given X =x is

f(y/x ) = 1

2  2 ⇡ (1  ⇢2 ) e  1

2 yb

2 p 1 ⇢2  2

where

b= µ2 +⇢ 2

1

(x µ1 ).

Similarly, the conditional density f (x/y ) is

f(x/y ) = 1

1  2 ⇡ (1  ⇢2 ) e  1

2 xc

1 p 1 ⇢2  2

Probability and Mathematical Statistics 345

where

c= µ1 +⇢ 1

2

(y µ2 ).

In view of the form of f (y/x ) and f (x/y ), the following theorem is transpar-

ent.

Theorem 12.15. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ), then

E(Y/x ) = µ2 +⇢ 2

1

(x µ1 )

E(X/y ) = µ1 +⇢ 1

2

(y µ2 )

V ar(Y /x ) =  2

2(1 ⇢ 2 )

V ar(X/y ) =  2

1(1 ⇢ 2 ).

We have seen that if (X, Y ) has a bivariate normal distribution, then the

distributions of X and Y are also normal. However, the converse of this is

not true. That is if X and Y have normal distributions as their marginals,

then their joint distribution is not necessarily bivariate normal.

Now we present some characterization theorems concerning the bivariate

normal distribution. The ﬁrst theorem is due to Cramer (1941).

Theorem 12.16. The random variables X and Y have a joint bivariate

normal distribution if and only if every linear combination of X and Yhas

a univariate normal distribution.

Theorem 12.17. The random variables X and Y with unit variances and

correlation coeﬃ cient ⇢ have a joint bivariate normal distribution if and only

if @

@⇢ E [g(X, Y )] = E @ 2

@X@Yg(X, Y )

holds for an arbitrary function g (x, y ) of two variable.

Many interesting characterizations of bivariate normal distribution can

be found in the survey paper of Hamedani (1992).

12.6. Bivariate Logistic Distributions

In this section, we study two bivariate logistic distributions. A univariate

logistic distribution is often considered as an alternative to the univariate

normal distribution. The univariate logistic distribution has a shape very

close to that of a univariate normal distribution but has heavier tails than

Some Special Continuous Bivariate Distributions 346

the normal. This distribution is also used as an alternative to the univariate

Weibull distribution in life-testing. The univariate logistic distribution has

the following probability density function

f(x ) = ⇡

p 3

e ⇡

p3 ( x µ

)

1 + e ⇡

p3 ( x µ

) 2  1 < x < 1,

where 1 <µ< 1 and > 0 are parameters. The parameter µ is the

mean and the parameter  is the standard deviation of the distribution. A

random variable X with the above logistic distribution will be denoted by

X⇠ LOG( µ,  ). It is well known that the moment generating function of

univariate logistic distribution is given by

M(t ) = eµt   1 + p 3

⇡t   1p 3

⇡t

for |t |< ⇡

p 3. We give brief proof of the above result for µ = 0 and  = ⇡

p3 .

Then with these assumptions, the logistic density function reduces to

f(x ) = e x

(1 + ex )2 .

The moment generating function with respect to this density function is

M(t ) =  1

1

etx f (x)dx

= 1

1

etx e x

(1 + e1 )2 dx

= 1

1 e x  t e x

(1 + e1 )2 dx

= 1

0z  1 1 t dz where z = 1

1 + ex

= 1

zt (1  z)t dz

=B (1 + t, 1 t)

=(1 + t ) (1  t)

(1 + t + 1  t)

=(1 + t ) (1  t)

(2)

=(1 + t )(1  t)

=t cosec (t).

Probability and Mathematical Statistics 347

Recall that the marginals and conditionals of the bivariate normal dis-

tribution are univariate normal. This beautiful property enjoyed by the bi-

variate normal distribution are apparently lacking from other bivariate dis-

tributions we have discussed so far. If we can not deﬁne a bivariate logistic

distribution so that the conditionals and marginals are univariate logistic,

then we would like to have at least one of the marginal distributions logistic

and the conditional distribution of the other variable logistic. The following

bivariate logistic distribution is due to Gumble (1961).

Deﬁnition 12.9. A continuous bivariate random variable (X, Y ) is said to

have the bivariate logistic distribution of ﬁrst kind if its joint probability

density function is of the form

f( x, y) = 2⇡2 e  ⇡

p3  x µ1

1 + y µ2

2 

312  1 + e ⇡

p3  x µ1

1 +e ⇡

p3  y µ2

2  3 1 < x, y < 1,

where 1 < µ1 , µ2 < 1, and 0 < 1 ,2 <1 are parameters. If a random

variable (X, Y ) has a bivariate logistic distribution of ﬁrst kind, then we

express this by writing (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ). The following ﬁgures

show the graph of f (x, y ) with µ1 =0= µ2 and 1 =1= 2 and the equi-

density plots.

It can be shown that marginally, X is a logistic random variable. That

is, X⇠ LOG (µ1 , 1 ). Similarly, Y⇠ LOG (µ2 , 2 ). These facts lead us to

the following theorem.

Theorem 12.18. If the random variable (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ),

Some Special Continuous Bivariate Distributions 348

then

E( X) = µ1

E( Y) = µ2

V ar( X ) =  2

V ar( Y ) =  2

E( XY ) = 1

2 1  2 +µ1µ2,

and the moment generating function is given by

M( s, t) = eµ 1 s+µ2 t   1 + (1 s+ 2 t)p3

⇡   1  1 sp 3

⇡   1  2 tp 3

⇡

for |s |< ⇡

1 p 3 and |t |< ⇡

2 p 3 .

It is an easy exercise to see that if the random variables X and Y have

a joint bivariate logistic distribution, then the correlation between X and Y

is 1

2. This can be considered as one of the drawbacks of this distribution in

the sense that it limits the dependence between the random variables Xand

The conditional density of Y given X =x is

f(y/x ) = 2 ⇡

2 p 3 e  ⇡

p3  y µ2

2 1 + e ⇡

p3  x µ1

1  2

1 + e ⇡

p3  x µ1

1 +e ⇡

p3  y µ2

2  3 .

Similarly the conditional density of X given Y =y is

f(x/y ) = 2 ⇡

1 p 3 e  ⇡

p3  x µ1

1 1 + e ⇡

p3  y µ2

2  2

1 + e ⇡

p3  x µ1

1 +e ⇡

p3  y µ2

2  3 .

Using these densities, the next theorem o↵ ers various conditional properties

of the bivariate logistic distribution.

Theorem 12.19. If the random variable (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ),

Probability and Mathematical Statistics 349

then

E(Y/x ) = 1  ln  1 + e ⇡

p3  x µ1

1 

E(X/y ) = 1  ln  1 + e ⇡

p3  y µ2

2 

V ar(Y /x ) = ⇡ 3

3 1

V ar(X/y ) = ⇡ 3

3 1.

It was pointed out earlier that one of the drawbacks of this bivariate

logistic distribution of ﬁrst kind is that it limits the dependence of the ran-

dom variables. The following bivariate logistic distribution was suggested to

rectify this drawback.

Deﬁnition 12.10. A continuous bivariate random variable (X, Y ) is said to

have the bivariate logistic distribution of second kind if its joint probability

density function is of the form

f( x, y) = [↵ (x, y)]12↵

[1 + ↵ (x, y)]2   ↵ (x, y) 1

↵ (x, y ) + 1 + ↵  e ↵(x+y ) , 1 < x, y < 1,

where ↵> 0 is a parameter, and ↵ (x, y ) := ( e↵x + e↵y ) 1

↵. As before, we

denote a bivariate logistic random variable of second kind (X, Y ) by writing

(X, Y )⇠ LOGS(↵).

The marginal densities of X and Y are again logistic and they given by

f1 (x) = e x

(1 + ex )2 , 1 <x< 1

and

f2 ( y ) = e y

(1 + ey )2 , 1 < y < 1 .

It was shown by Oliveira (1961) that if (X, Y )⇠ LOGS(↵ ), then the corre-

lation between X and Yis

⇢(X, Y ) = 1  1

2↵2 .

Some Special Continuous Bivariate Distributions 350

12.7. Review Exercises

1. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ) with Q(x, y ) = x2 +2y2  2xy + 2x 2y+1,

then what is the value of the conditional variance of Y given the event X = x?

2. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ) with

Q( x, y) =  1

102  (x + 3)2  16(x + 3)(y 2) + 4(y 2)2  ,

then what is the value of the conditional expectation of Y given X = x?

3. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ), then what is the correlation coeﬃ cient of

the random variables U and V , where U = 2X + 3Y and V = 2X 3Y?

4. Let the random variables X and Y denote the height and weight of

wild turkeys. If the random variables X and Y have a bivariate normal

distribution with µ1 = 18 inches, µ2 = 15 pounds, 1 = 3 inches, 2 = 2

pounds, and ⇢ = 0. 75, then what is the expected weight of one of these wild

turkeys that is 17 inches tall?

5. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ), then what is the moment generating

function of the random variables U and V , where U = 7X + 3Y and V=

7X 3Y?

6. Let (X, Y ) have a bivariate normal distribution. The mean of X is 10 and

the variance of X is 12. The mean of Y is  5 and the variance of Y is 5. If

the covariance of X and Y is 4, then what is the probability that X +Y is

greater than 10?

7. Let X and Y have a bivariate normal distribution with means µX = 5

and µY = 6, standard deviations X = 3 and Y = 2, and covariance

XY = 2. Let  denote the cumulative distribution function of a normal

random variable with mean 0 and variance 1. What is P (2 X Y  5) in

terms of ?

8. If (X, Y )⇠N (µ1 , µ2 , 1 , 2 , ⇢ ) with Q(x, y ) = x2 + xy  2y2 , then what

is the conditional distributions of X given the event Y =y ?

9. If (X, Y )⇠ GAM K (↵ ,✓ ), where 0 <↵ <1 and 0 ✓ < 1 are parame-

ters, then show that the moment generating function is given by

M( s, t) =  1

(1  s ) (1  t )✓ s t  ↵

Probability and Mathematical Statistics 351

10. Let X and Y have a bivariate gamma distribution of Kibble with pa-

rameters ↵ = 1 and 0 ✓ < 0. What is the probability that the random

variable 7X is less than 1

11. If (X, Y )⇠ GAM C (↵ , , ), then what are the regression and scedestic

curves of Y on X?

12. The position of a random point (X, Y ) is equally probable anywhere on

a circle of radius R and whose center is at the origin. What is the probability

density function of each of the random variables X and Y ? Are the random

variables X and Y independent?

13. If (X, Y )⇠ GAM C (↵ ,, ), what is the correlation coeﬃ cient of the

random variables X and Y?

14. Let X and Y have a bivariate exponential distribution of Gumble with

parameter ✓> 0. What is the regression curve of Y on X?

15. A screen of a navigational radar station represents a circle of radius 12

inches. As a result of noise, a spot may appear with its center at any point

of the circle. Find the expected value and variance of the distance between

the center of the spot and the center of the circle.

16. Let X and Y have a bivariate normal distribution. Which of the following

statements must be true?

(I) Any nonzero linear combination of X and Y has a normal distribution.

(II) E (Y /X = x ) is a linear function of x.

(III) V ar (Y/X = x ) V ar(Y).

17. If (X, Y )⇠ LOGS(↵ ), then what is the correlation between X and Y?

18. If (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ), then what is the correlation between

the random variables X and Y?

19. If (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ), then show that marginally X and Y

are univariate logistic.

20. If (X, Y )⇠ LO GF (µ1 , µ2 , 1 , 2 ), then what is the scedastic curve of

the random variable Y and X?

Some Special Continuous Bivariate Distributions 352

Probability and Mathematical Statistics 353

Chapter 13

SEQUENCES

RANDOM VARIABLES

AND

ORDER STASTISTICS

In this chapter, we generalize some of the results we have studied in the

previous chapters. We do these generalizations because the generalizations

are needed in the subsequent chapters relating to mathematical statistics. In

this chapter, we also examine the weak law of large numbers, Bernoulli's law

of large numbers, the strong law of large numbers, and the central limit the-

orem. Further, in this chapter, we treat the order statistics and percentiles.

13.1. Distribution of sample mean and variance

Consider a random experiment. Let X be the random variable associ-

ated with this experiment. Let f (x ) be the probability density function of X.

Let us repeat this experiment n times. Let Xk be the random variable asso-

ciated with the k th repetition. Then the collection of the random variables

{X1 , X2 , ..., Xn }is a random sample of size n . From here after, we simply

denote X1 , X2 , ..., Xn as a random sample of size n . The random variables

X1 , X2 , ..., Xn are independent and identically distributed with the common

probability density function f (x).

For a random sample, functions such as the sample mean X , the sample

variance S2 are called statistics . In a particular sample, say x1 , x2 , ..., xn , we

observed x and s2 . We may consider

X=1



i=1

Sequences of Random Variables and Order Statistics 354

and

S2 =1

n1



i=1 X i X 2

as random variables and x and s2 are the realizations from a particular

sample.

In this section, we are mainly interested in ﬁnding the probability distri-

butions of the sample mean X and sample variance S2 , that is the distribution

of the statistics of samples.

Example 13.1. Let X1 and X2 be a random sample of size 2 from a distri-

bution with probability density function

f(x ) =  6x(1  x) if 0 < x < 1

0 otherwise.

What are the mean and variance of sample sum Y = X1 + X2 ?

Answer: The population mean

µX = E ( X )

= 1

x6x (1  x ) dx

= 6  1

x2 (1  x)dx

= 6 B (3, 2) (here B denotes the beta function)

= 6  (3) (2)

(5)

= 6  1

12 

Since X1 and X2 have the same distribution, we obtain µX 1 = 1

2=µ X 2 .

Hence the mean of Y is given by

E( Y) = E(X1 + X2 )

=E (X1 ) + E (X2 )

2+ 1

= 1.

Probability and Mathematical Statistics 355

Next, we compute the variance of the population X . The variance of Xis

given by

V ar( X ) = E  X2   E ( X )2

= 1

6x3 (1  x ) dx  1

22

= 6  1

x3 (1  x) dx  1

4

= 6 B (4, 2)  1

4

= 6  (4) (2)

(6)   1

4

= 6  1

20    1

4

20  5

20 .

Since X1 and X2 have the same distribution as the population X , we get

V ar(X1 ) = 1

20 = V ar (X2 ).

Hence, the variance of the sample sum Y is given by

V ar( Y ) = V ar (X1 + X2 )

=V ar (X1 ) + V ar (X2 ) + 2 Cov (X1 , X2 )

=V ar (X1 ) + V ar (X2 )

20 + 1

10 .

Example 13.2. Let X1 and X2 be a random sample of size 2 from a distri-

bution with density

f(x ) =  1

4for x = 1, 2,3,4

0 otherwise.

What is the distribution of the sample sum Y = X1 + X2 ?

Sequences of Random Variables and Order Statistics 356

Answer: Since the range space of X1 as well as X2 is {1,2,3,4} , the range

space of Y = X1 + X2 is

RY = {2 , 3 , 4 , 5 , 6 , 7 , 8} .

Let g (y ) be the density function of Y . We want to ﬁnd this density function.

First, we ﬁnd g (2), g (3) and so on.

g(2) = P( Y= 2)

=P (X1 + X2 = 2)

=P (X1 = 1 and X2 = 1)

=P (X1 = 1) P (X2 = 1) (by independence of X1 and X2 )

=f (1) f(1)

= 1

4 1

4 = 1

16 .

g(3) = P( Y= 3)

=P (X1 + X2 = 3)

=P (X1 = 1 and X2 = 2) + P (X1 = 2 and X2 = 1)

=P (X1 = 1) P (X2 = 2)

+P (X1 = 2) P (X2 = 1) (by independence of X1 and X2 )

=f (1) f (2) + f (2) f(1)

= 1

4 1

4 +  1

4 1

4 = 2

16 .

Probability and Mathematical Statistics 357

g(4) = P( Y= 4)

=P (X1 + X2 = 4)

=P (X1 = 1 and X2 = 3) + P (X1 = 3 and X2 = 1)

+P (X1 = 2 and X2 = 2)

=P (X1 = 3) P (X2 = 1) + P (X1 = 1) P (X2 = 3)

+P (X1 = 2) P (X2 = 2) (by independence of X1 and X2 )

=f (1) f (3) + f (3) f (1) + f (2) f(2)

= 1

4 1

4 +  1

4 1

4 +  1

4 1

4

16 .

Similarly, we get

g(5) = 4

16 , g (6) = 3

16 , g (7) = 2

16 , g (8) = 1

16 .

Thus, putting these into one expression, we get

g( y) = P( Y= y)

y1



k=1

f( k) f( y k)

=4|y 5|

16 , y = 2, 3,4, ..., 8.

Remark 13.1. Note that g (y ) =

y1



k=1

f( k) f( y k) is the discrete convolution

of f with itself. The concept of convolution was introduced in chapter 10.

The above example can also be done using the moment generating func-

Sequences of Random Variables and Order Statistics 358

tion method as follows:

MY (t) = MX 1 +X2 (t)

=MX 1 (t )MX 2 (t)

= e t +e2t +e3t +e4t

4 e t +e2t +e3t +e4t

4

= e t +e2t +e3t +e4t

42

=e 2t + 2e3t + 3e4t + 4e5t + 3e6t + 2e7t + e8t

16 .

Hence, the density of Y is given by

g( y) = 4|y  5|

16 , y = 2, 3,4, ..., 8.

Theorem 13.1. If X1 , X2 , ..., Xn are mutually independent random vari-

ables with densities f1 (x1 ) , f2 (x2 ) , ..., fn (xn ) and E [ui (Xi )], i = 1, 2, ..., n

exist, then

E n



i=1

ui (Xi ) =



i=1

E[ui (Xi )],

where ui ( i = 1, 2, ..., n ) are arbitrary functions.

Proof: We prove the theorem assuming that the random variables

X1 , X2 , ..., Xn are continuous. If the random variables are not continuous,

then the proof follows exactly in the same manner if one replaces the integrals

by summations. Since

E n



i=1

ui (Xi )

=E (u1 (X1 )··· un (Xn ))

= 1

1 ··· 1

1

u1 (x1 )··· un (xn ) f (x1 , ..., xn )dx1 ···dxn

= 1

1 ··· 1

1

u1 (x1 )··· un (xn )f1 (x1 )··· fn (xn )dx1 ···dxn

= 1

1

u1 (x1 )f1 (x1 )dx1 ··· 1

1

un (xn )fn (xn )dxn

=E (u1 (X1 )) ···E (un (Xn ))



i=1

E(ui (Xi )) ,

Probability and Mathematical Statistics 359

the proof of the theorem is now complete.

Example 13.3. Let X and Y be two random variables with the joint density

f( x, y) =  e (x+y ) for 0 < x, y < 1

0 otherwise.

What is the expected value of the continuous random variable Z =X2 Y2 +

XY 2 + X2 + X ?

Answer: Since

f( x, y) = e(x+y)

=ex ey

=f1 (x )f2 ( y ),

the random variables X and Y are mutually independent. Hence, the ex-

pected value of Xis

E( X) =  1

x f1 (x ) dx

= 1

xex dx

=(2)

= 1.

Similarly, the expected value of X2 is given by

E X2  = 1

x2f1 (x)dx

= 1

x2ex dx

=(3)

= 2.

Since the marginals of X and Y are same, we also get E (Y ) = 1 and E (Y2 ) =

2. Further, by Theorem 13.1, we get

E[ Z] = E X2 Y2 + XY 2 + X2 +X

=E X2 +X Y2 + 1

=E X2 +X E Y2 + 1 (by Theorem 13.1)

= E X2  +E [X] E  Y2  + 1

= (2 + 1) (2 + 1)

= 9.

Sequences of Random Variables and Order Statistics 360

Theorem 13.2. If X1 , X2 , ..., Xn are mutually independent random vari-

ables with respective means µ1 , µ2 , ..., µn and variances  2

1, 2

2, ...,  2

n, then

the mean and variance of Y =  n

i=1 a i X i , where a 1 , a 2 , ..., a n are real con-

stants, are given by

µY =



i=1

aiµi and  2



i=1

i 2

Proof: First we show that µY =  n

i=1 a i µ i . Since

µY = E ( Y )

=E n



i=1

aiXi 



i=1

ai E (Xi )



i=1

aiµi

we have asserted result. Next we show  2

Y= n

i=1 a 2

i 2

i. Since

Cov (Xi , Xj ) = 0 for i 6= j , we have

2

Y=V ar( Y )

=V ar (aiXi )



i=1

iV ar (X i )



i=1

i 2

This completes the proof of the theorem.

Example 13.4. Let the independent random variables X1 and X2 have

means µ1 =  4 and µ2 = 3, respectively and variances  2

1= 4 and  2

2= 9.

What are the mean and variance of Y = 3X1  2X2 ?

Answer: The mean of Yis

µY = 3µ1  2µ2

= 3( 4) 2(3)

=18.

Probability and Mathematical Statistics 361

Similarly, the variance of Yis

2

Y= (3) 2  2

1+ (2) 2  2

= 9  2

1+ 4  2

= 9(4) + 4(9)

= 72.

Example 13.5. Let X1 , X2 , ..., X50 be a random sample of size 50 from a

distribution with density

f(x ) =  1

✓e  x

✓for 0 x < 1

0 otherwise.

What are the mean and variance of the sample mean X?

Answer: Since the distribution of the population X is exponential, the mean

and variance of X are given by

µX =✓ , and  2

X=✓ 2 .

Thus, the mean of the sample mean is

E X = E X 1 + X 2 +···+X50

50 



i=1

E(Xi )



i=1

✓

50 50 ✓= ✓.

The variance of the sample mean is given by

V ar  X = V ar  50



i=1

50 X i 



i=1  1

50  2

2



i=1  1

50  2

✓2

= 50  1

50  2

✓2

=✓ 2

50 .

Sequences of Random Variables and Order Statistics 362

Theorem 13.3. If X1 , X2 , ..., Xn are independent random variables with

respective moment generating functions MX i (t ), i = 1, 2, ..., n , then the mo-

ment generating function of Y =  n

i=1 a i X i is given by

MY (t) =



i=1

MX i (ait).

Proof: Since M Y (t) = M n

i=1 a i X i (t)



i=1

Ma i X i (t)



i=1

MX i (ait)

we have the asserted result and the proof of the theorem is now complete.

Example 13.6. Let X1 , X2 , ..., X10 be the observations from a random

sample of size 10 from a distribution with density

f(x ) = 1

p2⇡ e 1

2x 2 ,1 <x<1.

What is the moment generating function of the sample mean?

Answer: The density of the population X is a standard normal. Hence, the

moment generating function of each Xi is

MX i (t) = e 1

2t 2 , i = 1, 2, ..., 10.

The moment generating function of the sample mean is

MX (t) = M 10

i=1

10 X i (t)



i=1

MX i  1

10 t 



i=1

et2

200

= et2

200  10

=e 1

2.

Hence X⇠ N  0, 1

10 .

Probability and Mathematical Statistics 363

The last example tells us that if we take a sample of any size from

a standard normal population, then the sample mean also has a normal

distribution.

The following theorem says that a linear combination of random variables

with normal distributions is again normal.

Theorem 13.4. If X1 , X2 , ..., Xn are mutually independent random vari-

ables such that

Xi ⇠ N µi , 2

i, i = 1, 2, ..., n.

Then the random variable Y=



i=1

aiXi is a normal random variable with

mean

µY =



i=1

aiµi and  2



i=1

i 2

that is Y⇠ N n

i=1 a i µ i ,  n

i=1 a 2

i 2

i.

Proof: Since each Xi ⇠ N µi , 2

i, the moment generating function of each

Xi is given by

MX i (t) = eµ i t+ 1

2 2

it 2 .

Hence using Theorem 13.3, we have

MY (t) =



i=1

MX i (ait)



i=1

ea i µ i t+ 1

2a 2

i 2

it 2

=e n

i=1 a i µ i t+ 1

2 n

i=1 a 2

i 2

it 2 .

Thus the random variable Y⇠ N n



i=1

aiµi,



i=1

i 2

i. The proof of the

theorem is now complete.

Example 13.7. Let X1 , X2 , ..., Xn be the observations from a random sam-

ple of size n from a normal distribution with mean µ and variance 2 > 0.

What are the mean and variance of the sample mean X?

Sequences of Random Variables and Order Statistics 364

Answer: The expected value (or mean) of the sample mean is given by

E X =1



i=1

E(Xi )



i=1

=µ.

Similarly, the variance of the sample mean is

V ar  X =



i=1

V ar  X i

n =



i=1  1

n2

2 =  2

This example along with the previous theorem says that if we take a random

sample of size n from a normal population with mean µ and variance 2 ,

then the sample mean is also normal with mean µ and variance  2

n, that is

X⇠ N µ,  2

n.

Example 13.8. Let X1 , X2 , ..., X64 be a random sample of size 64 from a

normal distribution with µ = 50 and 2 = 16. What are P (49 < X8 < 51)

and P  49 <X< 51?

Answer: Since X8 ⇠ N (50 , 16), we get

P(49 < X8 < 51) = P (49  50 < X8  50 < 51  50)

=P 49 50

4<X 8  50

4< 51 50

4

=P  1

4<X 8  50

4< 1

4

=P  1

4< Z < 1

4

= 2P  Z < 1

4  1

= 0. 1974 (from normal table).

Probability and Mathematical Statistics 365

By the previous theorem, we see that X⇠ N  50, 16

64 . Hence

P 49 < X < 51 = P 49  50 < X  50 < 51  50

=P

49 50

16

<X50

16

<51 50

16

64 



=P

2<X50

16

<2



=P ( 2 < Z < 2)

= 2P (Z < 2) 1

= 0. 9544 (from normal table).

This example tells us that X has a greater probability of falling in an interval

containing µ , than a single observation, say X8 (or in general any Xi ).

Theorem 13.5. Let the distributions of the random variables X1 , X2 , ..., Xn

be 2 (r1 ), 2 (r2 ), ..., 2 (rn ), respectively. If X1 , X2 , ..., Xn are mutually in-

dependent, then Y = X1 + X2 +···+ Xn ⇠ 2 ( n

i=1 r i ).

Proof: Since each Xi ⇠ 2 (ri ), the moment generating function of each Xi

is given by

MX i (t) = (1  2 t) ri

By Theorem 13.3, we have

MY (t) =



i=1

MX i (t) =



i=1

(1  2t) ri

2= (1  2t) 1

2 n

i=1 r i .

Hence Y⇠ 2 ( n

i=1 r i ) and the proof of the theorem is now complete.

The proof of the following theorem is an easy consequence of Theorem

13.5 and we leave the proof to the reader.

Theorem 13.6. If Z1 , Z2 , ..., Zn are mutually independent and each one

is standard normal, then Z 2

1+Z 2

2+··· +Z 2

n⇠ 2 (n), that is the sum is

chi-square with n degrees of freedom.

For our next theorem, we write

Xn =1



i=1

Xi and S 2

n=1

n1



i=1

(Xi  Xn )2 .

Sequences of Random Variables and Order Statistics 366

Hence

X2 =1

2(X1 +X2 )

and

2= (X 1 X 2 ) + (X 2 X 2 )

4(X1 X2 )2 + 1

4(X2 X1 )2

2(X1 X2 )2 .

Further, it can be shown that

Xn+1 = n X n + X n+1

n+ 1 (13.1)

and

n S2

n+1 = (n 1) S 2

n+n

n+ 1 (Xn+1  Xn )2 .(13.2)

The folllowing theorem is very useful in mathematical statistics. In or-

der to prove this theorem we need the following result which can be estab-

lished with some e↵ ort. Two linear commbinations of a pair of independent

normally distributed random variables are themselves bivariate normal, and

hence if they are uncorrelated, they are independent. The prooof of the

following theorem is based on the inductive proof by Stigler (1984).

Theorem 13.7. If X1 , X2 , ..., Xn is a random sample of size n from the

normal distribution N (µ, 2 ), then the sample mean Xn = 1

n n

i=1 X i , and

the sample variance S 2

n= 1

n1  n

i=1(X i X) 2 have the following properties:

(a) (n1) S 2

2 ⇠ 2 (n 1), and

(b) Xn and S 2

nare independent.

Proof: We prove this theorem by induction. First, consider the case n = 2.

Since each Xi ⇠ N ( µ, 2 ) for i = 1, 2, ..., n , therefore X1 + X2 ⇠ N (2µ, 22 )

and X1 X2 ⇠ N (0 , 22 ). Hence

X1 X2

p22 ⇠N(0 ,1)

and therefore 1

(X1 X2 )2

2 ⇠  2 (1).

This proves (a), that is, S 2

2⇠ 2 (1).

Probability and Mathematical Statistics 367

Since X1 and X2 are independent,

Cov (X1 +X2 , X1  X2 )

=Cov (X1 , X1 ) + Cov(X1 , X2 ) Cov(X2 , X1 ) Cov(X2 , X2 )

=2 + 0  0 2

= 0.

Therefore X1 + X2 and X1 X2 are uncorrelated bivariate normal random

variables. Hencce they are independent. Thus 1

2(X 1 +X 2 ) and 1

2(X 1 X 2 ) 2

are independent random variables. This proves (b), that is X2 and S 2

2are

independent.

Now assume the conclusion (that is (a) and (b)) holds for the sample of

size n . We prove it holds for a sample of size n + 1.

Since X1 , X2 , ..., Xn+1 are independent and each Xi ⇠ N ( µ, 2 ), there-

fore Xn ⇠ N  µ,  2

n. Moreover Xn and Xn+1 are independent. Hence by

(13.1), X n+1 is a linear combination of independent random variables X n

and Xn+1 .

The linear combination Xn+1  Xn of the random variables Xn+1 and

Xn is a normal random variable with mean 0 and variance n+1

n 2 . Hence

Xn+1  X n

n+1

n 2 ⇠N(0 ,1).

Therefore n

n+ 1

(Xn+1  Xn )2

2 ⇠  2 (1).

Since Xn+1 and S 2

nare independent random variables, and by induction

hypothesis Xn and S 2

nare independent, therefore dividing (13.2) by  2 we

get

n S2

n+1

2 =(n 1) S 2

2 + n

n+ 1

(Xn+1  Xn )2

2

=2 (n 1) + 2 (1)

=2 (n).

Hence (a) follows.

Finally, the induction hypothesis and the fact that

Xn+1 = n X n + X n+1

n+ 1

Sequences of Random Variables and Order Statistics 368

show that X n+1 is independent of S 2

n. Since

Cov ( n X n + Xn+1 , Xn+1  Xn )

=n Cov (Xn , Xn+1 ) + n Cov (Xn , Xn ) C ov (Xn+1 , Xn+1 )

Cov (Xn+1 , X n )

= 0 n 2

n+2 0 = 0,

the random variables n Xn + Xn+1 and Xn+1  Xn are uncorrelated. Since

these two random variables are normal, therefore they are independent.

Hence (nXn + Xn+1 )/( n +1) and (Xn+1  Xn )2 /( n +1) are also independent.

Since X n+1 and S 2

nare independent, it follows that X n+1 and

n1

nS 2

n+1

n+ 1 (Xn+1  Xn )2

are independent and hence X n+1 and S 2

n+1 are independent. This proves (b)

and the proof of the theorem is now complete.

Remark 13.2. At ﬁrst sight the statement (b) might seem odd since the

sample mean Xn occurs explicitly in the deﬁnition of the sample variance

n. This remarkable independence of X n and S 2

nis a unique property that

distinguishes normal distribution from all other probability distributions.

Example 13.9. Let X1 , X2 , ..., Xn denote a random sample from a normal

distribution with variance 2 > 0. If the ﬁrst percentile of the statistics

W=n

i=1

(Xi  X )2

2 is 1.24, where X denotes the sample mean, what is the

sample size n?

Answer: 1

100 =P (W 1.24)

=P n



i=1

(Xi  X )2

2 1.24

=P (n 1) S 2

2 1.24

=P 2 (n 1)  1.24 .

Thus from 2 -table, we get

n1 = 7

and hence the sample size n is 8.

Probability and Mathematical Statistics 369

Example 13.10. Let X1 , X2 , ..., X4 be a random sample from a nor-

mal distribution with unknown mean and variance equal to 9. Let S2 =

3 4

i=1 X i X . If P S 2 k = 0. 05, then what is k?

Answer:

0. 05 = P  S2  k 

=P 3S2

9 3

9k 

=P 2 (3)  3

9k  .

From 2 -table with 3 degrees of freedom, we get

9k = 0.35

and thus the constant k is given by

k= 3(0 .35) = 1.05.

13.2. Laws of Large Numbers

In this section, we mainly examine the weak law of large numbers. The

weak law of large numbers states that if X1 , X2 , ..., Xn is a random sample

of size n from a population X with mean µ , then the sample mean Xrarely

deviates from the population mean µ when the sample size n is very large. In

other words, the sample mean X converges in probability to the population

mean µ . We begin this section with a result known as Markov inequality

which is needed to establish the weak law of large numbers.

Theorem 13.8 (Markov Inequality). Suppose X is a nonnegative random

variable with mean E (X ). Then

P( X t) E ( X )

for all t > 0.

Proof: We assume the random variable X is continuous. If X is not con-

tinuous, then a proof can be obtained for this case by replacing the integrals

Sequences of Random Variables and Order Statistics 370

with summations in the following proof. Since

E( X) =  1

1

xf (x)dx

= t

1

xf (x) dx + 1

xf (x)dx

 1

xf (x)dx

 1

tf (x) dx because x2 [ t, 1)

=t 1

f(x)dx

=t P (X t),

we see that

P( X t) E ( X )

This completes the proof of the theorem.

In Theorem 4.4 of the chapter 4, Chebychev inequality was treated. Let

Xbe a random variable with mean µand standard deviation  . Then Cheby-

chev inequality says that

P(| X µ| < k) 1 1

for any nonzero positive constant k . This result can be obtained easily using

Theorem 13.8 as follows. By Markov inequality, we have

P(( X µ)2  t2 ) E (( Xµ)2)

for all t > 0. Since the events (X µ)2  t2 and |X µ |t are same, we

get

P(( X µ)2  t2 ) = P (| X µ| t) E (( Xµ)2)

for all t > 0. Hence

P(| X µ| t) 2

t2 .

Letting t =k in the above equality, we see that

P(| X µ| k) 1

k2 .

Probability and Mathematical Statistics 371

Hence

1P (|X µ | < k ) 1

k2 .

The last inequality yields the Chebychev inequality

P(| X µ| < k) 1 1

k2 .

Now we are ready to treat the weak law of large numbers.

Theorem 13.9. Let X1 , X2, ... be a sequence of independent and identically

distributed random variables with µ =E (Xi ) and 2 = V ar (Xi )<1 for

i= 1 ,2 , ..., 1. Then

lim

n!1 P(| S n µ| ") = 0

for every " . Here Sn denotes X 1 +X2 +···+Xn

Proof: By Theorem 13.2 (or Example 13.7) we have

E( Sn ) = µand V ar( Sn ) =  2

By Chebychev's inequality

P(| Sn  E( Sn )| " ) V ar(Sn )

for "> 0. Hence

P(| Sn  µ| ")  2

n"2 .

Taking the limit as n tends to inﬁnity, we get

lim

n!1 P(|S n µ| ") lim

n!1

2

n"2

which yields

lim

n!1 P(| S n µ| ") = 0

and the proof of the theorem is now complete.

It is possible to prove the weak law of large numbers assuming only E (X)

to exist and ﬁnite but the proof is more involved.

The weak law of large numbers says that the sequence of sample means

S n1

n=1 from a population X stays close to the population mean E (X ) most

of the time. Let us consider an experiment that consists of tossing a coin

Sequences of Random Variables and Order Statistics 372

inﬁnitely many times. Let Xi be 1 if the ith toss results in a Head, and 0

otherwise. The weak law of large numbers says that

Sn = X 1 + X 2 +···+Xn

n! 1

2as n! 1 (13.3)

but it is easy to come up with sequences of tosses for which (13.3) is false:

H H H H H H H H H H H H · · · · · ·

H H T H H T H H T H H T · · · · · ·

The strong law of large numbers (Theorem 13.11) states that the set of "bad

sequences" like the ones given above has probability zero.

Note that the assertion of Theorem 13.9 for any "> 0 can also be written

lim

n!1 P(|S n µ| < ") = 1 .

The type of convergence we saw in the weak law of large numbers is not

the type of convergence discussed in calculus. This type of convergence is

called convergence in probability and deﬁned as follows.

Deﬁnition 13.1. Suppose X1 , X2, ... is a sequence of random variables de-

ﬁned on a sample space S . The sequence converges in probability to the

random variable X if, for any "> 0,

lim

n!1 P(|X n X| <") = 1.

In view of the above deﬁnition, the weak law of large numbers states that

the sample mean X converges in probability to the population mean µ.

The following theorem is known as the Bernoulli law of large numbers

and is a special case of the weak law of large numbers.

Theorem 13.10. Let X1 , X2, ... be a sequence of independent and identically

distributed Bernoulli random variables with probability of success p . Then,

for any "> 0,

lim

n!1 P(|S n p| < ") = 1

where Sn denotes X 1 +X2 +···+Xn

The fact that the relative frequency of occurrence of an event E is very

likely to be close to its probability P (E ) for large n can be derived from

the weak law of large numbers. Consider a repeatable random experiment

Probability and Mathematical Statistics 373

repeated large number of time independently. Let Xi = 1 if E occurs on the

ith repetition and Xi = 0 if E does not occur on ith repetition. Then

µ= E(Xi ) = 1 · P( E) + 0 · P( E) = P( E) for i= 1 ,2,3 , ...

and

X1 +X2 +···+Xn = N ( E )

where N (E ) denotes the number of times E occurs. Hence by the weak law

of large numbers, we have

lim

n!1 P    

N( E)

n P( E)   " = lim

n!1 P    

X1 +X2 +···+Xn

n µ    "

= lim

n!1 P  Sn µ  "

= 0.

Hence, for large n , the relative frequency of occurrence of the event E is very

likely to be close to its probability P (E).

Now we present the strong law of large numbers without a proof.

Theorem 13.11. Let X1 , X2, ... be a sequence of independent and identically

distributed random variables with µ =E (Xi ) and 2 = V ar (Xi )<1 for

i= 1 ,2 , ..., 1. Then

P lim

n!1 S n =µ = 1

for every "> 0. Here Sn denotes X 1 +X2 +···+Xn

The type convergence in Theorem 13.11 is called almost sure convergence.

The notion of almost sure convergence is deﬁned as follows.

Deﬁnition 13.2 Suppose the random variable X and the sequence

X1 , X2, ..., of random variables are deﬁned on a sample space S . The se-

quence Xn ( w ) converges almost surely to X (w ) if

P w2 S lim

n!1 X n (w ) = X (w) = 1.

It can be shown that the convergence in probability implies the almost

sure convergence but not the converse.

13.3. The Central Limit Theorem

Consider a random sample of measurement {Xi }n

i=1. The X i 's are iden-

tically distributed and their common distribution is the distribution of the

Sequences of Random Variables and Order Statistics 374

population. We have seen that if the population distribution is normal, then

the sample mean X is also normal. More precisely, if X1 , X2 , ..., Xn is a

random sample from a normal distribution with density

f(x ) = 1

p 2 ⇡e  1

2( xµ

) 2

then

X⇠ N µ,  2

n .

The central limit theorem (also known as Lindeberg-Levy Theorem) states

that even though the population distribution may be far from being normal,

yet for large sample size n , the distribution of the standardized sample mean

is approximately standard normal with better approximations obtained with

the larger sample size. Mathematically this can be stated as follows.

Theorem 13.12 (Central Limit Theorem). Let X1 , X2 , ..., Xn be a ran-

dom sample of size n from a distribution with mean µ and variance 2 < 1,

then the limiting distribution of

Zn = X µ



is standard normal, that is Zn converges in distribution to Z where Zdenotes

a standard normal random variable.

The type of convergence used in the central limit theorem is called the

convergence in distribution and is deﬁned as follows.

Deﬁnition 13.3. Suppose X is a random variable with cumulative den-

sity function F (x ) and the sequence X1 , X2, ... of random variables with

cumulative density functions F1 (x) , F2 (x) , ... , respectively. The sequence Xn

converges in distribution to Xif

lim

n!1 F n (x) = F (x)

for all values x at which F (x ) is continuous. The distribution of X is called

the limiting distribution of Xn .

Whenever a sequence of random variables X1 , X2 , ... converges in distri-

bution to the random variable X , it will be denoted by Xn

!X.

Probability and Mathematical Statistics 375

Example 13.11. Let Y = X1 + X2 +··· + X15 be the sum of a random

sample of size 15 from the distribution whose density function is

f(x ) =  3

2x 2 if 1 < x < 1

0 otherwise.

What is the approximate value of P (0. 3Y 1. 5) when one uses the

central limit theorem?

Answer: First, we ﬁnd the mean µ and variance 2 for the density function

f(x ). The mean for this distribution is given by

µ= 1

1

2x 3 dx

2 x4

41

1

= 0.

Hence the variance of this distribution is given by

V ar( X ) = E ( X2 ) [ E ( X ) ]2

= 1

1

2x 4 dx

2 x5

51

1

= 0.6.

P(0. 3 Y1. 5) = P(0. 3 0 Y0 1. 5 0)

=P 0.3

15(0.6)  Y0

15(0.6)  1.5

15(0.6) 

=P (0. 10 Z 0.50)

=P (Z  0. 50) + P (Z  0. 10) 1

= 0. 6915 + 0. 5398 1

= 0.2313.

Example 13.12. Let X1 , X2 , ..., Xn be a random sample of size n = 25 from

a population that has a mean µ = 71. 43 and variance 2 = 56. 25. Let X be

Sequences of Random Variables and Order Statistics 376

the sample mean. What is the probability that the sample mean is between

68.91 and 71.97?

Answer: The mean of X is given by E  X = 71. 43. The variance of Xis

given by

V ar  X = 2

n=56.25

25 = 2.25.

In order to ﬁnd the probability that the sample mean is between 68.91 and

71.97, we need the distribution of the population. However, the population

distribution is unknown. Therefore, we use the central limit theorem. The

central limit theorem says that Xµ



pn ⇠N(0 ,1) as napproaches inﬁnity.

Therefore

P 68. 91  X71.97

= 68.91  71.43

p2.25 X  71.43

p2.25  71.97  71.43

p2.25 

=P (0. 68 W 0.36)

=P (W  0. 36) + P (W  0. 68) 1

= 0.5941.

Example 13.13. Light bulbs are installed successively into a socket. If we

assume that each light bulb has a mean life of 2 months with a standard

deviation of 0.25 months, what is the probability that 40 bulbs last at least

7 years?

Answer: Let Xi denote the life time of the ith bulb installed. The 40 light

bulbs last a total time of

S40 =X1 +X2 +···+X40 .

By the central limit theorem

40

i=1 X i nµ

pn2 ⇠N(0 ,1) as n ! 1.

Thus S 40  (40)(2)

(40)(0.25)2 ⇠N (0,1).

That is S 40  80

1. 581 ⇠N (0,1).

Probability and Mathematical Statistics 377

Therefore P (S40  7(12))

=P S 40  80

1. 581  84 80

1. 581 

=P (Z 2.530)

= 0.0057.

Example 13.14. Light bulbs are installed into a socket. Assume that each

has a mean life of 2 months with standard deviation of 0.25 month. How

many bulbs n should be bought so that one can be 95% sure that the supply

of n bulbs will last 5 years?

Answer: Let Xi denote the life time of the ith bulb installed. The n light

bulbs last a total time of

Sn =X1 +X2 +···+Xn.

The total average life span Sn has

E(Sn ) = 2 nand V ar(Sn ) = n

16 .

By the central limit theorem, we get

Sn  E (Sn )

4⇠N(0 ,1).

Thus, we seek n such that

0. 95 = P (Sn  60)

=P S n  2n

460  2n

4

=P Z 240  8n

pn 

= 1 P  Z 240  8n

pn  .

From the standard normal table, we get

240  8n

pn = 1.645

which implies

1.645p n + 8n 240 = 0.

Sequences of Random Variables and Order Statistics 378

Solving this quadratic equation for p n , we get

pn= 5. 375 or 5 .581.

Thus n = 31. 15. So we should buy 32 bulbs.

Example 13.15. American Airlines claims that the average number of peo-

ple who pay for in-ﬂight movies, when the plane is fully loaded, is 42 with a

standard deviation of 8. A sample of 36 fully loaded planes is taken. What

is the probability that fewer than 38 people paid for the in-ﬂight movies?

Answer: Here, we like to ﬁnd P (X < 38). Since, we do not know the

distribution of X , we will use the central limit theorem. We are given that

the population mean is µ = 42 and population standard deviation is = 8.

Moreover, we are dealing with sample of size n = 36. Thus

P( X < 38) = P X42

<38 42

6

=P (Z < 3)

= 1 P (Z < 3)

= 1  0.9987

= 0.0013.

Since we have not yet seen the proof of the central limit theorem, ﬁrst

let us go through some examples to see the main idea behind the proof of the

central limit theorem. Later, at the end of this section a proof of the central

limit theorem will be given. We know from the central limit theorem that if

X1 , X2 , ..., Xn is a random sample of size n from a distribution with mean µ

and variance 2 , then

X µ



!Z ⇠N(0 ,1) as n ! 1.

However, the above expression is not equivalent to

!Z ⇠N µ,  2

n as n! 1

as the following example shows.

Example 13.16. Let X1 , X2 , ..., Xn be a random sample of size nfrom

a gamma distribution with parameters ✓ = 1 and ↵ = 1. What is the

Probability and Mathematical Statistics 379

distribution of the sample mean X ? Also, what is the limiting distribution

of X as n ! 1?

Answer: Since, each Xi ⇠GAM(1 , 1), the probability density function of

each Xi is given by

f(x ) =  e x if x0

0 otherwise

and hence the moment generating function of each Xi is

MX i (t) = 1

1t.

First we determine the moment generating function of the sample mean X,

and then examine this moment generating function to ﬁnd the probability

distribution of X . Since

MX (t) = M 1

n n

i=1 X i (t)



i=1

MX i  t

n



i=1

1 t

n

1 t

n n ,

therefore X⇠ GAM  1

n, n .

Next, we ﬁnd the limiting distribution of X as n ! 1 . This can be

done again by ﬁnding the limiting moment generating function of Xand

identifying the distribution of X . Consider

lim

n!1 M X (t) = lim

n!1

1 t

n n

limn!1  1 t

n n

et

=et.

Thus, the sample mean X has a degenerate distribution, that is all the prob-

ability mass is concentrated at one point of the space of X.

Sequences of Random Variables and Order Statistics 380

Example 13.17. Let X1 , X2 , ..., Xn be a random sample of size nfrom

a gamma distribution with parameters ✓ = 1 and ↵ = 1. What is the

distribution of Xµ



as n ! 1

where µ and  are the population mean and variance, respectively?

Answer: From Example 13.7, we know that

MX (t) = 1

1 t

n n .

Since the population distribution is gamma with ✓ = 1 and ↵ = 1, the

population mean µ is 1 and population variance 2 is also 1. Therefore

MX1

(t ) = MpnX pn (t)

=ep nt MX  p n t

=ep nt 1

1 pnt

n n

epnt  1 t

pn  n .

The limiting moment generating function can be obtained by taking the limit

of the above expression as n tends to inﬁnity. That is,

lim

n!1 M X1

(t ) = lim

n!1

epnt  1 t

pn  n

=e1

2t 2 (using MAPLE)

=Xµ



pn ⇠N(0 ,1).

The following theorem is used to prove the central limit theorem.

Theorem 13.13 (L´evy Continuity Theorem). Let X1 , X2, ... be a se-

quence of random variables with distribution functions F1 (x) , F2 (x) , ... and

moment generating functions MX 1 (t) , MX 2 (t) , ... , respectively. Let X be a

random variable with distribution function F (x ) and moment generating

function MX (t ). If for all t in the open interval (h, h ) for some h > 0

lim

n!1 M X n (t) = M X (t),

Probability and Mathematical Statistics 381

then at the points of continuity of F (x)

lim

n!1 F n (x) = F (x).

The proof of this theorem is beyond the scope of this book.

The following limit

lim

n!1  1 + t

n+ d(n)

nn

=et , if lim

n!1 d(n) = 0 , (13.4)

whose proof we leave it to the reader, can be established using advanced

calculus. Here t is independent of n.

Now we proceed to prove the central limit theorem assuming that the

moment generating function of the population X exists. Let MXµ (t ) be

the moment generating function of the random variable X µ . We denote

MXµ (t) as M (t) when there is no danger of confusion. Then

M(0) = 1,

M0 (0) = E( X µ) = E( X) µ= µ µ= 0,

M00 (0) = E ( X µ)2  =2 .









(13.5)

By Taylor series expansion of M (t ) about 0, we get

M(t ) = M(0) + M0 (0) t+1

2M 00 (⌘)t2

where ⌘2 (0, t ). Hence using (13.5), we have

M(t ) = 1 + 1

2M 00 (⌘)t2

= 1 + 1

2 2 t 2 + 1

2M 00 (⌘)t2  1

2 2 t2

= 1 + 1

2 2 t 2 + 1

2 M 00 (⌘) 2  t2.

Now using M (t ) we compute the moment generating function of Zn . Note

that

Zn = X µ



p n



i=1

(Xi µ).

Sequences of Random Variables and Order Statistics 382

Hence

MZ n (t) =



i=1

MX i µ  t

p n



i=1

MXµ  t

p n

= M t

p nn

= 1 + t 2

2n + (M 00 (⌘) 2 )t2

2n2 n

for 0 < |⌘ |< 1

p n|t|. Note that since 0 < |⌘ |< 1

p n|t|, we have

lim

n!1

p n= 0 ,lim

n!1 ⌘= 0, and lim

n!1 M 00 (⌘)  2 = 0. (13.6)

Letting

d(n) = ( M 00 (⌘) 2 ) t 2

22

and using (13.6), we see that lim

n!1d(n) = 0, and

MZ n (t) =  1 + t 2

2n + d(n)

nn

.(13.7)

Using (13.7) we have

lim

n!1 M Z n (t) = lim

n!1  1 + t 2

2n + d(n)

nn

=e1

2t 2 .

Hence by the L´evy continuity theorem, we obtain

lim

n!1 F n (x) =  (x)

where  (x ) is the cumulative density function of the standard normal distri-

bution. Thus Zn

!Zand the proof of the theorem is now complete.

Now we give another proof of the central limit theorem using L'Hospital

rule. This proof is essentially due to Tardi↵(1981).

As before, let Zn =Xµ



. Then MZ n (t ) =  M t

p n n where M (t ) is

the moment generating function of the random variable X µ . Hence from

(13.5), we have M (0) = 1, M 0 (0) = 0, and M 00(0) = 2 . Letting h = t

p n,

Probability and Mathematical Statistics 383

we see that n = t 2

2 h2 . Hence if n! 1, then h! 0. Using these and

applying the L'Hospital rule twice, we compute

lim

n!1M Z n (t) = lim

n!1 M t

p nn

= lim

n!1exp  nln  M t

p n

= lim

h!0exp  t 2

2

ln (M (h))

h2   0

0form

= lim

h!0 exp  t 2

2

M(h )M 0 (h)

2h (L 0 Hospital rule)

= lim

h!0 exp  t 2

2

M0 (h)

2h M (h )   0

0form

= lim

h!0 exp  t 2

2

M00 (h)

2M (h ) + 2h M 0 (h ) (L 0 Hospital rule)

= lim

h!0 exp  t 2

2

M00 (0)

2M (0) 

= lim

h!0 exp  t 2

2

2

=exp  1

2t 2  .

Hence by the L´evy continuity theorem, we obtain

lim

n!1 F n (x) =  (x)

where  (x ) is the cumulative density function of the standard normal distri-

bution. Thus as n ! 1 , the random variable Zn

!Z, where Z ⇠N(0 ,1).

Remark 13.3. In contrast to the moment generating function, since the

characteristic function of a random variable always exists, the original proof

of the central limit theorem involved the characteristic function (see for ex-

ample An Introduction to Probability Theory and Its Applications, Volume II

by Feller). In 1988, Brown gave an elementary proof using very clever Tay-

lor series expansions, where the use of the characteristic function has been

avoided.

13.4. Order Statistics

Often, sample values such as the smallest, largest, or middle observation

from a random sample provide important information. For example, the

Sequences of Random Variables and Order Statistics 384

highest ﬂood water or lowest winter temperature recorded during the last

50 years might be useful when planning for future emergencies. The median

price of houses sold during the previous month might be useful for estimating

the cost of living. The statistics highest, lowest or median are examples of

order statistics.

Deﬁnition 13.4. Let X1 , X2 , ..., Xn be observations from a random sam-

ple of size n from a distribution f (x ). Let X(1) denote the smallest of

{X1 , X2 , ..., Xn }, X(2) denote the second smallest of {X1 , X2 , ..., Xn }, and

similarly X(r) denote the r th smallest of {X1 , X2 , ..., Xn } . Then the ran-

dom variables X(1) , X(2) , ..., X(n) are called the order statistics of the sam-

ple X1 , X2 , ..., Xn . In particular, X(r) is called the r th -order statistic of

X1 , X2 , ..., Xn .

The sample range, R , is the distance between the smallest and the largest

observation. That is,

R= X(n) X(1).

This is an important statistic which is deﬁned using order statistics.

The distribution of the order statistics are very important when one uses

these in any statistical investigation. The next theorem gives the distribution

of an order statistic.

Theorem 13.14. Let X1 , X2 , ..., Xn be a random sample of size n from a dis-

tribution with density function f (x ). Then the probability density function

of the r th order statistic, X(r) , is

g(x ) = n !

(r 1)! (n r )! [F(x)]r1 f(x ) [1 F (x)]nr ,

where F (x ) denotes the cdf of f (x).

Proof: We prove the theorem assuming f (x ) continuous. In the case f (x ) is

discrete the proof has to be modiﬁed appropriately. Let h be a positive real

number and x be an arbitrary point in the domain of f . Let us divide the

real line into three segments, namely

IR = (1, x )  [x, x + h ) [x + h, 1).

The probability, say p1 , of a sample value falls into the ﬁrst interval (1, x ]

and is given by

p1 = x

1

f(t ) dt = F (x).

Probability and Mathematical Statistics 385

Similarly, the probability p2 of a sample value falls into the second interval

[x, x + h ) is

p2 = x+h

f(t ) dt = F ( x+ h) F (x).

In the same token, we can compute the probability p3 of a sample value which

falls into the third interval

p3 = 1

x+h

f(t ) dt = 1  F ( x+ h).

Then the probability, Ph (x ), that (r 1) sample values fall in the ﬁrst interval,

one falls in the second interval, and (n r ) fall in the third interval is

Ph (x) =  n

r1 ,1 , n  r p r1

1p 1

2p nr

3=n!

(r 1)! (n r )! p r1

1p 2 p nr

Hence the probability density function g (x ) of the r th statistics is given by

g(x)

= lim

h!0

Ph (x)

= lim

h!0 n!

(r 1)! (n r )! p r1

hp nr

3

=n!

(r 1)! (n r )! [F(x)]r1 lim

h!0

F( x+ h) F(x)

hlim

h!0 [1 F (x + h)] nr

=n!

(r 1)! (n r )! [F(x)]r1 F0 (x ) [1 F (x)]nr

=n!

(r 1)! (n r )! [F(x)]r1 f(x ) [1 F (x)]nr .

Example 13.18. Let X1 , X2 be a random sample from a distribution with

density function

f(x ) =  e x for 0  x < 1

0 otherwise.

What is the density function of Y = min{X1 , X2 } where nonzero?

Answer: The cumulative distribution function of f (x ) is

F(x ) =  x

et dt

= 1  ex

Sequences of Random Variables and Order Statistics 386

In this example, n = 2 and r= 1. Hence, the density of Yis

g( y) = 2!

0! 1! [F(y)]0 f(y ) [1 F (y)]

= 2f (y ) [1 F (y)]

= 2 ey  1 1 + ey 

= 2 e2y .

Example 13.19. Let Y1 < Y2 <··· < Y6 be the order statistics from a

random sample of size 6 from a distribution with density function

f(x ) =  2 xfor 0 < x < 1

0 otherwise.

What is the expected value of Y6 ?

Answer: f (x) = 2x

F(x ) =  x

2t dt

=x2.

The density function of Y6 is given by

g( y) = 6!

5! 0! [F(y)]5 f(y)

= 6  y2  5 2y

= 12y 11 .

Hence, the expected value of Y6 is

E(Y6 ) =  1

y g( y) dy

= 1

y12 y11 dy

=12

13  y 13  1

=12

13 .

Example 13.20. Let X, Y and Z be independent uniform random variables

on the interval (0, a ). Let W = min{X, Y, Z} . What is the expected value of

1 W

a 2 ?

Probability and Mathematical Statistics 387

Answer: The probability distribution of X (or Y or Z ) is

f(x ) =  1

aif 0 < x < a

0 otherwise.

Thus the cumulative distribution of function of f (x ) is given by

F(x ) = 









0 if x 0

aif 0 < x < a

1 if x a.

Since W = min{X, Y, Z } ,W is the ﬁrst order statistic of the random sample

X, Y, Z . Thus, the density function of W is given by

g( w) = 3!

0! 1! 2! [F(w)]0 f(w ) [1 F (w)]2

= 3f (w ) [1 F (w)]2

= 3  1 w

a 2  1

a

a 1 w

a 2 .

Thus, the pdf of W is given by

g( w) = 





a1 w

a 2 if 0 < w < a

0 otherwise.

The expected value of Wis

E 1 W

a 2 

= a

01w

a 2 g(w ) dw

= a

01w

a 2 3

a 1 w

a 2 dw

= a

a 1 w

a 4 dw

= 3

5 1w

a 5 a

Sequences of Random Variables and Order Statistics 388

Example 13.21. Let X1 , X2 , ..., Xn be a random sample from a population

Xwith uniform distribution on the interval [0 ,1]. What is the probability

distribution of the sample range W := X(n) X(1) ?

Answer: To ﬁnd the distribution of W, we need the joint distribution of the

random variable  X(n) , X(1)  . The joint distribution of  X(n) , X(1)  is given

h(x1 , xn ) = n( n 1) f (x1 ) f (xn ) [ F (xn ) F (x1 )]n2 ,

where xn x1 and f (x ) is the probability density function of X . To de-

termine the probability distribution of the sample range W , we consider the

transformation

U= X(1)

W= X(n) X(1) 

which has an inverse

X(1) =U

X(n) = U+ W. 

The Jacobian of this transformation is

J= det  1 0

1 1  = 1.

Hence the joint density of (U, W ) is given by

g( u, w) = | J| h(x1 , xn )

=n( n 1) f (u) f ( u +w )[F(u +w ) F (u)]n2

where w 0. Since f (u ) and f (u +w ) are simultaneously nonzero if 0 u1

and 0 u +w 1. Hence f (u ) and f (u +w ) are simultaneously nonzero if

0u 1w . Thus, the probability of W is given by

j( w) =  1

1

g( u, w)du

= 1

1

n( n 1) f (u) f ( u+ w )[ F ( u+ w) F (u)]n2 du

=n( n 1) w n2  1w

=n( n 1) (1 w )w n2

where 0 w1.

Probability and Mathematical Statistics 389

13.5. Sample Percentiles

The sample median, M , is a number such that approximately one-half

of the observations are less than M and one-half are greater than M.

Deﬁnition 13.5. Let X1 , X2 , ..., Xn be a random sample. The sample

median M is deﬁned as

M=





X( n+1

2)if n is odd

2X( n

2)+X ( n+2

2) if n is even.

The median is a measure of location like sample mean.

Recall that for continuous distribution, 100pth percentile, ⇡ p, is a number

such that

p= ⇡p

1

f(x ) dx.

Deﬁnition 13.6. The 100pth sample percentile is deﬁned as

⇡p = 









X([np]) if p < 0.5

Mif p= 0 .5

X(n+1[n(1p )]) if p > 0.5.

where [b ] denote the number b rounded to the nearest integer.

Example 13.22. Let X1 , X2 , ..., X12 be a random sample of size 12. What

is the 65th percentile of this sample?

Answer: 100p = 65

p= 0 .65

n(1  p) = (12)(1  0 .65) = 4.2

[n (1  p )] = [4 . 2] = 4

Hence by deﬁnition of 65th percentile is

⇡0.65 = X(n+1[n(1p )])

=X(134)

=X(9).

Sequences of Random Variables and Order Statistics 390

Thus, the 65th percentile of the random sample X1 , X2 , ..., X12 is the 9th-

order statistic.

For any number p between 0 and 1, the 100pth sample percentile is an

observation such that approximately np observations are less than this ob-

servation and n (1  p ) observations are greater than this.

Deﬁnition 13.7. The 25th percentile is called the lower quartile while the

75th percentile is called the upper quartile. The distance between these two

quartiles is called the interquartile range.

Example 13.23. If a sample of size 3 from a uniform distribution over [0,1]

is observed, what is the probability that the sample median is between 1

4and

Answer: When a sample of (2n + 1) random variables are observed, the

(n + 1)th smallest random variable is called the sample median. For our

problem, the sample median is given by

X(2) = 2nd smallest {X1 , X2, X3 }.

Let Y = X(2) . The density function of each Xi is given by

f(x ) =  1 if 0 x1

0 otherwise.

Hence, the cumulative density function of f (x ) is

F(x ) = x.

Thus the density function of Y is given by

g( y) = 3!

1! 1! [F(y)]21 f(y ) [1 F (y)]32

= 6 F (y )f (y ) [1 F (y)]

= 6y (1 y ).

Therefore

P 1

4< Y < 3

4 =  3

g( y) dy

= 3

6y (1 y ) dy

= 6  y 2

2 y 3

3 3

=11

16 .

Probability and Mathematical Statistics 391

13.6. Review Exercises

1. Suppose we roll a die 1000 times. What is the probability that the sum

of the numbers obtained lies between 3000 and 4000?

2. Suppose Kathy ﬂip a coin 1000 times. What is the probability she will

get at least 600 heads?

3. At a certain large university the weight of the male students and female

students are approximately normally distributed with means and standard

deviations of 180, and 20, and 130 and 15, respectively. If a male and female

are selected at random, what is the probability that the sum of their weights

is less than 280?

4. Seven observations are drawn from a population with an unknown con-

tinuous distribution. What is the probability that the least and the greatest

observations bracket the median?

5. If the random variable X has the density function

f(x ) = 





2 (1  x ) for 0 x  1

0 otherwise,

what is the probability that the larger of 2 independent observations of X

will exceed 1

6. Let X1 , X2, X3 be a random sample from the uniform distribution on the

interval (0, 1). What is the probability that the sample median is less than

0.4?

7. Let X1 , X2, X3, X4, X5 be a random sample from the uniform distribution

on the interval (0,✓ ), where ✓ is unknown, and let Xmax denote the largest

observation. For what value of the constant k , the expected value of the

random variable kXmax is equal to ✓?

8. A random sample of size 16 is to be taken from a normal population having

mean 100 and variance 4. What is the 90th percentile of the distribution of

the sample mean?

9. If the density function of a random variable X is given by

f(x ) = 





2x for 1

e< x < e

0 otherwise,

Sequences of Random Variables and Order Statistics 392

what is the probability that one of the two independent observations of Xis

less than 2 and the other is greater than 1?

10. Five observations have been drawn independently and at random from

a continuous distribution. What is the probability that the next observation

will be less than all of the ﬁrst 5?

11. Let the random variable X denote the length of time it takes to complete

a mathematics assignment. Suppose the density function of X is given by

f(x ) = 





e(x ✓) for ✓ < x < 1

0 otherwise,

where ✓ is a positive constant that represents the minimum time to complete

a mathematics assignment. If X1 , X2 , ..., X5 is a random sample from this

distribution. What is the expected value of X(1) ?

12. Let X and Y be two independent random variables with identical prob-

ability density function given by

f(x ) =  e x for x > 0

0 elsewhere.

What is the probability density function of W = max{X, Y } ?

13. Let X and Y be two independent random variables with identical prob-

ability density function given by

f(x ) = 





3x2

✓3 for 0 x✓

0 elsewhere,

for some ✓> 0. What is the probability density function of W = min{X, Y }?

14. Let X1 , X2 , ..., Xn be a random sample from a uniform distribution on

the interval from 0 to 5. What is the limiting moment generating function

of Xµ



as n ! 1?

15. Let X1 , X2 , ..., Xn be a random sample of size n from a normal distri-

bution with mean µ and variance 1. If the 75th percentile of the statistic

W=n

i=1 X i X 2 is 28.24, what is the sample size n?

16. Let X1 , X2 , ..., Xn be a random sample of size n from a Bernoulli distri-

bution with probability of success p = 1

2. What is the limiting distribution

the sample mean X?

Probability and Mathematical Statistics 393

17. Let X1 , X2 , ..., X1995 be a random sample of size 1995 from a distribution

with probability density function

f(x ) = e   x

x! x = 0 , 1 , 2 , 3 , ..., 1 .

What is the distribution of 1995X?

18. Suppose X1 , X2 , ..., Xn is a random sample from the uniform distribution

on (0, 1) and Z be the sample range. What is the probability that Z is less

than or equal to 0.5?

19. Let X1 , X2 , ..., X9 be a random sample from a uniform distribution on

the interval [1, 12]. Find the probability that the next to smallest is greater

than or equal to 4?

20. A machine needs 4 out of its 6 independent components to operate. Let

X1 , X2 , ..., X6 be the lifetime of the respective components. Suppose each is

exponentially distributed with parameter ✓ . What is the probability density

function of the machine lifetime?

21. Suppose X1 , X2 , ..., X2n+1 is a random sample from the uniform dis-

tribution on (0, 1). What is the probability density function of the sample

median X(n+1) ?

22. Let X and Y be two random variables with joint density

f( x, y) =  12 x if 0 <y< 2 x < 1

0 otherwise.

What is the expected value of the random variable Z =X2 Y3 +X2 X Y3 ?

23. Let X1 , X2 , ..., X50 be a random sample of size 50 from a distribution

with density

f(x ) =  1

( ↵) ✓↵ x ↵ 1 e  x

✓for 0 < x < 1

0 otherwise.

What are the mean and variance of the sample mean X?

24. Let X1 , X2 , ..., X100 be a random sample of size 100 from a distribution

with density

f(x ) =  e   x

x! for x = 0, 1,2, ..., 1

0 otherwise.

What is the probability that X greater than or equal to 1?

Sequences of Random Variables and Order Statistics 394

Probability and Mathematical Statistics 395

Chapter 14

SAMPLING

DISTRIBUTIONS

ASSOCIATED WITH THE

NORMAL

POPULATIONS

Given a random sample X1 , X2 , ..., Xn from a population X with proba-

bility distribution f (x ;✓ ), where ✓ is a parameter, a statistic is a function T

of X1 , X2 , ..., Xn , that is

T= T(X1 , X2 , ..., Xn )

which is free of the parameter ✓ . If the distribution of the population is

known, then sometimes it is possible to ﬁnd the probability distribution of

the statistic T . The probability distribution of the statistic T is called the

sampling distribution of T. The joint distribution of the random variables

X1 , X2 , ..., Xn is called the distribution of the sample. The distribution of

the sample is the joint density

f(x1 , x2 , ..., xn ;✓ ) = f(x1 ;✓ ) f(x2 ;✓ ) ··· f(xn ;✓ ) =



i=1

f(xi ;✓ )

since the random variables X1 , X2 , ..., Xn are independent and identically

distributed.

Since the normal population is very important in statistics, the sampling

distributions associated with the normal population are very important. The

most important sampling distributions which are associated with the normal

Sampling Distributions Associated with the Normal Population 396

population are the followings: the chi-square distribution, the student's t-

distribution, the F-distribution, and the beta distribution. In this chapter,

we only consider the ﬁrst three distributions, since the last distribution was

considered earlier.

14.1. Chi-square distribution

In this section, we treat the Chi-square distribution, which is one of the

very useful sampling distributions.

Deﬁnition 14.1. A continuous random variable X is said to have a chi-

square distribution with r degrees of freedom if its probability density func-

tion is of the form

f(x ; r) = 





( r

2) 2 r

2x r

21 e  x

2if 0 x < 1

0 otherwise,

where r > 0. If X has chi-square distribution, then we denote it by writing

X⇠2 ( r). Recall that a gamma distribution reduces to chi-square distri-

bution if ↵ = r

2and ✓ = 2. The mean and variance of X are r and 2r,

respectively.

Thus, chi-square distribution is also a special case of gamma distribution.

Further, if r ! 1 , then chi-square distribution tends to normal distribution.

Example 14.1. If X⇠ GAM (1 , 1), then what is the probability density

function of the random variable 2X?

Answer: We will use the moment generating method to ﬁnd the distribution

of 2X . The moment generating function of a gamma random variable is given

M(t ) = (1 ✓ t)↵ ,if t < 1

✓.

Probability and Mathematical Statistics 397

Since X⇠ GAM (1 , 1), the moment generating function of X is given by

MX (t) = 1

1t, t < 1.

Hence, the moment generating function of 2Xis

M2X (t) = MX (2t)

1 2t

(1  2t ) 2

= MGF of 2 (2).

Hence, if X is GAM (1 , 1) or is an exponential with parameter 1, then 2Xis

chi-square with 2 degrees of freedom.

Example 14.2. If X⇠ 2 (5), then what is the probability that X is between

1.145 and 12.83?

Answer: The probability of X between 1.145 and 12.83 can be calculated

from the following:

P(1. 145  X12.83)

=P (X  12. 83) P (X 1.145)

= 12.83

f(x ) dx  1.145

f(x ) dx

= 12.83

 5

22 5

21 e  x

2dx  1.145

 5

22 5

21 e  x

2dx

= 0. 975  0. 050 (from 2 table)

= 0.925.

The above integrals are hard to evaluate and thus their values are taken from

the chi-square table.

Example 14.3. If X⇠ 2 (7), then what are values of the constants aand

bsuch that P( a < X < b) = 0 .95?

Answer: Since

0. 95 = P (a < X < b) = P (X < b )P (X < a),

we get

P( X < b) = 0 .95 + P ( X < a).

Sampling Distributions Associated with the Normal Population 398

We choose a = 1. 690, so that

P( X < 1. 690) = 0 .025.

From this, we get

P( X < b) = 0 .95 + 0.025 = 0.975

Thus, from chi-square table, we get b = 16.01.

The following theorems were studied earlier in Chapters 6 and 13 and

they are very useful in ﬁnding the sampling distributions of many statistics.

We state these theorems here for the convenience of the reader.

Theorem 14.1. If X⇠ N (µ, 2 ), then  Xµ

 2 ⇠ 2 (1).

Theorem 14.2. If X⇠ N (µ, 2 ) and X1 , X2 , ..., Xn is a random sample

from the population X , then



i=1  X i µ

2

⇠2 (n).

Theorem 14.3. If X⇠ N (µ, 2 ) and X1 , X2 , ..., Xn is a random sample

from the population X , then

(n 1) S 2

2 ⇠  2 (n 1).

Theorem 14.4. If X⇠ GAM (✓ ,↵ ), then

✓X⇠2 (2 ↵ ).

Example 14.4. A new component is placed in service and n spares are

available. If the times to failure in days are independent exponential vari-

ables, that is Xi ⇠ EX P (100), how many spares would be needed to be 95%

sure of successful operation for at least two years ?

Answer: Since Xi ⇠ EX P (100),



i=1

Xi ⇠GAM (100 , n ).

Probability and Mathematical Statistics 399

Hence, by Theorem 14.4, the random variable

Y=2

100



i=1

Xi ⇠ 2 (2n).

We have to ﬁnd the number of spares n such that

0. 95 = P n



i=1

Xi  2 years

=P n



i=1

Xi  730 days

=P 2

100



i=1

Xi  2

100 730 days 

=P 2

100



i=1

Xi  730

50 

=P 2 (2n ) 14.6 .

2n = 25 (from 2 table)

Hence n = 13 (after rounding up to the next integer). Thus, 13 spares are

needed to be 95% sure of successful operation for at least two years.

Example 14.5. If X⇠ N (10, 25) and X1 , X2 , ..., X501 is a random sample

of size 501 from the population X , then what is the expected value of the

sample variance S2 ?

Answer: We will use the Theorem 14.3, to do this problem. By Theorem

14.3, we see that

(501  1) S 2

2 ⇠  2 (500).

Hence, the expected value of S2 is given by

E S2  = E 25

500  500

25  S 2 

= 25

500  E  500

25  S 2 

= 1

20  E   2 (500) 

= 1

20  500

= 25.

Sampling Distributions Associated with the Normal Population 400

14.2. Student's t-distribution

Here we treat the Student's t -distribution, which is also one of the very

useful sampling distributions.

Deﬁnition 14.2. A continuous random variable X is said to have a t-

distribution with ⌫ degrees of freedom if its probability density function is of

the form

f(x ;⌫ ) =   ⌫+1

2

p⇡ ⌫  ⌫

21 + x2

⌫( ⌫+1

2),1 <x<1

where ⌫> 0. If X has a t -distribution with ⌫ degrees of freedom, then we

denote it by writing X⇠ t(⌫).

The t -distribution was discovered by W.S. Gosset (1876-1936) of Eng-

land who published his work under the pseudonym of student. Therefore,

this distribution is known as Student's t -distribution. This distribution is a

generalization of the Cauchy distribution and the normal distribution. That

is, if ⌫ = 1, then the probability density function of X becomes

f(x ; 1) = 1

⇡(1 + x2 )  1 < x < 1,

which is the Cauchy distribution. Further, if ⌫ ! 1 , then

lim

⌫!1 f(x ;⌫ ) = 1

p2⇡ e 1

2x 2  1 <x< 1,

which is the probability density function of the standard normal distribution.

The following ﬁgure shows the graph of t -distributions with various degrees

of freedom.

Example 14.6. If T⇠ t (10), then what is the probability that T is at least

2. 228 ?

Probability and Mathematical Statistics 401

Answer: The probability that T is at least 2. 228 is given by

P( T2. 228) = 1  P( T < 2.228)

= 1  0. 975 (from t table)

= 0.025.

Example 14.7. If T⇠ t (19), then what is the value of the constant c such

that P (|T | c ) = 0 . 95 ?

Answer:

0. 95 = P (|T | c )

=P (c T  c )

=P (T c ) 1 + P (T c)

= 2 P (T c ) 1.

Hence

P( T c) = 0.975.

Thus, using the t-table, we get for 19 degrees of freedom

c= 2 .093.

Theorem 14.5. If the random variable X has a t -distribution with ⌫degrees

of freedom, then

E[ X] =  0 if ⌫ 2

DN E if ⌫ = 1

and

V ar[ X ] =  ⌫

⌫2 if ⌫ 3

DN E if ⌫ = 1 , 2

where DNE means does not exist.

Theorem 14.6. If Z⇠ N (0, 1) and U⇠ 2 (⌫ ) and in addition, Z and U

are independent, then the random variable W deﬁned by

W= Z

U

⌫

has a t -distribution with ⌫ degrees of freedom.

Sampling Distributions Associated with the Normal Population 402

Theorem 14.7. If X⇠ N (µ, 2 ) and X1 , X2 , ..., Xn be a random sample

from the population X , then

X µ

pn ⇠t( n 1).

Proof: Since each Xi ⇠ N ( µ, 2 ),

X⇠ N µ,  2

n .

Thus,

X µ



pn ⇠N(0 ,1).

Further, from Theorem 14.3 we know that

(n 1) S 2

2 ⇠  2 (n 1).

Hence

X µ

Xµ



(n1) S 2

(n 1) 2 ⇠t( n 1) (by Theorem 14.6).

This completes the proof of the theorem.

Example 14.8. Let X1 , X2 , X3 , X4 be a random sample of size 4 from a

standard normal distribution. If the statistic W is given by

W= X 1  X 2 + X3

X 2

1+X 2

2+X 2

3+X 2

then what is the expected value of W?

Answer: Since Xi ⇠ N (0 , 1), we get

X1 X2 +X3 ⇠ N (0 , 3)

and X 1  X 2 + X 3

p3 ⇠N(0 ,1).

Further, since Xi ⇠ N (0 , 1), we have

i⇠ 2 (1)

Probability and Mathematical Statistics 403

and hence

1+X 2

2+X 2

3+X 2

4⇠ 2 (4)

Thus, X 1 X2 +X3

X 2

1+X 2

2+X 2

3+X 2

= 2

p3  W⇠ t(4).

Now using the distribution of W , we ﬁnd the expected value of W.

E[ W] =  p 3

2 E  2

p3 W 

= p 3

2 E [t(4)]

= p 3

2 0

= 0.

Example 14.9. If X⇠ N (0, 1) and X1 , X2 is random sample of size two from

the population X , then what is the 75th percentile of the statistic W = X 1

pX 2

Answer: Since each Xi ⇠ N (0 , 1), we have

X1 ⇠ N (0 , 1)

2⇠ 2 (1).

Hence

W= X1

X 2

2⇠t(1).

The 75th percentile a of W is then given by

0. 75 = P (W a )

Hence, from the t -table, we get

a= 1 .0

Hence the 75th percentile of W is 1.0.

Example 14.10. Suppose X1 , X2 , ...., Xn is a random sample from a normal

distribution with mean µ and variance 2 . If X = 1

n n

i=1 X i and V 2 =

Sampling Distributions Associated with the Normal Population 404

n n

i=1 X i X 2 , and X n+1 is an additional observation, what is the value

of m so that the statistics m(X  Xn+1 )

Vhas a t-distribution.

Answer: Since

Xi ⇠ N ( µ, 2 )

)X ⇠N µ,  2

n

)X Xn+1 ⇠N µ µ,  2

n+2 

)X Xn+1 ⇠N 0 , n+ 1

n  2 

)X Xn+1

 n+1

n⇠N(0 ,1)

Now, we establish a relationship between V2 and S2 . We know that

(n 1) S2 = (n 1) 1

(n 1)



i=1

(Xi  X )2



i=1

(Xi  X )2

=n 1



i=1

(Xi  X )2 

=n V 2.

Hence, by Theorem 14.3

n V 2

2 =(n 1) S 2

2 ⇠  2 (n 1).

Thus  n 1

n+ 1  XXn+1

XXn+1

p n+1

n V 2

2

(n 1)

⇠t( n 1).

Thus by comparison, we get

m= n1

n+ 1 .

Probability and Mathematical Statistics 405

14.3. Snedecor's F-distribution

The next sampling distribution to be discussed in this chapter is

Snedecor's F -distribution. This distribution has many applications in math-

ematical statistics. In the analysis of variance, this distribution is used to

develop the technique for testing the equalities of sample means.

Deﬁnition 14.3. A continuous random variable X is said to have a F-

distribution with ⌫1 and ⌫2 degrees of freedom if its probability density func-

tion is of the form

f(x ; ⌫1 ,⌫2 ) = 









( ⌫1 +⌫2

2) ⌫ 1

⌫2  ⌫ 1

⌫1

21

( ⌫1

2)  ( ⌫ 2

2) 1+ ⌫ 1

⌫2 x ( ⌫ 1 + ⌫ 2

2)if 0 x < 1

0 otherwise,

where ⌫1 ,⌫2 > 0. If X has a F -distribution with ⌫1 and ⌫2 degrees of freedom,

then we denote it by writing X⇠ F (⌫1 ,⌫2 ).

The F -distribution was named in honor of Sir Ronald Fisher by George

Snedecor. F -distribution arises as the distribution of a ratio of variances.

Like, the other two distributions this distribution also tends to normal dis-

tribution as ⌫1 and ⌫2 become very large. The following ﬁgure illustrates the

shape of the graph of this distribution for various degrees of freedom.

The following theorem gives us the mean and variance of Snedecor's F-

distribution.

Theorem 14.8. If the random variable X⇠ F (⌫1 ,⌫2 ), then

E[ X] =  ⌫ 2

⌫2 2 if ⌫ 2 3

DN E if ⌫2 = 1 , 2

and

V ar[ X ] = 





2⌫2

2(⌫ 1 +⌫ 2 2)

⌫1 (⌫2 2)2( ⌫2 4) if ⌫ 2 5

DN E if ⌫2 = 1 , 2 , 3 , 4.

Sampling Distributions Associated with the Normal Population 406

Here DNE means does not exist.

Example 14.11. If X⇠ F (9, 10), what P (X 3. 02) ? Also, ﬁnd the mean

and variance of X.

Answer:

P( X3. 02) = 1  P( X3.02)

= 1 P (F(9, 10)  3.02)

= 1  0. 95 (from F table)

= 0.05.

Next, we determine the mean and variance of X using the Theorem 14.8.

Hence,

E( X) = ⌫ 2

⌫2 2= 10

10  2= 10

8= 1.25

and

V ar( X ) = 2⌫ 2

2(⌫ 1 +⌫ 2 2)

⌫1 (⌫2  2)2(⌫2  4)

=2 (10) 2 (19 2)

9 (8)2(6)

=(25) (17)

(27) (16)

=425

432 = 0.9838.

Theorem 14.9. If X⇠ F (⌫1 ,⌫2 ), then the random variable 1

X⇠F(⌫ 2 ,⌫ 1 ).

This theorem is very useful for computing probabilities like P (X 

0. 2439). If you look at a F -table, you will notice that the table start with val-

ues bigger than 1. Our next example illustrates how to ﬁnd such probabilities

using Theorem 14.9.

Example 14.12. If X⇠ F (6, 9), what is the probability that X is less than

or equal to 0. 2439 ?

Probability and Mathematical Statistics 407

Answer: We use the above theorem to compute

P( X0. 2439) = P 1

X 1

0.2439 

=P F (9, 6)  1

0.2439  (by Theorem 14 . 9)

= 1 P  F (9, 6)  1

0.2439 

= 1 P (F(9, 6)  4.10)

= 1  0.95

= 0.05.

The following theorem says that F -distribution arises as the distribution

of a random variable which is the quotient of two independently distributed

chi-square random variables, each of which is divided by its degrees of free-

dom.

Theorem 14.10. If U⇠ 2 (⌫1 ) and V⇠ 2 (⌫2 ), and the random variables

Uand Vare independent, then the random variable

⌫1

⌫2 ⇠F(⌫ 1 ,⌫ 2 ).

Example 14.13. Let X1 , X2 , ..., X4 and Y1 , Y2 , ..., Y5 be two random samples

of size 4 and 5 respectively, from a standard normal population. What is the

variance of the statistic T = 5

4 X 2

1+X 2

2+X 2

3+X 2

1+Y 2

2+Y 2

3+Y 2

4+Y 2

Answer: Since the population is standard normal, we get

1+X 2

2+X 2

3+X 2

4⇠ 2 (4).

Similarly,

1+Y 2

2+Y 2

3+Y 2

4+Y 2

5⇠ 2 (5).

Thus

T= 5

4 X 2

1+X 2

2+X 2

3+X 2

1+Y 2

2+Y 2

3+Y 2

4+Y 2

1+X 2

2+X 2

3+X 2

1+Y 2

2+Y 2

3+Y 2

4+Y 2

=T ⇠F (4,5).

Sampling Distributions Associated with the Normal Population 408

Therefore V ar (T ) = V ar [F (4, 5) ]

=2 (5) 2 (7)

4 (3)2(1)

=350

= 9.72.

Theorem 14.11. Let X⇠ N (µ1 , 2

1) and X 1 , X 2 , ..., X n be a random sam-

ple of size n from the population X . Let Y⇠ N (µ2 , 2

2) and Y 1 , Y 2 , ..., Y m

be a random sample of size m from the population Y . Then the statistic

2

⇠F( n 1, m 1),

where S 2

1and S 2

2denote the sample variances of the ﬁrst and the second

sample, respectively.

Proof: Since,

Xi ⇠ N (µ1 , 2

we have by Theorem 14.3, we get

(n 1) S 2

2

1⇠ 2 (n 1).

Similarly, since

Yi ⇠ N (µ2 , 2

we have by Theorem 14.3, we get

(m 1) S 2

2

2⇠ 2 (m 1).

Therefore S 2

2

(n 1) S 2

(n 1)  2

(m 1) S 2

(m 1)  2

⇠F( n 1, m 1).

This completes the proof of the theorem.

Because of this theorem, the F -distribution is also known as the variance-

ratio distribution.

Probability and Mathematical Statistics 409

14.4. Review Exercises

1. Let X1 , X2 , ..., X5 be a random sample of size 5 from a normal distribution

with mean zero and standard deviation 2. Find the sampling distribution of

the statistic X1 + 2X2 X3 + X4 + X5 .

2. Let X1 , X2, X3 be a random sample of size 3 from a standard normal

distribution. Find the distribution of X 2

1+X 2

2+X 2

3. If possible, ﬁnd the

sampling distribution of X 2

1X 2

2. If not, justify why you can not determine

it's distribution.

3. Let X1 , X2 , ..., X6 be a random sample of size 6 from a standard normal

distribution. Find the sampling distribution of the statistics X 1 +X2 +X3

pX 2

4+X 2

5+X 2

and X 1 X2 X3

pX 2

4+X 2

5+X 2

4. Let X1 , X2, X3 be a random sample of size 3 from an exponential distri-

bution with a parameter ✓> 0. Find the distribution of the sample (that is

the joint distribution of the random variables X1 , X2, X3 ).

5. Let X1 , X2 , ..., Xn be a random sample of size n from a normal population

with mean µ and variance 2 > 0. What is the expected value of the sample

variance S2 = 1

n1  n

i=1 X i  ¯

X 2 ?

6. Let X1 , X2, X3, X4 be a random sample of size 4 from a standard normal

population. Find the distribution of the statistic X 1 +X4

pX 2

2+X 2

7. Let X1 , X2, X3, X4 be a random sample of size 4 from a standard normal

population. Find the sampling distribution (if possible) and moment gener-

ating function of the statistic 2X 2

1+3X 2

2+X 2

3+4X 2

4. What is the probability

distribution of the sample?

8. Let X equal the maximal oxygen intake of a human on a treadmill, where

the measurement are in milliliters of oxygen per minute per kilogram of

weight. Assume that for a particular population the mean of X is µ = 54.03

and the standard deviation is  = 5. 8. Let ¯

Xbe the sample mean of a random

sample X1 , X2 , ..., X47 of size 47 drawn from X . Find the probability that

the sample mean is between 52.761 and 54.453.

9. Let X1 , X2 , ..., Xn be a random sample from a normal distribution with

mean µ and variance 2 . What is the variance of V2 = 1

n n

i=1 X i X 2 ?

10. If X is a random variable with mean µ and variance 2 , then µ 2is

called the lower 2 point of X . Suppose a random sample X1 , X2, X3, X4 is

Sampling Distributions Associated with the Normal Population 410

drawn from a chi-square distribution with two degrees of freedom. What is

the lower 2 point of X1 + X2 + X3 + X 4?

11. Let X and Y be independent normal random variables such that the

mean and variance of X are 2 and 4, respectively, while the mean and vari-

ance of Y are 6 and k , respectively. A sample of size 4 is taken from the

X-distribution and a sample of size 9 is taken from the Y-distribution. If

P Y X > 8 = 0 . 0228, then what is the value of the constant k ?

12. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

density function

f(x ;  ) =   e x if 0 < x < 1

0 otherwise.

What is the distribution of the statistic Y = 2  n

i=1 X i ?

13. Suppose X has a normal distribution with mean 0 and variance 1, Y

has a chi-square distribution with n degrees of freedom, W has a chi-square

distribution with p degrees of freedom, and W, X , and Y are independent.

What is the sampling distribution of the statistic V = X

W+Y

p+n

14. A random sample X1 , X2 , ..., Xn of size n is selected from a normal

population with mean µ and standard deviation 1. Later an additional in-

dependent observation X n+1 is obtained from the same population. What

is the distribution of the statistic (Xn+1 µ)2 +  n

i=1(X i X) 2 , where X

denote the sample mean?

15. Let T =k(X+Y)

pZ 2 +W2 , where X, Y, Z , and W are independent normal

random variables with mean 0 and variance 2 > 0. For exactly one value

of k ,T has a t-distribution. If r denotes the degrees of freedom of that

distribution, then what is the value of the pair (k, r)?

16. Let X and Y be joint normal random variables with common mean 0,

common variance 1, and covariance 1

2. What is the probability of the event

X+Yp 3 , that is P X+Yp 3 ?

17. Suppose Xj = Zj Zj1 , where j = 1, 2, ..., n and Z0 , Z1 , ..., Zn are

independent and identically distributed with common variance 2 . What is

the variance of the random variable 1

n n

j=1 X j ?

18. A random sample of size 5 is taken from a normal distribution with mean

0 and standard deviation 2. Find the constant k such that 0. 05 is equal to the

Probability and Mathematical Statistics 411

probability that the sum of the squares of the sample observations exceeds

the constant k.

19. Let X1 , X2 , ..., Xn and Y1 , Y2 , ..., Yn be two random sample from the

independent normal distributions with V ar [Xi ] = 2 and V ar [Yi ] = 22 , for

i= 1 ,2 , ..., n and 2 > 0. If U=  n

i=1 X i X 2 and V= n

i=1 Y i Y 2 ,

then what is the sampling distribution of the statistic 2U+V

22 ?

20. Suppose X1 , X2 , ..., X6 and Y1 , Y2 , ..., Y9 are independent, identically

distributed normal random variables, each with mean zero and variance 2 >

0. What is the 95th percentile of the statistics W = 6



i=1

i/ 





j=1

j

?

21. Let X1 , X2 , ..., X6 and Y1 , Y2 , ..., Y8 be independent random sam-

ples from a normal distribution with mean 0 and variance 1, and Z=

4



i=1

i/ 

3



j=1

j

?

22. Give a proof of Theorem 14.9.

Sampling Distributions Associated with the Normal Population 412

Probability and Mathematical Statistics 413

Chapter 15

SOME TECHNIQUES

FOR FINDING

POINT ESTIMATORS

PARAMETERS

A statistical population consists of all the measurements of interest in

a statistical investigation. Usually a population is described by a random

variable X . If we can gain some knowledge about the probability density

function f (x ;✓ ) of X , then we also gain some knowledge about the population

under investigation.

A sample is a portion of the population usually chosen by method of

random sampling and as such it is a set of random variables X1 , X2 , ..., Xn

with the same probability density function f (x ;✓ ) as the population. Once

the sampling is done, we get

X1 =x1 , X2 = x2 , ·· · , Xn = xn

where x1 , x2 , ..., xn are the sample data.

Every statistical method employs a random sample to gain information

about the population. Since the population is characterized by the proba-

bility density function f (x ;✓ ), in statistics one makes statistical inferences

about the population distribution f (x ;✓ ) based on sample information. A

statistical inference is a statement based on sample information about the

population. There are three types of statistical inferences (1) estimation (2)

Some Techniques for ﬁnding Point Estimators of Parameters 414

hypothesis testing and (3) prediction. The goal of this chapter is to examine

some well known point estimation methods.

In point estimation, we try to ﬁnd the parameter ✓ of the population

distribution f (x ;✓ ) from the sample information. Thus, in the parametric

point estimation one assumes the functional form of the pdf f (x ;✓ ) to be

known and only estimate the unknown parameter ✓ of the population using

information available from the sample.

Deﬁnition 15.1. Let X be a population with the density function f (x ;✓ ),

where ✓ is an unknown parameter. The set of all admissible values of ✓is

called a parameter space and it is denoted by ⌦ , that is

⌦={✓2 IRn |f (x ;✓ ) is a pdf }

for some natural number m.

Example 15.1. If X⇠ E XP (✓ ), then what is the parameter space of ✓?

Answer: Since X⇠ EX P (✓ ), the density function of X is given by

f(x ;✓ ) = 1

✓e  x

✓.

If ✓ is zero or negative then f (x ;✓ ) is not a density function. Thus, the

admissible values of ✓ are all the positive real numbers. Hence

⌦={✓2 IR | 0<✓ < 1}

= IR+ .

Example 15.2. If X⇠ N  µ, 2  , what is the parameter space?

Answer: The parameter space ⌦ is given by

⌦= ✓2 IR2 |f (x ;✓ )⇠N  µ, 2 

= (µ,  )2 IR2 |  1 < µ < 1, 0< <1 

= IR ⇥ IR+

= upper half plane.

In general, a parameter space is a subset of IRm . Statistics concerns

with the estimation of the unknown parameter ✓ from a random sample

X1 , X2 , ..., Xn . Recall that a statistic is a function of X1 , X2 , ..., Xn and free

of the population parameter ✓.

Probability and Mathematical Statistics 415

Deﬁnition 15.2. Let X⇠ f (x ;✓ ) and X1 , X2 , ..., Xn be a random sample

from the population X . Any statistic that can be used to guess the parameter

✓is called an estimator of ✓. The numerical value of this statistic is called

an estimate of ✓ . The estimator of the parameter ✓ is denoted by 

✓.

One of the basic problems is how to ﬁnd an estimator of population

parameter ✓ . There are several methods for ﬁnding an estimator of ✓ . Some

of these methods are:

(1) Moment Method

(2) Maximum Likelihood Method

(3) Bayes Method

(4) Least Squares Method

(5) Minimum Chi-Squares Method

(6) Minimum Distance Method

In this chapter, we only discuss the ﬁrst three methods of estimating a

population parameter.

15.1. Moment Method

Let X1 , X2 , ..., Xn be a random sample from a population X with proba-

bility density function f (x ; ✓1 ,✓2 , ..., ✓m ), where ✓1 ,✓2 , ..., ✓m are m unknown

parameters. Let

E Xk  = 1

1

xk f (x; ✓1 , ✓2 , ..., ✓m ) dx

be the k th population moment about 0. Further, let

Mk =1



i=1

be the k th sample moment about 0.

In moment method, we ﬁnd the estimator for the parameters ✓1 ,✓2 , ..., ✓m

by equating the ﬁrst m population moments (if they exist) to the ﬁrst m

sample moments, that is

E( X) = M1

E X2  = M2

E X3  = M3

E( Xm ) = Mm

Some Techniques for ﬁnding Point Estimators of Parameters 416

The moment method is one of the classical methods for estimating pa-

rameters and motivation comes from the fact that the sample moments are

in some sense estimates for the population moments. The moment method

was ﬁrst discovered by British statistician Karl Pearson in 1902. Now we

provide some examples to illustrate this method.

Example 15.3. Let X⇠ N  µ, 2  and X1 , X2 , ..., Xn be a random sample

of size n from the population X . What are the estimators of the population

parameters µ and 2 if we use the moment method?

Answer: Since the population is normal, that is

X⇠ N µ, 2 

we know that E (X ) = µ

E X2  =2 + µ2.

Hence µ =E (X)

=M1



i=1

=X.

Therefore, the estimator of the parameter µ is X , that is

 µ= X.

Next, we ﬁnd the estimator of 2 equating E (X2 ) to M2 . Note that

2 =2 +µ2 µ2

=E X2   µ2

=M2 µ2



i=1

iX 2



i=1 X i X 2 .

The last line follows from the fact that

Probability and Mathematical Statistics 417



i=1 X i X 2 =1



i=1 X 2

i2X i X+X 2 



i=1

i1



i=1

2Xi X +1



i=1



i=1

i2X 1



i=1

Xi + X 2



i=1

i2X X +X 2



i=1

iX 2 .

Thus, the estimator of 2 is 1



i=1 X i X 2 , that is



2 =1



i=1 X i X 2 .

Example 15.4. Let X1 , X2 , ..., Xn be a random sample of size n from a

population X with probability density function

f(x ;✓ ) = 





✓x✓1 if 0 <x< 1

0 otherwise,

where 0 <✓ <1 is an unknown parameter. Using the method of moment

ﬁnd an estimator of ✓ ? If x1 = 0.2 , x2 = 0.6 , x3 = 0.5 , x4 = 0. 3 is a random

sample of size 4, then what is the estimate of ✓?

Answer: To ﬁnd an estimator, we shall equate the population moment to

the sample moment. The population moment E (X ) is given by

E( X) =  1

x f (x ;✓ ) dx

= 1

x✓ x✓1 dx

=✓ 1

x✓dx

=✓

✓+ 1  x ✓+1  1

=✓

✓+ 1 .

Some Techniques for ﬁnding Point Estimators of Parameters 418

We know that M1 =X . Now setting M1 equal to E (X ) and solving for ✓,

we get

X=✓

✓+ 1

that is

✓=X

1X,

where X is the sample mean. Thus, the statistic X

1X is an estimator of the

parameter ✓ . Hence



✓=X

1X.

Since x1 = 0.2 , x2 = 0.6 , x3 = 0.5 , x4 = 0. 3, we have X = 0. 4 and



✓=0.4

1 0. 4= 2

is an estimate of the ✓.

Example 15.5. What is the basic principle of the moment method?

Answer: To choose a value for the unknown population parameter for which

the observed data have the same moments as the population.

Example 15.6. Suppose X1 , X2 , ..., X7 is a random sample from a popula-

tion X with density function

f(x ; ) = 





x6e x



(7) 7 if 0 < x < 1

0 otherwise.

Find an estimator of  by the moment method.

Answer: Since, we have only one parameter, we need to compute only the

ﬁrst population moment E (X ) about 0. Thus,

E( X) =  1

x f (x ; ) dx

= 1

xx 6 e  x



(7) 7 dx

(7)  1

0x

7

e x

dx

= 1

(7)  1

y7 ey dy

= 1

(7) (8)

= 7  .

Probability and Mathematical Statistics 419

Since M1 =X , equating E (X ) to M1 , we get

7 =X

that is

=1

7X.

Therefore, the estimator of  by the moment method is given by



=1

7X.

Example 15.7. Suppose X1 , X2 , ..., Xn is a random sample from a popula-

tion X with density function

f(x ;✓ ) =  1

✓if 0 < x < ✓

0 otherwise.

Find an estimator of ✓ by the moment method.

Answer: Examining the density function of the population X , we see that

X⇠ U N IF (0 ,✓ ). Therefore

E( X) = ✓

Now, equating this population moment to the sample moment, we obtain

✓

2=E( X) = M1 =X.

Therefore, the estimator of ✓is



✓= 2 X.

Example 15.8. Suppose X1 , X2 , ..., Xn is a random sample from a popula-

tion X with density function

f(x ;↵ , ) =  1

 ↵if ↵<x< 

0 otherwise.

Find the estimators of ↵ and  by the moment method.

Some Techniques for ﬁnding Point Estimators of Parameters 420

Answer: Examining the density function of the population X , we see that

X⇠ U N IF (↵ , ). Since, the distribution has two unknown parameters, we

need the ﬁrst two population moments. Therefore

E( X) = ↵ + 

2and E (X2 ) = ( ↵ )2

12 +E (X)2 .

Equating these moments to the corresponding sample moments, we obtain

↵+ 

2=E( X) = M1 =X

that is

↵+ = 2X(1)

and

( ↵ )2

12 +E (X)2 =E (X2 ) = M2 = 1



i=1

which is

( ↵ )2 = 12  1



i=1

iE( X) 2 

= 12  1



i=1

iX 2 

= 12  1



i=1 X 2

iX 2 .

Hence, we get

 ↵=



12



i=1 X 2

iX 2 .(2)

Adding equation (1) to equation (2), we obtain

2 = 2X± 2 



3



i=1 X 2

iX 2

that is

=X±



3



i=1 X 2

iX 2 .

Similarly, subtracting (2) from (1), we get

↵=X⌥



3



i=1 X 2

iX 2 .

Probability and Mathematical Statistics 421

Since, ↵<  , we see that the estimators of ↵ and are

 ↵=X



3



i=1 X 2

iX 2 and 

=X +



3



i=1 X 2

iX 2 .

15.2. Maximum Likelihood Method

The maximum likelihood method was ﬁrst used by Sir Ronald Fisher

in 1922 (see Fisher (1922)) for ﬁnding estimator of a unknown parameter.

However, the method originated in the works of Gauss and Bernoulli. Next,

we describe the method in detail.

Deﬁnition 15.3. Let X1 , X2 , ..., Xn be a random sample from a population

Xwith probability density function f (x ;✓ ), where ✓ is an unknown param-

eter. The likelihood function, L(✓ ), is the distribution of the sample. That

L(✓ ) =



i=1

f(xi ;✓ ).

This deﬁnition says that the likelihood function of a random sample

X1 , X2 , ..., Xn is the joint density of the random variables X1 , X2 , ..., Xn .

The ✓ that maximizes the likelihood function L(✓ ) is called the maximum

likelihood estimator of ✓ , and it is denoted by 

✓. Hence



✓=Arg sup

✓2⌦

L(✓),

where ⌦ is the parameter space of ✓ so that L(✓ ) is the joint density of the

sample.

The method of maximum likelihood in a sense picks out of all the possi-

ble values of ✓ the one most likely to have produced the given observations

x1 , x2 , ..., xn . The method is summarized below: (1) Obtain a random sample

x1 , x2 , ..., xn from the distribution of a population X with probability density

function f (x ;✓ ); (2) deﬁne the likelihood function for the sample x1 , x2 , ..., xn

by L(✓ ) = f (x1 ;✓ )f(x2 ;✓ ) ···f (xn ;✓ ); (3) ﬁnd the expression for ✓ that max-

imizes L(✓ ). This can be done directly or by maximizing ln L(✓ ); (4) replace

✓by 

✓to obtain an expression for the maximum likelihood estimator for ✓;

(5) ﬁnd the observed value of this estimator for a given sample.

Some Techniques for ﬁnding Point Estimators of Parameters 422

Example 15.9. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ;✓ ) = 





(1 ✓ ) x✓ if 0 < x < 1

0 elsewhere,

what is the maximum likelihood estimator of ✓?

Answer: The likelihood function of the sample is given by

L(✓ ) =



i=1

f(xi ;✓ ).

Therefore

ln L(✓ ) = ln  n



i=1

f(xi ;✓ )



i=1

ln f (xi ;✓ )



i=1

ln  (1 ✓ ) xi ✓ 

=n ln(1 ✓ ) ✓



i=1

ln xi.

Now we maximize ln L(✓ ) with respect to ✓.

dln L(✓)

d✓= d

d✓  n ln(1 ✓) ✓



i=1

ln xi 

=n

1✓ 



i=1

ln xi.

Setting this derivative dln L(✓)

d✓ to 0, we get

dln L(✓)

d✓= n

1✓ 



i=1

ln xi = 0

that is

1✓ = 1



i=1

ln xi

Probability and Mathematical Statistics 423

1✓ = 1



i=1

ln xi =  ln x.

✓= 1 + 1

ln x.

This ✓ can be shown to be maximum by the second derivative test and we

leave this veriﬁcation to the reader. Therefore, the estimator of ✓is



✓= 1 + 1

ln X.

Example 15.10. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ; ) = 





x6e x



(7) 7 if 0 < x < 1

0 otherwise,

then what is the maximum likelihood estimator of ?

Answer: The likelihood function of the sample is given by

L( ) =



i=1

f(xi ; ).

Thus,

ln L( ) =



i=1

ln f (xi , )

= 6



i=1

ln xi  1





i=1

xi  n ln(6!)  7 n ln().

Therefore

d ln L( ) = 1

2



i=1

xi  7 n

.

Setting this derivative d

d ln L( ) to zero, we get

2



i=1

xi  7 n

= 0

which yields

=1



i=1

xi.

Some Techniques for ﬁnding Point Estimators of Parameters 424

This  can be shown to be maximum by the second derivative test and again

we leave this veriﬁcation to the reader. Hence, the estimator of  is given by



=1

7X.

Remark 15.1. Note that this maximum likelihood estimator of  is same

as the one found for  using the moment method in Example 15.6. However,

in general the estimators by di↵ erent methods are di↵ erent as the following

example illustrates.

Example 15.11. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ;✓ ) = 





✓if 0 < x < ✓

0 otherwise,

then what is the maximum likelihood estimator of ✓?

Answer: The likelihood function of the sample is given by

L(✓ ) =



i=1

f(xi ;✓ )



i=1  1

✓ ✓> xi ( i = 1 , 2 , 3 , ..., n)

= 1

✓n

✓>max{x1 , x2 , ..., xn }.

Hence the parameter space of ✓ with respect to L(✓ ) is given by

⌦={✓2 IR | xmax <✓ < 1} = (xmax , 1) .

Now we maximize L(✓ ) on ⌦ . First, we compute ln L(✓ ) and then di↵ erentiate

it to get

ln L(✓ ) =  n ln(✓)

and d

d✓ ln L(✓ ) =  n

✓<0.

Therefore ln L(✓ ) is a decreasing function of ✓ and as such the maximum of

ln L(✓ ) occurs at the left end point of the interval ( x max,1). Therefore, at

Probability and Mathematical Statistics 425

✓=xmax the likelihood function achieve maximum. Hence the likelihood

estimator of ✓ is given by



✓=X(n)

where X(n) denotes the nth order statistic of the given sample.

Thus, Example 15.7 and Example 15.11 say that the if we estimate the

parameter ✓ of a distribution with uniform density on the interval (0,✓ ), then

the maximum likelihood estimator is given by



✓=X(n)

where as



✓= 2 X

is the estimator obtained by the method of moment. Hence, in general these

two methods do not provide the same estimator of an unknown parameter.

Example 15.12. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 



 2

⇡e  1

2(x✓ ) 2 if x✓

0 elsewhere.

What is the maximum likelihood estimator of ✓?

Answer: The likelihood function L(✓ ) is given by

L(✓ ) =  2

⇡ nn



i=1

e 1

2(x i ✓) 2 x i ✓(i = 1, 2,3, ..., n).

Hence the parameter space of ✓ is given by

⌦={✓2 IR | 0✓  xmin } = [0, xmin ], ,

where xmin = min{x1 , x2 , ..., xn } . Now we evaluate the logarithm of the

likelihood function.

ln L(✓ ) = n

2ln  2

⇡  1



i=1

(xi ✓ )2 ,

where ✓ is on the interval [0, xmin ]. Now we maximize ln L(✓ ) subject to the

condition 0 ✓ xmin . Taking the derivative, we get

d✓ ln L(✓ ) =  1



i=1

(xi ✓ ) 2(  1) =



i=1

(xi ✓ ).

Some Techniques for ﬁnding Point Estimators of Parameters 426

In this example, if we equate the derivative to zero, then we get ✓ = x . But

this value of ✓ is not on the parameter space ⌦ . Thus, ✓ =x is not the

solution. Hence to ﬁnd the solution of this optimization process, we examine

the behavior of the ln L(✓ ) on the interval [0, xmin ]. Note that

d✓ ln L(✓ ) =  1



i=1

(xi ✓ ) 2(  1) =



i=1

(xi ✓ )> 0

since each xi is bigger than ✓ . Therefore, the function ln L(✓ ) is an increasing

function on the interval [0, xmin ] and as such it will achieve maximum at the

right end point of the interval [0, xmin ]. Therefore, the maximum likelihood

estimator of ✓ is given by



X= X(1)

where X(1) denotes the smallest observation in the random sample

X1 , X2 , ..., Xn .

Example 15.13. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and variance 2 . What are the maximum likelihood

estimators of µ and 2 ?

Answer: Since X⇠ N (µ, 2 ), the probability density function of X is given

f(x ; µ,  ) = 1

p 2 ⇡e  1

2( xµ

) 2 .

The likelihood function of the sample is given by

L( µ,  ) =



i=1

p 2 ⇡e  1

2( xi µ

) 2

Hence, the logarithm of this likelihood function is given by

ln L( µ,  ) =  n

2ln(2⇡)n ln( ) 1

2 2



i=1

(xi µ)2 .

Taking the partial derivatives of ln L( µ,  ) with respect to µ and  , we get

@µln L(µ,  ) =  1

2 2



i=1

(xi µ ) (  2) = 1

2



i=1

(xi µ).

and

@ ln L( µ,  ) =  n

+1

3



i=1

(xi µ)2 .

Probability and Mathematical Statistics 427

Setting @

@µln L( µ,  ) = 0 and @

@ ln L( µ,  ) = 0, and solving for the unknown

µand  , we get

µ=1



i=1

xi =x.

Thus the maximum likelihood estimator of µis

 µ= X.

Similarly, we get

n

+1

3



i=1

(xi µ)2 = 0

implies

2 =1



i=1

(xi µ)2 .

Again µ and 2 found by the ﬁrst derivative test can be shown to be maximum

using the second derivative test for the functions of two variables. Hence,

using the estimator of µ in the above expression, we get the estimator of  2

to be



2 =1



i=1

(Xi  X )2 .

Example 15.14. Suppose X1 , X2 , ..., Xn is a random sample from a distri-

bution with density function

f(x ;↵ , ) =  1

 ↵if ↵<x< 

0 otherwise.

Find the estimators of ↵ and  by the method of maximum likelihood.

Answer: The likelihood function of the sample is given by

L(↵ , ) =



i=1

 ↵=  1

 ↵n

for all ↵ xi for (i = 1, 2, ..., n ) and for all  xi for (i = 1, 2, ..., n ). Hence,

the domain of the likelihood function is

⌦={(↵,  ) | 0<↵  x(1) and x(n)  < 1} .

Some Techniques for ﬁnding Point Estimators of Parameters 428

It is easy to see that L(↵ , ) is maximum if ↵ = x(1) and  = x(n) . Therefore,

the maximum likelihood estimator of ↵ and are

 ↵=X(1) and 

=X(n) .

The maximum likelihood estimator 

✓of a parameter ✓has a remarkable

property known as the invariance property. This invariance property says

that if 

✓is a maximum likelihood estimator of ✓, then g ( 

✓) is the maximum

likelihood estimator of g (✓ ), where g is a function from IRk to a subset of IRm.

This result was proved by Zehna in 1966. We state this result as a theorem

without a proof.

Theorem 15.1. Let 

✓be a maximum likelihood estimator of a parameter ✓

and let g (✓ ) be a function of ✓ . Then the maximum likelihood estimator of

g(✓ ) is given by g

✓ .

Now we give two examples to illustrate the importance of this theorem.

Example 15.15. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and variance 2 . What are the maximum likelihood

estimators of  and µ ?

Answer: From Example 15.13, we have the maximum likelihood estimator

of µ and 2 to be

 µ= X

and



2 =1



i=1

(Xi  X )2 =: ⌃2 (say).

Now using the invariance property of the maximum likelihood estimator we

have

 =⌃

and



µ= X⌃.

Example 15.16. Suppose X1 , X2 , ..., Xn is a random sample from a distri-

bution with density function

f(x ;↵ , ) =  1

 ↵if ↵<x< 

0 otherwise.

Find the estimator of  ↵2 +2 by the method of maximum likelihood.

Probability and Mathematical Statistics 429

Answer: From Example 15.14, we have the maximum likelihood estimator

of ↵ and  to be

 ↵=X(1) and 

=X(n) ,

respectively. Now using the invariance property of the maximum likelihood

estimator we see that the maximum likelihood estimator of  ↵2 +2 is

X 2

(1) +X 2

(n ) .

The concept of information in statistics was introduced by Sir Ronald

Fisher, and it is known as Fisher information.

Deﬁnition 15.4. Let X be an observation from a population with proba-

bility density function f (x ;✓ ). Suppose f (x ;✓ ) is continuous, twice di↵eren-

tiable and it's support does not depend on ✓ . Then the Fisher information,

I(✓ ), in a single observation Xabout ✓ is given by

I(✓ ) =  1

1 dln f (x ;✓ )

d✓  2

f(x ;✓ ) dx.

Thus I (✓ ) is the expected value of the square of the random variable

dln f( X;✓)

d✓ . That is,

I(✓ ) = E dln f (X ;✓ )

d✓  2  .

In the following lemma, we give an alternative formula for the Fisher

information.

Lemma 15.1. The Fisher information contained in a single observation

about the unknown parameter ✓ can be given alternatively as

I(✓ ) =  1

1  d 2 ln f (x ;✓ )

d✓2  f ( x;✓ ) dx.

Proof: Since f (x ;✓ ) is a probability density function,

1

1

f(x ;✓ ) dx = 1 . (3)

Di↵ erentiating (3) with respect to ✓ , we get

d✓ 1

1

f(x ;✓ ) dx = 0.

Some Techniques for ﬁnding Point Estimators of Parameters 430

Rewriting the last equality, we obtain

1

1

df (x;✓ )

d✓

f(x ;✓ ) f(x ;✓ ) dx = 0

which is  1

1

dln f(x ;✓ )

d✓ f ( x;✓ ) dx = 0 . (4)

Now di↵ erentiating (4) with respect to ✓ , we see that

1

1  d 2 ln f (x ;✓ )

d✓2 f ( x;✓ ) + d ln f ( x;✓ )

d✓

df (x;✓ )

d✓ dx = 0 .

Rewriting the last equality, we have

1

1  d 2 ln f (x ;✓ )

d✓2 f ( x;✓ ) + d ln f ( x;✓ )

d✓

df (x;✓ )

d✓

f(x ;✓ ) f(x ;✓ ) dx = 0

which is

1

1  d 2 ln f (x ;✓ )

d✓2 +  d ln f ( x;✓ )

d✓  2  f ( x;✓ ) dx = 0 .

The last equality implies that

1

1 dln f (x ;✓ )

d✓  2

f(x ;✓ ) dx = 1

1  d 2 ln f (x ;✓ )

d✓2  f ( x;✓ ) dx.

Hence using the deﬁnition of Fisher information, we have

I(✓ ) =  1

1  d 2 ln f (x ;✓ )

d✓2  f ( x;✓ ) dx

and the proof of the lemma is now complete.

The following two examples illustrate how one can determine Fisher in-

formation.

Example 15.17. Let X be a single observation taken from a normal pop-

ulation with unknown mean µ and known variance 2 . Find the Fisher

information in a single observation X about µ.

Answer: Since X⇠ N (µ, 2 ), the probability density of X is given by

f(x ; µ ) = 1

p2⇡2 e 1

2 2 (xµ) 2 .

Probability and Mathematical Statistics 431

Hence

ln f (x ; µ ) =  1

2ln(2⇡2 ) (x µ )2

22 .

Therefore d ln f (x ; µ)

dµ = x µ

2

and

d2 ln f (x;µ)

dµ2 = 1

2 .

Hence

I(µ ) =  1

1  1

2  f(x ; µ ) dx = 1

2 .

Example 15.18. Let X1 , X2 , ..., Xn be a random sample from a normal

population with unknown mean µ and known variance 2 . Find the Fisher

information in this sample of size n about µ.

Answer: Let In (µ ) be the required Fisher information. Then from the

deﬁnition, we have

In (µ) =  E d 2 ln f ( X 1 , X 2 , ..., X n ; µ

dµ2 

=E d2

dµ2 {ln f (X1 ; µ ) + ··· + ln f (Xn ; µ) }

=E d 2 ln f (X1 ; µ)

dµ2  ···E d 2 ln f ( X n ; µ)

dµ2 

=I (µ ) + · ·· +I (µ)

=n I (µ)

=n 1

2 (using Example 15.17).

This example shows that if X1 , X2 , ..., Xn is a random sample from a

population X⇠ f (x ;✓ ), then the Fisher information, In (✓ ), in a sample of

size n about the parameter ✓ is equal to n times the Fisher information in X

about ✓ . Thus

In (✓ ) = n I (✓).

If X is a random variable with probability density function f (x ;✓ ), where

✓= ( ✓1 , ..., ✓n ) is an unknown parameter vector then the Fisher information,

Some Techniques for ﬁnding Point Estimators of Parameters 432

I(✓ ), is a n⇥ nmatrix given by

I(✓ ) = ( Iij (✓))

= E @ 2 ln f (X ;✓ )

@✓i@✓j  .

Example 15.19. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and variance 2 . What is the Fisher information

matrix, In ( µ, 2 ), of the sample of size n about the parameters µ and 2 ?

Answer: Let us write ✓1 =µ and ✓2 =2 . The Fisher information, In (✓),

in a sample of size n about the parameter (✓1 ,✓2 ) is equal to n times the

Fisher information in the population about (✓1 ,✓2 ), that is

In (✓1 , ✓2 ) = n I (✓1 , ✓2 ) . (5)

Since there are two parameters ✓1 and ✓2 , the Fisher information matrix

I(✓1 ,✓2 ) is a 2 ⇥ 2 matrix given by

I(✓1 ,✓2 ) = 

I 11 (✓1 ,✓2 ) I12 (✓1 ,✓2 )

I21 (✓1 , ✓2 ) I22 (✓1 , ✓2 ) 

(6)

where

Iij (✓1 , ✓2 ) =  E @ 2 ln f ( X;✓1 ,✓2 )

@✓i@✓j 

for i = 1, 2 and j = 1, 2. Now we proceed to compute Iij . Since

f(x ; ✓1 ,✓2 ) = 1

p2 ⇡ ✓ 2

e (x✓1 )2

2✓2

we have

ln f (x ; ✓1 ,✓2 ) =  1

2ln(2 ⇡ ✓2 ) (x✓1 )2

2✓2

Taking partials of ln f (x ; ✓1 ,✓2 ), we have

@ln f (x ; ✓1 ,✓2 )

@✓1

=x✓1

✓2

@ln f (x ; ✓1 ,✓2 )

@✓2

= 1

2✓2

+(x✓1 )2

2✓2

@2 ln f (x ; ✓1 ,✓2 )

@✓ 2

= 1

✓2

@2 ln f (x ; ✓1 ,✓2 )

@✓ 2

2✓2

2(x✓ 1 ) 2

✓3

@2 ln f (x ; ✓1 ,✓2 )

@✓1@✓2

=x ✓1

✓2

Probability and Mathematical Statistics 433

Hence

I11 (✓1 , ✓2 ) =  E   1

✓2  = 1

✓2

2 .

Similarly,

I21 (✓1 , ✓2 ) = I12 (✓1 , ✓2 ) =  E   X✓1

✓2

2=E(X)

✓2

2✓ 1

✓2

=✓1

✓2

2✓ 1

✓2

= 0

and

I22 (✓1 , ✓2 ) =  E   ( X✓1 )2

✓3

2✓ 2

2

=E  (X✓1 )2 

✓3

21

2✓ 2

=✓2

✓3

21

2✓ 2

24 .

Thus from (5), (6) and the above calculations, the Fisher information matrix

is given by

In (✓1 , ✓2 ) = n 



2 0

2 4



= 



2 0

2 4



.

Now we present an important theorem about the maximum likelihood

estimator without a proof.

Theorem 15.2. Under certain regularity conditions on the f (x ;✓ ) the max-

imum likelihood estimator 

✓of ✓based on a random sample of size nfrom

a population X with probability density f (x ;✓ ) is asymptotically normally

distributed with mean ✓ and variance 1

n I(✓ ) . That is



✓ML ⇠N  ✓, 1

n I (✓ ) as n ! 1 .

The following example shows that the maximum likelihood estimator of

a parameter is not necessarily unique.

Example 15.20. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ;✓ ) = 





2if ✓1 x  ✓ + 1

0 otherwise,

then what is the maximum likelihood estimator of ✓?

Some Techniques for ﬁnding Point Estimators of Parameters 434

Answer: The likelihood function of this sample is given by

L(✓ ) =  1

2 n if max{x 1 , ..., x n }1 ✓ min{x 1 , ..., x n }+ 1

0 otherwise.

Since the likelihood function is a constant, any value in the interval

[max{x1 , ..., xn } 1,min{x1 , ..., xn } + 1] is a maximum likelihood estimate

of ✓.

Example 15.21. What is the basic principle of maximum likelihood esti-

mation?

Answer: To choose a value of the parameter for which the observed data

have as high a probability or density as possible. In other words a maximum

likelihood estimate is a parameter value under which the sample data have

the highest probability.

15.3. Bayesian Method

In the classical approach, the parameter ✓ is assumed to be an unknown,

but ﬁxed quantity. A random sample X1 , X2 , ..., Xn is drawn from a pop-

ulation with probability density function f (x ;✓ ) and based on the observed

values in the sample, knowledge about the value of ✓ is obtained.

In Bayesian approach ✓ is considered to be a quantity whose variation can

be described by a probability distribution (known as the prior distribution).

This is a subjective distribution, based on the experimenter's belief, and is

formulated before the data are seen (and hence the name prior distribution).

A sample is then taken from a population where ✓ is a parameter and the

prior distribution is updated with this sample information. This updated

prior is called the posterior distribution. The updating is done with the help

of Bayes' theorem and hence the name Bayesian method.

In this section, we shall denote the population density f (x ;✓ ) as f (x/✓ ),

that is the density of the population X given the parameter ✓.

Deﬁnition 15.5. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated.

The probability density function of the random variable ✓ is called the prior

distribution of ✓ and usually denoted by h(✓).

Deﬁnition 15.6. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated. The

Probability and Mathematical Statistics 435

conditional density, k (✓ /x1 , x2 , ..., xn ), of ✓ given the sample x1 , x2 , ..., xn is

called the posterior distribution of ✓.

Example 15.22. Let X1 = 1, X2 = 2 be a random sample of size 2 from a

distribution with probability density function

f(x/✓ ) =  3

x ✓ x (1 ✓ )3x , x = 0, 1,2,3.

If the prior density of ✓is

h(✓ ) = 





kif 1

2<✓<1

0 otherwise,

what is the posterior distribution of ✓?

Answer: Since h(✓ ) is the probability density of ✓ , we should get

1

h(✓ ) d✓ = 1

which implies

1

k d✓ = 1.

Therefore k = 2. The joint density of the sample and the parameter is given

u(x1 , x2 ,✓ ) = f (x1/✓ ) f (x2/✓ )h(✓)

= 3

x1  ✓ x 1 (1 ✓ )3x1  3

x2  ✓ x 2 (1 ✓ )3x2 2

= 2  3

x1  3

x2  ✓ x 1 +x2 (1 ✓ )6x1 x2 .

Hence,

u(1 , 2 ,✓ ) = 2  3

1 3

2 ✓ 3 (1 ✓ ) 3

= 18 ✓3 (1 ✓ )3 .

Some Techniques for ﬁnding Point Estimators of Parameters 436

The marginal distribution of the sample

g(1 ,2) =  1

u(1 , 2 ,✓) d✓

= 1

18 ✓3 (1 ✓ )3 d✓

= 18  1

✓3  1 + 3✓2 3 ✓ ✓3  d✓

= 18  1

2✓ 3 + 3✓5  3✓4  ✓6  d✓

140 .

The conditional distribution of the parameter ✓ given the sample X1 = 1 and

X2 = 2 is given by

k(✓ /x1 = 1 , x2 = 2) = u(1,2,✓ )

g(1 ,2)

=18 ✓3 (1 ✓ )3

140

= 280 ✓3 (1 ✓ )3 .

Therefore, the posterior distribution of ✓is

k(✓ /x1 = 1 , x2 = 2) =  280 ✓3 (1 ✓ )3 if 1

2<✓<1

0 otherwise.

Remark 15.2. If X1 , X2 , ..., Xn is a random sample from a population with

density f (x/✓ ), then the joint density of the sample and the parameter is

given by

u(x1 , x2 , ..., xn ,✓ ) = h(✓)



i=1

f(xi/✓ ).

Given this joint density, the marginal density of the sample can be computed

using the formula

g(x1 , x2 , ..., xn ) =  1

1

h(✓)



i=1

f(xi/✓ ) d✓.

Probability and Mathematical Statistics 437

Now using the Bayes rule, the posterior distribution of ✓ can be computed

as follows:

k(✓ /x1 , x2 , ..., xn ) = h (✓)  n

i=1 f(x i /✓)

1

1 h(✓ )  n

i=1 f(x i /✓) d✓ .

In Bayesian method, we use two types of loss functions.

Deﬁnition 15.7. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated. Let



✓be an estimator of ✓. The function

L2  

✓,✓ =

✓ ✓2

is called the squared error loss. The function

L1  

✓, ✓ =  

✓ ✓  

is called the absolute error loss.

The loss function L represents the 'loss' incurred when 

✓is used in place

of the parameter ✓.

Deﬁnition 15.8. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated. Let



✓be an estimator of ✓and let L 

✓, ✓ be a given loss function. The expected

value of this loss function with respect to the population distribution f (x/✓ ),

that is

RL (✓ ) =  L 

✓,✓ f(x/ ✓)dx

is called the risk.

The posterior density of the parameter ✓ given the sample x1 , x2 , ..., xn ,

that is

k(✓ /x1 , x2 , ..., xn )

contains all information about ✓ . In Bayesian estimation of parameter one

chooses an estimate 

✓for ✓such that

k(

✓/x1 , x2 , ..., xn )

is maximum subject to a loss function. Mathematically, this is equivalent to

minimizing the integral

⌦ L  

✓, ✓ k( ✓/x1 , x2 , ..., xn )d✓

Some Techniques for ﬁnding Point Estimators of Parameters 438

with respect to 

✓, where ⌦ denotes the support of the prior density h( ✓) of

the parameter ✓.

Example 15.23. Suppose one observation was taken of a random variable

Xwhich yielded the value 2. The density function for Xis

f(x/✓ ) = 





✓if 0 < x < ✓

0 otherwise,

and prior distribution for parameter ✓is

h(✓ ) =  3

✓4 if 1 <✓ <1

0 otherwise.

If the loss function is L(z, ✓ ) = (z✓ )2 , then what is the Bayes' estimate for

✓?

Answer: The prior density of the random variable ✓is

h(✓ ) =  3

✓4 if 1 <✓ <1

0 otherwise.

The probability density function of the population is

f(x/✓ ) =  1

✓if 0 < x < ✓

0 otherwise.

Hence, the joint probability density function of the sample and the parameter

is given by

u( x, ✓ ) = h(✓ ) f (x/✓ )

✓4

✓

= 3✓5 if 0 <x< ✓and 1 < ✓<1

0 otherwise.

The marginal density of the sample is given by

g(x ) =  1

u( x, ✓) d✓

= 1

3✓5 d✓

4x 4

4x4 .

Probability and Mathematical Statistics 439

Thus, if x = 2, then g (2) = 3

64 . The posterior density of ✓ when x = 2 is

given by

k(✓ /x = 2) = u(2,✓ )

g(2)

=64

33✓5

= 64 ✓ 5 if 2 <✓<1

0 otherwise .

Now, we ﬁnd the Bayes estimator by minimizing the expression

E[L(✓ , z) /x = 2]. That is



✓=Arg max

z2⌦  ⌦ L(✓ , z )k (✓ /x = 2) d✓.

Let us call this integral (z ). Then

(z ) =  ⌦ L( ✓, z) k ( ✓/x = 2) d✓

= 1

(z✓ )2 k(✓ /x = 2) d✓

= 1

(z✓ )264✓ 5 d✓.

We want to ﬁnd the value of z which yields a minimum of (z ). This can be

done by taking the derivative of (z ) and evaluating where the derivative is

zero. d

dz ( z ) = d

dz  1

(z✓ )264✓ 5 d✓

= 2  1

(z✓ ) 64✓ 5 d✓

= 2  1

z64✓ 5 d✓ 2 1

✓64 ✓5 d✓

= 2 z 16

Setting this derivative of (z ) to zero and solving for z , we get

2z 16

3= 0

)z=8

Since d 2 (z)

dz2 = 2, the function (z ) has a minimum at z = 8

3. Hence, the

Bayes' estimate of ✓ is 8

Some Techniques for ﬁnding Point Estimators of Parameters 440

In Example 15.23, we have found the Bayes' estimate of ✓ by di-

rectly minimizing the  ⌦ L 

✓, ✓ k( ✓/x1 , x2 , ..., xn )d ✓with respect to 

✓.

The next result is very useful while ﬁnding the Bayes' estimate using

a quadratic loss function. Notice that if L ( 

✓, ✓) = ( ✓

✓)2 , then

⌦ L  

✓, ✓ k( ✓/x1 , x2 , ..., xn )d ✓is E  ( ✓

✓)2 /x1 , x2 , ..., xn  . The follow-

ing theorem is based on the fact that the function  deﬁned by  (c ) =

E ( X c)2  attains minimum if c= E [ X ].

Theorem 15.3. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated. If

the loss function is squared error, then the Bayes' estimator 

✓of parameter

✓is given by



✓=E ( ✓/x1 , x2 , ..., xn ),

where the expectation is taken with respect to density k (✓ /x1 , x2 , ..., xn ).

Now we give several examples to illustrate the use of this theorem.

Example 15.24. Suppose the prior distribution of ✓ is uniform over the

interval (0, 1). Given ✓ , the population X is uniform over the interval (0,✓ ).

If the squared error loss function is used, ﬁnd the Bayes' estimator of ✓based

on a sample of size one.

Answer: The prior density of ✓ is given by

h(✓ ) =  1 if 0 <✓<1

0 otherwise .

The density of population is given by

f(x/✓ ) =  1

✓if 0 < x < ✓

0 otherwise.

The joint density of the sample and the parameter is given by

u( x, ✓ ) = h(✓ ) f (x/✓ )

= 1  1

✓

= 1

✓if 0 <x<✓<1

0 otherwise .

Probability and Mathematical Statistics 441

The marginal density of the sample is

g(x ) =  1

u( x, ✓) d✓

= 1

✓d✓

=  ln x if 0 < x < 1

0 otherwise.

The conditional density of ✓ given the sample is

k(✓ /x ) = u ( x, ✓)

g(x )=   1

✓ln x if 0 < x < ✓ < 1

0 elsewhere .

Since the loss function is quadratic error, therefore the Bayes' estimator of ✓

is 

✓=E [ ✓/x]

= 1

✓k( ✓/x)d✓

= 1

✓1

✓ln xd✓

= 1

ln x 1

d✓

=x1

ln x.

Thus, the Bayes' estimator of ✓ based on one observation Xis



✓=X1

ln X.

Example 15.25. Given ✓ , the random variable X has a binomial distribution

with n = 2 and probability of success ✓ . If the prior density of ✓is

h(✓ ) = 





kif 1

2<✓<1

0 otherwise,

what is the Bayes' estimate of ✓ for a squared error loss if X = 1 ?

Answer: Note that ✓ is uniform on the interval  1

2,1 , hence k = 2. There-

fore, the prior density of ✓is

h(✓ ) =  2 if 1

2<✓<1

0 otherwise.

Some Techniques for ﬁnding Point Estimators of Parameters 442

The population density is given by

f(x/✓ ) =  n

x ✓ x (1 ✓ )nx = 2

x ✓ x (1 ✓ )2x , x = 0, 1,2.

The joint density of the sample and the parameter ✓is

u( x, ✓ ) = h(✓ ) f (x/✓ )

= 2  2

x ✓ x (1 ✓ )2x

where 1

2<✓<1 and x= 0 ,1,2. The marginal density of the sample is given

g(x ) =  1

u( x, ✓) d✓.

This integral is easy to evaluate if we substitute X = 1 now. Hence

g(1) =  1

2 2

1 ✓ (1 ✓)d✓

= 1

24✓ 4✓2 d✓

= 4  ✓ 2

2 ✓ 3

31

3 3✓2  2✓3  1

3 (3 2)  3

4 2

8

Therefore, the posterior density of ✓ given x = 1, is

k(✓ /x = 1) = u(1,✓ )

g(1) = 12 (✓ ✓2 ),

where 1

2<✓<1. Since the loss function is quadratic error, therefore the

Probability and Mathematical Statistics 443

Bayes' estimate of ✓is



✓=E [ ✓/x = 1]

= 1

✓k( ✓/x = 1) d✓

= 1

12 ✓ (✓ ✓2 ) d✓

= 4✓3  3✓4 1

= 1  5

=11

16 .

Hence, based on the sample of size one with X = 1, the Bayes' estimate of ✓

is 11

16 , that is



✓=11

16 .

The following theorem help us to evaluate the Bayes estimate of a sample

if the loss function is absolute error loss. This theorem is based the fact that

a function  (c ) = E [ |X c | ] is minimum if c is the median of X.

Theorem 15.4. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density f (x/✓ ), where ✓ is the unknown parameter to be estimated. If

the loss function is absolute error, then the Bayes estimator 

✓of the param-

eter ✓ is given by



✓= median of k ( ✓/x1 , x2 , ..., xn )

where k (✓ /x1 , x2 , ..., xn ) is the posterior distribution of ✓.

The followings are some examples to illustrate the above theorem.

Example 15.26. Given ✓ , the random variable X has a binomial distribution

with n = 3 and probability of success ✓ . If the prior density of ✓is

h(✓ ) = 





kif 1

2<✓<1

0 otherwise,

what is the Bayes' estimate of ✓ for an absolute di↵ erence error loss if the

sample consists of one observation x = 3?

Some Techniques for ﬁnding Point Estimators of Parameters 444

Answer: Since, the prior density of ✓is

h(✓ ) = 





2 if 1

2<✓<1

0 otherwise ,

and the population density is

f(x/✓ ) =  3

x ✓ x (1 ✓ )3x ,

the joint density of the sample and the parameter is given by

u(3 ,✓ ) = h(✓ ) f (3/✓ ) = 2 ✓3 ,

where 1

2<✓<1. The marginal density of the sample (at x= 3) is given by

g(3) =  1

u(3 ,✓) d✓

= 1

2✓3 d✓

= ✓ 4

21

=15

32 .

Therefore, the conditional density of ✓ given X = 3 is

k(✓ /x = 3) = u(3,✓ )

g(3) =  64

15 ✓ 3 if 1

2<✓<1

0 elsewhere.

Since, the loss function is absolute error, the Bayes' estimator is the median

of the probability density function k (✓ /x = 3). That is

2= 

✓

15 ✓ 3 d✓

=64

60  ✓ 4 

✓

=64

60  

✓ 4 1

16  .

Probability and Mathematical Statistics 445

Solving the above equation for 

✓, we get



✓=4

17

32 = 0.8537.

Example 15.27. Suppose the prior distribution of ✓ is uniform over the

interval (2, 5). Given ✓ ,X is uniform over the interval (0,✓ ). What is the

Bayes' estimator of ✓ for absolute error loss if X = 1 ?

Answer: Since, the prior density of ✓is

h(✓ ) = 





3if 2 <✓<5

0 otherwise ,

and the population density is

f(x/✓ ) = 





✓if 0 <x<✓

0 elsewhere,

the joint density of the sample and the parameter is given by

u( x, ✓ ) = h(✓ ) f (x/✓ ) = 1

3✓ ,

where 2 <✓ < 5 and 0 <x<✓ . The marginal density of the sample (at

x= 1) is given by

g(1) =  5

u(1 ,✓) d✓

= 2

u(1 ,✓) d✓+ 5

u(1 ,✓) d✓

= 5

3✓ d✓

3ln  5

2 .

Therefore, the conditional density of ✓ given the sample x = 1, is

k(✓ /x = 1) = u(1,✓ )

g(1)

✓ln  5

2.

Some Techniques for ﬁnding Point Estimators of Parameters 446

Since, the loss function is absolute error, the Bayes estimate of ✓ is the median

of k (✓ /x = 1). Hence

2= 

✓

✓ln  5

2d✓

ln  5

2ln  

✓

2 .

Solving for 

✓, we get



✓=p 10 = 3.16.

Example 15.28. What is the basic principle of Bayesian estimation?

Answer: The basic principle behind the Bayesian estimation method con-

sists of choosing a value of the parameter ✓ for which the observed data have

as high a posterior probability k (✓ /x1 , x2 , ..., xn ) of ✓ as possible subject to

a loss function.

15.4. Review Exercises

1. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





2✓ if ✓<x< ✓

0 otherwise,

where 0 <✓ is a parameter. Using the moment method ﬁnd an estimator for

the parameter ✓.

2. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





(✓ + 1) x✓2 if 1 <x< 1

0 otherwise,

where 0 <✓ is a parameter. Using the moment method ﬁnd an estimator for

the parameter ✓.

3. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





✓2 x e✓x if 0 <x< 1

0 otherwise,

Probability and Mathematical Statistics 447

where 0 <✓ is a parameter. Using the moment method ﬁnd an estimator for

the parameter ✓.

4. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





✓x✓1 if 0 < x < 1

0 otherwise,

where 0 <✓ is a parameter. Using the maximum likelihood method ﬁnd an

estimator for the parameter ✓.

5. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





(✓ + 1) x✓2 if 1 <x< 1

0 otherwise,

where 0 <✓ is a parameter. Using the maximum likelihood method ﬁnd an

estimator for the parameter ✓.

6. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





✓2 x e✓x if 0 <x< 1

0 otherwise,

where 0 <✓ is a parameter. Using the maximum likelihood method ﬁnd an

estimator for the parameter ✓.

7. Let X1 , X2, X3, X4 be a random sample from a distribution with density

function

f(x ; ) = 





e (x 4)

for x > 4

0 otherwise,

where > 0. If the data from this random sample are 8.2, 9.1, 10.6 and 4.9,

respectively, what is the maximum likelihood estimate of ?

8. Given ✓ , the random variable X has a binomial distribution with n = 2

and probability of success ✓ . If the prior density of ✓is

h(✓ ) = 





kif 1

2<✓<1

0 otherwise,

Some Techniques for ﬁnding Point Estimators of Parameters 448

what is the Bayes' estimate of ✓ for a squared error loss if the sample consists

of x1 = 1 and x2 = 2.

9. Suppose two observations were taken of a random variable X which yielded

the values 2 and 3. The density function for Xis

f(x/✓ ) = 





✓if 0 < x < ✓

0 otherwise,

and prior distribution for the parameter ✓is

h(✓ ) =  3✓ 4 if ✓> 1

0 otherwise.

If the loss function is quadratic, then what is the Bayes' estimate for ✓?

10. The Pareto distribution is often used in study of incomes and has the

cumulative density function

F(x ;↵ ,✓ ) = 





1 ↵

x ✓ if ↵x

0 otherwise,

where 0 <↵<1 and 1 <✓<1 are parameters. Find the maximum likeli-

hood estimates of ↵ and ✓ based on a sample of size 5 for value 3, 5,2,7,8.

11. The Pareto distribution is often used in study of incomes and has the

cumulative density function

F(x ;↵ ,✓ ) = 





1 ↵

x ✓ if ↵x

0 otherwise,

where 0 <↵<1 and 1 <✓<1 are parameters. Using moment methods

ﬁnd estimates of ↵ and ✓ based on a sample of size 5 for value 3, 5 ,2,7,8.

12. Suppose one observation was taken of a random variable X which yielded

the value 2. The density function for Xis

f(x/µ ) = 1

p2⇡ e 1

2(xµ) 2  1 < x < 1,

and prior distribution of µis

h(µ) = 1

p2⇡ e 1

2µ 2  1 <µ< 1.

Probability and Mathematical Statistics 449

If the loss function is quadratic, then what is the Bayes' estimate for µ?

13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

probability density

f(x ) = 





✓if 2✓x 3✓

0 otherwise,

where ✓> 0. What is the maximum likelihood estimator of ✓?

14. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

probability density

f(x ) = 





1✓2 if 0 x  1

1✓ 2

0 otherwise,

where ✓> 0. What is the maximum likelihood estimator of ✓?

15. Given ✓ , the random variable X has a binomial distribution with n = 3

and probability of success ✓ . If the prior density of ✓is

h(✓ ) = 





kif 1

2<✓<1

0 otherwise,

what is the Bayes' estimate of ✓ for a absolute di↵ erence error loss if the

sample consists of one observation x = 1?

16. Suppose the random variable X has the cumulative density function

F(x ). Show that the expected value of the random variable ( X c)2is

minimum if c equals the expected value of X.

17. Suppose the continuous random variable X has the cumulative density

function F (x ). Show that the expected value of the random variable |X c |

is minimum if c equals the median of X (that is, F (c ) = 0 .5).

18. Eight independent trials are conducted of a given system with the follow-

ing results: S, F, S, F, S, S, S, S where S denotes the success and Fdenotes

the failure. What is the maximum likelihood estimate of the probability of

successful operation p?

19. What is the maximum likelihood estimate of if the 5 values 4

5, 2

3, 1,

2, 5

4were drawn from the population for which f (x ; ) = 1

2(1 +  ) 5  x

2  ?

Some Techniques for ﬁnding Point Estimators of Parameters 450

20. If a sample of ﬁve values of X is taken from the population for which

f(x ; t ) = 2( t1)tx , what is the maximum likelihood estimator of t?

21. A sample of size n is drawn from a gamma distribution

f(x ; ) = 





x3e x



64 if 0 <x<1

0 otherwise.

What is the maximum likelihood estimator of ?

22. The probability density function of the random variable X is deﬁned by

f(x ;  ) =  1 2

3+p x if 0 x1

0 otherwise.

What is the maximum likelihood estimate of the parameter  based on two

independent observations x1 = 1

4and x 2 = 9

16 ?

23. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function f (x ; ) = 

2e |x µ| . What is the maximum likelihood estimator of

?

24. Suppose X1 , X2, ... are independent random variables, each with proba-

bility of success p and probability of failure 1  p , where 0 p 1. Let N

be the number of observation needed to obtain the ﬁrst success. What is the

maximum likelihood estimator of p in term of N?

25. Let X1 , X2 , X3 and X4 be a random sample from the discrete distribution

Xsuch that

P( X= x) = 





✓2x e✓2

x! for x = 0, 1,2, ..., 1

0 otherwise,

where ✓> 0. If the data are 17, 10,32, 5, what is the maximum likelihood

estimate of ✓?

26. Let X1 , X2 , ..., Xn be a random sample of size n from a population with

a probability density function

f(x ;↵ ,  ) = 





↵

( ↵) x ↵ 1 e x if 0 < x < 1

0 otherwise,

Probability and Mathematical Statistics 451

where ↵ and  are parameters. Using the moment method ﬁnd the estimators

for the parameters ↵ and .

27. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ; p ) =  10

x p x (1  p)10x

for x = 0, 1, ..., 10, where p is a parameter. Find the Fisher information in

the sample about the parameter p.

28. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ;✓ ) = 





✓2 x e✓x if 0 <x< 1

0 otherwise,

where 0 <✓ is a parameter. Find the Fisher information in the sample about

the parameter ✓.

29. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ; µ, 2 ) = 





xp 2 ⇡e  1

2 ln(x)µ

 2

,if 0 <x<1

0 otherwise ,

where 1 <µ< 1 and 0 <2 <1 are unknown parameters. Find the

Fisher information matrix in the sample about the parameters µ and 2 .

30. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ; µ, ) = 







 

2⇡ x  3

2e   (xµ)2

2µ 2 x,if 0 <x< 1

0 otherwise ,

where 0 <µ<1 and 0 <<1 are unknown parameters. Find the Fisher

information matrix in the sample about the parameters µ and .

31. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ) = 





( ↵) ✓↵ x ↵ 1 e  x

✓if 0 < x < 1

0 otherwise,

Some Techniques for ﬁnding Point Estimators of Parameters 452

where ↵> 0 and ✓> 0 are parameters. Using the moment method ﬁnd

estimators for parameters ↵ and .

32. Let X1 , X2 , ..., Xn be a random sample of sizen from a distribution with

a probability density function

f(x ;✓ ) = 1

⇡[1 + (x ✓)2 ], 1 <x< 1,

where 0 <✓ is a parameter. Using the maximum likelihood method ﬁnd an

estimator for the parameter ✓.

33. Let X1 , X2 , ..., Xn be a random sample of sizen from a distribution with

a probability density function

f(x ;✓ ) = 1

2e|x✓ | ,1 < x < 1,

where 0 <✓ is a parameter. Using the maximum likelihood method ﬁnd an

estimator for the parameter ✓.

34. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ;  ) = 





x e

x! if x = 0, 1, ..., 1

0 otherwise,

where > 0 is an unknown parameter. Find the Fisher information matrix

in the sample about the parameter .

35. Let X1 , X2 , ..., Xn be a random sample of size n from a population

distribution with the probability density function

f(x ; p ) = 





(1  p)x1 p if x = 1, ..., 1

0 otherwise,

where 0 < p < 1 is an unknown parameter. Find the Fisher information

matrix in the sample about the parameter p.

36. Let X1 , X2 , ..., Xn be a random sample from a population X having the

probability density function

f(x ;✓ ) =  2

✓2 ✓x, if 0 x✓

0 otherwise,

Probability and Mathematical Statistics 453

where ✓> 0 is a parameter. Find an estimator for ✓ using the moment

method.

37. A box contains 50 red and blue balls out of which ✓ are red. A sample

of 30 balls is to be selected without replacement. If X denotes the number

of red balls in the sample, then ﬁnd an estimator for ✓ using the moment

method.

Some Techniques for ﬁnding Point Estimators of Parameters 454

Probability and Mathematical Statistics 455

Chapter 16

CRITERIA

FOR

EVALUATING

THE GOODNESS OF

ESTIMATORS

We have seen in Chapter 15 that, in general, di↵ erent parameter estima-

tion methods yield di↵ erent estimators. For example, if X⇠ U NI F (0,✓ ) and

X1 , X2 , ..., Xn is a random sample from the population X , then the estimator

of ✓ obtained by moment method is



✓MM = 2X

where as the estimator obtained by the maximum likelihood method is



✓ML = X(n)

where X and X(n) are the sample average and the nth order statistic, respec-

tively. Now the question arises: which of the two estimators is better? Thus,

we need some criteria to evaluate the goodness of an estimator. Some well

known criteria for evaluating the goodness of an estimator are: (1) Unbiased-

ness, (2) Eﬃ ciency and Relative Eﬃ ciency, (3) Uniform Minimum Variance

Unbiasedness, (4) Suﬃ ciency, and (5) Consistency.

In this chapter, we shall examine only the ﬁrst four criteria in details.

The concepts of unbiasedness, eﬃ ciency and suﬃ ciency were introduced by

Sir Ronald Fisher.

Criteria for Evaluating the Goodness of Estimators 456

16.1. The Unbiased Estimator

Let X1 , X2 , ..., Xn be a random sample of size n from a population with

probability density function f (x ;✓ ). An estimator 

✓of ✓is a function of

the random variables X1 , X2 , ..., Xn which is free of the parameter ✓ . An

estimate is a realized value of an estimator that is obtained when a sample

is actually taken.

Deﬁnition 16.1. An estimator 

✓of ✓is said to be an unbiased estimator of

✓if and only if

E

✓ =✓.

If 

✓is not unbiased, then it is called a biased estimator of ✓.

An estimator of a parameter may not equal to the actual value of the pa-

rameter for every realization of the sample X1 , X2 , ..., Xn , but if it is unbiased

then on an average it will equal to the parameter.

Example 16.1. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and variance 2 > 0. Is the sample mean Xan

unbiased estimator of the parameter µ?

Answer: Since, each Xi ⇠ N ( µ, 2 ), we have

X⇠ N µ,  2

n .

That is, the sample mean is normal with mean µ and variance  2

n. Thus

E X = µ.

Therefore, the sample mean X is an unbiased estimator of µ.

Example 16.2. Let X1 , X2 , ..., Xn be a random sample from a normal pop-

ulation with mean µ and variance 2 > 0. What is the maximum likelihood

estimator of 2 ? Is this maximum likelihood estimator an unbiased estimator

of the parameter 2 ?

Answer: In Example 15.13, we have shown that the maximum likelihood

estimator of 2 is



2 =1



i=1 X i X 2 .

Probability and Mathematical Statistics 457

Now, we examine the unbiasedness of this estimator

E

2  =E 1



i=1 X i X 2 

=E n1

n1



i=1 X i X 2 

=n1

nE  1

n1



i=1 X i X 2 

=n1

nE  S 2 

= 2

nE  n1

2 S 2  (since n1

2 S 2 ⇠  2 (n 1))

= 2

nE   2 ( n1)

= 2

n( n1)

=n1

n 2

6=2 .

Therefore, the maximum likelihood estimator of 2 is a biased estimator.

Next, in the following example, we show that the sample variance S 2

given by the expression

S2 =1

n1



i=1 X i X 2

is an unbiased estimator of the population variance 2 irrespective of the

population distribution.

Example 16.3. Let X1 , X2 , ..., Xn be a random sample from a population

with mean µ and variance 2 > 0. Is the sample variance S2 an unbiased

estimator of the population variance 2 ?

Answer: Note that the distribution of the population is not given. However,

we are given E (Xi ) = µ and E [(Xi µ)2 ] = 2 . In order to ﬁnd E  S2 ,

we need E  X and E  X2  . Thus we proceed to ﬁnd these two expected

Criteria for Evaluating the Goodness of Estimators 458

values. Consider

E X = E X 1 + X 2 +···+Xn

n



i=1

E(Xi ) = 1



i=1

µ= µ

Similarly,

V ar  X = V ar  X 1 + X 2 +··· + X n

n



i=1

V ar(Xi ) = 1



i=1

2 =  2

Therefore

E X2  = V ar  X + E X 2 = 2

n+ µ 2 .

Consider

E S2  = E 1

n1



i=1 X i X 2 

n1 E  n



i=1 X 2

i2XX i +X 2 

n1 E  n



i=1

in X 2 

n1 n



i=1

E X2

iE n X 2 

n1 n(2 + µ2 ) n µ2 + 2

n

n1 ( n1) 2 

=2 .

Therefore, the sample variance S2 is an unbiased estimator of the population

variance 2 .

Example 16.4. Let X be a random variable with mean 2. Let 

✓1 and



✓2 be unbiased estimators of the second and third moments, respectively, of

Xabout the origin. Find an unbiased estimator of the third moment of X

about its mean in terms of 

✓1 and 

✓2 .

Probability and Mathematical Statistics 459

Answer: Since, 

✓1 and 

✓2 are the unbiased estimators of the second and

third moments of X about origin, we get

E

✓1  =E (X2 ) and E 

✓2  =E  X3  .

The unbiased estimator of the third moment of X about its mean is

E ( X2)3  = E X3 6 X2 + 12 X8

=E X3   6E  X2  + 12E [X ] 8

=

✓2 6 

✓1 + 24 8

=

✓2 6 

✓1 + 16.

Thus, the unbiased estimator of the third moment of X about its mean is



✓2 6 

✓1 + 16.

Example 16.5. Let X1 , X2 , ..., X5 be a sample of size 5 from the uniform

distribution on the interval (0,✓ ), where ✓ is unknown. Let the estimator of

✓be k Xmax , where k is some constant and Xmax is the largest observation.

In order k Xmax to be an unbiased estimator, what should be the value of

the constant k?

Answer: The probability density function of Xmax is given by

g(x ) = 5!

4! 0! [F(x)]4 f(x)

= 5  x

✓ 4 1

✓

✓5 x 4 .

If k Xmax is an unbiased estimator of ✓ , then

✓=E (k Xmax )

=k E (Xmax )

=k ✓

x g(x ) dx

=k ✓

✓5 x 5 dx

6k✓.

Hence,

k=6

Criteria for Evaluating the Goodness of Estimators 460

Example 16.6. Let X1 , X2 , ..., Xn be a sample of size n from a distribution

with unknown mean 1 <µ< 1 , and unknown variance 2 > 0. Show

that the statistic X and Y = X 1 +2X2 +···+nXn

n(n+1)

are both unbiased estimators

of µ . Further, show that V ar  X < V ar (Y).

Answer: First, we show that X is an unbiased estimator of µ

E X = E X 1 + X 2 +···+Xn

n



i=1

E(Xi )



i=1

µ= µ.

Hence, the sample mean X is an unbiased estimator of the population mean

irrespective of the distribution of X . Next, we show that Y is also an unbiased

estimator of µ.

E( Y) = E X 1 + 2 X 2 +···+nXn

n(n+1)

2

n( n+ 1)



i=1

i E (Xi )

n( n+ 1)



i=1

i µ

n( n+ 1) µ n( n+ 1)

=µ.

Hence, X and Y are both unbiased estimator of the population mean irre-

spective of the distribution of the population. The variance of X is given

V ar  X = V ar  X 1 + X 2 +··· + X n

n

n2 V ar [ X 1 + X 2 +··· + X n ]



i=1

V ar [Xi ]

= 2

Probability and Mathematical Statistics 461

Similarly, the variance of Y can be calculated as follows:

V ar [ Y ] = V ar  X 1 + 2 X 2 + · · · + nX n

n(n+1)

2

n2 ( n + 1)2 V ar [1 X 1 + 2 X 2 +··· + n X n ]

n2 ( n + 1)2



i=1

V ar [ i Xi ]

n2 ( n + 1)2



i=1

i2 V ar [Xi ]

n2 ( n + 1)2  2



i=1

=2 4

n2 ( n + 1)2

n( n+ 1) (2n+ 1)

2n + 1

(n + 1)

2

2n + 1

(n + 1) V ar  X  .

Since 2

2n+1

(n +1) >1 for n 2, we see that V ar  X < V ar [Y ]. This shows

that although the estimators X and Y are both unbiased estimator of µ , yet

the variance of the sample mean X is smaller than the variance of Y.

In statistics, between two unbiased estimators one prefers the estimator

which has the minimum variance. This leads to our next topic. However,

before we move to the next topic we complete this section with some known

disadvantages with the notion of unbiasedness. The ﬁrst disadvantage is that

an unbiased estimator for a parameter may not exist. The second disadvan-

tage is that the property of unbiasedness is not invariant under functional

transformation, that is, if 

✓is an unbiased estimator of ✓and g is a function,

then g ( 

✓) may not be an unbiased estimator of g ( ✓).

16.2. The Relatively Eﬃ cient Estimator

We have seen that in Example 16.6 that the sample mean

X= X 1 + X 2 +···+Xn

and the statistic

Y= X 1 + 2 X 2 +···+nXn

1 + 2 + ··· + n

Criteria for Evaluating the Goodness of Estimators 462

are both unbiased estimators of the population mean. However, we also seen

that

V ar  X < V ar ( Y ).

The following ﬁgure graphically illustrates the shape of the distributions of

both the unbiased estimators.

If an unbiased estimator has a smaller variance or dispersion, then it has

a greater chance of being close to true parameter ✓ . Therefore when two

estimators of ✓ are both unbiased, then one should pick the one with the

smaller variance.

Deﬁnition 16.2. Let 

✓1 and 

✓2 be two unbiased estimators of ✓ . The

estimator 

✓1 is said to be more eﬃ cient than 

✓2 if

V ar  

✓1  < V ar  

✓2  .

The ratio ⌘ given by

⌘

✓1 , 

✓2  =

V ar  

✓2 

V ar  

✓1 

is called the relative eﬃ ciency of 

✓1 with respect to 

✓2 .

Example 16.7. Let X1 , X2, X3 be a random sample of size 3 from a pop-

ulation with mean µ and variance 2 > 0. If the statistics X and Y given

Y= X 1 + 2 X 2 + 3 X3

are two unbiased estimators of the population mean µ , then which one is

more eﬃ cient?

Probability and Mathematical Statistics 463

Answer: Since E (Xi ) = µ and V ar (Xi ) = 2 , we get

E X = E X 1 + X 2 + X3

3

3(E(X1 ) + E (X2 ) + E (X3 ))

33µ

=µ

and

E( Y) = E X 1 + 2 X 2 + 3 X3

6

6(E(X1 ) + 2E (X2 ) + 3E (X3 ))

66µ

=µ.

Therefore both X and Y are unbiased. Next we determine the variance of

both the estimators. The variances of these estimators are given by

V ar  X = V ar  X 1 + X 2 + X 3

3

9[V ar (X1 ) + V ar (X2 ) + V ar (X3 )]

932

=12

36  2

and

V ar ( Y ) = V ar  X 1 + 2 X 2 + 3 X 3

6

36 [V ar (X1 ) + 4V ar (X2 ) + 9V ar (X3 )]

36 14 2

=14

36  2 .

Therefore 12

36  2 = V ar  X  < V ar (Y ) = 14

36  2 .

Criteria for Evaluating the Goodness of Estimators 464

Hence, X is more eﬃ cient than the estimator Y . Further, the relative eﬃ-

ciency of X with respect to Y is given by

⌘ X, Y  =14

12 = 7

Example 16.8. Let X1 , X2 , ..., Xn be a random sample of size n from a

population with density

f(x ;✓ ) = 





✓e  x

✓if 0 x < 1

0 otherwise,

where ✓> 0 is a parameter. Are the estimators X1 and X unbiased? Given,

X1 and X , which one is more eﬃ cient estimator of ✓?

Answer: Since the population X is exponential with parameter ✓ , that is

X⇠ EX P (✓ ), the mean and variance of it are given by

E( X) = ✓ and V ar( X) = ✓2 .

Since X1 , X2 , ..., Xn is a random sample from X , we see that the statistic

X1 ⇠ EX P (✓ ). Hence, the expected value of X1 is ✓ and thus it is an

unbiased estimator of the parameter ✓ . Also, the sample mean is an unbiased

estimator of ✓since

E X =1



i=1

E(Xi )

nn✓

=✓.

Next, we compute the variances of the unbiased estimators X1 and X . It is

easy to see that

V ar (X1 ) = ✓ 2

and

V ar  X = V ar  X 1 + X 2 +··· + X n

n



i=1

V ar (Xi )

n2 n✓ 2

=✓ 2

Probability and Mathematical Statistics 465

Hence

✓2

n= V ar  X  < V ar (X1 ) = ✓2 .

Thus X is more eﬃ cient than X1 and the relative eﬃ ciency of X with respect

to X1 is

⌘(X, X1 ) = ✓ 2

✓2

=n.

Example 16.9. Let X1 , X2, X3 be a random sample of size 3 from a popu-

lation with density

f(x ;  ) = 





x e

x! if x = 0, 1,2, ..., 1

0 otherwise,

where  is a parameter. Are the estimators given by



1 =1

4(X1 + 2X2 +X3 ) and 

2 =1

9(4X1 + 3X2 + 2X3 )

unbiased? Given, 

1 and 

2 , which one is more eﬃ cient estimator of  ?

Find an unbiased estimator of  whose variance is smaller than the variances

of 

1 and 

2 .

Answer: Since each Xi ⇠ PO I ( ), we get

E(Xi ) =  and V ar (Xi ) =  .

It is easy to see that

E

1  =1

4(E(X1 ) + 2E (X2 ) + E (X3 ))

44

=,

and

E

2  =1

9(4E(X1 ) + 3E (X2 ) + 2E (X3 ))

99

=.

Criteria for Evaluating the Goodness of Estimators 466

Thus, both 

1 and 

2 are unbiased estimators of . Now we compute their

variances to ﬁnd out which one is more eﬃ cient. It is easy to note that

V ar  

1  =1

16 (V ar (X1 ) + 4V ar (X2 ) + V ar (X3 ))

16 6

16 

=486

1296 ,

and

V ar  

2  =1

81 (16V ar (X1 ) + 9V ar (X2 ) + 4V ar (X3 ))

81 29

=29

81 

=464

1296 ,

Since,

V ar  

2  < V ar  

1  ,

the estimator 

2 is eﬃ cient than the estimator 

1 . We have seen in section

16.1 that the sample mean is always an unbiased estimator of the population

mean irrespective of the population distribution. The variance of the sample

mean is always equals to 1

ntimes the population variance, where ndenotes

the sample size. Hence, we get

V ar  X = 

3= 432

1296 .

Therefore, we get

V ar  X < V ar  

2  < V ar  

1  .

Thus, the sample mean has even smaller variance than the two unbiased

estimators given in this example.

In view of this example, now we have encountered a new problem. That

is how to ﬁnd an unbiased estimator which has the smallest variance among

all unbiased estimators of a given parameter. We resolve this issue in the

next section.

Probability and Mathematical Statistics 467

16.3. The Uniform Minimum Variance Unbiased Estimator

Let X1 , X2 , ..., Xn be a random sample of size n from a population with

probability density function f (x ;✓ ). Recall that an estimator 

✓of ✓is a

function of the random variables X1 , X2 , ..., Xn which does depend on ✓.

Deﬁnition 16.3. An unbiased estimator 

✓of ✓is said to be a uniform

minimum variance unbiased estimator of ✓ if and only if

V ar  

✓ V ar  

T

for any unbiased estimator 

Tof ✓.

If an estimator 

✓is unbiased then the mean of this estimator is equal to

the parameter ✓ , that is

E

✓ = ✓

and the variance of 

✓is

V ar  

✓ =E 

✓E

✓ 2 

=E 

✓ ✓ 2  .

This variance, if exists, is a function of the unbiased estimator 

✓and it has a

minimum in the class of all unbiased estimators of ✓ . Therefore we have an

alternative deﬁnition of the uniform minimum variance unbiased estimator.

Deﬁnition 16.4. An unbiased estimator 

✓of ✓is said to be a uniform

minimum variance unbiased estimator of ✓ if it minimizes the variance

E 

✓ ✓ 2  .

Example 16.10. Let 

✓1 and 

✓2 be unbiased estimators of ✓ . Suppose

V ar  

✓1  = 1, V ar  

✓2  = 2 and Cov  

✓1 , 

✓2  = 1

2. What are the val-

ues of c1 and c2 for which c1 

✓1 + c2 

✓2 is an unbiased estimator of ✓ with

minimum variance among unbiased estimators of this type?

Answer: We want c1 

✓1 + c2 

✓2 to be a minimum variance unbiased estimator

of ✓ . Then

E c1 

✓1 + c2 

✓2  =✓

)c1 E 

✓1  + c2 E 

✓2  =✓

)c1 ✓+ c2 ✓= ✓

)c1 +c2 = 1

)c2 = 1 c1.

Criteria for Evaluating the Goodness of Estimators 468

Therefore

V ar  c1 

✓1 + c2 

✓2  = c2

1V ar 

✓1  + c2

2V ar 

✓2  + 2 c1c2 Cov  

✓1 , 

✓1 

=c2

1+ 2c 2

2+c 1 c 2

=c2

1+ 2(1 c 1 ) 2 +c 1 (1 c 1 )

= 2(1  c1 )2 + c1

= 2 + 2c2

13c 1 .

Hence, the variance V ar  c1 

✓1 + c2 

✓2  is a function of c1 . Let us denote this

function by  (c1 ), that is

(c1 ) := V ar  c1 

✓1 + c2 

✓2  = 2 + 2c2

13c 1 .

Taking the derivative of  (c1 ) with respect to c1 , we get

dc1

(c1 ) = 4 c1  3.

Setting this derivative to zero and solving for c1 , we obtain

4c1  3 = 0 ) c1 = 3

Therefore

c2 = 1  c1 = 1  3

4= 1

In Example 16.10, we saw that if 

✓1 and 

✓2 are any two unbiased esti-

mators of ✓ , then c 

✓1 + (1  c ) 

✓2 is also an unbiased estimator of ✓ for any

c2IR. Hence given two estimators 

✓1 and 

✓2 ,

C=

✓|

✓=c

✓1 + (1  c ) 

✓2 , c 2 IR

forms an uncountable class of unbiased estimators of ✓ . When the variances

of 

✓1 and 

✓2 are known along with the their covariance, then in Example

16.10 we were able to determine the minimum variance unbiased estimator

in the class C . If the variances of the estimators 

✓1 and 

✓2 are not known,

then it is very diﬃ cult to ﬁnd the minimum variance estimator even in the

class of estimators C . Notice that C is a subset of the class of all unbiased

estimators and ﬁnding a minimum variance unbiased estimator in this class

is a diﬃ cult task.

Probability and Mathematical Statistics 469

One way to ﬁnd a uniform minimum variance unbiased estimator for a

parameter is to use the Cram´er-Rao lower bound or the Fisher information

inequality.

Theorem 16.1. Let X1 , X2 , ..., Xn be a random sample of size n from a

population X with probability density f (x ;✓ ), where ✓ is a scalar parameter.

Let 

✓be any unbiased estimator of ✓. Suppose the likelihood function L( ✓)

is a di↵ erentiable function of ✓ and satisﬁes

d✓ 1

1 ··· 1

1

h(x1 , ..., xn ) L(✓ ) dx1 ···dxn

= 1

1 ··· 1

1

h(x1 , ..., xn ) d

d✓ L(✓ ) dx1 ···dxn

(1)

for any h(x1 , ..., xn ) with E (h(X1 , ..., Xn )) < 1 . Then

V ar  

✓ 1

E @ln L(✓)

@✓  2 .(CR1)

Proof: Since L(✓ ) is the joint probability density function of the sample

X1 , X2 , ..., Xn , 1

1 ··· 1

1

L(✓ ) dx1 ···dxn = 1 . (2)

Di↵ erentiating (2) with respect to ✓ we have

d✓ 1

1 ··· 1

1

L(✓ ) dx1 ···dxn = 0

and use of (1) with h(x1 , ..., xn ) = 1 yields

1

1 ···  1

1

d✓ L(✓ ) dx1 ···dxn = 0 . (3)

Rewriting (3) as

1

1 ··· 1

1

dL(✓)

d✓

L(✓ ) L(✓ ) dx1 ···dxn = 0

we see that  1

1 ··· 1

1

dln L(✓)

d✓ L(✓ ) dx1 ···dxn = 0.

Criteria for Evaluating the Goodness of Estimators 470

Hence  1

1 ··· 1

1

✓dln L(✓)

d✓ L(✓ ) dx1 ···dxn = 0 . (4)

Since 

✓is an unbiased estimator of ✓, we see that

E

✓ = 1

1 ··· 1

1 

✓L( ✓)dx1 ···dxn = ✓.(5)

Di↵ erentiating (5) with respect to ✓ , we have

d✓ 1

1 ··· 1

1 

✓L( ✓)dx1 ·· · dxn = 1.

Again using (1) with h(X1 , ..., Xn ) = 

✓, we have

1

1 ··· 1

1 

✓d

d✓ L(✓ ) dx1 ·· · dxn = 1 . (6)

Rewriting (6) as

1

1 ··· 1

1 

✓dL(✓)

d✓

L(✓ ) L(✓ ) dx1 ···dxn = 1

we have  1

1 ··· 1

1 

✓dln L(✓)

d✓ L(✓ ) dx1 ···dxn = 1 . (7)

From (4) and (7), we obtain

1

1 ··· 1

1 

✓ ✓ dln L(✓)

d✓ L(✓ ) dx1 ···dxn = 1 . (8)

By the Cauchy-Schwarz inequality,

1 =  1

1 ··· 1

1 

✓ ✓ dln L(✓)

d✓ L(✓ ) dx1 ···dxn  2

1

1 ··· 1

1 

✓ ✓ 2 L( ✓)dx1 ···dxn 

· 1

1 ··· 1

1 dln L(✓)

d✓  2

L(✓ ) dx1 ···dxn 

=V ar  

✓ E @ln L( ✓)

@✓  2  .

Probability and Mathematical Statistics 471

Therefore

V ar  

✓ 1

E @ln L(✓)

@✓  2 

and the proof of theorem is now complete.

If L(✓ ) is twice di↵ erentiable with respect to ✓ , the inequality (CR1) can

be stated equivalently as

V ar  

✓ 1

E @ 2 ln L(✓)

@✓2 .(CR2)

The inequalities (CR1) and (CR2) are known as Cram´er-Rao lower bound

for the variance of 

✓or the Fisher information inequality. The condition

(1) interchanges the order on integration and di↵ erentiation. Therefore any

distribution whose range depend on the value of the parameter is not covered

by this theorem. Hence distribution like the uniform distribution may not

be analyzed using the Cram´er-Rao lower bound.

If the estimator 

✓is minimum variance in addition to being unbiased,

then equality holds. We state this as a theorem without giving a proof.

Theorem 16.2. Let X1 , X2 , ..., Xn be a random sample of size n from a

population X with probability density f (x ;✓ ), where ✓ is a parameter. If 

✓

is an unbiased estimator of ✓and

V ar  

✓ =1

E @ln L(✓)

@✓  2  ,

then 

✓is a uniform minimum variance unbiased estimator of ✓ . The converse

of this is not true.

Deﬁnition 16.5. An unbiased estimator 

✓is called an eﬃ cient estimator if

it satisﬁes Cram´er-Rao lower bound, that is

V ar  

✓ =1

E @ln L(✓)

@✓  2  .

In view of the above theorem it is easy to note that an eﬃ cient estimator

of a parameter is always a uniform minimum variance unbiased estimator of

Criteria for Evaluating the Goodness of Estimators 472

a parameter. However, not every uniform minimum variance unbiased esti-

mator of a parameter is eﬃ cient. In other words not every uniform minimum

variance unbiased estimators of a parameter satisfy the Cram´er-Rao lower

bound

V ar  

✓ 1

E @ln L(✓)

@✓  2  .

Example 16.11. Let X1 , X2 , ..., Xn be a random sample of size n from a

distribution with density function

f(x ;✓ ) = 





3✓ x2e✓x3 if 0 < x < 1

0 otherwise.

What is the Cram´er-Rao lower bound for the variance of unbiased estimator

of the parameter ✓?

Answer: Let 

✓be an unbiased estimator of ✓. Cram´er-Rao lower bound for

the variance of 

✓is given by

V ar  

✓ 1

E @ 2 ln L(✓)

@✓2  ,

where L(✓ ) denotes the likelihood function of the given random sample

X1 , X2 , ..., Xn . Since, the likelihood function of the sample is

L(✓ ) =



i=1

3✓ x2

ie ✓x3

we get

ln L(✓ ) = n ln ✓+



i=1

ln  3x2

i✓



i=1

@ln L( ✓)

@✓ = n

✓



i=1

and @ 2 ln L(✓)

@✓ 2 = n

✓2 .

Hence, using this in the Cram´er-Rao inequality, we get

V ar  

✓  ✓ 2

Probability and Mathematical Statistics 473

Thus the Cram´er-Rao lower bound for the variance of the unbiased estimator

of ✓ is ✓ 2

Example 16.12. Let X1 , X2 , ..., Xn be a random sample from a normal

population with unknown mean µ and known variance 2 > 0. What is the

maximum likelihood estimator of µ ? Is this maximum likelihood estimator

an eﬃ cient estimator of µ?

Answer: The probability density function of the population is

f(x ; µ ) = 1

p2⇡ 2 e 1

2 2 (xµ) 2 .

Thus

ln f (x ; µ ) =  1

2ln(2⇡2 ) 1

22 (x µ )2

and hence

ln L(µ ) =  n

2ln(2⇡2 ) 1

2 2



i=1

(xi µ)2 .

Taking the derivative of ln L(µ ) with respect to µ , we get

dln L(µ)

dµ = 1

2



i=1

(xi µ).

Setting this derivative to zero and solving for µ , we see that  µ= X.

The variance of X is given by

V ar  X = V ar  X 1 + X 2 +··· + X n

n

= 2

Next we determine the Cram´er-Rao lower bound for the estimator X.

We already know that

dln L(µ)

dµ = 1

2



i=1

(xi µ)

and hence d 2 ln L(µ)

dµ2 = n

2 .

Therefore

E d 2 ln L(µ)

dµ2  = n

2

Criteria for Evaluating the Goodness of Estimators 474

and

1

E d 2 ln L(µ)

dµ2 = 2

Thus

V ar  X = 1

E d 2 ln L(µ)

dµ2 

and X is an eﬃ cient estimator of µ . Since every eﬃ cient estimator is a

uniform minimum variance unbiased estimator, therefore X is a uniform

minimum variance unbiased estimator of µ.

Example 16.13. Let X1 , X2 , ..., Xn be a random sample from a normal

population with known mean µ and unknown variance 2 > 0. What is the

maximum likelihood estimator of 2 ? Is this maximum likelihood estimator

a uniform minimum variance unbiased estimator of 2 ?

Answer: Let us write ✓ =2 . Then

f(x ;✓ ) = 1

p2⇡✓ e 1

2✓ (xµ) 2

and

ln L(✓ ) =  n

2ln(2⇡)n

2ln(✓) 1

2✓



i=1

(xi µ)2 .

Di↵ erentiating ln L (✓ ) with respect to ✓ , we have

d✓ ln L(✓ ) =  n

✓+1

2✓ 2



i=1

(xi µ)2

Setting this derivative to zero and solving for ✓ , we see that



✓=1



i=1

(Xi µ)2 .

Next we show that this estimator is unbiased. For this we consider

E

✓ =E 1



i=1

(Xi µ)2 

= 2

nE  n



i=1  X i µ

 2 

=✓

nE(2 (n ) )

=✓

nn=✓.

Probability and Mathematical Statistics 475

Hence 

✓is an unbiased estimator of ✓. The variance of 

✓can be obtained as

follows:

V ar  

✓ =V ar  1



i=1

(Xi µ)2 

= 4

nV ar  n



i=1  X i µ

 2 

=✓ 2

n2 V ar(2 (n ) )

=✓ 2

n2 4 n

=2✓2

n=24

Finally we determine the Cram´er-Rao lower bound for the variance of 

✓. The

second derivative of ln L(✓ ) with respect to ✓is

d2 ln L(✓)

d✓2 = n

2✓2  1

✓3



i=1

(xi µ)2 .

Hence

E d 2 ln L(✓)

d✓2  = n

2✓2  1

✓3 E  n



i=1

(Xi µ)2 

2✓2  ✓

✓3 E   2 (n) 

2✓2  n

✓2

=n

2✓ 2

Thus

1

E d 2 ln L(✓)

d✓2 =2✓ 2

n=24

Therefore

V ar  

✓ = 1

E d 2 ln L(✓)

d✓2  .

Hence 

✓is an eﬃ cient estimator of ✓. Since every eﬃ cient estimator is a

uniform minimum variance unbiased estimator, therefore 1

n n

i=1(X i µ) 2

is a uniform minimum variance unbiased estimator of 2 .

Example 16.14. Let X1 , X2 , ..., Xn be a random sample of size n from a

normal population known mean µ and variance 2 > 0. Show that S2 =

Criteria for Evaluating the Goodness of Estimators 476

n1  n

i=1(X i X) 2 is an unbiased estimator of  2 . Further, show that S2

can not attain the Cram´er-Rao lower bound.

Answer: From Example 16.2, we know that S2 is an unbiased estimator of

2 . The variance of S2 can be computed as follows:

V ar  S2  = V ar  1

n1



i=1

(Xi  X )2 

= 4

(n 1)2 V ar  n



i=1  X i X

 2 

= 4

(n 1)2 V ar (  2 (n 1) )

= 4

(n 1)2 2 (n 1)

=24

n1 .

Next we let ✓ =2 and determine the Cram´er-Rao lower bound for the

variance of S2 . The second derivative of ln L(✓ ) with respect to ✓is

d2 ln L(✓)

d✓2 = n

2✓2  1

✓3



i=1

(xi µ)2 .

Hence

E d 2 ln L(✓)

d✓2  = n

2✓2  1

✓3 E  n



i=1

(Xi µ)2 

2✓2  ✓

✓3 E   2 (n) 

2✓2  n

✓2

=n

2✓ 2

Thus

1

E d 2 ln L(✓)

d✓2 =✓ 2

n=24

Hence 2 4

n1= V ar  S 2  >1

E d 2 ln L(✓)

d✓2 =2 4

This shows that S2 can not attain the Cram´er-Rao lower bound.

Probability and Mathematical Statistics 477

The disadvantages of Cram´er-Rao lower bound approach are the fol-

lowings: (1) Not every density function f (x ;✓ ) satisﬁes the assumptions

of Cram´er-Rao theorem and (2) not every allowable estimator attains the

Cram´er-Rao lower bound. Hence in any one of these situations, one does

not know whether an estimator is a uniform minimum variance unbiased

estimator or not.

16.4. Suﬃ cient Estimator

In many situations, we can not easily ﬁnd the distribution of the es-

timator 

✓of a parameter ✓even though we know the distribution of the

population. Therefore, we have no way to know whether our estimator 

✓is

unbiased or biased. Hence, we need some other criteria to judge the quality

of an estimator. Suﬃ ciency is one such criteria for judging the quality of an

estimator.

Recall that an estimator of a population parameter is a function of the

sample values that does not contain the parameter. An estimator summarizes

the information found in the sample about the parameter. If an estimator

summarizes just as much information about the parameter being estimated

as the sample does, then the estimator is called a suﬃ cient estimator.

Deﬁnition 16.6. Let X⇠ f (x ;✓ ) be a population and let X1 , X2 , ..., Xn

be a random sample of size n from this population X . An estimator 

✓of

the parameter ✓ is said to be a suﬃ cient estimator of ✓ if the conditional

distribution of the sample given the estimator 

✓does not depend on the

parameter ✓.

Example 16.15. If X1 , X2 , ..., Xn is a random sample from the distribution

with probability density function

f(x ;✓ ) = 





✓x (1  ✓)1x if x = 0,1

0 elsewhere ,

where 0 <✓< 1. Show that Y =  n

i=1 X i is a suﬃcient statistic of ✓.

Answer: First, we ﬁnd the distribution of the sample. This is given by

f(x1 , x2 , ..., xn ) =



i=1

✓x i (1  ✓)1xi = ✓y (1  ✓)ny .

Since, each Xi ⇠BER (✓ ), we have



i=1

Xi ⇠BIN ( n, ✓ ).

Criteria for Evaluating the Goodness of Estimators 478

If X1 = x1 , X2 = x2 , ..., Xn = xn and Y=



i=1

xi , then

f(x1 , x2 , ..., xn , y ) = 





f(x1 , x2 , ..., xn ) if y=n

i=1 x i ,

0 if y 6 =  n

i=1 x i .

Therefore, the probability density function of Y is given by

g( y) =  n

y ✓ y (1 ✓ )ny .

Now, we ﬁnd the conditional density of the sample given the estimator

Y, that is

f(x1 , x2 , ..., xn /Y = y) = f(x1 , x2 , ..., xn , y )

g( y)

=f (x1 , x2 , ..., xn )

g( y)

=✓ y (1 ✓ )ny

n

y✓ y (1 ✓ ) ny

n

y.

Hence, the conditional density of the sample given the statistic Y is indepen-

dent of the parameter ✓ . Therefore, by deﬁnition Y is a suﬃ cient statistic.

Example 16.16. If X1 , X2 , ..., Xn is a random sample from the distribution

with probability density function

f(x ;✓ ) = 





e(x ✓) if ✓ <x< 1

0 elsewhere ,

where 1 <✓< 1 . What is the maximum likelihood estimator of ✓ ? Is

this maximum likelihood estimator suﬃcient estimator of ✓?

Answer: We have seen in Chapter 15 that the maximum likelihood estimator

of ✓ is Y = X(1) , that is the ﬁrst order statistic of the sample. Let us ﬁnd

Probability and Mathematical Statistics 479

the probability density of this statistic, which is given by

g( y) = n!

(n 1)! [F(y)]0 f(y ) [1 F (y)]n1

=n f (y ) [1 F (y)]n1

=n e(y ✓)  1 1e(y ✓)  n1

=n en✓ eny .

The probability density of the random sample is

f(x1 , x2 , ..., xn ) =



i=1

e(xi ✓ )

=en✓ en x ,

where nx =



i=1

xi . Let A be the event (X1 =x1 , X2 = x2 , ..., Xn = xn ) and

Bdenotes the event ( Y= y). Then A⇢ Band therefore A B= A. Now,

we ﬁnd the conditional density of the sample given the estimator Y , that is

f(x1 , x2 , ..., xn /Y = y) = P(X1 = x1 , X2 = x2 , ..., Xn = xn /Y = y)

=P (A/B)

=P (A B )

P( B)

=P (A)

P( B)

=f (x1 , x2 , ..., xn )

g( y)

=en✓ en x

n en✓ en y

=e n x

n en y .

Hence, the conditional density of the sample given the statistic Y is indepen-

dent of the parameter ✓ . Therefore, by deﬁnition Y is a suﬃ cient statistic.

We have seen that to verify whether an estimator is suﬃ cient or not one

has to examine the conditional density of the sample given the estimator. To

Criteria for Evaluating the Goodness of Estimators 480

compute this conditional density one has to use the density of the estimator.

The density of the estimator is not always easy to ﬁnd. Therefore, verifying

the suﬃ ciency of an estimator using this deﬁnition is not always easy. The

following factorization theorem of Fisher and Neyman helps to decide when

an estimator is suﬃ cient.

Theorem 16.3. Let X1 , X2 , ..., Xn denote a random sample with proba-

bility density function f (x1 , x2 , ..., xn ;✓ ), which depends on the population

parameter ✓ . The estimator 

✓is suﬃ cient for ✓if and only if

f(x1 , x2 , ..., xn ;✓ ) =  ( 

✓, ✓)h(x1 , x2 , ..., xn )

where  depends on x1 , x2 , ..., xn only through 

✓and h(x1 , x2 , ..., xn ) does

not depend on ✓.

Now we give two examples to illustrate the factorization theorem.

Example 16.17. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;  ) = 





x e

x! if x = 0, 1,2, ..., 1

0 elsewhere,

where > 0 is a parameter. Find the maximum likelihood estimator of and

show that the maximum likelihood estimator of  is a suﬃ cient estimator of

the parameter .

Answer: First, we ﬁnd the density of the sample or the likelihood function

of the sample. The likelihood function of the sample is given by

L( ) =



i=1

f(xi ; )



i=1

x i e

xi !

= nX e n



i=1

(xi !)

Taking the logarithm of the likelihood function, we get

ln L( ) = nx ln  n ln



i=1

(xi !).

Probability and Mathematical Statistics 481

Therefore

d ln L( ) = 1

nx n.

Setting this derivative to zero and solving for  , we get

=x.

The second derivative test assures us that the above  is a maximum. Hence,

the maximum likelihood estimator of  is the sample mean X . Next, we

show that X is suﬃ cient, by using the Factorization Theorem of Fisher and

Neyman. We factor the joint density of the sample as

L( ) =  nx e n



i=1

(xi !)

= nx en  1



i=1

(xi !)

= (X,  )h (x1 , x2 , ..., xn ) .

Therefore, the estimator X is a suﬃ cient estimator of .

Example 16.18. Let X1 , X2 , ..., Xn be a random sample from a normal

distribution with density function

f(x ; µ ) = 1

p2⇡ e 1

2(xµ) 2 ,

where 1 <µ< 1 is a parameter. Find the maximum likelihood estimator

of µ and show that the maximum likelihood estimator of µ is a suﬃ cient

estimator.

Answer: We know that the maximum likelihood estimator of µis the sample

mean X . Next, we show that this maximum likelihood estimator X is a

Criteria for Evaluating the Goodness of Estimators 482

suﬃ cient estimator of µ . The joint density of the sample is given by

f(x1 , x2, ...,xn ; µ)



i=1

f(xi ; µ)



i=1

p2⇡ e 1

2(x i µ) 2

= 1

p2⇡  n

e 1



i=1

(xi µ)2

= 1

p2⇡  n

e 1



i=1

[(xi x ) + (x µ)]2

= 1

p2⇡  n

e 1



i=1 (x i x) 2 + 2(x i x)( x µ) + ( x µ) 2 

= 1

p2⇡  n

e 1



i=1 (x i x) 2 + (x µ) 2 

= 1

p2⇡  n

e n

2(xµ) 2 e  1



i=1

(xi x)2

Hence, by the Factorization Theorem, X is a suﬃ cient estimator of the pop-

ulation mean.

Note that the probability density function of the Example 16.17 which

f(x ;  ) = 





x e

x! if x = 0, 1,2, ..., 1

0 elsewhere ,

can be written as

f(x ;  ) = e{xln  ln x!}

for x = 0, 1,2, ... This density function is of the form

f(x ;  ) = e{K(x)A()+ S(x)+ B()} .

Similarly, the probability density function of the Example 16.12, which is

f(x ; µ ) = 1

p2⇡ e 1

2(xµ) 2

Probability and Mathematical Statistics 483

can also be written as

f(x ; µ ) = e{xµ x2

2 µ 2

2 1

2ln(2⇡)} .

This probability density function is of the form

f(x ; µ ) = e{K(x)A(µ)+ S(x)+ B(µ)} .

We have also seen that in both the examples, the suﬃ cient estimators were

the sample mean X , which can be written as 1



i=1

Xi .

Our next theorem gives a general result in this direction. The following

theorem is known as the Pitman-Koopman theorem.

Theorem 16.4. Let X1 , X2 , ..., Xn be a random sample from a distribution

with probability density function of the exponential form

f(x ;✓ ) = e{K(x)A(✓)+ S(x)+ B(✓)}

on a support free of ✓ . Then the statistic



i=1

K(Xi ) is a suﬃ cient statistic

for the parameter ✓.

Proof: The joint density of the sample is

f(x1 , x2 , ..., xn ;✓ ) =



i=1

f(xi ;✓ )



i=1

e{K(xi )A(✓)+ S(xi )+ B(✓)}

=e n



i=1

K(xi )A(✓ ) +



i=1

S(xi ) + n B(✓)

=e n



i=1

K(xi )A(✓ ) + n B(✓) e  n



i=1

S(xi ) .

Hence by the Factorization Theorem the estimator



i=1

K(Xi ) is a suﬃ cient

statistic for the parameter ✓ . This completes the proof.

Criteria for Evaluating the Goodness of Estimators 484

Example 16.19. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





✓x✓1 for 0 <x< 1

0 otherwise,

where ✓> 0 is a parameter. Using the Pitman-Koopman Theorem ﬁnd a

suﬃ cient estimator of ✓.

Answer: The Pitman-Koopman Theorem says that if the probability density

function can be expressed in the form of

f(x ;✓ ) = e{K(x)A(✓)+ S(x)+ B(✓)}

then  n

i=1 K(X i ) is a suﬃcient statistic for ✓ . The given population density

can be written as f (x;✓ ) = ✓ x✓1

=e{ln [ ✓x✓1 ]

=e{ln ✓ +(✓  1) ln x} .

Thus,

K(x ) = ln x A(✓ ) = ✓

S(x ) =  ln x B(✓ ) = ln ✓ .

Hence by Pitman-Koopman Theorem,



i=1

K(Xi ) =



i=1

ln Xi

= ln



i=1

Xi.

Thus ln  n

i=1 X i is a suﬃcient statistic for ✓.

Remark 16.1. Notice that



i=1

Xi is also a suﬃ cient statistic of ✓ , since

knowing ln  n



i=1

Xi  , we also know



i=1

Xi .

Example 16.20. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





✓e  x

✓for 0 < x < 1

0 otherwise,

Probability and Mathematical Statistics 485

where 0 <✓<1 is a parameter. Find a suﬃ cient estimator of ✓.

Answer: First, we rewrite the population density in the exponential form.

That is

f(x ;✓ ) = 1

✓e  x

✓

=eln  1

✓e  x

✓

=eln ✓ x

✓.

Hence

K(x ) = x A(✓ ) =  1

✓

S(x ) = 0 B(✓ ) =  ln ✓ .

Hence by Pitman-Koopman Theorem,



i=1

K(Xi ) =



i=1

Xi = n X.

Thus, nX is a suﬃ cient statistic for ✓ . Since knowing nX , we also know X,

the estimator X is also a suﬃ cient estimator of ✓.

Example 16.21. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





e(x ✓) for ✓ < x < 1

0 otherwise,

where 1 <✓ < 1 is a parameter. Can Pitman-Koopman Theorem be

used to ﬁnd a suﬃ cient statistic for ✓?

Answer: No. We can not use Pitman-Koopman Theorem to ﬁnd a suﬃ cient

statistic for ✓ since the domain where the population density is nonzero is

not free of ✓.

Next, we present the connection between the maximum likelihood esti-

mator and the suﬃ cient estimator. If there is a suﬃ cient estimator for the

parameter ✓ and if the maximum likelihood estimator of this ✓ is unique, then

the maximum likelihood estimator is a function of the suﬃ cient estimator.

That is



✓ML = ( 

✓S ),

where is a real valued function, 

✓ML is the maximum likelihood estimator

of ✓ , and 

✓S is the suﬃ cient estimator of ✓ .

Criteria for Evaluating the Goodness of Estimators 486

Similarly, a connection can be established between the uniform minimum

variance unbiased estimator and the suﬃ cient estimator of a parameter ✓ . If

there is a suﬃ cient estimator for the parameter ✓ and if the uniform minimum

variance unbiased estimator of this ✓ is unique, then the uniform minimum

variance unbiased estimator is a function of the suﬃ cient estimator. That is



✓MVUE = ⌘( 

✓S ),

where ⌘ is a real valued function, 

✓MVUE is the uniform minimum variance

unbiased estimator of ✓ , and 

✓S is the suﬃ cient estimator of ✓ .

Finally, we may ask "If there are suﬃ cient estimators, why are not there

necessary estimators?" In fact, there are. Dynkin (1951) gave the following

deﬁnition.

Deﬁnition 16.7. An estimator is said to be a necessary estimator if it can

be written as a function of every suﬃ cient estimators.

16.5. Consistent Estimator

Let X1 , X2 , ..., Xn be a random sample from a population X with density

f(x ;✓ ). Let 

✓be an estimator of ✓based on the sample of size n . Obviously

the estimator depends on the sample size n . In order to reﬂect the depen-

dency of 

✓on n , we denote 

✓as 

✓n .

Deﬁnition 16.7. Let X1 , X2 , ..., Xn be a random sample from a population

Xwith density f(x ;✓ ). A sequence of estimators { 

✓n } of ✓ is said to be

consistent for ✓ if and only if the sequence { 

✓n } converges in probability to

✓, that is, for any ✏>0

lim

n!1 P   

✓n  ✓   ✏ = 0.

Note that consistency is actually a concept relating to a sequence of

estimators { 

✓n }1

n=no but we usually say "consistency of 

✓n " for simplicity.

Further, consistency is a large sample property of an estimator.

The following theorem states that if the mean squared error goes to zero

as n goes to inﬁnity, then { 

✓n } converges in probability to ✓.

Theorem 16.5. Let X1 , X2 , ..., Xn be a random sample from a population

Xwith density f(x ;✓ ) and { 

✓n } be a sequence of estimators of ✓ based on

the sample. If the variance of 

✓n exists for each n and is ﬁnite and

lim

n!1E  

✓n  ✓  2  = 0

Probability and Mathematical Statistics 487

then, for any ✏> 0,

lim

n!1 P   

✓n  ✓   ✏ = 0.

Proof: By Markov Inequality (see Theorem 13.8) we have

P 

✓n  ✓  2 ✏2  

E 

✓n  ✓  2 

✏2

for all ✏> 0. Since the events



✓n  ✓  2 ✏2 and | 

✓n  ✓| ✏

are same, we see that

P 

✓n  ✓  2 ✏2  =P  | 

✓n  ✓| ✏ 

E 

✓n  ✓  2 

✏2

for all n2 IN. Hence if

lim

n!1 E  

✓n  ✓  2  = 0

then

lim

n!1 P | 

✓n  ✓| ✏ = 0

and the proof of the theorem is complete.

Let

B

✓, ✓ =E

✓  ✓

be the biased. If an estimator is unbiased, then B 

✓, ✓ = 0. Next we show

that

E 

✓ ✓ 2  =V ar  

✓ + B

✓, ✓ 2 .(1)

To see this consider

E 

✓ ✓ 2  =E 

✓2 2

✓ ✓ + ✓2  2 

=E

✓2  2E 

✓ ✓+ ✓2

=E

✓2  E

✓ 2 +E

✓ 2 2E 

✓ ✓+ ✓2

=V ar  

✓ +E

✓ 2 2E 

✓ ✓+ ✓2

=V ar  

✓ + E

✓  ✓2

=V ar  

✓ + B

✓, ✓ 2 .

Criteria for Evaluating the Goodness of Estimators 488

In view of (1), we can say that if

lim

n!1 V ar 

✓n  = 0 (2)

and

lim

n!1 B

✓n , ✓ = 0 (3)

then

lim

n!1E  

✓n  ✓  2  = 0.

In other words, to show a sequence of estimators is consistent we have to

verify the limits (2) and (3).

Example 16.22. Let X1 , X2 , ..., Xn be a random sample from a normal

population X with mean µ and variance 2 > 0. Is the likelihood estimator



2 =1



i=1 X i X 2 .

of 2 a consistent estimator of 2 ?

Answer: Since 

2 depends on the sample size n , we denote 

2 as 

2n . Hence



2n =1



i=1 X i X 2 .

The variance of 

2n is given by

V ar  

2n  =V ar  1



i=1 X i X 2 

n2 V ar   2 ( n1)S 2

2 

= 4

n2 V ar  ( n1)S 2

2 

= 4

n2 V ar   2 ( n1)

=2(n 1) 4

= 1

n 1

n2  24 .

Probability and Mathematical Statistics 489

Hence

lim

n!1 V ar 

✓n  = lim

n!1  1

n 1

n2  24 = 0.

The biased B 

✓n , ✓ is given by

B

✓n , ✓ =E 

✓n    2

=E 1



i=1 X i X 2  2

nE   2 ( n1)S 2

2    2

= 2

nE   2 ( n1) 2

=(n 1) 2

n2

= 2

Thus

lim

n!1 B

✓n , ✓ = lim

n!1

2

n= 0.

Hence 1



i=1 X i X 2 is a consistent estimator of  2 .

In the last example we saw that the likelihood estimator of variance is a

consistent estimator. In general, if the density function f (x ;✓ ) of a population

satisﬁes some mild conditions, then the maximum likelihood estimator of ✓is

consistent. Similarly, if the density function f (x ;✓ ) of a population satisﬁes

some mild conditions, then the estimator obtained by moment method is also

consistent.

Let X1 , X2 , ..., Xn be a random sample from a population X with density

function f (x ;✓ ), where ✓ is a parameter. One can generate a consistent

estimator using moment method as follows. First, ﬁnd a function U (x ) such

that

E( U( X))=g (✓)

where g is a function of ✓ . If g is a one-to-one function of ✓ and has a

continuous inverse, then the estimator



✓MM =g 1  1



i=1

U(Xi )

Criteria for Evaluating the Goodness of Estimators 490

is consistent for ✓ . To see this, by law of large number, we get



i=1

U(Xi ) P

!E( U( X)).

Hence



i=1

U(Xi ) P

!g(✓)

and therefore

g1  1



i=1

U(Xi ) P

!✓.

Thus



✓MM

!✓

and 

✓MM is a consistent estimator of ✓ .

Example 16.23. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





✓e  x

✓for 0 < x < 1

0 otherwise,

where 0 <✓<1 is a parameter. Using moment method ﬁnd a consistent

estimator of ✓.

Answer: Let U (x ) = x . Then

f(✓ ) = E( U( X))=✓.

The function f (x ) = x for x > 0 is a one-to-one function and continuous.

Moreover, the inverse of f is given by f 1 (x ) = x . Thus



✓n =f 1  1



i=1

U(Xi )

=f1  1



i=1

Xi 

=f1 (X )

=X.

Therefore,



✓n = X

Probability and Mathematical Statistics 491

is a consistent estimator of ✓.

Since consistency is a large sample property of an estimator, some statis-

ticians suggest that consistency should not be used alone for judging the

goodness of an estimator; rather it should be used along with other criteria.

16.6. Review Exercises

1. Let T1 and T2 be estimators of a population parameter ✓ based upon the

same random sample. If Ti ⇠ N  ✓ , 2

ii= 1, 2 and if T = bT 1 + (1 b)T 2 ,

then for what value of b ,T is a minimum variance unbiased estimator of ✓?

2. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ;✓ ) = 1

2✓ e  |x|

✓ 1 < x < 1,

where 0 <✓ is a parameter. What is the expected value of the maximum

likelihood estimator of ✓? Is this estimator unbiased?

3. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ;✓ ) = 1

2✓ e  |x|

✓ 1 < x < 1,

where 0 <✓ is a parameter. Is the maximum likelihood estimator an eﬃ cient

estimator of ✓?

4. A random sample X1 , X2 , ..., Xn of size n is selected from a normal dis-

tribution with variance 2 . Let S2 be the unbiased estimator of 2 , and T

be the maximum likelihood estimator of 2 . If 20T 19S2 = 0, then what is

the sample size?

5. Suppose X and Y are independent random variables each with density

function

f(x ) =  2 x✓2 for 0 < x < 1

✓

0 otherwise.

If k (X + 2Y ) is an unbiased estimator of ✓ 1 , then what is the value of k?

6. An object of length c is measured by two persons using the same in-

strument. The instrument error has a normal distribution with mean 0 and

variance 1. The ﬁrst person measures the object 25 times, and the average

of the measurements is ¯

X= 12. The second person measures the objects 36

times, and the average of the measurements is ¯

Y= 12 .8. To estimate c we

use the weighted average a ¯

X+ b¯

Yas an estimator. Determine the constants

Criteria for Evaluating the Goodness of Estimators 492

aand bsuch that a¯

X+ b¯

Yis the minimum variance unbiased estimator of

cand then calculate the minimum variance unbiased estimate of c.

7. Let X1 , X2 , ..., Xn be a random sample from a distribution with probabil-

ity density function

f(x ) = 





3✓x2e✓x3 for 0 <x<1

0 otherwise,

where ✓> 0 is an unknown parameter. Find a suﬃ cient statistics for ✓.

8. Let X1 , X2 , ..., Xn be a random sample from a Weibull distribution with

probability density function

f(x ) = 







✓ x 1 e( x

✓)  if x > 0

0 otherwise ,

where ✓> 0 and > 0 are parameters. Find a suﬃ cient statistics for ✓

with  known, say  = 2. If  is unknown, can you ﬁnd a single suﬃ cient

statistics for ✓?

9. Let X1 , X2 be a random sample of size 2 from population with probability

density

f(x ;✓ ) = 





✓e  x

✓if 0 < x < 1

0 otherwise,

where ✓> 0 is an unknown parameter. If Y = p X1X2 , then what should

be the value of the constant k such that kY is an unbiased estimator of the

parameter ✓?

10. Let X1 , X2 , ..., Xn be a random sample from a population with proba-

bility density function

f(x ;✓ ) = 





✓if 0 < x < ✓

0 otherwise ,

where ✓> 0 is an unknown parameter. If X denotes the sample mean, then

what should be value of the constant k such that kX is an unbiased estimator

of ✓?

Probability and Mathematical Statistics 493

11. Let X1 , X2 , ..., Xn be a random sample from a population with proba-

bility density function

f(x ;✓ ) = 





✓if 0 < x < ✓

0 otherwise ,

where ✓> 0 is an unknown parameter. If Xmed denotes the sample median,

then what should be value of the constant k such that kXmed is an unbiased

estimator of ✓?

12. What do you understand by an unbiased estimator of a parameter ✓?

What is the basic principle of the maximum likelihood estimation of a param-

eter ✓ ? What is the basic principle of the Bayesian estimation of a parame-

ter ✓ ? What is the main di↵ erence between Bayesian method and likelihood

method.

13. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓

(1+x)✓+1 for 0 x < 1

0 otherwise,

where ✓> 0 is an unknown parameter. What is a suﬃ cient statistic for the

parameter ✓?

14. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓2 e  x2

2✓ 2 for 0 x < 1

0 otherwise,

where ✓ is an unknown parameter. What is a suﬃ cient statistic for the

parameter ✓?

15. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ;✓ ) = 





e(x ✓) for ✓ < x < 1

0 otherwise,

where 1 <✓ < 1 is a parameter. What is the maximum likelihood

estimator of ✓ ? Find a suﬃ cient statistics of the parameter ✓.

Criteria for Evaluating the Goodness of Estimators 494

16. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ;✓ ) = 





e(x ✓) for ✓ < x < 1

0 otherwise,

where 1 <✓< 1 is a parameter. Are the estimators X(1) and X 1 are

unbiased estimators of ✓ ? Which one is more eﬃ cient than the other?

17. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓x✓1 for 0  x < 1

0 otherwise,

where ✓> 1 is an unknown parameter. What is a suﬃ cient statistic for the

parameter ✓?

18. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓ ↵ x↵1 e✓x↵ for 0  x < 1

0 otherwise,

where ✓> 0 and ↵> 0 are parameters. What is a suﬃ cient statistic for the

parameter ✓ for a ﬁxed ↵?

19. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓ ↵✓

x(✓ +1) for ↵< x < 1

0 otherwise,

where ✓> 0 and ↵> 0 are parameters. What is a suﬃ cient statistic for the

parameter ✓ for a ﬁxed ↵?

20. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 



m

x✓ x (1 ✓ ) mx for x = 0, 1,2, ..., m

0 otherwise,

where 0 <✓ < 1 is parameter. Show that X

mis a uniform minimum variance

unbiased estimator of ✓ for a ﬁxed m.

Probability and Mathematical Statistics 495

21. Let X1 , X2 , ..., Xn be a random sample from a population X with density

function

f(x ;✓ ) = 





✓x✓1 for 0 <x< 1

0 otherwise,

where ✓> 1 is parameter. Show that  1

n n

i=1 ln(X i ) is a uniform minimum

variance unbiased estimator of 1

✓.

22. Let X1 , X2 , ..., Xn be a random sample from a uniform population X

on the interval [0,✓ ], where ✓> 0 is a parameter. Is the likelihood estimator



✓=X(n) of ✓a consistent estimator of ✓?

23. Let X1 , X2 , ..., Xn be a random sample from a population X⇠ P OI (),

where > 0 is a parameter. Is the estimator X of  a consistent estimator

of ?

24. Let X1 , X2 , ..., Xn be a random sample from a population X having the

probability density function

f(x ;✓ ) =  ✓ x ✓1 ,if 0 <x<1

0 otherwise,

where ✓> 0 is a parameter. Is the estimator 

✓=X

1X of ✓ , obtained by the

moment method, a consistent estimator of ✓?

25. Let X1 , X2 , ..., Xn be a random sample from a population X having the

probability density function

f(x ; p ) = 



m

xp x (1 p) mx,if x = 0, 1,2, ..., m

0 otherwise,

where 0 <p< 1 is a parameter and m is a ﬁxed positive integer. What is the

maximum likelihood estimator for p . Is this maximum likelihood estimator

for p is an eﬃ cient estimator?

26. Let X1 , X2 , ..., Xn be a random sample from a population X having the

probability density function

f(x ;✓ ) = 





✓x✓1 , if 0 < x < 1

0 otherwise,

where ✓> 0 is a parameter. Is the estimator 

✓=X

1X of ✓ , obtained by the

moment method, a consistent estimator of ✓ ? Justify your answer.

Criteria for Evaluating the Goodness of Estimators 496

Probability and Mathematical Statistics 497

Chapter 17

SOME TECHNIQUES

FOR FINDING INTERVAL

ESTIMATORS

FOR

PARAMETERS

In point estimation we ﬁnd a value for the parameter ✓ given a sample

data. For example, if X1 , X2 , ..., Xn is a random sample of size n from a

population with probability density function

f(x ;✓ ) = 



 2

⇡e  1

2(x✓ ) 2 for x✓

0 otherwise,

then the likelihood function of ✓is

L(✓ ) =



i=1  2

⇡e  1

2(x i ✓) 2 ,

where x1 ✓ , x2 ✓ , ..., xn ✓ . This likelihood function simpliﬁes to

L(✓ ) =  2

⇡ n

e 1



i=1

(xi ✓ )2

where min{x1 , x2 , ..., xn }✓ . Taking the natural logarithm of L(✓ ) and

maximizing, we obtain the maximum likelihood estimator of ✓as the ﬁrst

order statistic of the sample X1 , X2 , ..., Xn , that is



✓=X(1),

Techniques for ﬁnding Interval Estimators of Parameters 498

where X(1) = min{X1 , X2 , ..., Xn } . Suppose the true value of ✓ = 1. Using

the maximum likelihood estimator of ✓ , we are trying to guess this value of

✓based on a random sample. Suppose X1 = 1.5 , X2 = 1.1 , X3 = 1.7 , X4 =

2.1, X5 = 3 . 1 is a set of sample data from the above population. Then based

on this random sample, we will get



✓ML = X(1) = min {1 .5 , 1.1 , 1.7 , 2.1 , 3.1} = 1 .1.

If we take another random sample, say X1 = 1.8 , X2 = 2.1 , X3 = 2.5 , X4 =

3.1, X5 = 2 . 6 then the maximum likelihood estimator of this ✓ will be 

✓= 1.8

based on this sample. The graph of the density function f (x ;✓ ) for ✓ = 1 is

shown below.

From the graph, it is clear that a number close to 1 has higher chance of

getting randomly picked by the sampling process, then the numbers that are

substantially bigger than 1. Hence, it makes sense that ✓ should be estimated

by the smallest sample value. However, from this example we see that the

point estimate of ✓ is not equal to the true value of ✓ . Even if we take many

random samples, yet the estimate of ✓ will rarely equal the actual value of

the parameter. Hence, instead of ﬁnding a single value for ✓ , we should

report a range of probable values for the parameter ✓ with certain degree of

conﬁdence. This brings us to the notion of conﬁdence interval of a parameter.

17.1. Interval Estimators and Conﬁdence Intervals for Parameters

The interval estimation problem can be stated as follow: Given a random

sample X1 , X2 , ..., Xn and a probability value 1 ↵ , ﬁnd a pair of statistics

L= L(X1 , X2 , ..., Xn ) and U= U (X1 , X2 , ..., Xn ) with L U such that the

Probability and Mathematical Statistics 499

probability of ✓ being on the random interval [L, U ] is 1 ↵ . That is

P( L✓ U) = 1 ↵ .

Recall that a sample is a portion of the population usually chosen by

method of random sampling and as such it is a set of random variables

X1 , X2 , ..., Xn with the same probability density function f (x ;✓ ) as the pop-

ulation. Once the sampling is done, we get

X1 =x1 , X2 = x2 , ·· · , Xn = xn

where x1 , x2 , ..., xn are the sample data.

Deﬁnition 17.1. Let X1 , X2 , ..., Xn be a random sample of size nfrom

a population X with density f (x ;✓ ), where ✓ is an unknown parameter.

The interval estimator of ✓ is a pair of statistics L = L(X1 , X2 , ..., Xn ) and

U= U(X1 , X2 , ..., Xn ) with L Usuch that if x1 , x2 , ..., xn is a set of sample

data, then ✓ belongs to the interval [L(x1 , x2, ...xn ) , U (x1 , x2, ...xn )].

The interval [l, u ] will be denoted as an interval estimate of ✓ whereas the

random interval [L, U ] will denote the interval estimator of ✓ . Notice that

the interval estimator of ✓ is the random interval [L, U ]. Next, we deﬁne the

100(1 ↵ )% conﬁdence interval for the unknown parameter ✓.

Deﬁnition 17.2. Let X1 , X2 , ..., Xn be a random sample of size n from a

population X with density f (x ;✓ ), where ✓ is an unknown parameter. The

interval estimator of ✓ is called a 100(1 ↵ )% conﬁdence interval for ✓if

P( L✓ U) = 1 ↵ .

The random variable L is called the lower conﬁdence limit and U is called the

upper conﬁdence limit. The number (1↵ ) is called the conﬁdence coeﬃcient

or degree of conﬁdence.

There are several methods for constructing conﬁdence intervals for an

unknown parameter ✓ . Some well known methods are: (1) Pivotal Quantity

Method, (2) Maximum Likelihood Estimator (MLE) Method, (3) Bayesian

Method, (4) Invariant Methods, (5) Inversion of Test Statistic Method, and

(6) The Statistical or General Method.

In this chapter, we only focus on the pivotal quantity method and the

MLE method. We also brieﬂy examine the the statistical or general method.

The pivotal quantity method is mainly due to George Bernard and David

Fraser of the University of Waterloo, and this method is perhaps one of

the most elegant methods of constructing conﬁdence intervals for unknown

parameters.

Techniques for ﬁnding Interval Estimators of Parameters 500

17.2. Pivotal Quantity Method

In this section, we explain how the notion of pivotal quantity can be

used to construct conﬁdence interval for a unknown parameter. We will

also examine how to ﬁnd pivotal quantities for parameters associated with

certain probability density functions. We begin with the formal deﬁnition of

the pivotal quantity.

Deﬁnition 17.3. Let X1 , X2 , ..., Xn be a random sample of size n from a

population X with probability density function f (x ;✓ ), where ✓ is an un-

known parameter. A pivotal quantity Q is a function of X1 , X2 , ..., Xn and ✓

whose probability distribution is independent of the parameter ✓.

Notice that the pivotal quantity Q(X1 , X2 , ..., Xn ,✓ ) will usually contain

both the parameter ✓ and an estimator (that is, a statistic) of ✓ . Now we

give an example of a pivotal quantity.

Example 17.1. Let X1 , X2 , ..., Xn be a random sample from a normal

population X with mean µ and a known variance 2 . Find a pivotal quantity

for the unknown parameter µ.

Answer: Since each Xi ⇠ N ( µ, 2 ),

X⇠ N µ,  2

n .

Standardizing X , we see that

X µ



pn ⇠N(0 ,1).

The statistics Q given by

Q(X1 , X2 , ..., Xn , µ) = X µ



is a pivotal quantity since it is a function of X1 , X2 , ..., Xn and µ and its

probability density function is free of the parameter µ.

There is no general rule for ﬁnding a pivotal quantity (or pivot) for

a parameter ✓ of an arbitrarily given density function f (x ;✓ ). Hence to

some extents, ﬁnding pivots relies on guesswork. However, if the probability

density function f (x ;✓ ) belongs to the location-scale family, then there is a

systematic way to ﬁnd pivots.

Probability and Mathematical Statistics 501

Deﬁnition 17.4. Let g : IR ! IR be a probability density function. Then for

any µ and any > 0, the family of functions

F= f (x ; µ,  ) = 1

g  xµ

 |µ2(1 ,1) ,2(0 ,1) 

is called the location-scale family with standard probability density f (x ;✓ ).

The parameter µ is called the location parameter and the parameter is

called the scale parameter. If  = 1, then F is called the location family. If

µ= 0, then F is called the scale family

It should be noted that each member f (x ; µ,  ) of the location-scale

family is a probability density function. If we take g (x ) = 1

p2⇡ e  1

2x 2 , then

the normal density function

f(x ; µ,  ) = 1

g  xµ

 =1

p2⇡2 e 1

2( xµ

) 2 ,1 <x<1

belongs to the location-scale family. The density function

f(x ;✓ ) = 





✓e  x

✓if 0 < x < 1

0 otherwise,

belongs to the scale family. However, the density function

f(x ;✓ ) = 





✓x✓1 if 0 <x< 1

0 otherwise,

does not belong to the location-scale family.

It is relatively easy to ﬁnd pivotal quantities for location or scale param-

eter when the density function of the population belongs to the location-scale

family F . When the density function belongs to location family, the pivot

for the location parameter µ is  µ µ, where  µis the maximum likelihood

estimator of µ . If  is the maximum likelihood estimator of , then the pivot

for the scale parameter  is  

when the density function belongs to the scale

family. The pivot for location parameter µ is  µµ

 and the pivot for the scale

parameter  is  

when the density function belongs to location-scale fam-

ily. Sometime it is appropriate to make a minor modiﬁcation to the pivot

obtained in this way, such as multiplying by a constant, so that the modiﬁed

pivot will have a known distribution.

Techniques for ﬁnding Interval Estimators of Parameters 502

Remark 17.1. Pivotal quantity can also be constructed using a suﬃ cient

statistic for the parameter. Suppose T =T (X1 , X2 , ..., Xn ) is a suﬃ cient

statistic based on a random sample X1 , X2 , ..., Xn from a population Xwith

probability density function f (x ;✓ ). Let the probability density function of

Tbe g(t ;✓ ). If g(t ;✓ ) belongs to the location family, then an appropriate

constant multiple of T a(✓ ) is a pivotal quantity for the location parameter

✓for some suitable expression a( ✓). If g (t ; ✓) belongs to the scale family, then

an appropriate constant multiple of T

b(✓ ) is a pivotal quantity for the scale

parameter ✓ for some suitable expression b(✓ ). Similarly, if g (t ;✓ ) belongs to

the location-scale family, then an appropriate constant multiple of Ta(✓)

b(✓ ) is

a pivotal quantity for the location parameter ✓ for some suitable expressions

a(✓ ) and b(✓).

Algebraic manipulations of pivots are key factors in ﬁnding conﬁdence

intervals. If Q = Q(X1 , X2 , ..., Xn ,✓ ) is a pivot, then a 100(1↵ )% conﬁdence

interval for ✓ may be constructed as follows: First, ﬁnd two values a and b

such that

P( a Q b) = 1 ↵ ,

then convert the inequality a Q b into the form L✓  U .

For example, if X is normal population with unknown mean µ and known

variance 2 , then its pdf belongs to the location-scale family. A pivot for µ

is Xµ

S. However, since the variance 2 is known, there is no need to take

S. So we consider the pivot Xµ

to construct the 100(1  2↵ )% conﬁdence

interval for µ . Since our population X⇠ N (µ, 2 ), the sample mean Xis

also a normal with the same mean µ and the variance equals to 

pn . Hence

1 2↵ =P  z↵ X  µ



pn z ↵ 

=P µ z↵



pn X µ+ z↵



pn 

=P X z↵



pn µX+ z↵



pn  .

Therefore, the 100(1  2↵ )% conﬁdence interval for µis

Xz↵



pn , X + z↵



pn  .

Probability and Mathematical Statistics 503

Here z↵ denotes the 100(1 ↵ )-percentile (or (1 ↵ )-quartile) of a standard

normal random variable Z , that is

P( Z z↵ ) = 1 ↵ ,

where ↵ 0. 5 (see ﬁgure below). Note that ↵ =P (Z   z↵ ) if ↵ 0.5.

A 100(1 ↵ )% conﬁdence interval for a parameter ✓ has the following

interpretation. If X1 = x1 , X2 = x2 , ..., Xn = xn is a sample of size n , then

based on this sample we construct a 100(1 ↵ )% conﬁdence interval [l, u]

which is a subinterval of the real line IR. Suppose we take large number of

samples from the underlying population and construct all the corresponding

100(1 ↵ )% conﬁdence intervals, then approximately 100(1 ↵ )% of these

intervals would include the unknown value of the parameter ✓.

In the next several sections, we illustrate how pivotal quantity method

can be used to determine conﬁdence intervals for various parameters.

17.3. Conﬁdence Interval for Population Mean

At the outset, we use the pivotal quantity method to construct a con-

ﬁdence interval for the mean of a normal population. Here we assume ﬁrst

the population variance is known and then variance is unknown. Next, we

construct the conﬁdence interval for the mean of a population with continu-

ous, symmetric and unimodal probability distribution by applying the central

limit theorem.

Let X1 , X2 , ..., Xn be a random sample from a population X⇠ N (µ, 2 ),

where µ is an unknown parameter and 2 is a known parameter. First of all,

we need a pivotal quantity Q(X1 , X2 , ..., Xn , µ ). To construct this pivotal

Techniques for ﬁnding Interval Estimators of Parameters 504

quantity, we ﬁnd the likelihood estimator of the parameter µ . We know that

 µ= X. Since, each Xi ⇠ N ( µ, 2 ), the distribution of the sample mean is

given by

X⇠ N µ,  2

n .

It is easy to see that the distribution of the estimator of µ is not independent

of the parameter µ . If we standardize X , then we get

X µ



pn ⇠N(0 ,1).

The distribution of the standardized X is independent of the parameter µ.

This standardized X is the pivotal quantity since it is a function of the

sample X1 , X2 , ..., Xn and the parameter µ , and its probability distribution

is independent of the parameter µ . Using this pivotal quantity, we construct

the conﬁdence interval as follows:

1↵ =P  z↵

2Xµ



pn z ↵

2

=P X 

pn  z ↵

2µ X+ 

pn  z ↵

2

Hence, the (1 ↵ )% conﬁdence interval for µ when the population Xis

normal with the known variance 2 is given by

X 

pn  z ↵

2, X + 

pn  z ↵

2.

This says that if samples of size n are taken from a normal population with

mean µ and known variance 2 and if the interval

X 

pn  z ↵

2, X + 

pn  z ↵

2

is constructed for every sample, then in the long-run 100(1 ↵ )% of the

intervals will cover the unknown parameter µ and hence with a conﬁdence of

(1 ↵ )100% we can say that µ lies on the interval

X 

pn  z ↵

2, X + 

pn  z ↵

2.

Probability and Mathematical Statistics 505

The interval estimate of µ is found by taking a good (here maximum likeli-

hood) estimator X of µ and adding and subtracting z ↵

2times the standard

deviation of X.

Remark 17.2. By deﬁnition a 100(1 ↵ )% conﬁdence interval for a param-

eter ✓ is an interval [L, U ] such that the probability of ✓ being in the interval

[L, U ] is 1 ↵ . That is

1↵ =P (L✓  U ).

One can ﬁnd inﬁnitely many pairs L, U such that

1↵ =P (L✓  U ).

Hence, there are inﬁnitely many conﬁdence intervals for a given parameter.

However, we only consider the conﬁdence interval of shortest length. If a

conﬁdence interval is constructed by omitting equal tail areas then we obtain

what is known as the central conﬁdence interval. In a symmetric distribution,

it can be shown that the central conﬁdence interval is of the shortest length.

Example 17.2. Let X1 , X2 , ..., X11 be a random sample of size 11 from

a normal distribution with unknown mean µ and variance 2 = 9. 9. If

11

i=1 x i = 132, then what is the 95% conﬁdence interval for µ?

Answer: Since each Xi ⇠ N ( µ, 9. 9), the conﬁdence interval for µ is given

by  X 

pn  z ↵

2, X + 

pn  z ↵

2.

Since  11

i=1 x i = 132, the sample mean x= 132

11 = 12. Also, we see that

 2

n=  9.9

11 = p 0.9.

Further, since 1 ↵ = 0. 95, ↵ = 0. 05. Thus

z↵

2=z0.025 = 1.96 (from normal table).

Using these information in the expression of the conﬁdence interval for µ , we

get  12  1. 96 p0.9, 12 + 1. 96 p 0.9

that is

[10.141,13.859].

Techniques for ﬁnding Interval Estimators of Parameters 506

Example 17.3. Let X1 , X2 , ..., X11 be a random sample of size 11 from

a normal distribution with unknown mean µ and variance 2 = 9. 9. If

11

i=1 x i = 132, then for what value of the constant kis

12 k p 0.9, 12 + k p 0.9 

a 90% conﬁdence interval for µ?

Answer: The 90% conﬁdence interval for µ when the variance is given is

x 

pn  z ↵

2, x + 

pn  z ↵

2.

Thus we need to ﬁnd x,   2

nand z↵

2corresponding to 1 ↵ = 0. 9. Hence

x=11

i=1 x i

=132

= 12.

 2

n=  9.9

=p 0.9.

z0.05 = 1.64 (from normal table).

Hence, the conﬁdence interval for µ at 90% conﬁdence level is

12  (1. 64) p0.9, 12 + (1 . 64) p0.9  .

Comparing this interval with the given interval, we get

k= 1 .64.

and the corresponding 90% conﬁdence interval is [10.444 , 13.556].

Remark 17.3. Notice that the length of the 90% conﬁdence interval for µ

is 3.112. However, the length of the 95% conﬁdence interval is 3.718. Thus

higher the conﬁdence level bigger is the length of the conﬁdence interval.

Hence, the conﬁdence level is directly proportional to the length of the conﬁ-

dence interval. In view of this fact, we see that if the conﬁdence level is zero,

Probability and Mathematical Statistics 507

then the length is also zero. That is when the conﬁdence level is zero, the

conﬁdence interval of µ degenerates into a point X.

Until now we have considered the case when the population is normal

with unknown mean µ and known variance 2 . Now we consider the case

when the population is non-normal but its probability density function is

continuous, symmetric and unimodal. If the sample size is large, then by the

central limit theorem

X µ



pn ⇠N(0 ,1) as n! 1.

Thus, in this case we can take the pivotal quantity to be

Q(X1 , X2 , ..., Xn , µ) = X µ



if the sample size is large (generally n 32). Since the pivotal quantity is

same as before, we get the sample expression for the (1 ↵ )100% conﬁdence

interval, that is

X 

pn  z ↵

2, X + 

pn  z ↵

2.

Example 17.4. Let X1 , X2 , ..., X40 be a random sample of size 40 from

a distribution with known variance and unknown mean µ . If  40

i=1 x i =

286. 56 and 2 = 10, then what is the 90 percent conﬁdence interval for the

population mean µ?

Answer: Since 1 ↵ = 0. 90, we get ↵

2= 0.05. Hence, z 0.05 = 1.64 (from

the standard normal table). Next, we ﬁnd the sample mean

x=286.56

40 = 7.164.

Hence, the conﬁdence interval for µ is given by

7.164  (1. 64)  10

40  , 7.164 + (1.64)  10

40 

that is

[6.344,7.984].

Techniques for ﬁnding Interval Estimators of Parameters 508

Example 17.5. In sampling from a nonnormal distribution with a variance

of 25, how large must the sample size be so that the length of a 95% conﬁdence

interval for the mean is 1. 96 ?

Answer: The conﬁdence interval when the sample is taken from a normal

population with a variance of 25 is

x 

pn  z ↵

2, x + 

pn  z ↵

2.

Thus the length of the conﬁdence interval is

`= 2 z ↵

2 2

= 2 z0.025  25

= 2 (1. 96)  25

But we are given that the length of the conﬁdence interval is ` = 1. 96. Thus

1. 96 = 2 (1 . 96)  25

pn = 10

n= 100.

Hence, the sample size must be 100 so that the length of the 95% conﬁdence

interval will be 1.96.

So far, we have discussed the method of construction of conﬁdence in-

terval for the parameter population mean when the variance is known. It is

very unlikely that one will know the variance without knowing the popula-

tion mean, and thus what we have treated so far in this section is not very

realistic. Now we treat case of constructing the conﬁdence interval for pop-

ulation mean when the population variance is also unknown. First of all, we

begin with the construction of conﬁdence interval assuming the population

Xis normal.

Suppose X1 , X2 , ..., Xn is random sample from a normal population X

with mean µ and variance 2 > 0. Let the sample mean and sample variances

be X and S2 respectively. Then

(n 1)S 2

2 ⇠  2 (n 1)

Probability and Mathematical Statistics 509

and Xµ

 2

⇠N(0 ,1).

Therefore, the random variable deﬁned by the ratio of (n 1)S 2

2 to Xµ

2

has

at -distribution with (n 1) degrees of freedom, that is

Q(X1 , X2 , ..., Xn , µ) =

Xµ

2

(n 1)S 2

(n 1) 2

=Xµ

S 2

⇠t( n 1),

where Q is the pivotal quantity to be used for the construction of the conﬁ-

dence interval for µ . Using this pivotal quantity, we construct the conﬁdence

interval as follows:

1↵ =P  t↵

2(n 1) X  µ

pn t ↵

2(n 1)

=P X S

pn  t ↵

2(n 1) µX + S

pn  t ↵

2(n 1)

Hence, the 100(1 ↵ )% conﬁdence interval for µ when the population Xis

normal with the unknown variance 2 is given by

X S

pn  t ↵

2(n 1) , X + S

pn  t ↵

2(n 1) .

Example 17.6. A random sample of 9 observations from a normal popula-

tion yields the observed statistics x = 5 and 1

8 9

i=1(x i x) 2 = 36. What is

the 95% conﬁdence interval for µ?

Answer: Since n = 9 x= 5

s2 = 36 and 1 ↵ = 0 .95,

the 95% conﬁdence interval for µ is given by

x s

pn  t ↵

2(n 1) , x + s

pn  t ↵

2(n 1) ,

that is  5 6

p9  t0.025 (8) , 5 +  6

p9  t0.025 (8) ,

Techniques for ﬁnding Interval Estimators of Parameters 510

which is  5 6

p9  (2. 306) , 5 +  6

p9  (2.306) .

Hence, the 95% conﬁdence interval for µ is given by [0.388 , 9.612].

Example 17.7. Which of the following is true of a 95% conﬁdence interval

for the mean of a population?

(a) The interval includes 95% of the population values on the average.

(b) The interval includes 95% of the sample values on the average.

Answer: None of the statements is correct since the 95% conﬁdence inter-

val for the population mean µ means that the interval has 95% chance of

including the population mean µ.

Finally, we consider the case when the population is non-normal but

it probability density function is continuous, symmetric and unimodal. If

some weak conditions are satisﬁed, then the sample variance S2 of a random

sample of size n 2, converges stochastically to 2 . Therefore, in

Xµ

2

(n 1)S 2

(n 1) 2

=Xµ

S 2

the numerator of the left-hand member converges to N (0, 1) and the denom-

inator of that member converges to 1. Hence

X µ

S 2

⇠N(0 ,1) as n ! 1.

This fact can be used for the construction of a conﬁdence interval for pop-

ulation mean when variance is unknown and the population distribution is

nonnormal. We let the pivotal quantity to be

Q(X1 , X2 , ..., Xn , µ) = X µ

S 2

and obtain the following conﬁdence interval

X S

pn  z ↵

2, X + S

pn  z ↵

2.

Probability and Mathematical Statistics 511

We summarize the results of this section by the following table.

Population Variance 2 Sample Size n Conﬁdence Limits

normal known n 2x⌥ z ↵



normal not known n 2x⌥ t ↵

2(n 1) s

not normal known n 32 x⌥ z ↵



not normal known n < 32 no formula exists

not normal not known n 32 x⌥ t ↵

2(n 1) s

not normal not known n < 32 no formula exists

17.4. Conﬁdence Interval for Population Variance

In this section, we will ﬁrst describe the method for constructing the

conﬁdence interval for variance when the population is normal with a known

population mean µ . Then we treat the case when the population mean is

also unknown.

Let X1 , X2 , ..., Xn be a random sample from a normal population X

with known mean µ and unknown variance 2 . We would like to construct

a 100(1 ↵ )% conﬁdence interval for the variance 2 , that is, we would like

to ﬁnd the estimate of L and U such that

P L2  U = 1 ↵ .

To ﬁnd these estimate of L and U , we ﬁrst construct a pivotal quantity. Thus

Xi ⇠ N  µ, 2  ,

X i µ

 ⇠N(0 ,1),

X i µ

2

⇠2 (1).



i=1  X i µ

2

⇠2 (n).

We deﬁne the pivotal quantity Q(X1 , X2 , ..., Xn ,2 ) as

Q(X1 , X2 , ..., Xn ,2 ) =



i=1  X i µ

2

Techniques for ﬁnding Interval Estimators of Parameters 512

which has a chi-square distribution with n degrees of freedom. Hence

1↵ =P (a Q b)

=P a



i=1  X i µ

2

b

=P 1

a



i=1

2

(Xi µ)2  1

b

=Pn

i=1(X i µ) 2

a2 n

i=1(X i µ) 2

b

=Pn

i=1(X i µ) 2

b2 n

i=1(X i µ) 2

a

=Pn

i=1(X i µ) 2

2

1 ↵

2(n)2   n

i=1(X i µ) 2

2

↵

2(n) 

Therefore, the (1 ↵ )% conﬁdence interval for 2 when mean is known is

given by   n

i=1(X i µ) 2

2

1 ↵

2(n) ,  n

i=1(X i µ) 2

2

↵

2(n) .

Example 17.8. A random sample of 9 observations from a normal pop-

ulation with µ = 5 yields the observed statistics 1

8 9

i=1 x 2

i= 39.125 and

9

i=1 x i = 45. What is the 95% conﬁdence interval for  2 ?

Answer: We have been given that

n= 9 and µ= 5.

Further we know that



i=1

xi = 45 and 1



i=1

i= 39.125.

Hence 9



i=1

i= 313,

and 9



i=1

(xi µ)2=



i=1

i2µ



i=1

xi + 9µ2

= 313 450 + 225

= 88.

Probability and Mathematical Statistics 513

Since 1 ↵ = 0. 95, we get ↵

2= 0.025 and 1  ↵

2= 0.975. Using chi-square

table we have

2

0.025(9) = 2.700 and  2

0.975(9) = 19.02.

Hence, the 95% conﬁdence interval for 2 is given by

n

i=1(X i µ) 2

2

1 ↵

2(n) ,  n

i=1(X i µ) 2

2

↵

2(n) ,

that is  88

19.02 , 88

2. 7 

which is

[4.63,32.59].

Remark 17.4. Since the 2 distribution is not symmetric, the above conﬁ-

dence interval is not necessarily the shortest. Later, in the next section, we

describe how one construct a conﬁdence interval of shortest length.

Consider a random sample X1 , X2 , ..., Xn from a normal population

X⇠ N( µ, 2 ), where the population mean µ and population variance  2

are unknown. We want to construct a 100(1 ↵ )% conﬁdence interval for

the population variance. We know that

(n 1)S 2

2 ⇠  2 (n 1)

)n

i=1 X i X 2

2 ⇠  2 (n 1).

We take  n

i=1( X i X ) 2

2 as the pivotal quantity Q to construct the conﬁdence

interval for 2 . Hence, we have

1↵ =P 1

2

↵

2(n 1) Q 1

2

1 ↵

2(n 1) 

=P 1

2

↵

2(n 1)   n

i=1 X i X 2

2  1

2

1 ↵

2(n 1) 

=Pn

i=1 X i X 2

2

1 ↵

2(n 1) 2   n

i=1 X i X 2

2

↵

2(n 1)  .

Techniques for ﬁnding Interval Estimators of Parameters 514

Hence, the 100(1 ↵ )% conﬁdence interval for variance 2 when the popu-

lation mean is unknown is given by

n

i=1 X i X 2

2

1 ↵

2(n 1) ,  n

i=1 X i X 2

2

↵

2(n 1) 

Example 17.9. Let X1 , X2 , ..., Xn be a random sample of size 13 from a

normal distribution N (µ, 2 ). If  13

i=1 x i = 246.61 and  13

i=1 x 2

i= 4806.61.

Find the 90% conﬁdence interval for 2 ?

Answer: x = 18.97

s2 =1

n1



i=1

(xi x)2

n1



i=1 x 2

inx 2  2

12 [4806.61  4678.2]

12 128.41.

Hence, 12s2 = 128. 41. Further, since 1 ↵ = 0. 90, we get ↵

2= 0.05 and

1↵

2= 0.95. Therefore, from chi-square table, we get

2

0.95(12) = 21.03 ,  2

0.05(12) = 5.23.

Hence, the 95% conﬁdence interval for 2 is

128.41

21. 03 , 128.41

5. 23  ,

that is

[6.11,24.55].

Example 17.10. Let X1 , X2 , ..., Xn be a random sample of size n from a

distribution N  µ, 2  , where µ and 2 are unknown parameters. What is

the shortest 90% conﬁdence interval for the standard deviation ?

Answer: Let S2 be the sample variance. Then

(n 1)S 2

2 ⇠  2 (n 1).

Probability and Mathematical Statistics 515

Using this random variable as a pivot, we can construct a 100(1 ↵ )% con-

ﬁdence interval for from

1↵ =P a (n 1)S 2

2 b

by suitably choosing the constants a and b . Hence, the conﬁdence interval

for  is given by  (n 1)S 2

b,  ( n1)S 2

a .

The length of this conﬁdence interval is given by

L( a, b) = S p n1 1

pa  1

pb  .

In order to ﬁnd the shortest conﬁdence interval, we should ﬁnd a pair of

constants a and b such that L( a, b ) is minimum. Thus, we have a constraint

minimization problem. That is

Minimize L( a, b )

Subject to the condition

b

f(u) du = 1 ↵ ,











(MP)

where

f(x ) = 1

 n1

22 n1

xn1

21 e  x

Di↵ erentiating L with respect to a , we get

da = S p n1   1

2a  3

2+1

2b  3

2db

da  .

From  b

f(u ) du = 1 ↵ ,

we ﬁnd the derivative of b with respect to a as follows:

da  b

f(u ) du = d

da (1 ↵ )

that is

f(b ) db

da  f (a) = 0.

Techniques for ﬁnding Interval Estimators of Parameters 516

Thus, we have

da = f (a)

f(b ) .

Letting this into the expression for the derivative of L , we get

da = S p n1   1

2a  3

2+1

2b  3

2f(a)

f(b ) .

Setting this derivative to zero, we get

Sp n1 1

2a  3

2+1

2b  3

2f(a)

f(b ) = 0

which yields

2f(a ) = b 3

2f(b).

Using the form of f , we get from the above expression

2a n3

2e  a

2=b 3

2b n3

2e  b

that is

2e  a

2=b n

2e  b

From this we get

ln  a

b =  ab

n .

Hence to obtain the pair of constants a and b that will produce the shortest

conﬁdence interval for  , we have to solve the following system of nonlinear

equations  b

f(u ) du = 1  ↵

ln  a

b = ab











(?)

If ao and bo are solutions of (? ), then the shortest conﬁdence interval for 

is given by 

 (n 1)S 2

, ( n1)S 2

ao 

.

Since this system of nonlinear equations is hard to solve analytically, nu-

merical solutions are given in statistical literature in the form of a table for

ﬁnding the shortest interval for the variance.

Probability and Mathematical Statistics 517

17.5. Conﬁdence Interval for Parameter of some Distributions

not belonging to the Location-Scale Family

In this section, we illustrate the pivotal quantity method for ﬁnding

conﬁdence intervals for a parameter ✓ when the density function does not

belong to the location-scale family. The following density functions does not

belong to the location-scale family:

f(x ;✓ ) = 





✓x✓1 if 0 < x < 1

0 otherwise,

f(x ;✓ ) =  1

✓if 0 < x < ✓

0 otherwise.

We will construct interval estimators for the parameters in these density

functions. The same idea for ﬁnding the interval estimators can be used to

ﬁnd interval estimators for parameters of density functions that belong to

the location-scale family such as

f(x ;✓ ) =  1

✓e  x

✓if 0 <x<1

0 otherwise.

To ﬁnd the pivotal quantities for the above mentioned distributions and

others we need the following three results. The ﬁrst result is Theorem 6.2

while the proof of the second result is easy and we leave it to the reader.

Theorem 17.1. Let F (x ;✓ ) be the cumulative distribution function of a

continuous random variable X . Then

F( X;✓ )⇠ U NI F (0 , 1).

Theorem 17.2. If X⇠ U N I F (0, 1), then

ln X ⇠EX P (1).

Theorem 17.3. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 otherwise,

Techniques for ﬁnding Interval Estimators of Parameters 518

where ✓> 0 is a parameter. Then the random variable

✓



i=1

Xi ⇠ 2 (2n)

Proof: Let Y = 2

✓ n

i=1 X i . Now we show that the sampling distribution of

Yis chi-square with 2 ndegrees of freedom. We use the moment generating

method to show this. The moment generating function of Y is given by

MY (t) = M

✓



i=1

(t)



i=1

MX i  2

✓t



i=1 1✓2

✓t  1

= (1  2t)n

= (1  2t) 2n

Since (1  2t) 2n

2corresponds to the moment generating function of a chi-

square random variable with 2n degrees of freedom, we conclude that

✓



i=1

Xi ⇠ 2 (2n).

Theorem 17.4. Let X1 , X2 , ..., Xn be a random sample from a distribution

with density function

f(x ;✓ ) = 





✓x✓1 if 0  x 1

0 otherwise,

where ✓> 0 is a parameter. Then the random variable 2✓  n

i=1 ln X i has

a chi-square distribution with 2n degree of freedoms.

Proof: We are given that

Xi ⇠✓ x✓1 ,0 < x < 1.

Probability and Mathematical Statistics 519

Hence, the cdf of fis

F(x ;✓ ) =  x

✓x✓1 dx = x✓.

Thus by Theorem 17.1, each

F(Xi ;✓ )⇠ U NI F (0 , 1),

that is

X✓

i⇠U N IF (0 , 1).

By Theorem 17.2, each

ln X ✓

i⇠EX P (1),

that is

✓ ln Xi ⇠ EX P (1).

By Theorem 17.3 (with ✓ = 1), we obtain

2 ✓



i=1

ln Xi ⇠ 2 (2n).

Hence, the sampling distribution of  2✓  n

i=1 ln X i is chi-square with 2n

degree of freedoms.

The following theorem whose proof follows from Theorems 17.1, 17.2 and

17.3 is the key to ﬁnding pivotal quantity of many distributions that do not

belong to the location-scale family. Further, this theorem can also be used

for ﬁnding the pivotal quantities for parameters of some distributions that

belong the location-scale family.

Theorem 17.5. Let X1 , X2 , ..., Xn be a random sample from a continuous

population X with a distribution function F (x ;✓ ). If F (x ;✓ ) is monotone in

✓, then the statistic Q =  2  n

i=1 ln F (X i ;✓) is a pivotal quantity and has

a chi-square distribution with 2n degrees of freedom (that is, Q⇠ 2 (2n)).

It should be noted that the condition F (x ;✓ ) is monotone in ✓ is needed

to ensure an interval. Otherwise we may get a conﬁdence region instead of a

conﬁdence interval. Further note that the statistic  2  n

i=1 ln (1 F (X i ;✓))

is also has a chi-square distribution with 2n degrees of freedom, that is

2



i=1

ln (1 F (Xi ;✓ )) ⇠ 2 (2n).

Techniques for ﬁnding Interval Estimators of Parameters 520

Example 17.11. If X1 , X2 , ..., Xn is a random sample from a population

with density

f(x ;✓ ) = 





✓x✓1 if 0 < x < 1

0 otherwise,

where ✓> 0 is an unknown parameter, what is a 100(1 ↵ )% conﬁdence

interval for ✓?

Answer: To construct a conﬁdence interval for ✓ , we need a pivotal quantity.

That is, we need a random variable which is a function of the sample and the

parameter, and whose probability distribution is known but does not involve

✓. We use the random variable

Q= 2✓



i=1

ln Xi ⇠ 2 (2n)

as the pivotal quantity. The 100(1 ↵ )% conﬁdence interval for ✓ can be

constructed from

1↵ =P 2

↵

2(2n) Q 2

1 ↵

2(2n)

=P 2

↵

2(2n)  2 ✓



i=1

ln Xi  2

1 ↵

2(2n)

=P





2

↵

2(2n)

2



i=1

ln Xi

✓ 2

1 ↵

2(2n)

2



i=1

ln Xi





.

Hence, 100(1 ↵ )% conﬁdence interval for ✓ is given by







2

↵

2(2n)

2



i=1

ln Xi

,2

1 ↵

2(2n)

2



i=1

ln Xi





.

Here 2

1 ↵

2(2n) denotes the  1 ↵

2-quantile of a chi-square random variable

Y, that is

P( Y2

1 ↵

2(2n)) = 1  ↵

and 2

↵

2(2n) similarly denotes ↵

2-quantile of Y , that is

P Y2

↵

2(2n) = ↵

Probability and Mathematical Statistics 521

for ↵ 0. 5 (see ﬁgure below).

Example 17.12. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ;✓ ) = 





✓if 0 < x < ✓

0 otherwise,

where ✓> 0 is a parameter, then what is the 100(1 ↵ )% conﬁdence interval

for ✓?

Answer: The cumulation density function of f (x ;✓ ) is

F(x ;✓ ) =  x

✓if 0 <x<✓

0 otherwise.

Since

2



i=1

ln F (Xi ;✓ ) = 2



i=1

ln  X i

✓

= 2n ln ✓ 2



i=1

ln Xi

by Theorem 17.5, the quantity 2n ln ✓ 2  n

i=1 ln X i ⇠ 2 (2n). Since

2n ln ✓ 2  n

i=1 ln X i is a function of the sample and the parameter and

its distribution is independent of ✓ , it is a pivot for ✓ . Hence, we take

Q(X1 , X2 , ..., Xn ,✓ ) = 2 n ln ✓ 2



i=1

ln Xi.

Techniques for ﬁnding Interval Estimators of Parameters 522

The 100(1 ↵ )% conﬁdence interval for ✓ can be constructed from

1↵ =P 2

↵

2(2n) Q 2

1 ↵

2(2n)

=P 2

↵

2(2n) 2n ln ✓ 2



i=1

ln Xi  2

1 ↵

2(2n)

=P 2

↵

2(2n) + 2



i=1

ln Xi  2 n ln ✓ 2

1 ↵

2(2n) + 2



i=1

ln Xi 

=P e

2n   2

↵

2(2n)+2  n

i=1 ln X i ✓e

2n   2

1 ↵

2(2n)+2  n

i=1 ln X i  .

Hence, 100(1 ↵ )% conﬁdence interval for ✓ is given by





e

2n   2

↵

2(2n)+2



i=1

ln Xi  , e

2n   2

1 ↵

2(2n)+2



i=1

ln Xi  



.

The density function of the following example belongs to the scale family.

However, one can use Theorem 17.5 to ﬁnd a pivot for the parameter and

determine the interval estimators for the parameter.

Example 17.13. If X1 , X2 , ..., Xn is a random sample from a distribution

with density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 otherwise,

where ✓> 0 is a parameter, then what is the 100(1 ↵ )% conﬁdence interval

for ✓?

Answer: The cumulative density function F (x ;✓ ) of the density function

f(x ;✓ ) =  1

✓e  x

✓if 0 <x<1

0 otherwise

is given by

F(x ;✓ ) = 1  e x

✓.

Hence

2



i=1

ln (1 F (Xi ;✓ )) = 2

✓



i=1

Xi.

Probability and Mathematical Statistics 523

Thus

✓



i=1

Xi ⇠ 2 (2n).

We take Q = 2

✓



i=1

Xi as the pivotal quantity. The 100(1 ↵ )% conﬁdence

interval for ✓ can be constructed using

1↵ =P 2

↵

2(2n) Q 2

1 ↵

2(2n)

=P 2

↵

2(2n) 2

✓



i=1

Xi  2

1 ↵

2(2n)

=P







i=1

2

1 ↵

2(2n)✓



i=1

2

↵

2(2n)





.

Hence, 100(1 ↵ )% conﬁdence interval for ✓ is given by









i=1

2

1 ↵

2(2n) ,



i=1

2

↵

2(2n)





.

In this section, we have seen that 100(1 ↵ )% conﬁdence interval for the

parameter ✓ can be constructed by taking the pivotal quantity Q to be either

Q=2



i=1

ln F (Xi ;✓ )

Q=2



i=1

ln (1 F (Xi ;✓ )) .

In either case, the distribution of Q is chi-squared with 2n degrees of freedom,

that is Q⇠ 2 (2n ). Since chi-squared distribution is not symmetric about

the y -axis, the conﬁdence intervals constructed in this section do not have

the shortest length. In order to have a shortest conﬁdence interval one has

to solve the following minimization problem:

Minimize L( a, b )

Subject to the condition  b

f(u) du = 1 ↵ , 







(MP)

Techniques for ﬁnding Interval Estimators of Parameters 524

where

f(x ) = 1

 n1

22 n1

xn1

21 e  x

In the case of Example 17.13, the minimization process leads to the following

system of nonlinear equations

b

f(u ) du = 1  ↵

ln  a

b = ab

2(n + 1) .











(NE)

If ao and bo are solutions of (NE), then the shortest conﬁdence interval for ✓

is given by  2  n

i=1X i

,2 n

i=1X i

ao  .

17.6. Approximate Conﬁdence Interval for Parameter with MLE

In this section, we discuss how to construct an approximate (1 ↵ )100%

conﬁdence interval for a population parameter ✓ using its maximum likelihood

estimator 

✓. Let X1 , X2 , ..., Xn be a random sample from a population X

with density f (x ;✓ ). Let 

✓be the maximum likelihood estimator of ✓. If

the sample size n is large, then using asymptotic property of the maximum

likelihood estimator, we have



✓E

✓

V ar  

✓ ⇠N(0 ,1) as n! 1,

where V ar  

✓ denotes the variance of the estimator 

✓. Since, for large n,

the maximum likelihood estimator of ✓ is unbiased, we get



✓ ✓

V ar  

✓ ⇠N(0 ,1) as n! 1.

The variance V ar  

✓ can be computed directly whenever possible or using

the Cram´er-Rao lower bound

V ar  

✓ 1

E d 2 ln L(✓)

d✓2  .

Probability and Mathematical Statistics 525

Now using Q = 

✓✓

V ar  

✓as the pivotal quantity, we construct an approxi-

mate (1 ↵ )100% conﬁdence interval for ✓as

1↵ =P  z↵

2Qz ↵

2

=P



z ↵

2

✓ ✓

V ar  

✓ z↵

2



.

If V ar  

✓ is free of ✓, then have

1↵ =P

✓z↵

2V ar  

✓  ✓

✓+z↵

2V ar  

✓ .

Thus 100(1 ↵ )% approximate conﬁdence interval for ✓is



✓z↵

2V ar  

✓ ,

✓+z↵

2V ar  

✓

provided V ar  

✓ is free of ✓.

Remark 17.5. In many situations V ar  

✓ is not free of the parameter ✓.

In those situations we still use the above form of the conﬁdence interval by

replacing the parameter ✓ by 

✓in the expression of V ar  

✓ .

Next, we give some examples to illustrate this method.

Example 17.14. Let X1 , X2 , ..., Xn be a random sample from a population

Xwith probability density function

f(x ; p ) =  p x (1  p ) (1x) if x= 0,1

0 otherwise.

What is a 100(1 ↵ )% approximate conﬁdence interval for the parameter p?

Answer: The likelihood function of the sample is given by

L(p) =



i=1

px i (1  p)(1xi ) .

Techniques for ﬁnding Interval Estimators of Parameters 526

Taking the logarithm of the likelihood function, we get

ln L(p ) =



i=1

[xi ln p + (1  xi ) ln(1  p )] .

Di↵ erentiating, the above expression, we get

dln L(p)

dp = 1



i=1

xi  1

1p



i=1

(1  xi ).

Setting this equals to zero and solving for p , we get

p nnx

1p = 0,

that is

(1  p ) n x =p (n n x ),

which is

n x p n x = p n  p n x.

Hence

p= x.

Therefore, the maximum likelihood estimator of p is given by

 p= X.

The variance of Xis

V ar  X = 2

Since X⇠ Ber(p ), the variance 2 = p (1  p ), and

V ar (  p) = V ar  X = p(1  p)

Since V ar (  p) is not free of the parameter p, we replave p by  pin the expression

of V ar (  p) to get

V ar (  p)'  p(1   p)

The 100(1↵ )% approximate conﬁdence interval for the parameter p is given

by   p z↵

2 p(1   p)

n, p+ z↵

2 p(1   p)

n

Probability and Mathematical Statistics 527

which is 

Xz ↵

2X(1 X )

n, X +z↵

2X(1 X )

n

.

The above conﬁdence interval is a 100(1 ↵ )% approximate conﬁdence

interval for proportion.

Example 17.15. A poll was taken of university students before a student

election. Of 78 students contacted, 33 said they would vote for Mr. Smith.

The population may be taken as 2200. Obtain 95% conﬁdence limits for the

proportion of voters in the population in favor of Mr. Smith.

Answer: The sample proportion  pis given by

 p=33

78 = 0.4231.

Hence   p(1   p)

n=  (0.4231) (0.5769)

78 = 0.0559.

The 2.5th percentile of normal distribution is given by

z0.025 = 1.96 (From table).

Hence, the lower conﬁdence limit of 95% conﬁdence interval is

 p z↵

2 p(1   p)

= 0. 4231  (1. 96) (0 .0559)

= 0. 4231  0.1096

= 0.3135.

Similarly, the upper conﬁdence limit of 95% conﬁdence interval is

 p+ z↵

2 p(1   p)

= 0. 4231 + (1. 96) (0.0559)

= 0. 4231 + 0.1096

= 0.5327.

Hence, the 95% conﬁdence limits for the proportion of voters in the popula-

tion in favor of Smith are 0. 3135 and 0.5327.

Techniques for ﬁnding Interval Estimators of Parameters 528

Remark 17.6. In Example 17.15, the 95% percent approximate conﬁdence

interval for the parameter p was [0.3135 , 0. 5327]. This conﬁdence interval can

be improved to a shorter interval by means of a quadratic inequality. Now

we explain how the interval can be improved. First note that in Example

17.14, which we are using for Example 17.15, the approximate value of the

variance of the ML estimator  pwas obtained to be  p(1 p)

n. However, this

is the exact variance of  p. Now the pivotal quantity Q =  pp

pV ar( p) becomes

Q= p p

p(1 p)

Using this pivotal quantity, we can construct a 95% conﬁdence interval as

0. 05 = P 

 z 0.025   p p

p(1 p)

z0.025 



=P

      p p

p(1 p)

     1. 96 

.

Using  p= 0. 4231 and n= 78, we solve the inequality

      p p

p(1 p)

     1.96

which is      

0. 4231  p

p(1 p)

     1.96.

Squaring both sides of the above inequality and simplifying, we get

78 (0. 4231  p)2 (1.96)2( p p2 ).

The last inequality is equivalent to

13. 96306158  69. 84520000 p + 81 . 84160000 p2  0.

Solving this quadratic inequality, we obtain [0.3196 , 0. 5338] as a 95% conﬁ-

dence interval for p . This interval is an improvement since its length is 0.2142

where as the length of the interval [0.3135 , 0. 5327] is 0.2192.

Probability and Mathematical Statistics 529

Example 17.16. If X1 , X2 , ..., Xn is a random sample from a population

with density

f(x ;✓ ) = 





✓x✓1 if 0 <x< 1

0 otherwise,

where ✓> 0 is an unknown parameter, what is a 100(1 ↵ )% approximate

conﬁdence interval for ✓ if the sample size is large?

Answer: The likelihood function L(✓ ) of the sample is

L(✓ ) =



i=1

✓x✓1

Hence

ln L(✓ ) = n ln ✓ + (✓ 1)



i=1

ln xi.

The ﬁrst derivative of the logarithm of the likelihood function is

d✓ ln L(✓ ) = n

✓+



i=1

ln xi.

Setting this derivative to zero and solving for ✓ , we obtain

✓=n

n

i=1 ln x i

Hence, the maximum likelihood estimator of ✓ is given by



✓=n

n

i=1 ln X i

Finding the variance of this estimator is diﬃ cult. We compute its variance by

computing the Cram´er-Rao bound for this estimator. The second derivative

of the logarithm of the likelihood function is given by

d✓2 ln L(✓ ) = d

d✓ n

✓+



i=1

ln xi 

=n

✓2 .

Hence

E d2

d✓2 ln L(✓) = n

✓2 .

Techniques for ﬁnding Interval Estimators of Parameters 530

Therefore

V ar  

✓  ✓ 2

Thus we take

V ar  

✓ ' ✓ 2

Since V ar  

✓ has ✓in its expression, we replace the unknown ✓by its

estimate 

✓so that

V ar  

✓ '

✓2

The 100(1 ↵ )% approximate conﬁdence interval for ✓ is given by



✓z↵

2

✓

pn, 

✓+z↵

2

✓

pn  ,

which is

 n

n

i=1 ln X i

+z↵

2pn

n

i=1 ln X i ,n

n

i=1 ln X i z ↵

2pn

n

i=1 ln X i  .

Remark 17.7. In the next section 17.2, we derived the exact conﬁdence

interval for ✓ when the population distribution in exponential. The exact

100(1 ↵ )% conﬁdence interval for ✓ was given by

  2

↵

2(2n)

2n

i=1 ln X i

,2

1 ↵

2(2n)

2n

i=1 ln X i  .

Note that this exact conﬁdence interval is not the shortest conﬁdence interval

for the parameter ✓.

Example 17.17. If X1 , X2 , ..., X49 is a random sample from a population

with density

f(x ;✓ ) = 





✓x✓1 if 0 <x< 1

0 otherwise,

where ✓> 0 is an unknown parameter, what are 90% approximate and exact

conﬁdence intervals for ✓ if  49

i=1 ln X i =0.7567?

Answer: We are given the followings:

n= 49



i=1

ln Xi = 0.7576

1↵ = 0.90.

Probability and Mathematical Statistics 531

Hence, we get

z0.05 = 1.64,

n

i=1 ln X i

=49

0. 7567 = 64.75

and p n

n

i=1 ln X i

0. 7567 = 9.25.

Hence, the approximate conﬁdence interval is given by

[64. 75  (1.64)(9.25),64. 75 + (1 .64)(9.25)]

that is [49.58 , 79.92].

Next, we compute the exact 90% conﬁdence interval for ✓ using the

formula    2

↵

2(2n)

2n

i=1 ln X i

,2

1 ↵

2(2n)

2n

i=1 ln X i  .

From chi-square table, we get

2

0.05(98) = 77.93 and  2

0.95(98) = 124.34.

Hence, the exact 90% conﬁdence interval is

77.93

(2)(0.7567) , 124.34

(2)(0.7567) 

that is [51.49 , 82.16].

Example 17.18. If X1 , X2 , ..., Xn is a random sample from a population

with density

f(x ;✓ ) = 





(1 ✓ )✓x if x = 0, 1,2, ..., 1

0 otherwise,

where 0 <✓ < 1 is an unknown parameter, what is a 100(1↵ )% approximate

conﬁdence interval for ✓ if the sample size is large?

Answer: The logarithm of the likelihood function of the sample is

ln L(✓ ) = ln ✓



i=1

xi + n ln(1 ✓ ) .

Techniques for ﬁnding Interval Estimators of Parameters 532

Di↵ erentiating we see obtain

d✓ ln L(✓ ) =  n

i=1 x i

✓ n

1✓.

Equating this derivative to zero and solving for ✓ , we get ✓ = x

1+x . Thus, the

maximum likelihood estimator of ✓ is given by



✓=X

1 + X.

Next, we ﬁnd the variance of this estimator using the Cram´er-Rao lower

bound. For this, we need the second derivative of ln L(✓ ). Hence

d✓2 ln L(✓ ) =  nx

✓2  n

(1 ✓ )2 .

Therefore

E d2

d✓2 ln L(✓)

=E  nX

✓2  n

(1 ✓ )2 

✓2 E  X   n

(1 ✓ )2

✓2

(1 ✓ ) n

(1 ✓ )2 (since each X i ⇠ GEO(1 ✓ ))

=n

✓(1  ✓) 1

✓+ ✓

1✓

=n (1 ✓ +✓2 )

✓2 (1  ✓)2 .

Therefore

V ar  

✓ '

✓2  1

✓2

n 1

✓+

✓2  .

The 100(1 ↵ )% approximate conﬁdence interval for ✓ is given by







✓z↵

2

✓ 1

✓

n  1 

✓+

✓2  , 

✓+z↵

2

✓ 1

✓

n  1 

✓+

✓2 



,

Probability and Mathematical Statistics 533

where



✓=X

1 + X.

17.7. The Statistical or General Method

Now we brieﬂy describe the statistical or general method for constructing

a conﬁdence interval. Let X1 , X2 , ..., Xn be a random sample from a pop-

ulation with density f (x ;✓ ), where ✓ is a unknown parameter. We want to

determine an interval estimator for ✓ . Let T (X1 , X2 , ..., Xn ) be some statis-

tics having the density function g (t ;✓ ). Let p1 and p2 be two ﬁxed positive

number in the open interval (0, 1) with p1 + p2 < 1. Now we deﬁne two

functions h1 (✓ ) and h2 (✓ ) as follows:

p1 = h 1 (✓)

1

g(t ;✓ ) dt and p2 = h 2 (✓)

1

g(t ;✓ ) dt

such that

P(h1 (✓ ) < T (X1 , X2 , ..., Xn ) < h2 (✓ )) = 1  p1 p2.

If h1 (✓ ) and h2 (✓ ) are monotone functions in ✓ , then we can ﬁnd a conﬁdence

interval

P(u1 <✓ < u2 ) = 1  p1 p2

where u1 = u1 (t ) and u2 = u2 (t ). The statistics T (X1 , X2 , ..., Xn ) may be a

suﬃ cient statistics, or a maximum likelihood estimator. If we minimize the

length u2 u1 of the conﬁdence interval, subject to the condition 1  p1 p2 =

1↵ for 0 <↵< 1, we obtain the shortest conﬁdence interval based on the

statistics T.

17.8. Criteria for Evaluating Conﬁdence Intervals

In many situations, one can have more than one conﬁdence intervals for

the same parameter ✓ . Thus it necessary to have a set of criteria to decide

whether a particular interval is better than the other intervals. Some well

known criteria are: (1) Shortest Length and (2) Unbiasedness. Now we only

brieﬂy describe these criteria.

The criterion of shortest length demands that a good 100(1 ↵ )% con-

ﬁdence interval [L, U ] of a parameter ✓ should have the shortest length

`=U L . In the pivotal quantity method one ﬁnds a pivot Q for a parameter

✓and then converting the probability statement

P( a < Q < b) = 1  ↵

Techniques for ﬁnding Interval Estimators of Parameters 534

P( L < ✓< U ) = 1  ↵

obtains a 100(1↵ )% conﬁdence interval for ✓ . If the constants a and b can be

found such that the di↵ erence U L depending on the sample X1 , X2 , ..., Xn

is minimum for every realization of the sample, then the random interval

[L, U ] is said to be the shortest conﬁdence interval based on Q.

If the pivotal quantity Q has certain type of density functions, then one

can easily construct conﬁdence interval of shortest length. The following

result is important in this regard.

Theorem 17.6. Let the density function of the pivot Q⇠ h( q ;✓ ) be continu-

ous and unimodal. If in some interval [a, b ] the density function h has a mode,

and satisﬁes conditions (i)  b

ah(q ;✓ )dq = 1 ↵ and (ii) h(a) = h(b) > 0, then

the interval [a, b ] is of the shortest length among all intervals that satisfy

condition (i).

If the density function is not unimodal, then minimization of ` is neces-

sary to construct a shortest conﬁdence interval. One of the weakness of this

shortest length criterion is that in some cases, ` could be a random variable.

Often, the expected length of the interval E (` ) = E (U L ) is also used

as a criterion for evaluating the goodness of an interval. However, this too

has weaknesses. A weakness of this criterion is that minimization of E (`)

depends on the unknown true value of the parameter ✓ . If the sample size

is very large, then every approximate conﬁdence interval constructed using

MLE method has minimum expected length.

A conﬁdence interval is only shortest based on a particular pivot Q . It is

possible to ﬁnd another pivot Q? which may yield even a shorter interval than

the shortest interval found based on Q . The question naturally arises is how

to ﬁnd the pivot that gives the shortest conﬁdence interval among all other

pivots. It has been pointed out that a pivotal quantity Q which is a some

function of the complete and suﬃ cient statistics gives shortest conﬁdence

interval.

Unbiasedness, is yet another criterion for judging the goodness of an

interval estimator. The unbiasedness is deﬁned as follow. A 100(1 ↵ )%

conﬁdence interval [L, U ] of the parameter ✓ is said to be unbiased if

P( L✓?  U)  1↵if ✓? = ✓

1 ↵if ✓? 6=✓ .

Probability and Mathematical Statistics 535

17.9. Review Exercises

1. Let X1 , X2 , ..., Xn be a random sample from a population with gamma

density function

f(x ;✓ , ) = 





( )✓ x  1 e  x

✓for 0 < x < 1

0 otherwise,

where ✓ is an unknown parameter and > 0 is a known parameter. Show

that  2  n

i=1X i

2

1 ↵

2(2n) , 2n

i=1X i

2

↵

2(2n) 

is a 100(1 ↵ )% conﬁdence interval for the parameter ✓.

2. Let X1 , X2 , ..., Xn be a random sample from a population with Weibull

density function

f(x ;✓ , ) = 







✓x 1 e  x

✓for 0 < x < 1

0 otherwise,

where ✓ is an unknown parameter and > 0 is a known parameter. Show

that  2  n

i=1X 

2

1 ↵

2(2n) , 2n

i=1X 

2

↵

2(2n) 

is a 100(1 ↵ )% conﬁdence interval for the parameter ✓.

3. Let X1 , X2 , ..., Xn be a random sample from a population with Pareto

density function

f(x ;✓ , ) = 





✓ ✓ x(✓ +1) for  x < 1

0 otherwise,

where ✓ is an unknown parameter and > 0 is a known parameter. Show

that 

2 n

i=1 ln  X i



2

1 ↵

2(2n) ,

2 n

i=1 ln  X i



2

↵

2(2n) 



is a 100(1 ↵ )% conﬁdence interval for 1

✓.

Techniques for ﬁnding Interval Estimators of Parameters 536

4. Let X1 , X2 , ..., Xn be a random sample from a population with Laplace

density function

f(x ;✓ ) = 1

2✓ e  |x|

✓,1 < x < 1

where ✓ is an unknown parameter. Show that

2 n

i=1|X i |

2

1 ↵

2(2n) , 2n

i=1|X i |

2

↵

2(2n) 

is a 100(1 ↵ )% conﬁdence interval for ✓.

5. Let X1 , X2 , ..., Xn be a random sample from a population with density

function

f(x ;✓ ) = 





2✓2 x 3 e  x2

2✓ for 0 < x < 1

0 otherwise,

where ✓ is an unknown parameter. Show that

n

i=1X 2

2

1 ↵

2(4n) ,  n

i=1X 2

2

↵

2(4n) 

is a 100(1 ↵ )% conﬁdence interval for ✓.

6. Let X1 , X2 , ..., Xn be a random sample from a population with density

function

f(x ;✓ , ) = 





 ✓ x 1

(1+x )✓+1 for 0 <x<1

0 otherwise,

where ✓ is an unknown parameter and > 0 is a known parameter. Show

that 

2

↵

2(2n)

2n

i=1 ln  1 + X 

i, 2

1 ↵

2(2n)

2n

i=1 ln  1 + X 

i



is a 100(1 ↵ )% conﬁdence interval for ✓.

7. Let X1 , X2 , ..., Xn be a random sample from a population with density

function

f(x ;✓ ) = 





e(x ✓) if ✓ <x< 1

0 otherwise,

Probability and Mathematical Statistics 537

where ✓2 IR is an unknown parameter. Then show that Q = X(1) ✓ is a

pivotal quantity. Using this pivotal quantity ﬁnd a 100(1 ↵ )% conﬁdence

interval for ✓.

8. Let X1 , X2 , ..., Xn be a random sample from a population with density

function

f(x ;✓ ) = 





e(x ✓) if ✓ <x< 1

0 otherwise,

where ✓2 IR is an unknown parameter. Then show that Q = 2n  X(1) ✓ is

a pivotal quantity. Using this pivotal quantity ﬁnd a 100(1 ↵ )% conﬁdence

interval for ✓.

9. Let X1 , X2 , ..., Xn be a random sample from a population with density

function

f(x ;✓ ) = 





e(x ✓) if ✓ <x< 1

0 otherwise,

where ✓2 IR is an unknown parameter. Then show that Q = e ( X (1) ✓ ) is a

pivotal quantity. Using this pivotal quantity ﬁnd a 100(1 ↵ )% conﬁdence

interval for ✓.

10. Let X1 , X2 , ..., Xn be a random sample from a population with uniform

density function

f(x ;✓ ) = 





✓if 0 x✓

0 otherwise,

where 0 <✓ is an unknown parameter. Then show that Q = X (n)

✓is a pivotal

quantity. Using this pivotal quantity ﬁnd a 100(1 ↵ )% conﬁdence interval

for ✓.

11. Let X1 , X2 , ..., Xn be a random sample from a population with uniform

density function

f(x ;✓ ) = 





✓if 0 x✓

0 otherwise,

where 0 <✓ is an unknown parameter. Then show that Q = X (n) X (1)

✓is a

pivotal quantity. Using this pivotal quantity ﬁnd a 100(1 ↵ )% conﬁdence

interval for ✓.

Techniques for ﬁnding Interval Estimators of Parameters 538

12. If X1 , X2 , ..., Xn is a random sample from a population with density

f(x ;✓ ) = 



 2

⇡e  1

2(x✓ ) 2 if ✓x < 1

0 otherwise,

where ✓ is an unknown parameter, what is a 100(1 ↵ )% approximate con-

ﬁdence interval for ✓ if the sample size is large?

13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





(✓ + 1) x✓2 if 1 <x< 1

0 otherwise,

where 0 <✓ is a parameter. What is a 100(1 ↵ )% approximate conﬁdence

interval for ✓ if the sample size is large?

14. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

a probability density function

f(x ;✓ ) = 





✓2 x e✓x if 0 <x< 1

0 otherwise,

where 0 <✓ is a parameter. What is a 100(1 ↵ )% approximate conﬁdence

interval for ✓ if the sample size is large?

15. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ; ) = 





e (x 4)

for x > 4

0 otherwise,

where > 0. What is a 100(1 ↵ )% approximate conﬁdence interval for ✓

if the sample size is large?

16. Let X1 , X2 , ..., Xn be a random sample from a distribution with density

function

f(x ;✓ ) = 





✓for 0 x ✓

0 otherwise,

where 0 <✓ . What is a 100(1 ↵ )% approximate conﬁdence interval for ✓if

the sample size is large?

Probability and Mathematical Statistics 539

17. A sample X1 , X2 , ..., Xn of size n is drawn from a gamma distribution

f(x ; ) = 





x3e x



64 if 0 <x<1

0 otherwise.

What is a 100(1 ↵ )% approximate conﬁdence interval for ✓ if the sample

size is large?

18. Let X1 , X2 , ..., Xn be a random sample from a continuous popu-

lation X with a distribution function F (x ;✓ ). Show that the statistic

Q= 2n

i=1 ln F (X i ;✓) is a pivotal quantity and has a chi-square dis-

tribution with 2n degrees of freedom.

19. Let X1 , X2 , ..., Xn be a random sample from a continuous popu-

lation X with a distribution function F (x ;✓ ). Show that the statistic

Q= 2n

i=1 ln (1 F (X i ;✓)) is a pivotal quantity and has a chi-square

distribution with 2n degrees of freedom.

Techniques for ﬁnding Interval Estimators of Parameters 540

Probability and Mathematical Statistics 541

Chapter 18

TEST OF STATISTICAL

HYPOTHESES

FOR

PARAMETERS

18.1. Introduction

Inferential statistics consists of estimation and hypothesis testing. We

have already discussed various methods of ﬁnding point and interval estima-

tors of parameters. We have also examined the goodness of an estimator.

Suppose X1 , X2 , ..., Xn is a random sample from a population with prob-

ability density function given by

f(x ;✓ ) = 





(1 + ✓ ) x✓ for 0 < x < 1

0 otherwise,

where ✓> 0 is an unknown parameter. Further, let n = 4 and suppose

x1 = 0.92 , x2 = 0 .75 , x3 = 0 .85 , x4 = 0 .8 is a set of random sample data

from the above distribution. If we apply the maximum likelihood method,

then we will ﬁnd that the estimator 

✓of ✓is



✓= 1 4

ln(X1 ) + ln( X2 ) + ln( X3 ) + ln( X2 ) .

Hence, the maximum likelihood estimate of ✓is



✓= 1 4

ln(0. 92) + ln(0 . 75) + ln(0 . 85) + ln(0 .80)

= 1 + 4

0. 7567 = 4 . 2861

Test of Statistical Hypotheses for Parameters 542

Therefore, the corresponding probability density function of the population

is given by

f(x ) =  5.2861 x4.2861 for 0 < x < 1

0 otherwise.

Since, the point estimate will rarely equal to the true value of ✓ , we would

like to report a range of values with some degree of conﬁdence. If we want

to report an interval of values for ✓ with a conﬁdence level of 90%, then we

need a 90% conﬁdence interval for ✓ . If we use the pivotal quantity method,

then we will ﬁnd that the conﬁdence interval for ✓is

1 2

↵

2(8)

24

i=1 ln X i

,1   2

1 ↵

2(8)

24

i=1 ln X i  .

Since 2

0.05(8) = 2.73,  2

0.95(8) = 15.51, and  4

i=1 ln(x i ) = 0. 7567, we

obtain   1 + 2.73

2(0.7567) ,  1 + 15.51

2(0. 7567) 

which is

[ 0.803 , 9. 249 ] .

Thus we may draw inference, at a 90% conﬁdence level, that the population

Xhas the distribution

f(x ;✓ ) = 





(1 + ✓ ) x✓ for 0 < x < 1

0 otherwise,

(?)

where ✓2 [0.803,9. 249]. If we think carefully, we will notice that we have

made one assumption. The assumption is that the observable quantity Xcan

be modeled by a density function as shown in (? ). Since, we are concerned

with the parametric statistics, our assumption is in fact about ✓.

Based on the sample data, we found that an interval estimate of ✓ at a

90% conﬁdence level is [0.803 , 9. 249]. But, we assumed that ✓2 [0.803,9.249].

However, we can not be sure that our assumption regarding the parameter is

real and is not due to the chance in the random sampling process. The vali-

dation of this assumption can be done by the hypothesis test. In this chapter,

we discuss testing of statistical hypotheses. Most of the ideas regarding the

hypothesis test came from Jerry Neyman and Karl Pearson during 1928-1938.

Deﬁnition 18.1. A statistical hypothesis H is a conjecture about the dis-

tribution f (x ;✓ ) of a population X . This conjecture is usually about the

Probability and Mathematical Statistics 543

parameter ✓ if one is dealing with a parametric statistics; otherwise it is

about the form of the distribution of X.

Deﬁnition 18.2. A hypothesis H is said to be a simple hypothesis if H

completely speciﬁes the density f (x ;✓ ) of the population; otherwise it is

called a composite hypothesis.

Deﬁnition 18.3. The hypothesis to be tested is called the null hypothesis.

The negation of the null hypothesis is called the alternative hypothesis. The

null and alternative hypotheses are denoted by Ho and Ha , respectively.

If ✓ denotes a population parameter, then the general format of the null

hypothesis and alternative hypothesis is

Ho :✓2 ⌦o and Ha :✓2 ⌦a (?)

where ⌦o and ⌦a are subsets of the parameter space ⌦with

⌦o \⌦a =; and ⌦o [⌦a ✓⌦.

Remark 18.1. If ⌦o [⌦a = ⌦ , then (? ) becomes

Ho :✓2 ⌦o and Ha :✓ 62 ⌦o .

If ⌦o is a singleton set, then Ho reduces to a simple hypothesis. For

example, ⌦o = {4.2861} , the null hypothesis becomes Ho :✓ = 4. 2861 and the

alternative hypothesis becomes Ha :✓ 6 = 4. 2861. Hence, the null hypothesis

Ho :✓ = 4 .2861 is a simple hypothesis and the alternative Ha :✓ 6= 4 .2861 is

a composite hypothesis.

Deﬁnition 18.4. A hypothesis test is an ordered sequence

(X1 , X2 , ..., Xn ; Ho , Ha ;C )

where X1 , X2 , ..., Xn is a random sample from a population X with the prob-

ability density function f (x ;✓ ), Ho and Ha are hypotheses concerning the

parameter ✓ in f (x ;✓ ), and C is a Borel set in IRn.

Remark 18.2. Borel sets are deﬁned using the notion of  -algebra. A

collection of subsets A of a set S is called a  -algebra if (i) S2 A , (ii) Ac 2 A ,

whenever A2 A , and (iii)  1

k=1A k 2A, whenever A 1 , A 2 , ..., A n , ... 2A. The

Borel sets are the member of the smallest  -algebra containing all open sets

Test of Statistical Hypotheses for Parameters 544

of IRn . Two examples of Borel sets in IRn are the sets that arise by countable

union of closed intervals in IRn , and countable intersection of open sets in IRn.

The set C is called the critical region in the hypothesis test. The critical

region is obtained using a test statistic W (X1 , X2 , ..., Xn ). If the outcome of

(X1 , X2 , ..., Xn ) turns out to be an element of C , then we decide to accept

Ha ; otherwise we accept Ho .

Broadly speaking, a hypothesis test is a rule that tells us for which sample

values we should decide to accept Ho as true and for which sample values we

should decide to reject Ho and accept Ha as true. Typically, a hypothesis test

is speciﬁed in terms of a test statistic W . For example, a test might specify

that Ho is to be rejected if the sample total  n

k=1 X k is less than 8. In this

case the critical region C is the set {(x1 , x2 , ..., xn ) | x1 + x2 +··· + xn < 8}.

18.2. A Method of Finding Tests

There are several methods to ﬁnd test procedures and they are: (1) Like-

lihood Ratio Tests, (2) Invariant Tests, (3) Bayesian Tests, and (4) Union-

Intersection and Intersection-Union Tests. In this section, we only examine

likelihood ratio tests.

Deﬁnition 18.5. The likelihood ratio test statistic for testing the simple

null hypothesis Ho :✓2 ⌦o against the composite alternative hypothesis

Ha :✓ 62 ⌦o based on a set of random sample data x1 , x2 , ..., xn is deﬁned as

W(x1 , x2 , ..., xn ) =

max

✓2⌦o

L(✓ , x1, x2 , ..., xn )

max

✓2⌦ L(✓ , x 1 , x 2 , ..., x n ) ,

where ⌦ denotes the parameter space, and L(✓ , x1 , x2 , ..., xn ) denotes the

likelihood function of the random sample, that is

L(✓ , x1, x2 , ..., xn ) =



i=1

f(xi ;✓ ).

Alikelihood ratio test (LRT) is any test that has a critical region C (that is,

rejection region) of the form

C={(x1 , x2 , ..., xn ) | W(x1 , x2 , ..., xn ) k} ,

where k is a number in the unit interval [0,1].

Probability and Mathematical Statistics 545

If Ho :✓ = ✓0 and Ha :✓ = ✓a are both simple hypotheses, then the

likelihood ratio test statistic is deﬁned as

W(x1 , x2 , ..., xn ) = L (✓o , x1, x2 , ..., xn )

L(✓a , x1 , x2 , ..., xn ) .

Now we give some examples to illustrate this deﬁnition.

Example 18.1. Let X1 , X2, X3 denote three independent observations from

a distribution with density

f(x ;✓ ) =  (1 + ✓) x ✓ if 0 x1

0 otherwise.

What is the form of the LRT critical region for testing Ho :✓ = 1 versus

Ha :✓ = 2?

Answer: In this example, ✓o = 1 and ✓a = 2. By the above deﬁnition, the

form of the critical region is given by

C= (x1 , x2, x3 )2 IR3    

L(✓o , x1, x2, x3 )

L(✓a , x1, x2, x3 ) k 

= (x1 , x2, x3 )2 IR3     

(1 + ✓o )3  3

i=1 x ✓ o

(1 + ✓a )3  3

i=1 x ✓ a

ik

= (x1 , x2, x3 )2 IR3    

8x1x2x3

27x2

1x 2

2x 2

3k

= (x1 , x2, x3 )2 IR3    

x1x2x3  27

8k 

= (x1 , x2 , x3 )2 IR3 | x1x2x3  a, 

where a is some constant. Hence the likelihood ratio test is of the form:

"Reject Ho if



i=1

Xi a."

Example 18.2. Let X1 , X2 , ..., X12 be a random sample from a normal

population with mean zero and variance 2 . What is the form of the LRT

critical region for testing the null hypothesis Ho :2 = 10 versus Ha :2 = 5?

Answer: Here  2

o= 10 and  2

a= 5. By the above deﬁnition, the form of the

Test of Statistical Hypotheses for Parameters 546

critical region is given by (with o2 = 10 and a2 = 5)

C= (x1 , x2 , ..., x12 )2 IR12     

L o2 , x1, x2 , ..., x12 

L(a2 , x1, x2 , ..., x12 ) k 

=







(x1 , x2 , ..., x12 )2 IR12       



i=1

p2⇡2

e 1

2( xi

o ) 2

p2⇡2

e 1

2( xi

a ) 2 k









= (x1 , x2 , ..., x12 )2 IR12     1

26

20  12

i=1 x 2

ik

= (x1 , x2 , ..., x12 )2 IR12     



i=1

ia ,

where a is some constant. Hence the likelihood ratio test is of the form:

"Reject Ho if



i=1

ia."

Example 18.3. Suppose that X is a random variable about which the

hypothesis Ho :X⇠ U NI F (0, 1) against Ha :X⇠ N (0, 1) is to be tested.

What is the form of the LRT critical region based on one observation of X?

Answer: In this example, Lo (x ) = 1 and La (x ) = 1

p2⇡ e  1

2x 2 . By the above

deﬁnition, the form of the critical region is given by

C= x2IR    

Lo (x)

La (x) k  , where k2[0, 1)

= x2 IR   p2⇡e1

2x 2 k

= x2 IR    x2 2 ln  k

p2⇡ 

={x2 IR |x a, }

where a is some constant. Hence the likelihood ratio test is of the form:

"Reject Ho if X a ."

In the above three examples, we have dealt with the case when null as

well as alternative were simple. If the null hypothesis is simple (for example,

Ho :✓ = ✓o ) and the alternative is a composite hypothesis (for example,

Ha :✓ 6= ✓o ), then the following algorithm can be used to construct the

likelihood ratio critical region:

(1) Find the likelihood function L(✓ , x1, x2 , ..., xn ) for the given sample.

Probability and Mathematical Statistics 547

(2) Find L(✓o , x1, x2 , ..., xn ).

(3) Find max

✓2⌦ L(✓ , x 1 , x 2 , ..., x n ).

(4) Rewrite L(✓o ,x1,x2,...,xn )

max

✓2⌦ L(✓ , x 1 , x 2 , ..., x n )in a "suitable form".

(5) Use step (4) to construct the critical region.

Now we give an example to illustrate these steps.

Example 18.4. Let X be a single observation from a population with

probability density

f(x ;✓ ) = 





✓x e✓

x! for x = 0, 1,2, ..., 1

0 otherwise,

where ✓ 0. Find the likelihood ratio critical region for testing the null

hypothesis Ho :✓ = 2 against the composite alternative Ha :✓ 6= 2.

Answer: The likelihood function based on one observation xis

L(✓ , x) = ✓ x e ✓

x! .

Next, we ﬁnd L(✓o , x ) which is given by

L(2 , x) = 2 x e 2

x! .

Our next step is to evaluate max

✓0 L(✓ , x ). For this we di↵ erentiate L(✓ , x )

with respect to ✓ , and then set the derivative to 0 and solve for ✓ . Hence

dL(✓ , x)

d✓= 1

x!  e ✓ x✓ x1 ✓x e ✓ 

and dL(✓ ,x)

d✓ = 0 gives ✓=x. Hence

max

✓0 L(✓ , x ) = x x e x

x! .

To do the step (4), we consider

L(2 , x)

max

✓2⌦ L(✓ , x ) =

2x e2

xxex

Test of Statistical Hypotheses for Parameters 548

which simpliﬁes to

L(2 , x)

max

✓2⌦ L(✓ , x )=  2e

xx

e2 .

Thus, the likelihood ratio critical region is given by

C= x2IR    2e

xx

e2  k = x2 IR    2e

xx

a

where a is some constant. The likelihood ratio test is of the form: "Reject

Ho if  2e

X X a."

So far, we have learned how to ﬁnd tests for testing the null hypothesis

against the alternative hypothesis. However, we have not considered the

goodness of these tests. In the next section, we consider various criteria for

evaluating the goodness of a hypothesis test.

18.3. Methods of Evaluating Tests

There are several criteria to evaluate the goodness of a test procedure.

Some well known criteria are: (1) Powerfulness, (2) Unbiasedness and Invari-

ancy, and (3) Local Powerfulness. In order to examine some of these criteria,

we need some terminologies such as error probabilities, power functions, type

I error, and type II error. First, we develop these terminologies.

A statistical hypothesis is a conjecture about the distribution f (x ;✓ ) of

the population X . This conjecture is usually about the parameter ✓ if one

is dealing with a parametric statistics; otherwise it is about the form of the

distribution of X . If the hypothesis completely speciﬁes the density f (x ;✓ )

of the population, then it is said to be a simple hypothesis; otherwise it is

called a composite hypothesis. The hypothesis to be tested is called the null

hypothesis. We often hope to reject the null hypothesis based on the sample

information. The negation of the null hypothesis is called the alternative

hypothesis. The null and alternative hypotheses are denoted by Ho and Ha ,

respectively.

In hypothesis test, the basic problem is to decide, based on the sample

information, whether the null hypothesis is true. There are four possible

situations that determines our decision is correct or in error. These four

situations are summarized below:

Probability and Mathematical Statistics 549

Ho is true Ho is false

Accept Ho Correct Decision Type II Error

Reject Ho Type I Error Correct Decision

Deﬁnition 18.6. Let Ho :✓2 ⌦o and Ha :✓ 62 ⌦o be the null and

alternative hypotheses to be tested based on a random sample X1 , X2 , ..., Xn

from a population X with density f (x ;✓ ), where ✓ is a parameter. The

signiﬁcance level of the hypothesis test

Ho :✓2 ⌦o and Ha :✓ 62 ⌦o ,

denoted by ↵ , is deﬁned as

↵=P (Type I Error) .

Thus, the signiﬁcance level of a hypothesis test we mean the probability of

rejecting a true null hypothesis, that is

↵=P (Reject Ho / Ho is true) .

This is also equivalent to

↵=P (Accept Ha / Ho is true) .

Deﬁnition 18.7. Let Ho :✓2 ⌦o and Ha :✓ 62 ⌦o be the null and

alternative hypothesis to be tested based on a random sample X1 , X2 , ..., Xn

from a population X with density f (x ;✓ ), where ✓ is a parameter. The

probability of type II error of the hypothesis test

Ho :✓2 ⌦o and Ha :✓ 62 ⌦o ,

denoted by  , is deﬁned as

=P (Accept Ho / Ho is false) .

Similarly, this is also equivalent to

=P (Accept Ho / Ha is true) .

Remark 18.3. Note that ↵ can be numerically evaluated if the null hypoth-

esis is a simple hypothesis and rejection rule is given. Similarly,  can be

Test of Statistical Hypotheses for Parameters 550

evaluated if the alternative hypothesis is simple and rejection rule is known.

If null and the alternatives are composite hypotheses, then ↵ and  become

functions of ✓.

Example 18.5. Let X1 , X2 , ..., X20 be a random sample from a distribution

with probability density function

f(x ; p ) = 





px (1  p)1x if x = 0 , 1

0 otherwise,

where 0 < p  1

2is a parameter. The hypothesis H o :p= 1

2to be tested

against Ha : p < 1

2. If H o is rejected when  20

i=1 X i 6, then what is the

probability of type I error?

Answer: Since each observation Xi ⇠BER(p ), the sum the observations



i=1

Xi ⇠BIN (20 , p). The probability of type I error is given by

↵=P (Type I Error)

=P (Reject Ho / Ho is true)

=P 20



i=1

Xi 6 Ho is true

=P 20



i=1

Xi 6 Ho : p=1

2



k=0 20

k 1

2 k  1 1

2 20k

= 0. 0577 (from binomial table).

Hence the probability of type I error is 0.0577.

Example 18.6. Let p represent the proportion of defectives in a manufac-

turing process. To test Ho :p 1

4versus H a :p > 1

4, a random sample of

size 5 is taken from the process. If the number of defectives is 4 or more, the

null hypothesis is rejected. What is the probability of rejecting Ho if p = 1

Answer: Let X denote the number of defectives out of a random sample of

size 5. Then X is a binomial random variable with n = 5 and p = 1

5. Hence,

Probability and Mathematical Statistics 551

the probability of rejecting Ho is given by

↵=P (Reject Ho / Ho is true)

=P (X 4/ Ho is true)

=P X  4 p =1

5

=P X = 4  p = 1

5 +P X= 5  p=1

5

= 5

4 p 4 (1  p ) 1 +  5

5 p 5 (1  p ) 0

= 5  1

5 4  4

5 +  1

55

= 1

55

[20 + 1]

=21

3125 .

Hence the probability of rejecting the null hypothesis Ho is 21

3125 .

Example 18.7. A random sample of size 4 is taken from a normal distri-

bution with unknown mean µ and variance 2 > 0. To test Ho :µ = 0

against Ha : µ < 0 the following test is used: "Reject Ho if and only if

X1 +X2 +X3 +X4 < 20." Find the value of  so that the signiﬁcance level

of this test will be closed to 0.14.

Answer: Since

0. 14 = ↵ (signiﬁcance level)

=P (Type I Error)

=P (Reject Ho / Ho is true)

=P (X1 + X2 + X3 + X4 <  20 /Ho :µ = 0)

=P X <  5/Ho :µ = 0

=P X0



<5 0



2

=P Z <  10

 ,

we get from the standard normal table

1. 08 = 10

.

Test of Statistical Hypotheses for Parameters 552

Therefore

=10

1. 08 = 9 . 26.

Hence, the standard deviation has to be 9.26 so that the signiﬁcance level

will be closed to 0.14.

Example 18.8. A normal population has a standard deviation of 16. The

critical region for testing Ho :µ = 5 versus the alternative Ha :µ =k is

X > k  2. What would be the value of the constant k and the sample size

nwhich would allow the probability of Type I error to be 0.0228 and the

probability of Type II error to be 0.1587.

Answer: It is given that the population X⇠ N  µ, 162  . Since

0. 0228 = ↵

=P (Type I Error)

=P (Reject Ho / Ho is true)

=P X > k  2/Ho :µ = 5

=P

X5

256

>k7

256

n



=P

Z > k7

256

n



= 1 P 

Zk  7

256

n



Hence, from standard normal table, we have

(k 7)p n

16 = 2

which gives

(k 7)p n = 32.

Probability and Mathematical Statistics 553

Similarly

0. 1587 = P (Type II Error)

=P (Accept Ho / Ha is true)

=P X k  2/Ha :µ =k

=P

Xµ

256

nk2µ

256

nH a :µ= k



=P

Xk

256

nk2 k

256

n



=P

Z   2

256

n



= 1 P  Z 2p n

16  .

Hence 0. 1587 = 1 P  Z 2p n

16  or P Z 2p n

16  = 0.8413. Thus, from

the standard normal table, we have

2p n

16 = 1

which yields

n= 64.

Letting this value of nin

(k 7)p n = 32,

we see that k = 11.

While deciding to accept Ho or Ha , we may make a wrong decision. The

probability  of a wrong decision can be computed as follows:

=P (Ha accepted and Ho is true) + P (Ho accepted and Ha is true)

=P (Ha accepted / Ho is true) P (Ho is true)

+P (Ho accepted / Ha is true) P (Ha is true)

=↵P (Ho is true) + P (Ha is true) .

In most cases, the probabilities P (Ho is true) and P (Ha is true) are not

known. Therefore, it is, in general, not possible to determine the exact

Test of Statistical Hypotheses for Parameters 554

numerical value of the probability  of making a wrong decision. However,

since  is a weighted sum of ↵ and  , and P (Ho is true)+ P (Ha is true) = 1,

we have

max{ ↵, }.

A good decision rule (or a good test) is the one which yields the smallest .

In view of the above inequality, one will have a small  if the probability of

type I error as well as probability of type II error are small.

The alternative hypothesis is mostly a composite hypothesis. Thus, it

is not possible to ﬁnd a value for the probability of type II error,  . For

composite alternative,  is a function of ✓ . That is,  : ⌦c

o:![0, 1]. Here ⌦ c

denotes the complement of the set ⌦o in the parameter space ⌦ . In hypothesis

test, instead of  , one usually considers the power of the test 1 (✓ ), and

a small probability of type II error is equivalent to large power of the test.

Deﬁnition 18.8. Let Ho :✓2 ⌦o and Ha :✓ 62 ⌦o be the null and

alternative hypothesis to be tested based on a random sample X1 , X2 , ..., Xn

from a population X with density f (x ;✓ ), where ✓ is a parameter. The power

function of a hypothesis test

Ho :✓2 ⌦o versus Ha :✓ 62 ⌦o

is a function ⇡ :⌦! [0, 1] deﬁned by

⇡( ✓) = 





P(Type I Error) if Ho is true

1P (Type II Error) if Ha is true.

Example 18.9. A manufacturing ﬁrm needs to test the null hypothesis Ho

that the probability p of a defective item is 0. 1 or less, against the alternative

hypothesis Ha : p > 0. 1. The procedure is to select two items at random. If

both are defective, Ho is rejected; otherwise, a third is selected. If the third

item is defective Ho is rejected. If all other cases, Ho is accepted, what is the

power of the test in terms of p (if Ho is true)?

Answer: Let p be the probability of a defective item. We want to calculate

the power of the test at the null hypothesis. The power function of the test

is given by

⇡(p ) = 





P(Type I Error) if p0.1

1P (Type II Error) if p > 0.1.

Probability and Mathematical Statistics 555

Hence, we have

⇡(p)

=P (Reject Ho / Ho is true)

=P (Reject Ho / Ho : p = p)

=P (ﬁrst two items are both defective / p) +

+P (at least one of the ﬁrst two items is not defective and third is/p)

=p2 + (1  p)2 p +  2

1 p (1  p )p

=p +p2 p3.

The graph of this power function is shown below.

Remark 18.4. If X denotes the number of independent trials needed to

obtain the ﬁrst success, then X⇠ GEO (p ), and

P( X= k) = (1  p)k1 p,

where k = 1, 2,3, ..., 1 . Further

P( X n) = 1  (1  p)n

since n



k=1

(1  p)k1 p = p



k=1

(1  p)k1

=p 1(1 p)n

1 (1  p)

= 1  (1  p)n .

Test of Statistical Hypotheses for Parameters 556

Example 18.10. Let X be the number of independent trails required to

obtain a success where p is the probability of success on each trial. The

hypothesis Ho :p = 0. 1 is to be tested against the alternative Ha :p = 0.3.

The hypothesis is rejected if X 4. What is the power of the test if Ha is

true?

Answer: The power function is given by

⇡(p ) = 





P(Type I Error) if p= 0 .1

1P (Type II Error) if p = 0.3.

Hence, we have

↵= 1 P (Accept Ho / Ho is false)

=P (Reject Ho / Ha is true)

=P (X  4/ Ha is true)

=P (X  4/ p = 0.3)



k=1

P( X= k /p = 0 .3)



k=1

(1  p)k1 p (where p = 0.3)



k=1

(0.7)k1 (0.3)

= 0.3



k=1

(0.7)k1

= 1  (0.7)4

= 0.7599.

Hence, the power of the test at the alternative is 0.7599.

Example 18.11. Let X1 , X2 , ..., X25 be a random sample of size 25 drawn

from a normal distribution with unknown mean µ and variance 2 = 100.

It is desired to test the null hypothesis µ = 4 against the alternative µ = 6.

What is the power at µ = 6 of the test with rejection rule: reject µ = 4 if

25

i=1 X i 125?

Probability and Mathematical Statistics 557

Answer: The power of the test at the alternative is

⇡(6) = 1 P (Type II Error)

= 1 P (Accept Ho / Ho is false)

=P (Reject Ho / Ha is true)

=P 25



i=1

Xi  125 / Ha : µ = 6

=P X 5/ Ha µ = 6

=P X6

p25 56

p25 

=P Z  1

2

= 0.6915.

Example 18.12. A urn contains 7 balls, ✓ of which are red. A sample of

size 2 is drawn without replacement to test Ho :✓ 1 against Ha :✓> 1.

If the null hypothesis is rejected if one or more red balls are drawn, ﬁnd the

power of the test when ✓ = 2.

Answer: The power of the test at ✓ = 2 is given by

⇡(2) = 1 P (Type II Error)

= 1 P (Accept Ho / Ho is false)

= 1 P (zero red balls are drawn /2 balls were red)

= 1   5

2

7

2

= 1  10

=11

= 0.524.

In all of these examples, we have seen that if the rule for rejection of the

null hypothesis Ho is given, then one can compute the signiﬁcance level or

power function of the hypothesis test. The rejection rule is given in terms

of a statistic W (X1 , X2 , ..., Xn ) of the sample X1 , X2 , ..., Xn . For instance,

in Example 18.5, the rejection rule was: "Reject the null hypothesis Ho if

20

i=1 X i 6." Similarly, in Example 18.7, the rejection rule was: "Reject Ho

Test of Statistical Hypotheses for Parameters 558

if and only if X1 + X2 + X3 + X4 <  20", and so on. The statistic W , used in

the statement of the rejection rule, partitioned the set Sn into two subsets,

where S denotes the support of the density function of the population X.

One subset is called the rejection or critical region and other subset is called

the acceptance region. The rejection rule is obtained in such a way that the

probability of the type I error is as small as possible and the power of the

test at the alternative is as large as possible.

Next, we give two deﬁnitions that will lead us to the deﬁnition of uni-

formly most powerful test.

Deﬁnition 18.9. Given 0   1, a test (or test procedure) T for testing

the null hypothesis Ho :✓2 ⌦o against the alternative Ha :✓2 ⌦a is said to

be a test of level  if

max

✓2⌦o

⇡( ✓) ,

where ⇡ (✓ ) denotes the power function of the test T.

Deﬁnition 18.10. Given 0  1, a test (or test procedure) for testing

the null hypothesis Ho :✓2 ⌦o against the alternative Ha :✓2 ⌦a is said to

be a test of size  if

max

✓2⌦o

⇡( ✓) = .

Deﬁnition 18.11. Let T be a test procedure for testing the null hypothesis

Ho :✓2 ⌦o against the alternative Ha :✓2 ⌦a . The test (or test procedure)

Tis said to be the uniformly most powerful (UMP) test of level  if Tis of

level  and for any other test W of level ,

⇡T ( ✓) ⇡W ( ✓ )

for all ✓2 ⌦a . Here ⇡T (✓ ) and ⇡W (✓ ) denote the power functions of tests T

and W , respectively.

Remark 18.5. If T is a test procedure for testing Ho :✓ = ✓o against

Ha :✓ = ✓a based on a sample data x1 , ..., xn from a population X with a

continuous probability density function f (x ;✓ ), then there is a critical region

Cassociated with the the test procedure T, and power function of Tcan be

computed as

⇡T =  C

L(✓a , x1 , ..., xn ) dx1 ···dxn.

Probability and Mathematical Statistics 559

Similarly, the size of a critical region C , say ↵ , can be given by

↵=C

L(✓o , x1 , ..., xn ) dx1 ·· · dxn.

The following famous result tells us which tests are uniformly most pow-

erful if the null hypothesis and the alternative hypothesis are both simple.

Theorem 18.1 (Neyman-Pearson). Let X1 , X2 , ..., Xn be a random sam-

ple from a population with probability density function f (x ;✓ ). Let

L(✓ , x1 , ..., xn ) =



i=1

f(xi ;✓ )

be the likelihood function of the sample. Then any critical region C of the

form

C= (x1 , x2 , ..., xn )    

L(✓o , x1 , ..., xn )

L(✓a , x1 , ..., xn ) k 

for some constant 0  k < 1 is best (or uniformly most powerful) of its size

for testing Ho :✓ = ✓o against Ha :✓ = ✓a .

Proof: We assume that the population has a continuous probability density

function. If the population has a discrete distribution, the proof can be

appropriately modiﬁed by replacing integration by summation.

Let C be the critical region of size ↵ as described in the statement of the

theorem. Let B be any other critical region of size ↵ . We want to show that

the power of C is greater than or equal to that of B . In view of Remark 18.5,

we would like to show that

C

L(✓a , x1 , ..., xn ) dx1 ···dxn   B

L(✓a , x1 , ..., xn ) dx1 ·· · dxn . (1)

Since C and B are both critical regions of size ↵ , we have

C

L(✓o , x1 , ..., xn ) dx1 ···dxn =  B

L(✓o , x1 , ..., xn ) dx1 ···dxn . (2)

The last equality (2) can be written as

C\B

L(✓o , x1 , ..., xn ) dx1 ···dxn +  C\B c

L(✓o , x1 , ..., xn ) dx1 ···dxn

= C\B

L(✓o , x1 , ..., xn ) dx1 ···dxn +  C c \B

L(✓o , x1 , ..., xn ) dx1 ···dxn

Test of Statistical Hypotheses for Parameters 560

since

C= ( C\ B)[ ( C\ Bc ) and B= ( C\ B)[ ( Cc \ B) .(3)

Therefore from the last equality, we have

C\B c

L(✓o , x1 , ..., xn ) dx1 ···dxn =  C c \B

L(✓o , x1 , ..., xn ) dx1 ···dxn . (4)

Since

C= (x1 , x2 , ..., xn )    

L(✓o , x1 , ..., xn )

L(✓a , x1 , ..., xn ) k  (5)

we have

L(✓a , x1 , ..., xn ) L(✓o , x1 , ..., xn )

k(6)

on C , and

L(✓a , x1 , ..., xn ) < L(✓o , x1 , ..., xn )

k(7)

on Cc . Therefore from (4), (6) and (7), we have

C\B c

L(✓a , x1 ,..., xn ) dx1 ···dxn

 C\B c

L(✓o , x1 , ..., xn )

kdx 1 ··· dxn

= C c \B

L(✓o , x1 , ..., xn )

kdx 1 ··· dxn

 C c \B

L(✓a , x1 , ..., xn ) dx1 ···dxn.

Thus, we obtain

C\B c

L(✓a , x1 , ..., xn ) dx1 ···dxn   C c \B

L(✓a , x1 , ..., xn ) dx1 ···dxn.

From (3) and the last inequality, we see that

C

L(✓a , x1 , ..., xn ) dx1 ···dxn

= C\B

L(✓a , x1 , ..., xn ) dx1 ···dxn +  C\B c

L(✓a , x1 , ..., xn ) dx1 ···dxn

 C\B

L(✓a , x1 , ..., xn ) dx1 ···dxn +  C c \B

L(✓a , x1 , ..., xn ) dx1 ···dxn

B

L(✓a , x1 , ..., xn ) dx1 ·· · dxn

and hence the theorem is proved.

Probability and Mathematical Statistics 561

Now we give several examples to illustrate the use of this theorem.

Example 18.13. Let X be a random variable with a density function f (x).

What is the critical region for the best test of

Ho : f (x) = 





2if 1<x< 1

0 elsewhere,

against

Ha : f (x) = 





1|x | if  1 < x < 1

0 elsewhere,

at the signiﬁcance size ↵ = 0.10?

Answer: We assume that the test is performed with a sample of size 1.

Using Neyman-Pearson Theorem, the best critical region for the best test at

the signiﬁcance size ↵ is given by

C= x2IR | L o (x)

La (x) k 

= x2 IR |

1|x | k

= x2 IR | |x | 1 1

2k 

= x2 IR | 1

2k 1x 1  1

2k .

Since 0.1 = P( C )

=P L o (X)

La ( X) k / H o is true 

=P 1

1|X | k / Ho is true

=P 1

2k 1X1  1

2k/ H o is true 

= 11

2k 1

2dx

= 1  1

2k,

we get the critical region C to be

C={ x2IR | 0. 1x 0.1}.

Test of Statistical Hypotheses for Parameters 562

Thus the best critical region is C = [0.1,0. 1] and the best test is: "Reject

Ho if 0 .1 X 0.1".

Example 18.14. Suppose X has the density function

f(x ;✓ ) =  (1 + ✓) x ✓ if 0 x1

0 otherwise.

Based on a single observed value of X , ﬁnd the most powerful critical region

of size ↵ = 0. 1 for testing Ho :✓ = 1 against Ha :✓ = 2.

Answer: By Neyman-Pearson Theorem, the form of the critical region is

given by

C= x2IR |L (✓o , x)

L(✓a , x) k 

= x2 IR | (1 + ✓o ) x✓ o

(1 + ✓a ) x✓ a  k 

= x2 IR | 2x

3x2  k 

= x2 IR | 1

x 3

2k 

={x2 IR |x a, }

where a is some constant. Hence the most powerful or best test is of the

form: "Reject Ho if X a ."

Since, the signiﬁcance level of the test is given to be ↵ = 0. 1, the constant

acan be determined. Now we proceed to ﬁnd a. Since

0. 1 = ↵

=P (Reject Ho / Ho is true}

=P (X a / ✓ = 1)

= 1

2x dx

= 1  a2,

hence

a2 = 1  0 .1 = 0.9.

Therefore

a=p 0.9,

Probability and Mathematical Statistics 563

since k in Neyman-Pearson Theorem is positive. Hence, the most powerful

test is given by "Reject Ho if Xp 0.9".

Example 18.15. Suppose that X is a random variable about which the

hypothesis Ho :X⇠ U NI F (0, 1) against Ha :X⇠ N (0, 1) is to be tested.

What is the most powerful test with a signiﬁcance level ↵ = 0. 05 based on

one observation of X?

Answer: By Neyman-Pearson Theorem, the form of the critical region is

given by

C= x2IR | L o (x)

La (x) k 

= x2 IR | p 2⇡e 1

2x 2 k

= x2 IR | x2  2 ln  k

p2⇡ 

={x2 IR |x a, }

where a is some constant. Hence the most powerful or best test is of the

form: "Reject Ho if X a ."

Since, the signiﬁcance level of the test is given to be ↵ = 0. 05, the

constant a can be determined. Now we proceed to ﬁnd a . Since

0. 05 = ↵

=P (Reject Ho / Ho is true}

=P (X a / X ⇠U N IF (0,1))

= a

=a,

hence a = 0. 05. Thus, the most powerful critical region is given by

C={ x2IR | 0 < x  0.05}

based on the support of the uniform distribution on the open interval (0,1).

Since the support of this uniform distribution is the interval (0, 1), the ac-

ceptance region (or the complement of C in (0, 1)) is

Cc ={ x2IR | 0. 05 <x< 1}.

Test of Statistical Hypotheses for Parameters 564

However, since the support of the standard normal distribution is IR, the

actual critical region should be the complement of Cc in IR. Therefore, the

critical region of this hypothesis test is the set

{x2 IR |x 0. 05 or x 1}.

The most powerful test for ↵ = 0. 05 is: "Reject Ho if X 0. 05 or X 1."

Example 18.16. Let X1 , X2, X3 denote three independent observations

from a distribution with density

f(x ;✓ ) =  (1 + ✓) x ✓ if 0 x1

0 otherwise.

What is the form of the best critical region of size 0. 034 for testing Ho :✓ = 1

versus Ha :✓ = 2?

Answer: By Neyman-Pearson Theorem, the form of the critical region is

given by (with ✓o = 1 and ✓a = 2)

C= (x1 , x2, x3 )2 IR3 |L (✓o , x1, x2, x3 )

L(✓a , x1, x2, x3 ) k 

= (x1 , x2, x3 )2 IR3 | (1 + ✓o ) 3  3

i=1 x ✓ o

(1 + ✓a )3  3

i=1 x ✓ a

ik

= (x1 , x2, x3 )2 IR3 | 8x1x2x3

27x2

1x 2

2x 2

3k

= (x1 , x2, x3 )2 IR3 | 1

x1x2x3  27

8k 

= (x1 , x2 , x3 )2 IR3 | x1x2x3  a, 

where a is some constant. Hence the most powerful or best test is of the

form: "Reject Ho if



i=1

Xi a."

Since, the signiﬁcance level of the test is given to be ↵ = 0. 034, the

constant a can be determined. To evaluate the constant a , we need the

probability distribution of X1X2X3 . The distribution of X1X2X3 is not

easy to get. Hence, we will use Theorem 17.5. There, we have shown that

Probability and Mathematical Statistics 565

2(1 + ✓ )  3

i=1 ln X i ⇠ 2 (6). Now we proceed to ﬁnd a. Since

0. 034 = ↵

=P (Reject Ho / Ho is true}

=P (X1X2X3  a / ✓ = 1)

=P (ln(X1X2X3 )  ln a / ✓ = 1)

=P ( 2(1 + ✓ ) ln( X1X2X3 )   2(1 + ✓ ) ln a / ✓ = 1)

=P ( 4 ln( X1X2X3 )   4 ln a)

=P 2 (6)   4 ln a

hence from chi-square table, we get

4 ln a = 1 .4.

Therefore

a= e0. 35 = 0.7047.

Hence, the most powerful test is given by "Reject Ho if X1X2X3  0.7047".

The critical region C is the region above the surface x1x2x3 = 0. 7047 of

the unit cube [0, 1]3 . The following ﬁgure illustrates this region.

Critical region is to the right of the shaded surface

Example 18.17. Let X1 , X2 , ..., X12 be a random sample from a normal

population with mean zero and variance 2 . What is the most powerful test

of size 0. 025 for testing the null hypothesis Ho :2 = 10 versus Ha :2 = 5?

Test of Statistical Hypotheses for Parameters 566

Answer: By Neyman-Pearson Theorem, the form of the critical region is

given by (with o2 = 10 and a2 = 5)

C= (x1 , x2 , ..., x12 )2 IR12     

L o2 , x1, x2 , ..., x12 

L(a2 , x1, x2 , ..., x12 ) k 

=







(x1 , x2 , ..., x12 )2 IR12       



i=1

p2⇡2

e 1

2( xi

o ) 2

p2⇡2

e 1

2( xi

a ) 2 k









= (x1 , x2 , ..., x12 )2 IR12     1

26

20  12

i=1 x 2

ik

= (x1 , x2 , ..., x12 )2 IR12     



i=1

ia ,

where a is some constant. Hence the most powerful or best test is of the

form: "Reject Ho if



i=1

ia."

Since, the signiﬁcance level of the test is given to be ↵ = 0. 025, the

constant a can be determined. To evaluate the constant a , we need the

probability distribution of X 2

1+X 2

2+··· +X 2

12. It can be shown that the

distribution of  12

i=1  X i

 2 ⇠ 2 (12). Now we proceed to ﬁnd a. Since

0. 025 = ↵

=P (Reject Ho / Ho is true}

=P 12



i=1  X i

2

a / 2 = 10

=P 12



i=1  X i

p10  2

a / 2 = 10

=P 2 (12)  a

10  ,

hence from chi-square table, we get

10 = 4.4.

Therefore

a= 44.

Probability and Mathematical Statistics 567

Hence, the most powerful test is given by "Reject Ho if  12

i=1 X 2

i44." The

best critical region of size 0. 025 is given by

C= (x1 , x2 , ..., x12 )2 IR12 |



i=1

i44  .

In last ﬁve examples, we have found the most powerful tests and corre-

sponding critical regions when the both Ho and Ha are simple hypotheses. If

either Ho or Ha is not simple, then it is not always possible to ﬁnd the most

powerful test and corresponding critical region. In this situation, hypothesis

test is found by using the likelihood ratio. A test obtained by using likelihood

ratio is called the likelihood ratio test and the corresponding critical region is

called the likelihood ratio critical region.

18.4. Some Examples of Likelihood Ratio Tests

In this section, we illustrate, using likelihood ratio, how one can construct

hypothesis test when one of the hypotheses is not simple. As pointed out

earlier, the test we will construct using the likelihood ratio is not the most

powerful test. However, such a test has all the desirable properties of a

hypothesis test. To construct the test one has to follow a sequence of steps.

These steps are outlined below:

(1) Find the likelihood function L(✓ , x1, x2 , ..., xn ) for the given sample.

(2) Evaluate max

✓2⌦o

L(✓ , x1, x2 , ..., xn ).

(3) Find the maximum likelihood estimator 

✓of ✓.

(4) Compute max

✓2⌦ L(✓ , x 1 , x 2 , ..., x n ) using L

✓, x1, x2 , ..., xn  .

(5) Using steps (2) and (4), ﬁnd W (x1 , ..., xn ) =

max

✓2⌦o

L(✓ , x1, x2 , ..., xn )

max

✓2⌦ L(✓ , x 1 , x 2 , ..., x n ).

(6) Using step (5) determine C = {(x1 , x2 , ..., xn ) |W (x1 , ..., xn )k },

where k2 [0,1].

(7) Reduce W (x1 , ..., xn )k to an equivalent inequality 

W(x1 , ..., xn ) A.

(8) Determine the distribution of 

W(x1 , ..., xn ).

(9) Find A such that given ↵ equals P 

W(x1 , ..., xn ) A| Ho is true.

Test of Statistical Hypotheses for Parameters 568

In the remaining examples, for notational simplicity, we will denote the

likelihood function L(✓ , x1, x2 , ..., xn ) simply as L(✓).

Example 18.19. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and known variance 2 . What is the likelihood

ratio test of size ↵ for testing the null hypothesis Ho :µ = µo versus the

alternative hypothesis Ha :µ 6 = µo ?

Answer: The likelihood function of the sample is given by

L(µ) =



i=1  1

p 2 ⇡ e  1

2 2 (x i µ) 2

= 1

p 2 ⇡n

e 1

2 2



i=1

(xi µ)2

Since ⌦o = {µo } , we obtain

max

µ2⌦o

L(µ) = L(µo )

= 1

p 2 ⇡n

e 1

2 2



i=1

(xi µo )2

We have seen in Example 15.13 that if X⇠ N (µ, 2 ), then the maximum

likelihood estimator of µ is X , that is

 µ= X.

Hence

max

µ2⌦ L(µ) = L(  µ) =  1

p 2 ⇡n

e 1

2 2



i=1

(xi x)2

Now the likelihood ratio statistics W (x1 , x2 , ..., xn ) is given by

W(x1 , x2 , ..., xn ) =  1

p 2 ⇡ n e  1

2 2



i=1

(xi µo )2

1

p 2 ⇡ n e  1

2 2



i=1

(xi x)2

Probability and Mathematical Statistics 569

which simpliﬁes to

W(x1 , x2 , ..., xn ) = e n

2 2 (xµ o ) 2 .

Now the inequality W (x1 , x 2, ..., x n)kbecomes

e n

2 2 (xµ o ) 2 k

and which can be rewritten as

(x µo )2   2 2

nln( k)

|x µo | K

where K =   2 2

nln(k ). In view of the above inequality, the critical region

can be described as

C={(x1 , x2 , ..., xn ) | | x µo | K }.

Since we are given the size of the critical region to be ↵ , we can determine

the constant K . Since the size of the critical region is ↵ , we have

↵=P Xµo  K .

For ﬁnding K , we need the probability density function of the statistic X µo

when the population X is N (µ, 2 ) and the null hypothesis Ho :µ = µo is

true. Since 2 is known and Xi ⇠ N ( µ, 2 ),

X µo



pn ⇠N(0 ,1)

and ↵ =P  Xµo  K

=P    

X µo



pn     Kp n



=P | Z| Kp n

 where Z= Xµo



= 1 P  K p n

ZK pn



Test of Statistical Hypotheses for Parameters 570

we get

z↵

2=Kp n



which is

K= z↵



pn ,

where z ↵

2is a real number such that the integral of the standard normal

density from z ↵

2to 1 equals ↵

Hence, the likelihood ratio test is given by "Reject Ho if

 Xµo  z↵



pn ."

If we denote

z= xµo



then the above inequality becomes

|Z |z ↵

Thus critical region is given by

C= (x1 , x2 , ..., xn ) | | z| z↵

2}.

This tells us that the null hypothesis must be rejected when the absolute

value of z takes on a value greater than or equal to z ↵

Remark 18.6. The hypothesis Ha :µ 6 = µo is called a two-sided alternative

hypothesis. An alternative hypothesis of the form Ha : µ > µo is called

a right-sided alternative. Similarly, Ha : µ < µo is called the a left-sided

Probability and Mathematical Statistics 571

alternative. In the above example, if we had a right-sided alternative, that

is Ha : µ > µo , then the critical region would have been

C={(x1 , x2 , ..., xn ) | z z↵ }.

Similarly, if the alternative would have been left-sided, that is Ha : µ < µo ,

then the critical region would have been

C={(x1 , x2 , ..., xn ) | z  z↵ }.

We summarize the three cases of hypotheses test of the mean (of the normal

population with known variance) in the following table.

HoHa Critical Region (or Test)

µ= µo µ > µo z= xµo



pn z↵

µ= µo µ < µo z= xµo



pn  z↵

µ= µo µ6= µo | z|=    xµo



pn    z↵

Example 18.20. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and unknown variance 2 . What is the likelihood

ratio test of size ↵ for testing the null hypothesis Ho :µ = µo versus the

alternative hypothesis Ha :µ 6 = µo ?

Answer: In this example,

⌦= µ, 2  2 IR2 |  1 <µ< 1,2 > 0 ,

⌦o =  µo ,2  2 IR2 |2 > 0 ,

⌦a =  µ, 2  2 IR2 |µ 6 = µo ,2 > 0 .

These sets are illustrated below.

Test of Statistical Hypotheses for Parameters 572

The likelihood function is given by

L µ, 2  =



i=1  1

p2⇡2  e 1

2( xi µ

) 2

= 1

p2⇡2  n

e 1

2 2  n

i=1(x i µ) 2 .

Next, we ﬁnd the maximum of L  µ, 2  on the set ⌦o . Since the set ⌦o is

equal to  µo ,2  2 IR2 | 0< < 1 , we have

max

(µ,2 )2⌦o

L µ, 2  = max

2 >0L µ o , 2  .

Since L  µo ,2  and ln L  µo ,2  achieve the maximum at the same  value,

we determine the value of  where ln L  µo ,2  achieves the maximum. Tak-

ing the natural logarithm of the likelihood function, we get

ln  L  µ, 2  = n

2ln(2 )n

2ln(2⇡) 1

2 2



i=1

(xi µo )2 .

Di↵ erentiating ln L  µo ,2  with respect to 2 , we get from the last equality

d2 ln  L  µ, 2  = n

22 + 1

2 4



i=1

(xi µo )2 .

Setting this derivative to zero and solving for  , we obtain

=



1



i=1

(xi µo )2 .

Probability and Mathematical Statistics 573

Thus ln  L  µ, 2  attains maximum at  = 



1



i=1

(xi µo )2 . Since this

value of  is also yield maximum value of L  µ, 2  , we have

max

2 >0L µ o , 2 = 2⇡ 1



i=1

(xi µo )2   n

e n

Next, we determine the maximum of L  µ, 2  on the set ⌦ . As before,

we consider ln L  µ, 2  to determine where L  µ, 2  achieves maximum.

Taking the natural logarithm of L  µ, 2  , we obtain

ln  L  µ, 2  = n

2ln(2 )n

2ln(2⇡) 1

2 2



i=1

(xi µ)2 .

Taking the partial derivatives of ln L  µ, 2  ﬁrst with respect to µ and then

with respect to 2 , we get

@µln L  µ,  2  =1

2



i=1

(xi µ),

and

@ 2 ln L  µ,  2  = n

22 + 1

2 4



i=1

(xi µ)2 ,

respectively. Setting these partial derivatives to zero and solving for µand

, we obtain

µ= xand 2 = n1

ns 2 ,

where s2 = 1

n1



i=1

(xi x)2 is the sample variance.

Letting these optimal values of µ and  into L  µ, 2  , we obtain

max

(µ,2 )2⌦L µ,  2 = 2⇡ 1



i=1

(xi x)2   n

e n

Hence

max

(µ,2 )2⌦o

L µ, 2 

max

(µ,2 )2⌦L µ,  2 = 2⇡ 1



i=1

(xi µo )2   n

e n

2⇡ 1



i=1

(xi x)2   n

e n

=







i=1

(xi µo )2



i=1

(xi x)2







n

Test of Statistical Hypotheses for Parameters 574

Since n



i=1

(xi x)2 = (n 1) s2

and n



i=1

(xi µ)2=



i=1

(xi x)2 +n (x µo )2 ,

we get

W(x1 , x2 , ..., xn ) =

max

(µ,2 )2⌦o

L µ, 2 

max

(µ,2 )2⌦L µ,  2 = 1 + n

n1

(x µo )2

s2   n

Now the inequality W (x1 , x 2, ..., x n)kbecomes

1 + n

n1

(x µo )2

s2   n

k

and which can be rewritten as

xµo

s2

n 1

n k  2

n1

or     

x µo

pn     K

where K =  (n 1)  k 2

n1 . In view of the above inequality, the critical

region can be described as

C= (x1 , x2 , ..., xn )|     

x µo

pn     K

and the best likelihood ratio test is: "Reject Ho if    xµo

pn    K". Since we

are given the size of the critical region to be ↵ , we can ﬁnd the constant K.

For ﬁnding K , we need the probability density function of the statistic xµo

when the population X is N (µ, 2 ) and the null hypothesis Ho :µ = µo is

true.

Since the population is normal with mean µ and variance 2 ,

X µo

pn ⇠t( n 1),

Probability and Mathematical Statistics 575

where S2 is the sample variance and equals to 1

n1



i=1 X i X 2 . Hence

K= t↵

2(n 1) s

pn ,

where t ↵

2(n 1) is a real number such that the integral of the t-distribution

with n 1 degrees of freedom from t ↵

2(n 1) to 1 equals ↵

Therefore, the likelihood ratio test is given by "Reject Ho :µ = µo if

 Xµo  t↵

2(n 1) S

pn ."

If we denote

t= xµo

then the above inequality becomes

|T |t ↵

2(n 1).

Thus critical region is given by

C= (x1 , x2 , ..., xn ) | | t| t ↵

2(n 1) }.

This tells us that the null hypothesis must be rejected when the absolute

value of t takes on a value greater than or equal to t ↵

2(n 1).

Remark 18.7. In the above example, if we had a right-sided alternative,

that is Ha : µ > µo , then the critical region would have been

C={(x1 , x2 , ..., xn ) | t t↵ ( n 1) } .

Test of Statistical Hypotheses for Parameters 576

Similarly, if the alternative would have been left-sided, that is Ha : µ < µo ,

then the critical region would have been

C={(x1 , x2 , ..., xn ) | t  t↵ ( n 1) } .

We summarize the three cases of hypotheses test of the mean (of the normal

population with unknown variance) in the following table.

HoHa Critical Region (or Test)

µ= µo µ > µo t= xµo

pn t↵ ( n 1)

µ= µo µ < µo t= xµo

pn  t↵ ( n 1)

µ= µo µ6= µo |t|=    xµo

pn    t↵

2(n 1)

Example 18.21. Let X1 , X2 , ..., Xn be a random sample from a normal

population with mean µ and variance 2 . What is the likelihood ratio test

of signiﬁcance of size ↵ for testing the null hypothesis Ho :2 = 2

oversus

Ha :2 6= 2

Answer: In this example,

⌦= µ, 2  2 IR2 |  1 <µ< 1,2 > 0 ,

⌦o =  µ,  2

o2IR2 |  1 <µ< 1 ,

⌦a =  µ, 2  2 IR2 |  1 <µ< 1, 6 = o  .

These sets are illustrated below.

Probability and Mathematical Statistics 577

The likelihood function is given by

L µ, 2  =



i=1  1

p2⇡2  e 1

2( xi µ

) 2

= 1

p2⇡2  n

e 1

2 2  n

i=1(x i µ) 2 .

Next, we ﬁnd the maximum of L  µ, 2  on the set ⌦o . Since the set ⌦o is

equal to  µ,  2

o2IR2 |  1 <µ< 1 , we have

max

(µ,2 )2⌦o

L µ, 2  = max

1<µ< 1L µ,  2

o.

Since L  µ,  2

oand ln L µ,  2

oachieve the maximum at the same µ value, we

determine the value of µ where ln L  µ,  2

oachieves the maximum. Taking

the natural logarithm of the likelihood function, we get

ln  L  µ,  2

o =n

2ln(2

o)n

2ln(2⇡) 1

2 2



i=1

(xi µ)2 .

Di↵ erentiating ln L  µ,  2

owith respect to µ, we get from the last equality

dµ ln  L  µ, 2  = 1

2



i=1

(xi µ).

Setting this derivative to zero and solving for µ , we obtain

µ= x.

Hence, we obtain

max

1<µ< 1L µ,  2 =  1

2⇡2

o n

e 1

2 2

o n

i=1(x i x) 2

Next, we determine the maximum of L  µ, 2  on the set ⌦ . As before,

we consider ln L  µ, 2  to determine where L  µ, 2  achieves maximum.

Taking the natural logarithm of L  µ, 2  , we obtain

ln  L  µ, 2  = n ln( )  n

2ln(2⇡) 1

2 2



i=1

(xi µ)2 .

Test of Statistical Hypotheses for Parameters 578

Taking the partial derivatives of ln L  µ, 2  ﬁrst with respect to µ and then

with respect to 2 , we get

@µln L  µ,  2  =1

2



i=1

(xi µ),

and

@ 2 ln L  µ,  2  = n

22 + 1

2 4



i=1

(xi µ)2 ,

respectively. Setting these partial derivatives to zero and solving for µand

, we obtain

µ= xand 2 = n1

ns 2 ,

where s2 = 1

n1



i=1

(xi x)2 is the sample variance.

Letting these optimal values of µ and  into L  µ, 2  , we obtain

max

(µ,2 )2⌦L µ,  2 =  n

2⇡(n 1)s2  n

e n

2(n 1)s 2



i=1

(xi x)2

Therefore

W(x1 , x2 , ..., xn ) =

max

(µ,2 )2⌦o

L µ, 2 

max

(µ,2 )2⌦L µ,  2 

= 1

2⇡2

o n

2e  1

2 2

o n

i=1(x i x) 2

n

2⇡(n 1)s2  n

2e  n

2(n 1)s 2  n

i=1(x i x) 2

=n n

2e n

2(n 1)s2

2

o n

e (n 1)s 2

2 2

Now the inequality W (x1 , x 2, ..., x n)kbecomes

n n

2e n

2(n 1)s2

2

o n

e (n 1)s 2

2 2

ok

which is equivalent to

(n 1)s2

2

o n

e (n 1)s 2

2

o k n

e n

22

:= Ko,

Probability and Mathematical Statistics 579

where Ko is a constant. Let H be a function deﬁned by

H( w) = wn ew .

Using this, we see that the above inequality becomes

H ( n1)s2

2

oK o .

The ﬁgure below illustrates this inequality.

From this it follows that

(n 1)s2

2

oK 1 or (n 1)s2

2

oK 2 .

In view of these inequalities, the critical region can be described as

C= (x1 , x2 , ..., xn )    

(n 1)s2

2

oK 1 or (n 1)s2

2

oK 2 ,

and the best likelihood ratio test is: "Reject Ho if

(n 1)S 2

2

oK 1 or (n 1)S 2

2

oK 2 ."

Since we are given the size of the critical region to be ↵ , we can determine the

constants K1 and K2 . As the sample X1 , X2 , ..., Xn is taken from a normal

distribution with mean µ and variance 2 , we get

(n 1)S 2

2

o⇠ 2 (n 1)

Test of Statistical Hypotheses for Parameters 580

when the null hypothesis Ho :2 = 2

ois true.

Therefore, the likelihood ratio critical region C becomes

(x1 , x2 , ..., xn )    

(n 1)s2

2

o2

↵

2(n 1) or (n 1)s2

2

o2

1 ↵

2(n 1) 

and the likelihood ratio test is: "Reject Ho :2 = 2

oif

(n 1)S 2

2

o2

↵

2(n 1) or (n 1)S 2

2

o2

1 ↵

2(n 1)"

where 2

↵

2(n 1) is a real number such that the integral of the chi-square

density function with (n 1) degrees of freedom from 0 to 2

↵

2(n 1) is ↵

Further, 2

1 ↵

2(n 1) denotes the real number such that the integral of the

chi-square density function with (n 1) degrees of freedom from 2

1 ↵

2(n 1)

to 1 is ↵

Remark 18.8. We summarize the three cases of hypotheses test of the

variance (of the normal population with unknown mean) in the following

table.

HoHa Critical Region (or Test)

2 =2

o 2 > 2

o 2 = (n 1)s2

2

o2

1↵ (n 1)

2 =2

o 2 < 2

o 2 = (n 1)s2

2

o2

↵(n 1)

2 =2

o 2 6= 2

o 2 = (n 1)s2

2

o2

1↵/ 2 (n 1)

2 = (n 1)s2

2

o2

↵/2 (n 1)

18.5. Review Exercises

1. Five trials X1 , X2 , ..., X5 of a Bernoulli experiment were conducted to test

Ho : p= 1

2against H a :p= 3

4. The null hypothesis H o will be rejected if

5

i=1 X i = 5. Find the probability of Type I and Type II errors.

2. A manufacturer of car batteries claims that the life of his batteries is

normally distributed with a standard deviation equal to 0.9 year. If a random

Probability and Mathematical Statistics 581

sample of 10 of these batteries has a standard deviation of 1.2 years, do you

think that > 0. 9 year? Use a 0.05 level of signiﬁcance.

3. Let X1 , X2 , ..., X8 be a random sample of size 8 from a Poisson distribution

with parameter  . Reject the null hypothesis Ho : = 0. 5 if the observed

sum  8

i=1 x i 8. First, compute the signiﬁcance level ↵ of the test. Second,

ﬁnd the power function  ( ) of the test as a sum of Poisson probabilities

when Ha is true.

4. Suppose X has the density function

f(x ;✓ ) =  1

✓for 0 <x<✓

0 otherwise.

If one observation of X is taken, what are the probabilities of Type I and

Type II errors in testing the null hypothesis Ho :✓ = 1 against the alternative

hypothesis Ha :✓ = 2, if Ho is rejected for X > 0.92.

5. Let X have the density function

f(x ;✓ ) =  (✓ + 1) x✓ for 0 <x< 1 where ✓> 0

0 otherwise.

The hypothesis Ho :✓ = 1 is to be rejected in favor of H1 :✓ = 2 if X > 0.90.

What is the probability of Type I error?

6. Let X1 , X2 , ..., X6 be a random sample from a distribution with density

function

f(x ;✓ ) =  ✓ x ✓1 for 0 < x < 1 where ✓> 0

0 otherwise.

The null hypothesis Ho :✓ = 1 is to be rejected in favor of the alternative

Ha :✓ > 1 if and only if at least 5 of the sample observations are larger than

0.7. What is the signiﬁcance level of the test?

7. A researcher wants to test Ho :✓ = 0 versus Ha :✓ = 1, where ✓ is a

parameter of a population of interest. The statistic W , based on a random

sample of the population, is used to test the hypothesis. Suppose that under

Ho , W has a normal distribution with mean 0 and variance 1, and under Ha ,

Whas a normal distribution with mean 4 and variance 1. If Ho is rejected

when W > 1. 50, then what are the probabilities of a Type I or Type II error

respectively?

Test of Statistical Hypotheses for Parameters 582

8. Let X1 and X2 be a random sample of size 2 from a normal distribution

N( µ, 1). Find the likelihood ratio critical region of size 0.005 for testing the

null hypothesis Ho :µ = 0 against the composite alternative Ha :µ 6= 0?

9. Let X1 , X2 , ..., X10 be a random sample from a Poisson distribution with

mean ✓ . What is the most powerful (or best) critical region of size 0. 08 for

testing the null hypothesis H0 :✓ = 0. 1 against Ha :✓ = 0.5?

10. Let X be a random sample of size 1 from a distribution with probability

density function

f(x ;✓ ) =  (1  ✓

2) + ✓x if 0 x1

0 otherwise.

For a signiﬁcance level ↵ = 0. 1, what is the best (or uniformly most powerful)

critical region for testing the null hypothesis Ho :✓ =  1 against Ha :✓ = 1?

11. Let X1 , X2 be a random sample of size 2 from a distribution with prob-

ability density function

f(x ;✓ ) = 





✓x e✓

x! if x = 0, 1,2,3, ....

0 otherwise,

where ✓ 0. For a signiﬁcance level ↵ = 0. 053, what is the best critical

region for testing the null hypothesis Ho :✓ = 1 against Ha :✓ = 2? Sketch

the graph of the best critical region.

12. Let X1 , X2 , ..., X8 be a random sample of size 8 from a distribution with

probability density function

f(x ;✓ ) = 





✓x e✓

x! if x = 0, 1,2,3, ....

0 otherwise,

where ✓ 0. What is the likelihood ratio critical region for testing the null

hypothesis Ho :✓ = 1 against Ha :✓ 6= 1? If ↵= 0. 1 can you determine the

best likelihood ratio critical region?

13. Let X1 , X2 , ..., Xn be a random sample of size n from a distribution with

probability density function

f(x ; ) = 





x6e x



(7)7 , if x > 0

0 otherwise,

Probability and Mathematical Statistics 583

where  0. What is the likelihood ratio critical region for testing the null

hypothesis Ho : = 5 against Ha : 6 = 5? What is the most powerful test?

14. Let X1 , X2 , ..., X5 denote a random sample of size 5 from a population

Xwith probability density function

f(x ;✓ ) = 





(1 ✓ )x1 ✓ if x = 1, 2,3, ..., 1

0 otherwise,

where 0 <✓< 1 is a parameter. What is the likelihood ratio critical region

of size 0. 05 for testing Ho :✓ = 0. 5 versus Ha :✓ 6 = 0.5?

15. Let X1 , X2, X3 denote a random sample of size 3 from a population X

with probability density function

f(x ; µ ) = 1

p2⇡ e (xµ)2

2 1 <x< 1,

where 1 <µ< 1 is a parameter. What is the likelihood ratio critical

region of size 0. 05 for testing Ho :µ = 3 versus Ha :µ 6= 3?

16. Let X1 , X2, X3 denote a random sample of size 3 from a population X

with probability density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 otherwise,

where 0 <✓<1 is a parameter. What is the likelihood ratio critical region

for testing Ho :✓ = 3 versus Ha :✓ 6 = 3?

17. Let X1 , X2, X3 denote a random sample of size 3 from a population X

with probability density function

f(x ;✓ ) = 





e✓ ✓ x

x! if x = 0, 1,2,3, ..., 1

0 otherwise,

where 0 <✓<1 is a parameter. What is the likelihood ratio critical region

for testing Ho :✓ = 0. 1 versus Ha :✓ 6 = 0.1?

18. A box contains 4 marbles, ✓ of which are white and the rest are black.

A sample of size 2 is drawn to test Ho :✓ = 2 versus Ha :✓ 6 = 2. If the null

Test of Statistical Hypotheses for Parameters 584

hypothesis is rejected if both marbles are the same color, ﬁnd the signiﬁcance

level of the test.

19. Let X1 , X2, X3 denote a random sample of size 3 from a population X

with probability density function

f(x ;✓ ) = 





✓for 0 x ✓

0 otherwise,

where 0 <✓<1 is a parameter. What is the likelihood ratio critical region

of size 117

125 for testing H o :✓= 5 versus H a :✓6= 5?

20. Let X1 , X2 and X3 denote three independent observations from a dis-

tribution with density

f(x ; ) = 





e  x

for 0 <x<1

0 otherwise,

where 0 <<1 is a parameter. What is the best (or uniformly most

powerful critical region for testing Ho : = 5 versus Ha : = 10?

21. Suppose X has the density function

f(x ;✓ ) =  1

✓for 0 <x<✓

0 otherwise.

If X1 , X2 , X3, X4 is a random sample of size 4 taken from X , what are the

probabilities of Type I and Type II errors in testing the null hypothesis

Ho :✓ = 1 against the alternative hypothesis Ha :✓ = 2, if Ho is rejected for

max{X1 , X2, X3 , X4 } 1

22. Let X1 , X2, X3 denote a random sample of size 3 from a population X

with probability density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 otherwise,

where 0 <✓<1 is a parameter. The null hypothesis Ho :✓ = 3 is to be

rejected in favor of the alternative Ha :✓ 6 = 3 if and only if X > 6. 296. What

is the signiﬁcance level of the test?

Probability and Mathematical Statistics 585

Chapter 19

SIMPLE LINEAR

REGRESSION

AND

CORRELATION ANALYSIS

Let X and Y be two random variables with joint probability density

function f (x, y ). Then the conditional density of Y given that X =x is

f(y/x ) = f(x, y)

g(x)

where

g(x ) =  1

1

f( x, y)dy

is the marginal density of X . The conditional mean of Y

E( Y| X= x) =  1

1

yf (y/x)dy

is called the regression equation of Y on X.

Example 19.1. Let X and Y be two random variables with the joint prob-

ability density function

f( x, y) =  xe x(1+ y) if x > 0, y > 0

0 otherwise.

Find the regression equation of Y on X and then sketch the regression curve.

Simple Linear Regression and Correlation Analysis 586

Answer: The marginal density of X is given by

g(x ) =  1

1

xex(1+ y) dy

= 1

1

xex exy dy

=xex  1

1

exy dy

=xex   1

xe xy 1

=ex .

The conditional density of Y given X =x is

f(y/x ) = f(x, y)

g(x )= xe x(1+ y )

ex = xe xy , y > 0 .

The conditional mean of Y given X =x is given by

E(Y/x ) =  1

1

yf (y/x) dy = 1

1

y x exy dy =1

Thus the regression equation of Y on Xis

E(Y/x ) = 1

x, x > 0.

The graph of this equation of Y on X is shown below.

Graph of the regression equation E(Y/x) = 1/ x

Probability and Mathematical Statistics 587

From this example it is clear that the conditional mean E (Y /x ) is a

function of x . If this function is of the form ↵ + x , then the correspond-

ing regression equation is called a linear regression equation; otherwise it is

called a nonlinear regression equation. The term linear regression refers to

a speciﬁcation that is linear in the parameters. Thus E (Y/x ) = ↵ + x2 is

also a linear regression equation. The regression equation E (Y /x ) = ↵ x is

an example of a nonlinear regression equation.

The main purpose of regression analysis is to predict Yi from the knowl-

edge of xi using the relationship like

E(Yi/xi ) = ↵ + xi.

The Yi is called the response or dependent variable where as xi is called the

predictor or independent variable. The term regression has an interesting his-

tory, dating back to Francis Galton (1822-1911). Galton studied the heights

of fathers and sons, in which he observed a regression (a "turning back")

from the heights of sons to the heights of their fathers. That is tall fathers

tend to have tall sons and short fathers tend to have short sons. However,

he also found that very tall fathers tend to have shorter sons and very short

fathers tend to have taller sons. Galton called this phenomenon regression

towards the mean.

In regression analysis, that is when investigating the relationship be-

tween a predictor and response variable, there are two steps to the analysis.

The ﬁrst step is totally data oriented. This step is always performed. The

second step is the statistical one, in which we draw conclusions about the

(population) regression equation E (Yi/xi ). Normally the regression equa-

tion contains several parameters. There are two well known methods for

ﬁnding the estimates of the parameters of the regression equation. These

two methods are: (1) The least square method and (2) the normal regression

method.

19.1. The Least Squares Method

Let {(xi , yi ) |i = 1, 2, ..., n} be a set of data. Assume that

E(Yi/xi ) = ↵ + xi , (1)

that is

yi =↵ + xi , i = 1 , 2 , ..., n.

Simple Linear Regression and Correlation Analysis 588

Then the sum of the squares of the error is given by

E(↵,  ) =



i=1

(yi ↵ xi )2 .(2)

The least squares estimates of ↵ and  are deﬁned to be those values which

minimize E (↵,  ). That is,

 ↵,

 = arg min

(↵,  )E(↵,  ).

This least squares method is due to Adrien M. Legendre (1752-1833). Note

that the least squares method also works even if the regression equation is

nonlinear (that is, not of the form (1)).

Next, we give several examples to illustrate the method of least squares.

Example 19.2. Given the ﬁve pairs of points (x, y ) shown in table below

x4 0 2 3 1

y5 0 0 6 3

what is the line of the form y =x +b best ﬁts the data by method of least

squares?

Answer: Suppose the best ﬁt line is y =x + b . Then for each xi , xi +b is

the estimated value of yi . The di↵ erence between yi and the estimated value

of yi is the error or the residual corresponding to the ith measurement. That

is, the error corresponding to the ith measurement is given by

✏i = yi xi b.

Hence the sum of the squares of the errors is

E(b ) =



i=1

✏2



i=1

(yi xi b)2 .

Di↵ erentiating E (b ) with respect to b , we get

db E ( b) = 2



i=1

(yi xi b ) ( 1).

Probability and Mathematical Statistics 589

Setting d

db E(b ) equal to 0, we get



i=1

(yi xi b ) = 0

which is

5b=



i=1

yi 



i=1

xi.

Using the data, we see that

5b = 14 6

which yields b = 8

5. Hence the best ﬁtted line is

y= x+8

Example 19.3. Suppose the line y = bx + 1 is ﬁt by the method of least

squares to the 3 data points

x1 2 4

y2 2 0

What is the value of the constant b?

Answer: The error corresponding to the ith measurement is given by

✏i = yi bxi  1.

Hence the sum of the squares of the errors is

E(b ) =



i=1

✏2



i=1

(yi bxi  1)2 .

Di↵ erentiating E (b ) with respect to b , we get

db E ( b) = 2



i=1

(yi bxi  1) ( xi ).

Simple Linear Regression and Correlation Analysis 590

Setting d

db E(b ) equal to 0, we get



i=1

(yi bxi  1) xi = 0

which in turn yields



i=1

xiyi 



i=1



i=1

Using the given data we see that

b=67

21 = 1

21 ,

and the best ﬁtted line is

y= 1

21 x + 1.

Example 19.4. Observations y1 , y2 , ..., yn are assumed to come from a model

with

E(Yi/xi ) = ✓ + 2 ln xi

where ✓ is an unknown parameter and x1 , x2 , ..., xn are given constants. What

is the least square estimate of the parameter ✓?

Answer: The sum of the squares of errors is

E(✓ ) =



i=1

✏2



i=1

(yi ✓ 2 ln xi )2 .

Di↵ erentiating E (✓ ) with respect to ✓ , we get

d✓E (✓ ) = 2



i=1

(yi ✓ 2 ln xi ) ( 1).

Setting d

d✓E(✓ ) equal to 0, we get



i=1

(yi ✓ 2 ln xi ) = 0

which is

✓=1

n n



i=1

yi  2



i=1

ln xi .

Probability and Mathematical Statistics 591

Hence the least squares estimate of ✓ is 

✓=y2



i=1

ln xi .

Example 19.5. Given the three pairs of points (x, y ) shown below:

x4 1 2

y2 1 0

What is the curve of the form y = x best ﬁts the data by method of least

squares?

Answer: The sum of the squares of the errors is given by

E( ) =



i=1

✏2



i=1 y i x 

i 2 .

Di↵ erentiating E ( ) with respect to  , we get

dE ( ) = 2



i=1 y i x 

i(x

iln x i )

Setting this derivative d

dE( ) to 0, we get



i=1

yix

iln x i =



i=1

x

ix 

iln x i .

Using the given data we obtain

(2) 4 ln 4 = 42 ln 4 + 22 ln 2

which simpliﬁes to

4 = (2) 4 + 1

4 = 3

Taking the natural logarithm of both sides of the above expression, we get

=ln 3  ln 2

ln 4 = 0.2925

Simple Linear Regression and Correlation Analysis 592

Thus the least squares best ﬁt model is y = x0.2925 .

Example 19.6. Observations y1 , y2 , ..., yn are assumed to come from a model

with E (Yi/xi ) = ↵ + xi , where ↵ and  are unknown parameters, and

x1 , x2 , ..., xn are given constants. What are the least squares estimate of the

parameters ↵ and ?

Answer: The sum of the squares of the errors is given by

E(↵,  ) =



i=1

✏2



i=1

(yi ↵  xi )2 .

Di↵ erentiating E (↵,  ) with respect to ↵ and  respectively, we get

@↵ E ( ↵,) = 2



i=1

(yi ↵ xi ) ( 1)

and

@ E ( ↵,) = 2



i=1

(yi ↵  xi ) ( xi ).

Setting these partial derivatives @

@↵ E(↵,  ) and @

@ E(↵,  ) to 0, we get



i=1

(yi ↵ xi ) = 0 (3)

and n



i=1

(yi ↵ xi ) xi = 0.(4)

From (3), we obtain n



i=1

yi =n↵+ 



i=1

which is

y=↵ + x. (5)

Similarly, from (4), we have



i=1

xiyi = ↵



i=1

xi + 



i=1

Probability and Mathematical Statistics 593

which can be rewritten as follows



i=1

(xi x)(yi  y ) + nx y =n↵ x + 



i=1

(xi x)(xi x ) + n x2 (6)

Deﬁning

Sxy :=



i=1

(xi x)(yi  y )

we see that (6) reduces to

Sxy + nx y =↵ n x +  Sxx +nx2  (7)

Substituting (5) into (7), we have

Sxy + nx y = [ y x] n x +  Sxx +nx2 .

Simplifying the last equation, we get

Sxy = Sxx

which is

=Sxy

Sxx

.(8)

In view of (8) and (5), we get

↵=y Sxy

Sxx

x. (9)

Thus the least squares estimates of ↵ and are

 ↵=y Sxy

Sxx

xand 

=Sxy

Sxx

respectively.

We need some notations. The random variable Y given X =x will be

denoted by Yx . Note that this is the variable appears in the model E (Y/x ) =

↵+ x. When one chooses in succession values x1 , x2 , ..., xn for x, a sequence

Yx 1 , Yx 2 , ..., Yx n of random variable is obtained. For the sake of convenience,

we denote the random variables Yx 1 , Yx 2 , ..., Yx n simply as Y1 , Y2 , ..., Yn . To

do some statistical analysis, we make following three assumptions:

(1) E (Yx ) = ↵ +x so that µi =E (Yi ) = ↵ + xi ;

Simple Linear Regression and Correlation Analysis 594

(2) Y1 , Y2 , ..., Yn are independent;

(3) Each of the random variables Y1 , Y2 , ..., Yn has the same variance 2 .

Theorem 19.1. Under the above three assumptions, the least squares esti-

mators  ↵and 

of a linear model E (Y /x ) = ↵+ xare unbiased.

Proof: From the previous example, we know that the least squares estima-

tors of ↵ and are

 ↵=Y SxY

Sxx

Xand 

=SxY

Sxx

where

SxY :=



i=1

(xi x)(Yi  Y ).

First, we show 

is unbiased. Consider

E

 =E SxY

Sxx  = 1

Sxx

E(SxY )

Sxx

E n



i=1

(xi x)(Yi  Y )

Sxx



i=1

(xi x )E  Yi  Y 

Sxx



i=1

(xi x )E (Yi ) 1

Sxx



i=1

(xi x )E  Y 

Sxx



i=1

(xi x )E (Yi ) 1

Sxx

E Y n



i=1

(xi x)

Sxx



i=1

(xi x )E (Yi ) = 1

Sxx



i=1

(xi x ) (↵ + xi )

=↵ 1

Sxx



i=1

(xi x ) +  1

Sxx



i=1

(xi x ) xi

= 1

Sxx



i=1

(xi x ) xi

= 1

Sxx



i=1

(xi x ) xi  1

Sxx



i=1

(xi x ) x

= 1

Sxx



i=1

(xi x ) ( xi x)

= 1

Sxx

Sxx = .

Probability and Mathematical Statistics 595

Thus the estimator 

is unbiased estimator of the parameter .

Next, we show that  ↵is also an unbiased estimator of ↵. Consider

E( ↵) = E  Y S xY

Sxx

x = E  Y  x E  S xY

Sxx 

=E Y  x E  

 =E Y  x 

n n



i=1

E(Yi )  x

n n



i=1

E(↵ + xi )  x

n n↵+ 



i=1

xi   x 

=↵ +x x =↵

This proves that  ↵is an unbiased estimator of ↵and the proof of the theorem

is now complete.

19.2. The Normal Regression Analysis

In a regression analysis, we assume that the xi 's are constants while yi 's

are values of the random variables Yi 's. A regression analysis is called a

normal regression analysis if the conditional density of Yi given Xi = xi is of

the form

f(yi/xi ) = 1

p2⇡2 e 1

2 yi ↵ xi

 2

where 2 denotes the variance, and ↵ and  are the regression coeﬃ cients.

That is Y |x i ⇠N (↵ + x, 2 ). If there is no danger of confusion, then we

will write Yi for Y |x i . The ﬁgure on the next page shows the regression

model of Y with equal variances, and with means falling on the straight line

µy =↵ + x.

Normal regression analysis concerns with the estimation of  ,↵ , and

. We use maximum likelihood method to estimate these parameters. The

maximum likelihood function of the sample is given by

L( ,↵, ) =



i=1

f(yi/xi )

Simple Linear Regression and Correlation Analysis 596

and

ln L( ,↵, ) =



i=1

ln f (yi/xi )

=n ln  n

2ln(2⇡) 1

2 2



i=1

(yi ↵ xi )2 .

Taking the partial derivatives of ln L( ,↵, ) with respect to ↵,  and 

respectively, we get

@↵ ln L( ,↵, ) = 1

2



i=1

(yi ↵ xi )

@ ln L( ,↵, ) = 1

2



i=1

(yi ↵ xi ) xi

@ ln L( ,↵, ) =  n

+1

3



i=1

(yi ↵ xi )2 .

Equating each of these partial derivatives to zero and solving the system of

three equations, we obtain the maximum likelihood estimator of , ↵,  as



=SxY

Sxx

, ↵=Y SxY

Sxx

x, and  = 1

n S Y Y  SxY

Sxx

SxY ,

Probability and Mathematical Statistics 597

where

SxY =



i=1

(xi x )  Yi  Y .

Theorem 19.2. In the normal regression analysis, the likelihood estimators



and  ↵are unbiased estimators of and ↵, respectively.

Proof: Recall that



=SxY

Sxx



i=1

(xi x )  Yi  Y 



i=1  x i x

Sxx  Y i ,

where Sxx =  n

i=1 (x i x) 2 . Thus 

is a linear combination of Yi 's. Since

Yi ⇠ N  ↵+  xi ,2  , we see that 

is also a normal random variable.

First we show 

is an unbiased estimator of . Since

E

 =E n



i=1  x i x

Sxx  Y i 



i=1  x i x

Sxx  E ( Y i )



i=1  x i x

Sxx  (↵+  x i ) =  ,

the maximum likelihood estimator of  is unbiased.

Next, we show that  ↵is also an unbiased estimator of ↵. Consider

E( ↵) = E  Y S xY

Sxx

x = E  Y  x E  S xY

Sxx 

=E Y  x E  

 =E Y  x 

n n



i=1

E(Yi )  x

n n



i=1

E(↵ + xi )  x

n n↵+ 



i=1

xi   x 

=↵ +x x =↵.

Simple Linear Regression and Correlation Analysis 598

This proves that  ↵is an unbiased estimator of ↵and the proof of the theorem

is now complete.

Theorem 19.3. In normal regression analysis, the distributions of the esti-

mators 

and  ↵are given by



⇠N ,  2

Sxx  and  ↵⇠N ↵,  2

n+ x 2  2

Sxx 

where

Sxx =



i=1

(xi x)2 .

Proof: Since



=SxY

Sxx



i=1

(xi x )  Yi  Y 



i=1  x i x

Sxx  Y i ,

the 

is a linear combination of Yi 's. As Yi ⇠ N  ↵+ xi , 2  , we see that



is also a normal random variable. By Theorem 19.2, 

is an unbiased

estimator of .

The variance of 

is given by

V ar  

 =



i=1  x i x

Sxx  2

V ar (Yi/xi )



i=1  x i x

Sxx  2

2



i=1

(xi x)2  2

= 2

Sxx

Hence 

is a normal random variable with mean (or expected value) and

variance  2

Sxx . That is 

⇠N , 2

Sxx .

Now determine the distribution of  ↵. Since each Yi ⇠ N ( ↵+ xi , 2 ),

the distribution of Y is given by

Y⇠ N ↵+  x,  2

n .

Probability and Mathematical Statistics 599

Since



⇠N ,  2

Sxx 

the distribution of x 

is given by

x

⇠N x , x2  2

Sxx  .

Since  ↵=Y x

and Yand x 

being two normal random variables,  ↵is

also a normal random variable with mean equal to ↵ +x x =↵ and

variance variance equal to  2

n+ x 2  2

Sxx . That is

 ↵⇠N ↵,  2

n+ x 2  2

Sxx 

and the proof of the theorem is now complete.

It should be noted that in the proof of the last theorem, we have assumed

the fact that Y and x 

are statistically independent.

In the next theorem, we give an unbiased estimator of the variance 2 .

For this we need the distribution of the statistic U given by

U= n 2

2 .

It can be shown (we will omit the proof, for a proof see Graybill (1961)) that

the distribution of the statistic

U= n 2

2 ⇠  2 (n 2).

Theorem 19.4. An unbiased estimator S2 of 2 is given by

S2 = n 2

n2 ,

where  = 1

nS Y Y  SxY

Sxx S xY .

Proof: Since

E( S2 ) = E n 2

n2

= 2

n2 E  n 2

2 

= 2

n2 E(2 (n 2))

= 2

n2( n2) = 2 .

Simple Linear Regression and Correlation Analysis 600

The proof of the theorem is now complete.

Note that the estimator S2 can be written as S2 =SSE

n2 , where

SSE = SY Y = 

SxY =



i=1

[yi   ↵

xi ]

the estimator S2 is unbiased estimator of 2 . The proof of the theorem is

now complete.

In the next theorem we give the distribution of two statistics that can

be used for testing hypothesis and constructing conﬁdence interval for the

regression parameters ↵ and .

Theorem 19.5. The statistics

Q = 

 

  (n 2) Sxx

and

Q↵ =  ↵ ↵

  (n 2) Sxx

n(x)2 + Sxx

have both a t -distribution with n 2 degrees of freedom.

Proof: From Theorem 19.3, we know that



⇠N ,  2

Sxx  .

Hence by standardizing, we get

Z=

 

 2

Sxx

⇠N(0 ,1).

Further, we know that the likelihood estimator of is

 = 1

n S Y Y  SxY

Sxx

SxY 

and the distribution of the statistic U =n  2

2 is chi-square with n 2 degrees

of freedom.

Probability and Mathematical Statistics 601

Since Z = 



2

Sxx ⇠N(0 ,1) and U=n  2

2 ⇠ 2 (n 2), by Theorem 14.6,

the statistic Z

U

n2⇠t( n 2). Hence

Q = 

 

  (n 2) Sxx

n= 

 

n  2

(n 2) Sxx

=



2

Sxx

n  2

(n 2) 2 ⇠t( n 2).

Similarly, it can be shown that

Q↵ =  ↵ ↵

  (n 2) Sxx

n(x)2 + Sxx ⇠ t ( n2).

This completes the proof of the theorem.

In the normal regression model, if  = 0, then E (Yx ) = ↵ . This implies

that E (Yx ) does not depend on x . Therefore if  6 = 0, then E (Yx ) is de-

pendent on x . Thus the null hypothesis Ho : = 0 should be tested against

Ha : 6= 0. To devise a test we need the distribution of 

. Theorem 19.3 says

that 

is normally distributed with mean and variance  2

Sx x . Therefore, we

have

Z=

 

 2

Sxx

⇠N(0 ,1).

In practice the variance V ar (Yi/xi ) which is 2 is usually unknown. Hence

the above statistic Zis not very useful. However, using the statistic Q ,

we can devise a hypothesis test to test the hypothesis Ho : = o against

Ha : 6= o at a signiﬁcance level  . For this one has to evaluate the quantity

|t |=      

 

n  2

(n 2) Sxx

     

=    

 

  (n 2) Sxx

n    

and compare it to quantile t/2 ( n 2). The hypothesis test, at signiﬁcance

level  , is then "Reject Ho : = o if |t | > t/2 (n 2)".

The statistic

Q = 

 

  (n 2) Sxx

Simple Linear Regression and Correlation Analysis 602

is a pivotal quantity for the parameter  since the distribution of this quantity

Q is a t-distribution with n 2 degrees of freedom. Thus it can be used for

the construction of a (1  )100% conﬁdence interval for the parameter as

follows:

1

=P  t

2(n 2)  

 

  (n 2)Sxx

n t 

2(n 2)

=P

t

2(n 2)  n

(n 2)Sxx  

+t

2(n 2)  n

(n 2) Sxx  .

Hence, the (1  )% conﬁdence interval for  is given by



t

2(n 2)   n

(n 2) Sxx

,

+t

2(n 2)   n

(n 2) Sxx  .

In a similar manner one can devise hypothesis test for ↵ and construct

conﬁdence interval for ↵ using the statistic Q↵ . We leave these to the reader.

Now we give two examples to illustrate how to ﬁnd the normal regression

line and related things.

Example 19.7. Let the following data on the number of hours, x which

ten persons studied for a French test and their scores, y on the test is shown

below:

x4 9 10 14 4 7 12 22 1 17

y31 58 65 73 37 44 60 91 21 84

Find the normal regression line that approximates the regression of test scores

on the number of hours studied. Further test the hypothesis Ho : = 3 versus

Ha : 6= 3 at the signiﬁcance level 0 .02.

Answer: From the above data, we have



i=1

xi = 100,



i=1

i= 1376



i=1

yi = 564,



i=1



i=1

xiyi = 6945

Probability and Mathematical Statistics 603

Sxx = 376 , Sxy = 1305, Syy = 4752 .4.

Hence



=sxy

sxx

= 3. 471 and  ↵=y

x= 21 .690.

Thus the normal regression line is

y= 21 .690 + 3.471x.

This regression line is shown below.

Regression line y = 21.690 + 3.471 x

Now we test the hypothesis Ho : = 3 against Ha : 6= 3 at 0. 02 level

of signiﬁcance. From the data, the maximum likelihood estimate of is

 = 1

n S yy  Sxy

Sxx

Sxy 

= 1

n S yy  

Sxy 

= 1

10 [4752.4 (3.471)(1305)]

= 4.720

Simple Linear Regression and Correlation Analysis 604

and

|t |=     

3. 471 3

4. 720  (8) (376)

10     = 1.73.

Hence

1. 73 = |t | < t0.01 (8) = 2 .896.

Thus we do not reject the null hypothesis that Ho : = 3 at the signiﬁcance

level 0.02.

This means that we can not conclude that on the average an extra hour

of study will increase the score by more than 3 points.

Example 19.8. The frequency of chirping of a cricket is thought to be

related to temperature. This suggests the possibility that temperature can

be estimated from the chirp frequency. Let the following data on the number

chirps per second, x by the striped ground cricket and the temperature, yin

Fahrenheit is shown below:

x20 16 20 18 17 16 15 17 15 16

y89 72 93 84 81 75 70 82 69 83

Find the normal regression line that approximates the regression of tempera-

ture on the number chirps per second by the striped ground cricket. Further

test the hypothesis Ho : = 4 versus Ha : 6 = 4 at the signiﬁcance level 0.1.

Answer: From the above data, we have



i=1

xi = 170,



i=1

i= 2920



i=1

yi = 789,



i=1

i= 64270



i=1

xiyi = 13688

Sxx = 376 , Sxy = 1305, Syy = 4752 .4.

Hence



=sxy

sxx

= 4. 067 and  ↵=y

x= 9 .761.

Thus the normal regression line is

y= 9 .761 + 4.067x.

Probability and Mathematical Statistics 605

This regression line is shown below.

Regression line y = 9.761 + 4.067x

Now we test the hypothesis Ho : = 4 against Ha : 6 = 4 at 0. 1 level of

signiﬁcance. From the data, the maximum likelihood estimate of is

 = 1

n S yy  Sxy

Sxx

Sxy 

= 1

n S yy  

Sxy 

= 1

10 [589  (4.067)(122)]

= 3.047

and

|t |=     

4. 067 4

3. 047  (8) (30)

10     = 0.528.

Hence

0. 528 = |t | < t0.05 (8) = 1 .860.

Simple Linear Regression and Correlation Analysis 606

Thus we do not reject the null hypothesis that Ho : = 4 at a signiﬁcance

level 0.1.

Let µx =↵ +x and write 

Yx =  ↵+

xfor an arbitrary but ﬁxed x.

Then 

Yx is an estimator of µx . The following theorem gives various properties

of this estimator.

Theorem 19.6. Let x be an arbitrary but ﬁxed real number. Then

(i) 

Yx is a linear estimator of Y1 , Y2 , ..., Yn ,

(ii) 

Yx is an unbiased estimator of µx , and

(iii) V ar  

Yx  = 1

n+ (xx)2

Sxx  2 .

Proof: First we show that 

Yx is a linear estimator of Y1 , Y2 , ..., Yn . Since



Yx =  ↵+

x

=Y

x+

x

=Y +

(x x )

=Y +



k=1

(xk x ) (x x)

Sxx



k=1



k=1

(xk x ) (x x)

Sxx



k=1  1

n+(xk x) ( x x)

Sxx  Y k



Yx is a linear estimator of Y1 , Y2 , ..., Yn .

Next, we show that 

Yx is an unbiased estimator of µx . Since

E

Yx  = E  ↵+

x

=E ( ↵) + E 

x

=↵ +x

=µx



Yx is an unbiased estimator of µx .

Finally, we calculate the variance of 

Yx using Theorem 19.3. The variance

Probability and Mathematical Statistics 607

of 

Yx is given by

V ar  

Yx  = V ar   ↵+

x

=V ar ( ↵) + V ar  

x + 2 Cov   ↵,

x

= 1

n+ x2

Sxx  + x 2  2

Sxx

+ 2 x C ov   ↵,



= 1

n+ x2

Sxx  2xx 2

Sxx

= 1

n+( xx)2

Sxx   2 .

In this computation we have used the fact that

Cov   ↵,

 =x  2

Sxx

whose proof is left to the reader as an exercise. The proof of the theorem is

now complete.

By Theorem 19.3, we see that



⇠N ,  2

Sxx  and  ↵⇠N ↵,  2

n+ x 2  2

Sxx  .

Since 

Yx =  ↵+

x, the random variable 

Yx is also a normal random variable

with mean µx and variance

V ar  

Yx  = 1

n+( xx)2

Sxx   2 .

Hence standardizing 

Yx , we have



Yx µx

V ar  

Yx  ⇠ N (0 , 1) .

If 2 is known, then one can take the statistic Q = 

Yx µx

V ar  

Yx  as a pivotal

quantity to construct a conﬁdence interval for µx . The (1  )100% conﬁdence

interval for µx when 2 is known is given by



Yx  z 

2V ar( 

Yx ) , 

Yx + z 

2V ar( 

Yx ) .

Simple Linear Regression and Correlation Analysis 608

Example 19.9. Let the following data on the number chirps per second, x

by the striped ground cricket and the temperature, y in Fahrenheit is shown

below:

x20 16 20 18 17 16 15 17 15 16

y89 72 93 84 81 75 70 82 69 83

What is the 95% conﬁdence interval for  ? What is the 95% conﬁdence

interval for µx when x = 14 and  = 3.047?

Answer: From Example 19.8, we have

n= 10 ,

= 4.067 ,  = 3. 047 and Sxx = 376.

The (1  )% conﬁdence interval for  is given by



t

2(n 2)   n

(n 2) Sxx

,

+t

2(n 2)   n

(n 2) Sxx  .

Therefore the 90% conﬁdence interval for is

4.067  t0.025 (8) (3.047) 10

(8) (376) , 4.067 + t0.025 (8) (3.047) 10

(8) (376) 

which is

[ 4. 067  t0.025 (8) (0. 1755) , 4. 067 + t0.025 (8) (0. 1755)] .

Since from the t -table, we have t0.025 (8) = 2. 306, the 90% conﬁdence interval

for  becomes

[ 4. 067  (2. 306) (0 . 1755) , 4. 067 + (2 . 306) (0 .1755)]

which is [3.6623 , 4.4717].

If variance 2 is not known, then we can use the fact that the statistic

U=n  2

2 is chi-squares with n 2 degrees of freedom to obtain a pivotal

quantity for µx . This can be done as follows:

Q=

Yx µx

  (n 2) Sxx

Sxx + n ( x x)2

=

Yx µx

 1

n+ (xx)2

Sxx   2

n  2

(n 2) 2 ⇠t( n 2).

Probability and Mathematical Statistics 609

Using this pivotal quantity one can construct a (1  )100% conﬁdence in-

terval for mean µas



Yx  t 

2(n 2) S xx + n (x x )2

(n 2) Sxx

,

Yx + t 

2(n 2) S xx + n (x x )2

(n 2) Sxx  .

Next we determine the 90% conﬁdence interval for µx when x = 14 and

= 3. 047. The (1  )100% conﬁdence interval for µx when 2 is known is

given by  

Yx  z 

2V ar( 

Yx ) , 

Yx + z 

2V ar( 

Yx ) .

From the data, we have



Yx =  ↵+

x= 9 .761 + (4.067) (14) = 66.699

and

V ar  

Yx  = 1

10 + (14  17)2

376   2 = (0.124) (3.047)2 = 1.1512.

The 90% conﬁdence interval for µx is given by

66.699  z0.025 p 1.1512,66. 699 + z0.025 p 1. 1512 

and since z0.025 = 1. 96 (from the normal table), we have

[66. 699  (1. 96) (1 .073),66. 699 + (1 . 96) (1 .073)]

which is [64.596 , 68.802].

We now consider the predictions made by the normal regression equation



Yx =  ↵+

x. The quantity 

Yx gives an estimate of µx =↵ + x. Each

time we compute a regression line from a random sample we are observing

one possible linear equation in a population consisting all possible linear

equations. Further, the actual value of Yx that will be observed for given

value of x is normal with mean ↵ +x and variance 2 . So the actual

observed value will be di↵ erent from µx . Thus, the predicted value for 

will be in error from two di↵ erent sources, namely (1)  ↵and 

are randomly

distributed about ↵ and  , and (2) Yx is randomly distributed about µx .

Simple Linear Regression and Correlation Analysis 610

Let yx denote the actual value of Yx that will be observed for the value

xand consider the random variable

D=Yx   ↵

x.

Since D is a linear combination of normal random variables, D is also a

normal random variable.

The mean of D is given by

E(D ) = E(Yx ) E( ↵)x E (

)

=↵ +x ↵x

= 0.

The variance of D is given by

V ar(D ) = V ar(Yx   ↵

x)

=V ar (Yx ) + V ar (  ↵) + x2 V ar ( 

) + 2 x C ov (  ↵,

)

=2 + 2

n+ x 2  2

Sxx

+x2  2

Sxx 2x x

Sxx

=2 + 2

n+( xx)2  2

Sxx

=(n + 1) Sxx + n

n Sxx

2 .

Therefore

D⇠N 0 ,( n+ 1) Sxx + n

n Sxx

2  .

We standardize D to get

Z=D0

(n+1) Sxx +n

n Sxx  2 ⇠N(0 ,1).

Since in practice the variance of Yx which is 2 is unknown, we can not use

Zto construct a conﬁdence interval for a predicted value yx .

We know that U =n  2

2 ⇠ 2 (n 2). By Theorem 14.6, the statistic

Probability and Mathematical Statistics 611

U

n2⇠t( n 2). Hence

Q= y x  ↵

x

  (n 2) Sxx

(n + 1) Sxx + n

yx   ↵ 

x

(n+1) Sxx +n

n Sxx  2

n  2

(n 2)  2

D0

pV ar(D)

n  2

(n 2)  2

U

n2⇠t( n 2).

The statistic Q is a pivotal quantity for the predicted value yx and one can

use it to construct a (1  )100% conﬁdence interval for yx . The (1  )100%

conﬁdence interval, [a, b ], for yx is given by

1 =P  t

2(n 2) Q t 

2(n 2)

=P (a yx b),

where

a= ↵+

x t

2(n 2)   (n + 1) Sxx + n

(n 2) Sxx

and

b= ↵+

x+ t

2(n 2)   (n + 1) Sxx + n

(n 2) Sxx

This conﬁdence interval for yx is usually known as the prediction interval for

predicted value yx based on the given x . The prediction interval represents an

interval that has a probability equal to 1 of containing not a parameter but

a future value yx of the random variable Yx . In many instances the prediction

interval is more relevant to a scientist or engineer than the conﬁdence interval

on the mean µx .

Example 19.10. Let the following data on the number chirps per second, x

by the striped ground cricket and the temperature, y in Fahrenheit is shown

below:

Simple Linear Regression and Correlation Analysis 612

x20 16 20 18 17 16 15 17 15 16

y89 72 93 84 81 75 70 82 69 83

What is the 95% prediction interval for yx when x = 14?

Answer: From Example 19.8, we have

n= 10 ,

= 4.067 ,  ↵= 9.761 ,  = 3. 047 and Sxx = 376.

Thus the normal regression line is

yx = 9.761 + 4.067x.

Since x = 14, the corresponding predicted value yx is given by

yx = 9.761 + (4.067) (14) = 66.699.

Therefore

a= ↵+

x t

2(n 2)   (n + 1) Sxx + n

(n 2) Sxx

= 66. 699  t0.025 (8) (3. 047)  (11) (376) + 10

(8) (376)

= 66. 699  (2. 306) (3 . 047) (1 .1740)

= 58.4501.

Similarly

b= ↵+

x+ t

2(n 2)   (n + 1) Sxx + n

(n 2) Sxx

= 66. 699 + t0.025 (8) (3. 047)  (11) (376) + 10

(8) (376)

= 66. 699 + (2. 306) (3.047) (1.1740)

= 74.9479.

Hence the 95% prediction interval for yx when x = 14 is [58.4501 , 74.9479].

19.3. The Correlation Analysis

In the ﬁrst two sections of this chapter, we examine the regression prob-

lem and have done an in-depth study of the least squares and the normal

regression analysis. In the regression analysis, we assumed that the values

of X are not random variables, but are ﬁxed. However, the values of Yx for

Probability and Mathematical Statistics 613

a given value of x are randomly distributed about E (Yx ) = µx =↵ + x.

Further, letting " to be a random variable with E (" ) = 0 and V ar (" ) = 2 ,

one can model the so called regression problem by

Yx =↵ + x+ ".

In this section, we examine the correlation problem. Unlike the regres-

sion problem, here both X and Y are random variables and the correlation

problem can be modeled by

E( Y) = ↵ + E( X).

From an experimental point of view this means that we are observing random

vector (X, Y ) drawn from some bivariate population.

Recall that if (X, Y ) is a bivariate random variable then the correlation

coeﬃ cient ⇢ is deﬁned as

⇢=E ((X µX ) ( Y µY ))

E ((X µX )2 )E ((Y µY )2)

where µX and µY are the mean of the random variables X and Y , respec-

tively.

Deﬁnition 19.1. If (X1 , Y1 ) , (X2 , Y2 ) , ..., (Xn , Yn ) is a random sample from

a bivariate population, then the sample correlation coeﬃ cient is deﬁned as



i=1

(Xi  X ) ( Yi  Y )









i=1

(Xi  X )2 







i=1

(Yi  Y )2

The corresponding quantity computed from data (x1 , y1 ) , (x2 , y2 ) , ..., (xn , yn )

will be denoted by r and it is an estimate of the correlation coeﬃ cient ⇢.

Now we give a geometrical interpretation of the sample correlation coeﬃ-

cient based on a paired data set {(x1 , y1 ),(x2 , y2 ), ..., (xn , yn )} . We can asso-

ciate this data set with two vectors ~ x= (x1 , x2 , ..., xn ) and ~ y= (y1 , y2 , ..., yn )

in IRn . Let L be the subset { ~ e|2IR} of IRn , where ~ e= (1 ,1, ..., 1) 2 IRn.

Consider the linear space V given by IRn modulo L , that is V = IRn /L . The

linear space V is illustrated in a ﬁgure on next page when n = 2.

Simple Linear Regression and Correlation Analysis 614

Illustration of the linear space V for n=2

We denote the equivalence class associated with the vector ~ xby [~ x]. In

the linear space V it can be shown that the points (x1 , y1 ) , (x2 , y2 ) , ..., (xn , yn )

are collinear if and only if the the vectors [~ x] and [~ y] in Vare proportional.

We deﬁne an inner product on this linear space V by

h[~ x] , [~ y]i=



i=1

(xi x ) ( yi  y ).

Then the angle ✓ between the vectors [~ x] and [~ y] is given by

cos(✓ ) = h [~ x] , [~ y]i

h[~ x] , [~ x]i  h[~ y] ,[~ y]i

which is

cos(✓ ) =



i=1

(xi x ) ( yi  y )









i=1

(xi x)2 







i=1

(yi  y )2

=r.

Thus the sample correlation coeﬃ cient r can be interpreted geometrically as

the cosine of the angle between the vectors [ ~ x] and [~ y]. From this view point

the following theorem is obvious.

Probability and Mathematical Statistics 615

Theorem 19.7. The sample correlation coeﬃ cient r satisﬁes the inequality

1 r 1.

The sample correlation coeﬃ cient r = ± 1 if and only if the set of points

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } for n 3 are collinear.

To do some statistical analysis, we assume that the paired data is a

random sample of size n from a bivariate normal population (X, Y ) ⇠

BV N (µ1 , µ2 , 2

1, 2

2,⇢). Then the conditional distribution of the random

variable Y given X =x is normal, that is

Y|x ⇠ N µ2 +⇢ 2

1

(x µ1 ), 2

2(1 ⇢ 2 ) .

This can be viewed as a normal regression model E (Y |x ) = ↵ +x where

↵=µ ⇢2

1 µ 1 ,= ⇢  2

1 , and V ar( Y | x ) =  2

2(1 ⇢ 2 ).

Since  =⇢  2

1 , if ⇢ = 0, then  = 0. Hence the null hypothesis H o :⇢= 0

is equivalent to Ho : = 0. In the previous section, we devised a hypothesis

test for testing Ho : = o against Ha : 6 = o . This hypothesis test, at

signiﬁcance level  , is "Reject Ho : = o if |t |t 

2(n 2)", where

t=

 

  (n 2) Sxx

If  = 0, then we have

t=



  (n 2) Sxx

n.(10)

Now we express t in term of the sample correlation coeﬃ cient r . Recall that



=Sxy

Sxx

,(11)

 2 =1

n S yy  Sxy

Sxx

Sxy  , (12)

and

r= Sxy

S xx S yy

.(13)

Simple Linear Regression and Correlation Analysis 616

Now using (11), (12), and (13), we compute

t=



  (n 2) Sxx

=Sxy

Sxx

 S yy  S xy

Sxx S xy  (n 2) S xx

=Sxy

S xx S yy

 1 S xy

Sxx

Sxy

Syy  pn2

=p n 2r

p1 r2 .

Hence to test the null hypothesis Ho :⇢ = 0 against Ha :⇢ 6 = 0, at

signiﬁcance level  , is "Reject Ho :⇢ = 0 if |t |t 

2(n 2)", where t=

pn 2 r

1r2 .

This above test does not extend to test other values of ⇢ except ⇢ = 0.

However, tests for the nonzero values of ⇢ can be achieved by the following

result.

Theorem 19.8. Let (X1 , Y1 ) , (X2 , Y2 ) , ..., (Xn , Yn ) be a random sample from

a bivariate normal population (X, Y )⇠ BV N (µ1 , µ2 , 2

1, 2

2,⇢). If

V=1

2ln  1 + R

1R and m=1

2ln  1 + ⇢

1⇢ ,

then

Z=p n3 ( V m)! N (0 , 1) as n ! 1.

This theorem says that the statistic V is approximately normal with

mean m and variance 1

n3 when n is large. This statistic can be used to

devise a hypothesis test for the nonzero values of ⇢ . Hence to test the null

hypothesis Ho :⇢ = ⇢o against Ha :⇢ 6 = ⇢o , at signiﬁcance level  , is "Reject

Ho :⇢ = ⇢o if | z| z 

2", where z=p n 3 (V mo ) and mo = 1

2ln  1+⇢o

1⇢o .

Example 19.11. The following data were obtained in a study of the rela-

tionship between the weight and chest size of infants at birth:

x, weight in kg 2.76 2.17 5.53 4.31 2.30 3.70

y, chest size in cm 29 .5 26.3 36.6 27.8 28.3 28.6

Probability and Mathematical Statistics 617

Determine the sample correlation coeﬃ cient r and then test the null hypoth-

esis Ho :⇢ = 0 against the alternative hypothesis Ha :⇢ 6= 0 at a signiﬁcance

level 0.01.

Answer: From the above data we ﬁnd that

x= 3 .46 and y = 29.51.

Next, we compute Sxx , Syy and Sxy using a tabular representation.

x x y  y ( x x)( y y ) ( x x)2( y y )2

0. 70 0. 01 0 . 007 0 . 490 0 .000

1. 29 3. 21 4 . 141 1 . 664 10 .304

2. 07 7 . 09 14 . 676 4 . 285 50 .268

0. 85 1. 71 1. 453 0 . 722 2 .924

1. 16 1. 21 1 . 404 1 . 346 1 .464

0. 24 0. 91 0. 218 0 . 058 0 .828

Sxy = 18.557 Sxx = 8.565 Syy = 65.788

Hence, the correlation coeﬃ cient r is given by

r= Sxy

S xx S yy

=18.557

(8.565) (65.788) = 0 . 782.

The computed t value is give by

t=p n2 r

p1 r2 =  (6  2) 0.782

1(0.782)2 = 2 . 509.

From the t -table we have t0.005 (4) = 4. 604. Since

2. 509 = |t | 6 t0.005 (4) = 4 .604

we do not reject the null hypothesis Ho :⇢ = 0.

19.4. Review Exercises

1. Let Y1 , Y2 , ..., Yn be n independent random variables such that each

Yi ⇠ N ( xi ,2 ), where both  and 2 are unknown parameters. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd the maximum likelihood esti-

mators of 

and  2 of and 2 .

Simple Linear Regression and Correlation Analysis 618

2. Let Y1 , Y2 , ..., Yn be n independent random variables such that each

Yi ⇠ N ( xi ,2 ), where both  and 2 are unknown parameters. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then show that the maximum likelihood

estimator of 

is normally distributed. What are the mean and variance of



?

3. Let Y1 , Y2 , ..., Yn be n independent random variables such that each

Yi ⇠ N ( xi ,2 ), where both  and 2 are unknown parameters. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd an unbiased estimator  2 of

2 and then ﬁnd a constant k such that k  2 ⇠ 2 (2n).

4. Let Y1 , Y2 , ..., Yn be n independent random variables such that each

Yi ⇠ N ( xi ,2 ), where both  and 2 are unknown parameters. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd a pivotal quantity for and

using this pivotal quantity construct a (1  )100% conﬁdence interval for .

5. Let Y1 , Y2 , ..., Yn be n independent random variables such that each

Yi ⇠ N ( xi ,2 ), where both  and 2 are unknown parameters. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd a pivotal quantity for 2 and

using this pivotal quantity construct a (1  )100% conﬁdence interval for

2 .

6. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ EX P ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd the maximum likelihood esti-

mator of 

of .

7. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ EX P ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd the least squares estimator of



of .

8. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ P OI ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

Probability and Mathematical Statistics 619

served values based on x1 , x2 , ..., xn , then ﬁnd the maximum likelihood esti-

mator of 

of .

9. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ P OI ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , then ﬁnd the least squares estimator of



of .

10. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ P OI ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , show that the least squares estimator

and the maximum likelihood estimator of are both unbiased estimator of

.

11. Let Y1 , Y2 , ..., Yn be n independent random variables such that

each Yi ⇠ P OI ( xi ), where  is an unknown parameter. If

{(x1 , y1 ),(x2 , y2 ), ..., (xn , yn ) } is a data set where y1 , y2 , ..., yn are the ob-

served values based on x1 , x2 , ..., xn , the ﬁnd the variances of both the least

squares estimator and the maximum likelihood estimator of .

12. Given the ﬁve pairs of points (x, y ) shown below:

x10 20 30 40 50

y50. 071 0 . 078 0 . 112 0 . 120 0 .131

What is the curve of the form y =a + bx + cx2 best ﬁts the data by method

of least squares?

13. Given the ﬁve pairs of points (x, y ) shown below:

x4 7 9 10 11

y10 16 22 20 25

What is the curve of the form y =a + b x best ﬁts the data by method of

least squares?

14. The following data were obtained from the grades of six students selected

at random:

Mathematics Grade, x 72 94 82 74 65 85

English Grade, y 76 86 65 89 80 92

Simple Linear Regression and Correlation Analysis 620

Find the sample correlation coeﬃ cient r and then test the null hypothesis

Ho :⇢ = 0 against the alternative hypothesis Ha :⇢ 6= 0 at a signiﬁcance

level 0.01.

15. Given a set of data {(x1 , y2 ),(x2 , y2 ), ..., (xn , yn ) } what is the least square

estimate of ↵ if y =↵ is ﬁtted to this data set.

16. Given a set of data points {(2,3),(4,6),(5,7) } what is the curve of the

form y =↵ + x2 best ﬁts the data by method of least squares?

17. Given a data set {(1,1),(2,1),(2,3),(3,2),(4,3) } and Yx ⇠ N (↵+

x, 2 ), ﬁnd the point estimate of 2 and then construct a 90% conﬁdence

interval for .

18. For the data set {(1,1),(2,1),(2,3),(3,2),(4,3) } determine the correla-

tion coeﬃ cient r . Test the null hypothesis H0 :⇢ = 0 versus Ha :⇢ 6= 0 at a

signiﬁcance level 0.01.

Probability and Mathematical Statistics 621

Chapter 20

ANALYSIS OF VARIANCE

In Chapter 19, we examine how a quantitative independent variable x

can be used for predicting the value of a quantitative dependent variable y . In

this chapter we would like to examine whether one or more independent (or

predictor) variable a↵ ects a dependent (or response) variable y . This chap-

ter di↵ ers from the last chapter because the independent variable may now

be either quantitative or qualitative. It also di↵ ers from the last chapter in

assuming that the response measurements were obtained for speciﬁc settings

of the independent variables. Selecting the settings of the independent vari-

ables is another aspect of experimental design. It enables us to tell whether

changes in the independent variables cause changes in the mean response

and it permits us to analyze the data using a method known as analysis of

variance (or ANOVA). Sir Ronald Aylmer Fisher (1890-1962) developed the

analysis of variance in 1920's and used it to analyze data from agricultural

experiments.

The ANOVA investigates independent measurements from several treat-

ments or levels of one or more than one factors (that is, the predictor vari-

ables). The technique of ANOVA consists of partitioning the total sum of

squares into component sum of squares due to di↵ erent factors and the error.

For instance, suppose there are Q factors. Then the total sum of squares

(SST ) is partitioned as

SST = SSA + SSB +··· + SSQ+ SSError ,

where SSA , SSB , ..., and SSQ represent the sum of squares associated with

the factors A, B, ..., and Q, respectively. If the ANOVA involves only one

factor, then it is called one-way analysis of variance. Similarly if it involves

two factors, then it is called the two-way analysis of variance. If it involves

Analysis of Variance 622

more then two factors, then the corresponding ANOVA is called the higher

order analysis of variance. In this chapter we only treat the one-way analysis

of variance.

The analysis of variance is a special case of the linear models that rep-

resent the relationship between a continuous response variable y and one or

more predictor variables (either continuous or categorical) in the form

y= X+ ✏(1)

where y is an m⇥ 1 vector of observations of response variable, X is the

m⇥ ndesign matrix determined by the predictor variables,  is n⇥1 vector

of parameters, and ✏ is an m⇥ 1 vector of random error (or disturbances)

independent of each other and having distribution.

20.1. One-Way Analysis of Variance with Equal Sample Sizes

The standard model of one-way ANOVA is given by

Yij =µi + ✏ij for i = 1 , 2 , ..., m, j = 1 , 2 , ..., n, (2)

where m 2 and n 2. In this model, we assume that each random variable

Yij ⇠ N (µi ,2 ) for i = 1 , 2 , ..., m, j = 1 , 2 , ..., n. (3)

Note that because of (3), each ✏ij in model (2) is normally distributed with

mean zero and variance 2 .

Given m independent samples, each of size n , where the members of the

ith sample, Yi1 , Yi2 , ..., Yin , are normal random variables with mean µi and

unknown variance 2 . That is,

Yij ⇠ N µi ,2  , i = 1 , 2 , ..., m, j = 1 , 2 , ..., n.

We will be interested in testing the null hypothesis

Ho : µ1 = µ2 =··· = µm = µ

against the alternative hypothesis

Ha : not all the means are equal.

Probability and Mathematical Statistics 623

In the following theorem we present the maximum likelihood estimators

of the parameters µ1 , µ2 , ..., µm and 2 .

Theorem 20.1. Suppose the one-way ANOVA model is given by the equa-

tion (2) where the ✏ij 's are independent and normally distributed random

variables with mean zero and variance 2 for i = 1, 2, ..., m and j = 1, 2, ..., n.

Then the MLE's of the parameters µi ( i = 1, 2, ..., m ) and 2 of the model

are given by

 µi = Y i• i = 1 , 2 , ..., m,



2 =1

nm SS W ,

where Y i• = 1



j=1

Yij and SSW=



i=1



j=1 Y ij Y i•  2 is the within samples

sum of squares.

Proof: The likelihood function is given by

L(µ1 , µ2 , ..., µm ,2 ) =



i=1



j=1  1

p2⇡2 e (Yij µi )2

2 2 

= 1

p2⇡2  nm

1

2 2



i=1



j=1

(Yij µi )2

Taking the natural logarithm of the likelihood function L , we obtain

ln L(µ1 , µ2 , ..., µm ,2 ) =  nm

2ln(2 ⇡ 2 ) 1

2 2



i=1



j=1

(Yij µi )2 .(4)

Now taking the partial derivative of (4) with respect to µ1 , µ2 , ..., µm and

2 , we get

@lnL

@µi

2



j=1

(Yij µi ) (5)

and

@lnL

@ 2 = nm

22 + 1

2 4



i=1



j=1

(Yij µi )2 .(6)

Equating these partial derivatives to zero and solving for µi and 2 , respec-

tively, we have

µi = Y i• i = 1 , 2 , ..., m,

2 =1



i=1



j=1 Y ij Y i•  2 ,

Analysis of Variance 624

where

Yi• =1



j=1

Yij .

It can be checked that these solutions yield the maximum of the likelihood

function and we leave this veriﬁcation to the reader. Thus the maximum

likelihood estimators of the model parameters are given by

 µi = Y i• i = 1 , 2 , ..., m,



2 =1

nm SS W ,

where SSW=



i=1



j=1 Y ij Y i•  2 . The proof of the theorem is now complete.

Deﬁne

Y•• =1



i=1



j=1

Yij . (7)

Further, deﬁne

SST=



i=1



j=1 Y ij Y ••  2 (8)

SSW=



i=1



j=1 Y ij Y i•  2 (9)

and

SSB=



i=1



j=1 Y i•Y •• 2 (10)

Here SST is the total sum of square, SSW is the within sum of square, and

SSB is the between sum of square.

Next we consider the partitioning of the total sum of squares. The fol-

lowing lemma gives us such a partition.

Lemma 20.1. The total sum of squares is equal to the sum of within and

between sum of squares, that is

SST = SSW + SSB .(11)

Probability and Mathematical Statistics 625

Proof: Rewriting (8) we have

SST=



i=1



j=1 Y ij Y ••  2



i=1



j=1 (Y ij Y i• ) + (Y i•Y ••) 2



i=1



j=1

(Yij  Y i• )2+



i=1



j=1

(Y i• Y ••)2

+ 2



i=1



j=1

(Yij  Y i• ) (Y i• Y ••)

= SSW + SSB + 2



i=1



j=1

(Yij  Y i• ) (Y i• Y •• ).

The cross-product term vanishes, that is



i=1



j=1

(Yij  Y i• ) (Y i• Y •• ) =



i=1

(Yi• Y•• )



j=1

(Yij  Y i• ) = 0.

Hence we obtain the asserted result SST = SSW + SSB and the proof of the

lemma is complete.

The following theorem is a technical result and is needed for testing the

null hypothesis against the alternative hypothesis.

Theorem 20.2. Consider the ANOVA model

Yij =µi + ✏ij i = 1 , 2 , ..., m, j = 1 , 2 , ..., n,

where Yij ⇠ N µi ,2  . Then

(a) the random variable SS W

2 ⇠ 2 (m(n 1)), and

(b) the statistics SSW and SSB are independent.

Further, if the null hypothesis Ho : µ1 = µ2 =··· = µm =µ is true, then

2 ⇠ 2 (m 1),

(d) the statistics SS B m(n1)

SSW(m 1) ⇠F( m1, m( n1)), and

(e) the random variable SS T

2 ⇠ 2 (nm  1).

Analysis of Variance 626

Proof: In Chapter 13, we have seen in Theorem 13.7 that if X1 , X2 , ..., Xn

are independent random variables each one having the distribution N (µ, 2 ),

then their mean Xand



i=1

(Xi  X )2 have the following properties:

(i) Xand



i=1

(Xi  X )2 are independent, and

(ii) 1

2



i=1

(Xi  X )2 ⇠ 2 (n 1).

Now using (i) and (ii), we establish this theorem.

(a) Using (ii), we see that

2



j=1 Y ij Y i•  2 ⇠ 2 (n 1)

for each i = 1, 2, ..., m . Since



j=1 Y ij Y i•  2 and



j=1 Y i 0 j Y i 0 •  2

are independent for i0 6 = i , we obtain



i=1

2



j=1 Y ij Y i•  2 ⇠ 2 (m(n 1)).

Hence

SSW

2 =1

2



i=1



j=1 Y ij Y i•  2



i=1

2



j=1 Y ij Y i•  2 ⇠ 2 (m(n 1)).

(b) Since for each i = 1, 2, ..., m , the random variables Yi1 , Yi2 , ..., Yin are

independent and

Yi1 , Yi2 , ..., Yin ⇠ N  µi ,2 

we conclude by (i) that



j=1 Y ij Y i•  2 and Y i•

Probability and Mathematical Statistics 627

are independent. Further



j=1 Y ij Y i•  2 and Y i 0 •

are independent for i0 6 = i . Therefore, each of the statistics



j=1 Y ij Y i•  2 i= 1, 2, ..., m

is independent of the statistics Y 1• , Y 2• , ..., Y m• , and the statistics



j=1 Y ij Y i•  2 i= 1, 2, ..., m

are independent. Thus it follows that the sets



j=1 Y ij Y i•  2 i= 1, 2, ..., m and Y i•i= 1, 2, ..., m

are independent. Thus



i=1



j=1 Y ij Y i•  2 and



i=1



j=1 Y i•Y •• 2

are independent. Hence by deﬁnition, the statistics SSW and SSBare

independent.

Suppose the null hypothesis Ho : µ1 = µ2 =··· = µm =µ is true.

identically distributed with N  µ,  2

n. Therefore by (ii)

2



i=1 Y i•Y •• 2 ⇠ 2 (m 1).

Hence

SSB

2 =1

2



i=1



j=1 Y i•Y •• 2

2



i=1 Y i•Y •• 2 ⇠ 2 (m 1).

Analysis of Variance 628

(d) Since

SSW

2 ⇠  2 (m(n 1))

and SS B

2 ⇠  2 (m 1)

therefore SS B

(m 1)  2

SSW

(n(m 1) 2 ⇠F( m1, m( n1)).

That is SS B

(m 1)

SSW

(n(m 1) ⇠F( m1, m( n1)).

(e) Under Ho , the random variables Yij , i = 1, 2, ..., m, j = 1, 2, ..., n are

independent and each has the distribution N (µ, 2 ). By (ii) we see that

2



i=1



j=1 Y ij Y ••  2 ⇠ 2 (nm  1).

Hence we have SS T

2 ⇠  2 (nm  1)

and the proof of the theorem is now complete.

From Theorem 20.1, we see that the maximum likelihood estimator of each

µi ( i = 1 , 2 , ..., m) is given by

 µi = Y i• ,

and since Y i• ⇠N  µi , 2

n,

E( µi ) = E  Y i•  =µi.

Thus the maximum likelihood estimators are unbiased estimator of µi for

i= 1 ,2 , ..., m.

Since



2 =SSW

and by Theorem 20.2, 1

2 SS W ⇠ 2 (m(n 1)), we have

E

2  =E SSW

mn  = 1

mn  2 E  1

2 SS W  =1

mn  2 m(n 1) 6 =2 .

Probability and Mathematical Statistics 629

Thus the maximum likelihood estimator 

2 of 2 is biased. However, the

estimator SS W

m(n1) is an unbiased estimator. Similarly, the estimator SS T

mn1 is

an unbiased estimator where as SS T

mn is a biased estimator of  2 .

Theorem 20.3. Suppose the one-way ANOVA model is given by the equa-

tion (2) where the ✏ij 's are independent and normally distributed random

variables with mean zero and variance 2 for i = 1, 2, ..., m and j = 1, 2, ..., n.

The null hypothesis Ho : µ1 = µ2 =··· = µm =µ is rejected whenever the

test statistics Fsatisﬁes

F=SS B /(m 1)

SSW /(m(n 1)) > F ↵ (m 1, m(n 1)),(12)

where ↵ is the signiﬁcance level of the hypothesis test and F↵ ( m 1 , m ( n 1))

denotes the 100(1 ↵ ) percentile of the F -distribution with m 1 numerator

and nm  m denominator degrees of freedom.

Proof: Under the null hypothesis Ho : µ1 = µ2 =··· = µm = µ , the

likelihood function takes the form

L( µ, 2 ) =



i=1



j=1  1

p2⇡2 e (Yij µ)2

2 2 

= 1

p2⇡2  nm

1

2 2



i=1



j=1

(Yij µ)2

Taking the natural logarithm of the likelihood function and then maximizing

it, we obtain

 µ= Y•• and  H o =1

mn SS T

as the maximum likelihood estimators of µ and 2 , respectively. Inserting

these estimators into the likelihood function, we have the maximum of the

likelihood function, that is

max L( µ, 2 ) = 

1

2⇡ 

2





1

2

2



i=1



j=1

(Yij  Y •• )2

Simplifying the above expression, we see that

max L( µ, 2 ) = 

1

2⇡ 

2





e mn

2 SST SS T

Analysis of Variance 630

which is

max L( µ, 2 ) = 

1

2⇡ 

2





e mn

2.(13)

When no restrictions imposed, we get the maximum of the likelihood function

from Theorem 20.1 as

max L(µ1 , µ2 , ..., µm ,2 ) =  1

2⇡ 

2 nm

1

2

2



i=1



j=1

(Yij  Y i• )2

Simplifying the above expression, we see that

max L(µ1 , µ2 , ..., µm ,2 ) =  1

2⇡ 

2 nm

e mn

2 SSW SS W

which is

max L(µ1 , µ2 , ..., µm ,2 ) =  1

2⇡ 

2 nm

e mn

2.(14)

Next we ﬁnd the likelihood ratio statistic W for testing the null hypoth-

esis Ho : µ1 = µ2 =··· = µm = µ . Recall that the likelihood ratio statistic

Wcan be found by evaluating

W=max L(µ, 2 )

max L(µ1 , µ2 , ..., µm ,2 ) .

Using (13) and (14), we see that

W=

2



2

Ho  mn

.(15)

Hence the likelihood ratio test to reject the null hypothesis Ho is given by

the inequality

W < k0

where k0 is a constant. Using (15) and simplifying, we get



2



2 > k1

Probability and Mathematical Statistics 631

where k1 = 1

k0  2

mn . Hence

SST /mn

SSW /mn = 

2



2 > k 1 .

Using Lemma 20.1 we have

SSW + SSB

SSW

> k1.

Therefore SS B

SSW

> k (16)

where k = k1  1. In order to ﬁnd the cuto↵ point k in (16), we use Theorem

20.2 (d). Therefore

F=SS B /(m 1)

SSW /(m(n 1)) > m (n 1)

m1 k

Since F has F distribution, we obtain

m( n 1)

m1 k=F↵ ( m 1 , m ( n 1)).

Thus, at a signiﬁcance level ↵ , reject the null hypothesis Hoif

F=SS B /(m 1)

SSW /(m(n 1)) > F ↵ (m 1, m(n 1))

and the proof of the theorem is complete.

The various quantities used in carrying out the test described in Theorem

20.3 are presented in a tabular form known as the ANOVA table.

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between SSB m 1 MSB = SS B

m1F= MS B

MSW

Within SSW m( n 1) MSW = SS W

m(n1)

Total SST mn  1

Table 20.1. One-Way ANOVA Table

Analysis of Variance 632

At a signiﬁcance level ↵ , the likelihood ratio test is: "Reject the null

hypothesis Ho : µ1 = µ2 =··· = µm =µ if F > F↵ (m 1, m(n 1))." One

can also use the notion of p value to perform this hypothesis test. If the

value of the test statistics is F = , then the p -value is deﬁned as

pvalue = P( F( m1 , m( n 1))  ) .

Alternatively, at a signiﬁcance level ↵ , the likelihood ratio test is: "Reject

the null hypothesis Ho : µ1 = µ2 = ·· · = µm =µ if p value <↵ ."

The following ﬁgure illustrates the notions of between sample variation

and within sample variation.

The ANOVA model described in (2), that is

Yij =µi + ✏ij for i = 1 , 2 , ..., m, j = 1 , 2 , ..., n,

can be rewritten as

Yij = µ+ ↵i +✏ij for i = 1 , 2 , ..., m, j = 1 , 2 , ..., n,

where µ is the mean of the m values of µi , and



i=1

↵i = 0. The quantity ↵i is

called the e↵ ect of the ith treatment. Thus any observed value is the sum of

Probability and Mathematical Statistics 633

an overall mean µ , a treatment or class deviation ↵i , and a random element

from a normally distributed random variable ✏ij with mean zero and variance

2 . This model is called model I, the ﬁxed e↵ ects model. The e↵ects of the

treatments or classes, measured by the parameters ↵i , are regarded as ﬁxed

but unknown quantities to be estimated. In this ﬁxed e↵ ect model the null

hypothesis H0 is now

Ho : ↵1 = ↵2 =··· = ↵m = 0

and the alternative hypothesis is

Ha : not all the ↵i are zero.

The random e↵ ects model, also known as model II, is given by

Yij = µ+ Ai + ✏ij for i = 1 , 2 , ..., m, j = 1 , 2 , ..., n,

where µ is the overall mean and

Ai ⇠ N (0 , 2

A) and ✏ ij ⇠N(0 , 2 ).

In this model, the variances  2

Aand  2 are unknown quantities to be esti-

mated. The null hypothesis of the random e↵ ect model is Ho : 2

A= 0 and

the alternative hypothesis is Ha : 2

A>0. In this chapter we do not consider

the random e↵ ect model.

Before we present some examples, we point out the assumptions on which

the ANOVA is based on. The ANOVA is based on the following three as-

sumptions:

(1) Independent Samples: The samples taken from the population under

consideration should be independent of one another.

(2) Normal Population: For each population, the variable under considera-

tion should be normally distributed.

(3) Equal Variance: The variances of the variables under consideration

should be the same for all the populations.

Example 20.1. The data in the following table gives the number of hours of

relief provided by 5 di↵ erent brands of headache tablets administered to 25

subjects experiencing fevers of 38o C or more. Perform the analysis of variance

Analysis of Variance 634

and test the hypothesis at the 0.05 level of signiﬁcance that the mean number

of hours of relief provided by the tablets is same for all 5 brands.

Tablets

A B C D F

5 9 3 2 7

4 7 5 3 6

8 8 2 4 9

6 6 3 1 4

3 9 7 4 7

Answer: Using the formulas (8), (9) and (10), we compute the sum of

squares SSW , SSB and SSTas

SSW = 57 .60,SSB = 79 .94, and SST = 137 .04.

The ANOVA table for this problem is shown below.

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 79. 94 4 19. 86 6.90

Within 57. 60 20 2.88

Total 137. 04 24

At the signiﬁcance level ↵ = 0. 05, we ﬁnd the F-table that F0.05 (4 , 20) =

2. 8661. Since

6. 90 = F > F0.05 (4, 20) = 2 .8661

we reject the null hypothesis that the mean number of hours of relief provided

by the tablets is same for all 5 brands.

Note that using a statistical package like MINITAB, SAS or SPSS we

can compute the p -value to be

pvalue = P( F(4 ,20)  6. 90) = 0 .001.

Hence again we reach the same conclusion since p -value is less then the given

↵for this problem.

Probability and Mathematical Statistics 635

Example 20.2. Perform the analysis of variance and test the null hypothesis

at the 0.05 level of signiﬁcance for the following two data sets.

Data Set 1 Data Set 2

Sample Sample

A B C A B C

8. 1 8 . 0 14 . 8 9 . 2 9 . 5 9 .4

4. 2 15 . 1 5 . 3 9 . 1 9 . 5 9 .3

14. 7 4 . 7 11 . 1 9 . 2 9 . 5 9 .3

9. 9 10 . 4 7 . 9 9 . 2 9 . 6 9 .3

12. 1 9 . 0 9 . 3 9 . 3 9 . 5 9 .2

6. 2 9 . 8 7 . 4 9 . 2 9 . 4 9 .3

Answer: Computing the sum of squares SSW , SSB and SST , we have the

following two ANOVA tables:

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 0. 3 2 0. 1 0.01

Within 187. 2 15 12.5

Total 187. 5 17

and

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 0. 280 2 0. 140 35.0

Within 0. 600 15 0.004

Total 0. 340 17

Analysis of Variance 636

At the signiﬁcance level ↵ = 0. 05, we ﬁnd from the F-table that F0.05 (2 , 15) =

3. 68. For the ﬁrst data set, since

0. 01 = F < F0.05 (2, 15) = 3 .68

we do not reject the null hypothesis whereas for the second data set,

35. 0 = F > F0.05 (2, 15) = 3 .68

we reject the null hypothesis.

Remark 20.1. Note that the sample means are same in both the data

sets. However, there is a less variation among the sample points in samples

of the second data set. The ANOVA ﬁnds a more signiﬁcant di↵erences

among the means in the second data set. This example suggests that the

larger the variation among sample means compared with the variation of

the measurements within samples, the greater is the evidence to indicate a

di↵ erence among population means.

20.2. One-Way Analysis of Variance with Unequal Sample Sizes

In the previous section, we examined the theory of ANOVA when sam-

ples are same sizes. When the samples are same sizes we say that the ANOVA

is in the balanced case. In this section we examine the theory of ANOVA

for unbalanced case, that is when the samples are of di↵ erent sizes. In ex-

perimental work, one often encounters unbalance case due to the death of

experimental animals in a study or drop out of the human subjects from

a study or due to damage of experimental materials used in a study. Our

analysis of the last section for the equal sample size will be valid but have to

be modiﬁed to accommodate the di↵ erent sample size.

Consider m independent samples of respective sizes n1 , n2 , ..., nm , where

the members of the ith sample, Yi1 , Yi2 , ..., Yin i , are normal random variables

with mean µi and unknown variance 2 . That is,

Yij ⇠ N µi ,2  , i = 1 , 2 , ..., m, j = 1 , 2 , ..., ni.

Let us denote N = n1 + n2 +···+ nm . Again, we will be interested in testing

the null hypothesis

Ho : µ1 = µ2 =··· = µm = µ

Probability and Mathematical Statistics 637

against the alternative hypothesis

Ha : not all the means are equal.

Now we deﬁning

Yi• =1



j=1

Yij , (17)

Y•• =1



i=1



j=1

Yij , (18)

SST=



i=1



j=1 Y ij Y ••  2 ,(19)

SSW=



i=1



j=1 Y ij Y i•  2 ,(20)

and

SSB=



i=1



j=1 Y i•Y •• 2 (21)

we have the following results analogous to the results in the previous section.

Theorem 20.4. Suppose the one-way ANOVA model is given by the equa-

tion (2) where the ✏ij 's are independent and normally distributed random

variables with mean zero and variance 2 for i = 1, 2, ..., m and j = 1, 2, ..., ni .

Then the MLE's of the parameters µi ( i = 1, 2, ..., m ) and 2 of the model

are given by

 µi = Y i• i = 1 , 2 , ..., m,



2 =1

NSS W ,

where Y i• = 1



j=1

Yij and SSW=



i=1



j=1 Y ij Y i•  2 is the within samples

sum of squares.

Lemma 20.2. The total sum of squares is equal to the sum of within and

between sum of squares, that is SST = SSW + SSB.

Theorem 20.5. Consider the ANOVA model

Yij =µi + ✏ij i = 1 , 2 , ..., m, j = 1 , 2 , ..., ni,

Analysis of Variance 638

where Yij ⇠ N µi ,2  . Then

(a) the random variable SS W

2 ⇠ 2 (N m), and

(b) the statistics SSW and SSB are independent.

Further, if the null hypothesis Ho : µ1 = µ2 =··· = µm =µ is true, then

2 ⇠ 2 (m 1),

(d) the statistics SS B m(n1)

SSW(m 1) ⇠F( m1, N  m), and

(e) the random variable SS T

2 ⇠ 2 (N 1).

Theorem 20.6. Suppose the one-way ANOVA model is given by the equa-

tion (2) where the ✏ij 's are independent and normally distributed random

variables with mean zero and variance 2 for i = 1, 2, ..., m and j = 1, 2, ..., ni .

The null hypothesis Ho : µ1 = µ2 =··· = µm =µ is rejected whenever the

test statistics Fsatisﬁes

F=SS B /(m 1)

SSW /(N m ) > F ↵ (m 1, N  m ),

where ↵ is the signiﬁcance level of the hypothesis test and F↵ ( m 1 , N  m)

denotes the 100(1 ↵ ) percentile of the F -distribution with m 1 numerator

and N m denominator degrees of freedom.

The corresponding ANOVA table for this case is

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between SSB m 1 MSB = SS B

m1F= MS B

MSW

Within SSW N m MSW = SS W

Nm

Total SST N 1

Table 20.2. One-Way ANOVA Table with unequal sample size

Example 20.3. Three sections of elementary statistics were taught by dif-

ferent instructors. A common ﬁnal examination was given. The test scores

are given in the table below. Perform the analysis of variance and test the

hypothesis at the 0.05 level of signiﬁcance that there is a di↵ erence in the

average grades given by the three instructors.

Probability and Mathematical Statistics 639

Elementary Statistics

Instructor A Instructor B Instructor C

75 90 17

91 80 81

83 50 55

45 93 70

82 53 61

75 87 43

68 76 89

47 82 73

38 78 58

80 70

Answer: Using the formulas (17) - (21), we compute the sum of squares

SSW , SSB and SSTas

SSW = 10362, SSB = 755, and SST = 11117.

The ANOVA table for this problem is shown below.

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 755 2 377 1.02

Within 10362 28 370

Total 11117 30

At the signiﬁcance level ↵ = 0. 05, we ﬁnd the F-table that F0.05 (2 , 28) =

3. 34. Since

1. 02 = F < F0.05 (2, 28) = 3 .34

we accept the null hypothesis that there is no di↵ erence in the average grades

given by the three instructors.

Note that using a statistical package like MINITAB, SAS or SPSS we

can compute the p -value to be

pvalue = P( F(2 ,28)  1. 02) = 0 .374.

Analysis of Variance 640

Hence again we reach the same conclusion since p -value is less then the given

↵for this problem.

We conclude this section pointing out the advantages of choosing equal

sample sizes (balance case) over the choice of unequal sample sizes (unbalance

case). The ﬁrst advantage is that the F -statistics is insensitive to slight

departures from the assumption of equal variances when the sample sizes are

equal. The second advantage is that the choice of equal sample size minimizes

the probability of committing a type II error.

20.3. Pair wise Comparisons

When the null hypothesis is rejected using the F -test in ANOVA, one

may still wants to know where the di↵ erence among the means is. There are

several methods to ﬁnd out where the signiﬁcant di↵ erences in the means

lie after the ANOVA procedure is performed. Among the most commonly

used tests are Sche↵ ´e test and Tuckey test. In this section, we give a brief

description of these tests.

In order to perform the Sche↵ ´e test, we have to compare the means two

at a time using all possible combinations of means. Since we have mmeans,

we need  m

2pair wise comparisons. A pair wise comparison can be viewed as

a test of the null hypothesis H0 : µi = µk against the alternative Ha : µi 6 = µk

for all i 6 =k .

To conduct this test we compute the statistics

Fs =  Y i•  Y k•  2

MSW  1

ni + 1

nk  ,

where Y i• and Yk• are the means of the samples being compared, ni and

nk are the respective sample sizes, and MSW is the mean sum of squared of

within group. We reject the null hypothesis at a signiﬁcance level of ↵if

Fs > ( m 1)F↵ ( m 1 , N  m)

where N = n1 + n2 +···+ nm .

Example 20.4. Perform the analysis of variance and test the null hypothesis

at the 0.05 level of signiﬁcance for the following data given in the table below.

Further perform a Sche↵ ´e test to determine where the signiﬁcant di↵erences

in the means lie.

Probability and Mathematical Statistics 641

Sample

1 2 3

9. 2 9 . 5 9 .4

9. 1 9 . 5 9 .3

9. 2 9 . 5 9 .3

9. 2 9 . 6 9 .3

9. 3 9 . 5 9 .2

9. 2 9 . 4 9 .3

Answer: The ANOVA table for this data is given by

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 0. 280 2 0. 140 35.0

Within 0. 600 15 0.004

Total 0. 340 17

At the signiﬁcance level ↵ = 0. 05, we ﬁnd the F-table that F0.05 (2 , 15) =

3. 68. Since

35. 0 = F > F0.05 (2, 15) = 3 .68

we reject the null hypothesis. Now we perform the Sche↵ ´e test to determine

where the signiﬁcant di↵ erences in the means lie. From given data, we obtain

Y1• = 9 .2, Y2• = 9 .5 and Y3• = 9 .3. Since m = 3, we have to make 3 pair

wise comparisons, namely µ1 with µ2 , µ1 with µ3 , and µ2 with µ3 . First we

consider the comparison of µ1 with µ2 . For this case, we ﬁnd

Fs =  Y 1•  Y 2•  2

MSW  1

n1 + 1

n2 =(9.2 9.5) 2

0. 004  1

6+ 1

6= 67.5.

Since

67. 5 = Fs > 2 F0.05 (2, 15) = 7 .36

we reject the null hypothesis H0 : µ1 = µ2 in favor of the alternative Ha :

µ1 6= µ2 .

Analysis of Variance 642

Next we consider the comparison of µ1 with µ3 . For this case, we ﬁnd

Fs =  Y 1•  Y 3•  2

MSW  1

n1 + 1

n3 =(9.2 9.3) 2

0. 004  1

6+ 1

6= 7.5.

Since

7. 5 = Fs > 2 F0.05 (2, 15) = 7 .36

we reject the null hypothesis H0 : µ1 = µ3 in favor of the alternative Ha :

µ1 6= µ3 .

Finally we consider the comparison of µ2 with µ3 . For this case, we ﬁnd

Fs =  Y 2•  Y 3•  2

MSW  1

n2 + 1

n3 =(9.5 9.3) 2

0. 004  1

6+ 1

6= 30.0.

Since

30. 0 = Fs > 2 F0.05 (2, 15) = 7 .36

we reject the null hypothesis H0 : µ2 = µ3 in favor of the alternative Ha :

µ2 6= µ3 .

Next consider the Tukey test. Tuckey test is applicable when we have a

balanced case, that is when the sample sizes are equal. For Tukey test we

compute the statistics

Q= Y i•  Y k•

MSW

where Y i• and Yk• are the means of the samples being compared, n is the

size of the samples, and MSW is the mean sum of squared of within group.

At a signiﬁcance level ↵ , we reject the null hypothesis H0 if

|Q | > Q↵ (m, ⌫ )

where ⌫ represents the degrees of freedom for the error mean square.

Example 20.5. For the data given in Example 20.4 perform a Tukey test

to determine where the signiﬁcant di↵ erences in the means lie.

Answer: We have seen that Y 1• = 9. 2, Y 2• = 9. 5 and Y 3• = 9.3.

First we compare µ1 with µ2 . For this we compute

Q= Y 1•  Y 2•

MSW

=9.2 9.3

0.004

=11.6189.

Probability and Mathematical Statistics 643

Since

11. 6189 = |Q | > Q0.05 (2, 15) = 3 .01

we reject the null hypothesis H0 : µ1 = µ2 in favor of the alternative Ha :

µ1 6= µ2 .

Next we compare µ1 with µ3 . For this we compute

Q= Y 1•  Y 3•

MSW

=9.2 9.5

0.004

=3.8729.

Since

3. 8729 = |Q | > Q0.05 (2, 15) = 3 .01

we reject the null hypothesis H0 : µ1 = µ3 in favor of the alternative Ha :

µ1 6= µ3 .

Finally we compare µ2 with µ3 . For this we compute

Q= Y 2•  Y 3•

MSW

=9.5 9.3

0.004

= 7.7459.

Since

7. 7459 = |Q | > Q0.05 (2, 15) = 3 .01

we reject the null hypothesis H0 : µ2 = µ3 in favor of the alternative Ha :

µ2 6= µ3 .

Often in scientiﬁc and engineering problems, the experiment dictates

the need for comparing simultaneously each treatment with a control. Now

we describe a test developed by C. W. Dunnett for determining signiﬁcant

di↵ erences between each treatment mean and the control. Suppose we wish

to test the m hypotheses

H0 :µ0 =µi versus Ha :µ0 6= µi for i = 1 , 2 , ..., m,

where µ0 represents the mean yield for the population of measurements in

which the control is used. To test the null hypotheses speciﬁed by H0 against

two-sided alternatives for an experimental situation in which there are m

treatments, excluding the control, and n observation per treatment, we ﬁrst

calculate

Di = Y i•  Y 0•

2MSW

, i = 1 , 2 , ..., m.

Analysis of Variance 644

At a signiﬁcance level ↵ , we reject the null hypothesis H0 if

|Di | > D ↵

2(m, ⌫ )

where ⌫ represents the degrees of freedom for the error mean square. The

values of the quantity D ↵

2(m, ⌫ ) are tabulated for various ↵ , m and ⌫.

Example 20.6. For the data given in the table below perform a Dunnett

test to determine any signiﬁcant di↵ erences between each treatment mean

and the control.

Control Sample 1 Sample 2

9. 2 9 . 5 9 .4

9. 1 9 . 5 9 .3

9. 2 9 . 5 9 .3

9. 2 9 . 6 9 .3

9. 3 9 . 5 9 .2

9. 2 9 . 4 9 .3

Answer: The ANOVA table for this data is given by

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 0. 280 2 0. 140 35.0

Within 0. 600 15 0.004

Total 0. 340 17

At the signiﬁcance level ↵ = 0. 05, we ﬁnd that D0.025 (2 , 15) = 2. 44. Since

35. 0 = D > D0.025 (2, 15) = 2 .44

we reject the null hypothesis. Now we perform the Dunnett test to determine

if there is any signiﬁcant di↵ erences between each treatment mean and the

control. From given data, we obtain Y 0• = 9. 2, Y 1• = 9. 5 and Y 2• = 9.3.

Since m = 2, we have to make 2 pair wise comparisons, namely µ0 with µ1 ,

and µ0 with µ2 . First we consider the comparison of µ0 with µ1 . For this

case, we ﬁnd

D1 = Y 1•  Y 0•

2MSW

=9.5 9.2

2 (0.004)

= 8.2158.

Probability and Mathematical Statistics 645

Since

8. 2158 = D1 > D0.025 (2, 15) = 2 .44

we reject the null hypothesis H0 : µ1 = µ0 in favor of the alternative Ha :

µ1 6= µ0 .

Next we ﬁnd

D2 = Y 2•  Y 0•

2MSW

=9.3 9.2

2 (0.004)

= 2.7386.

Since

2. 7386 = D2 > D0.025 (2, 15) = 2 .44

we reject the null hypothesis H0 : µ2 = µ0 in favor of the alternative Ha :

µ2 6= µ0 .

20.4. Tests for the Homogeneity of Variances

One of the assumptions behind the ANOVA is the equal variance, that is

the variances of the variables under consideration should be the same for all

population. Earlier we have pointed out that the F -statistics is insensitive

to slight departures from the assumption of equal variances when the sample

sizes are equal. Nevertheless it is advisable to run a preliminary test for

homogeneity of variances. Such a test would certainly be advisable in the

case of unequal sample sizes if there is a doubt concerning the homogeneity

of population variances.

Suppose we want to test the null hypothesis

H0 : 2

1= 2

2=··· 2

versus the alternative hypothesis

Ha : not all variances are equal.

A frequently used test for the homogeneity of population variances is the

Bartlett test. Bartlett (1937) proposed a test for equal variances that was

modiﬁcation of the normal-theory likelihood ratio test.

We will use this test to test the above null hypothesis H0 against Ha .

First, we compute the m sample variances S 2

1, S 2

2, ..., S 2

mfrom the samples of

Analysis of Variance 646

size n1 , n2 , ..., nm , with n1 + n2 +··· + nm =N . The test statistics Bc is

given by

Bc =

(N m ) ln S 2

p



i=1

(ni  1) ln S 2

1 + 1

3 (m 1)  m



i=1

ni 1  1

N m

where the pooled variance S 2

pis given by



i=1

(ni  1) S 2

N m= MS W .

It is known that the sampling distribution of Bc is approximately chi-square

with m 1 degrees of freedom, that is

Bc ⇠ 2 ( m 1)

when (ni  1)  3. Thus the Bartlett test rejects the null hypothesis H0 :

2

1= 2

2=··· 2

mat a signiﬁcance level ↵if

Bc > 2

1↵ (m 1),

where 2

1↵ (m 1) denotes the upp er (1 ↵ )100 percentile of the chi-square

distribution with m 1 degrees of freedom.

Example 20.7. For the following data perform an ANOVA and then apply

Bartlett test to examine if the homogeneity of variances condition is met for

a signiﬁcance level 0.05.

Data

Sample 1 Sample 2 Sample 3 Sample 4

34 29 32 34

28 32 34 29

29 31 30 32

37 43 42 28

42 31 32 32

27 29 33 34

29 28 29 29

35 30 27 31

25 37 37 30

29 44 26 37

41 29 29 43

40 31 31 42

Probability and Mathematical Statistics 647

Answer: The ANOVA table for this data is given by

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 16. 2 3 5. 4 0.20

Within 1202. 2 44 27.3

Total 1218. 5 47

At the signiﬁcance level ↵ = 0. 05, we ﬁnd the F-table that F0.05 (2 , 44) =

3. 23. Since

0. 20 = F < F0.05 (2, 44) = 3 .23

we do not reject the null hypothesis.

Now we compute Bartlett test statistic Bc . From the data the variances

of each group can be found to be

1= 35.2836 , S 2

2= 30.1401 , S 2

3= 19.4481 , S 2

4= 24.4036.

Further, the pooled variance is

p= MS W = 27.3.

The statistics Bc is

Bc =

(N m ) ln S 2

p



i=1

(ni  1) ln S 2

1 + 1

3 (m 1)  m



i=1

ni 1  1

N m

=44 ln 27.3 11 [ ln 35. 2836  ln 30. 1401  ln 19. 4481  ln 24. 4036 ]

1 + 1

3 (4 1)  4

12 1 1

48 4 

=1.0537

1. 0378 = 1 . 0153.

From chi-square table we ﬁnd that 2

0.95(3) = 7.815. Hence, since

1. 0153 = Bc < 2

0.95(3) = 7.815,

Analysis of Variance 648

we do not reject the null hypothesis that the variances are equal. Hence

Bartlett test suggests that the homogeneity of variances condition is met.

The Bartlett test assumes that the m samples should be taken from

mnormal populations. Thus Bartlett test is sensitive to departures from

normality. The Levene test is an alternative to the Bartlett test that is less

sensitive to departures from normality. Levene (1960) proposed a test for the

homogeneity of population variances that considers the random variables

Wij = Yij  Y i•  2

and apply a one-way analysis of variance to these variables. If the F -test is

signiﬁcant, the homogeneity of variances is rejected.

Levene (1960) also proposed using F -tests based on the variables

Brown and Forsythe (1974c) proposed using the transformed variables based

on the absolute deviations from the median, that is Wij = |Yij Med(Yi• ) |,

where Med(Yi• ) denotes the median of group i . Again if the F -test is signif-

icant, the homogeneity of variances is rejected.

Example 20.8. For the data in Example 20.7 do a Levene test to examine

if the homogeneity of variances condition is met for a signiﬁcance level 0.05.

Answer: From data we ﬁnd that Y 1• = 33. 00, Y 2• = 32. 83, Y 3• = 31.83,

and Y 4• = 33. 42. Next we compute Wij =  Yij  Y i•  2 . The resulting

values are given in the table below.

Transformed Data

Sample 1 Sample 2 Sample 3 Sample 4

1 14. 7 0. 0 0.3

25 0. 7 4. 7 19.5

16 3. 4 3.4 2.0

16 103. 4 103. 4 29.3

81 3. 4 0.0 2.0

36 14. 7 1. 4 0.3

16 23. 4 8. 0 19.5

4 8. 0 23. 4 5.8

64 17. 4 26. 7 11.7

16 124. 7 34. 0 12.8

64 14. 7 0. 0 91.8

49 3. 4 0. 7 73.7

Probability and Mathematical Statistics 649

Now we perform an ANOVA to the data given in the table above. The

ANOVA table for this data is given by

Source of Sums of Degree of Mean F-statistics

variation squares freedom squares F

Between 1430 3 477 0.46

Within 45491 44 1034

Total 46922 47

At the signiﬁcance level ↵ = 0. 05, we ﬁnd the F-table that F0.05 (3 , 44) =

2. 84. Since

0. 46 = F < F0.05 (3, 44) = 2 .84

we do not reject the null hypothesis that the variances are equal. Hence

Bartlett test suggests that the homogeneity of variances condition is met.

Although Bartlet test is most widely used test for homogeneity of vari-

ances a test due to Cochran provides a computationally simple procedure.

Cochran test is one of the best method for detecting cases where the variance

of one of the groups is much larger than that of the other groups. The test

statistics of Cochran test is give by

max

1imS 2



i=1

The Cochran test rejects the null hypothesis H0 : 2

1= 2

2=··· 2

mat a

signiﬁcance level ↵if

C > C↵.

The critical values of C↵ were originally published by Eisenhart et al (1947)

for some combinations of degrees of freedom ⌫ and the number of groups m.

Here the degrees of freedom ⌫are

⌫= max

1im(n i 1).

Example 20.9. For the data in Example 20.7 perform a Cochran test to

examine if the homogeneity of variances condition is met for a signiﬁcance

level 0.05.

Analysis of Variance 650

Answer: From the data the variances of each group can be found to be

1= 35.2836 , S 2

2= 30.1401 , S 2

3= 19.4481 , S 2

4= 24.4036.

Hence the test statistic for Cochran test is

C=35.2836

35. 2836 + 30 . 1401 + 19 . 4481 + 24 . 4036 = 35.2836

109. 2754 = 0 . 3328.

The critical value C0.5 (3 , 11) is given by 0. 4884. Since

0. 3328 = C < C0.5 (3, 11) = 0 .4884.

At a signiﬁcance level ↵ = 0. 05, we do not reject the null hypothesis that

the variances are equal. Hence Cochran test suggests that the homogeneity

of variances condition is met.

20.5. Exercises

1. A consumer organization wants to compare the prices charged for a par-

ticular brand of refrigerator in three types of stores in Louisville: discount

stores, department stores and appliance stores. Random samples of 6 stores

of each type were selected. The results were shown below.

Discount Department Appliance

1200 1700 1600

1300 1500 1500

1100 1450 1300

1400 1300 1500

1250 1300 1700

1150 1500 1400

At the 0. 05 level of signiﬁcance, is there any evidence of a di↵ erence in the

average price between the types of stores?

2. It is conjectured that a certain gene might be linked to ovarian cancer.

The ovarian cancer is sub-classiﬁed into three categories: stage I, stage II and

stage III-IV. There are three random samples available; one from each stage.

The samples are labelled with three colors dyes and hybridized on a four

channel cDNA microarray (one channel remains unused). The experiment is

repeated 5 times and the following data were obtained.

Probability and Mathematical Statistics 651

Microarray Data

Array mRNA 1 mRNA 2 mRNA 3

1 100 95 70

2 90 93 72

3 105 79 81

4 83 85 74

5 78 90 75

Is there any di↵ erence between the averages of the three mRNA samples at

0. 05 signiﬁcance level?

3. A stock market analyst thinks 4 stock of mutual funds generate about the

same return. He collected the accompaning rate-of-return data on 4 di↵ erent

mutual funds during the last 7 years. The data is given in table below.

Mutual Funds

Year A B C D

2000 12 11 13 15

2001 12 17 19 11

2002 13 18 15 12

2004 18 20 25 11

2005 17 19 19 10

2006 18 12 17 10

2007 12 15 20 12

Do a one-way ANOVA to decide whether the funds give di↵ erent performance

at 0.05 signiﬁcance level.

4. Give a proof of the Theorem 20.4.

5. Give a proof of the Lemma 20.2.

6. Give a proof of the Theorem 20.5.

7. Give a proof of the Theorem 20.6.

8. An automobile company produces and sells its cars under 3 di↵ erent brand

names. An autoanalyst wants to see whether di↵ erent brand of cars have

same performance. He tested 20 cars from 3 di↵ erent brands and recorded

the mileage per gallon.

Analysis of Variance 652

Brand 1 Brand 2 Brand 3

32 31 34

29 28 25

32 30 31

25 34 37

35 39 32

33 36

34 38

Do the data suggest a rejection of the null hypothesis at a signiﬁcance level

0. 05 that the mileage per gallon generated by three di↵ erent brands are same.

Probability and Mathematical Statistics 653

Chapter 21

GOODNESS OF FITS

TESTS

In point estimation, interval estimation or hypothesis test we always

started with a random sample X1 , X2 , ..., Xn of size n from a known dis-

tribution. In order to apply the theory to data analysis one has to know

the distribution of the sample. Quite often the experimenter (or data ana-

lyst) assumes the nature of the sample distribution based on his subjective

knowledge.

Goodness of ﬁt tests are performed to validate experimenter opinion

about the distribution of the population from where the sample is drawn.

The most commonly known and most frequently used goodness of ﬁt tests

are the Kolmogorov-Smirnov (KS) test and the Pearson chi-square (2 ) test.

There is a controversy over which test is the most powerful, but the gen-

eral feeling seems to be that the Kolmogorov-Smirnov test is probably more

powerful than the chi-square test in most situations. The KS test measures

the distance between distribution functions, while the 2 test measures the

distance between density functions. Usually, if the population distribution

is continuous, then one uses the Kolmogorov-Smirnov where as if the pop-

ulation distribution is discrete, then one performs the Pearson's chi-square

goodness of ﬁt test.

21.1. Kolmogorov-Smirnov Test

Let X1 , X2 , ..., Xn be a random sample from a population X . We hy-

pothesized that the distribution of X is F (x ). Further, we wish to test our

hypothesis. Thus our null hypothesis is

Ho : X⇠ F (x).

Goodness of Fit Tests 654

We would like to design a test of this null hypothesis against the alternative

Ha : X 6⇠ F (x).

In order to design a test, ﬁrst of all we need a statistic which will unbias-

edly estimate the unknown distribution F (x ) of the population X using the

random sample X1 , X2 , ..., Xn . Let x(1) < x(2) < · ·· < x(n) be the observed

values of the ordered statistics X(1) , X(2) , ..., X(n) . The empirical distribution

of the random sample is deﬁned as

Fn (x) = 









0 if x < x(1) ,

nif x ( k)x < x (k+1),for k = 1, 2, ..., n  1,

1 if x(n) x.

The graph of the empirical distribution function F4 (x ) is shown below.

Empirical Distribution Function

For a ﬁxed value of x , the empirical distribution function can be considered

as a random variable that takes on the values

0, 1

n, 2

n, ..., n1

n, n

First we show that Fn (x ) is an unbiased estimator of the population distri-

bution F (x ). That is,

E(Fn (x )) = F(x ) (1)

Probability and Mathematical Statistics 655

for a ﬁxed value of x . To establish (1), we need the probability density

function of the random variable Fn (x ). From the deﬁnition of the empirical

distribution we see that if exactly k observations are less than or equal to x,

then

Fn (x) = k

which is

n Fn (x ) = k.

The probability that an observation is less than or equal to x is given by

F(x).

Distribution of the Empirical Distribution Function

Hence (see ﬁgure above)

P( n Fn (x ) = k ) = P  Fn (x) = k

n

=n

k [ F(x)]k [1  F(x)]nk

for k = 0, 1, ..., n . Thus

n Fn (x )⇠ BIN ( n, F (x)).

Goodness of Fit Tests 656

Thus the expected value of the random variable n Fn (x ) is given by

E( n Fn (x )) = n F (x)

n E(Fn (x )) = n F (x)

E(Fn (x )) = F(x).

This shows that, for a ﬁxed x , Fn (x ), on an average, equals to the population

distribution function F (x ). Hence the empirical distribution function Fn (x)

is an unbiased estimator of F (x).

Since n Fn (x )⇠ BIN (n, F (x )), the variance of n Fn (x ) is given by

V ar( n Fn (x )) = n F (x ) [1  F (x)].

Hence the variance of Fn (x ) is

V ar(Fn (x )) = F (x) [1  F (x)]

It is easy to see that V ar (Fn (x )) ! 0 as n ! 1 for all values of x . Thus

the empirical distribution function Fn (x ) and F (x ) tend to be closer to each

other with large n . As a matter of fact, Glivenkno, a Russian mathemati-

cian, proved that Fn (x ) converges to F (x ) uniformly in x as n ! 1 with

probability one.

Because of the convergence of the empirical distribution function to the

theoretical distribution function, it makes sense to construct a goodness of

ﬁt test based on the closeness of Fn (x ) and hypothesized distribution F (x).

Let

Dn = max

x2 IR |F n (x) F (x)|.

That is Dn is the maximum of all pointwise di↵ erences |Fn (x )F (x)| . The

distribution of the Kolmogorov-Smirnov statistic, Dn can be derived. How-

ever, we shall not do that here as the derivation is quite involved. In stead,

we give a closed form formula for P (Dn d ). If X1 , X2 , ..., Xn is a sample

from a population with continuous distribution function F (x ), then

P(Dn d ) = 









0 if d 1



i=1  2 i 1

n+d

2id

du if 1

2n < d < 1

1 if d 1

Probability and Mathematical Statistics 657

where du = du1du2 ···dun with 0 < u1< u2 <··· < un < 1. Further,

lim

n!1 P(p n D n d) = 1  2 1



k=1

(1)k1 e2k2 d2 .

These formulas show that the distribution of the Kolmogorov-Smirnov statis-

tic Dn is distribution free, that is, it does not depend on the distribution F

of the population.

For most situations, it is suﬃ cient to use the following approximations

due to Kolmogorov:

P(p n Dn  d)⇡ 1 2 e2nd2 for d > 1

pn .

If the null hypothesis Ho :X⇠ F (x ) is true, the statistic Dn is small. It

is therefore reasonable to reject Ho if and only if the observed value of Dn

is larger than some constant dn . If the level of signiﬁcance is given to be ↵,

then the constant dn can be found from

↵=P (Dn > dn / Ho is true) ⇡ 2e2nd2

This yields the following hypothesis test: Reject Ho if Dn dn where

dn =   1

2n ln  ↵

2

is obtained from the above Kolmogorov's approximation. Note that the ap-

proximate value of d12 obtained by the above formula is equal to 0. 3533 when

↵= 0. 1, however more accurate value of d12 is 0.34.

Next we address the issue of the computation of the statistics Dn . Let

us deﬁne

D+

n= max

x2 IR {F n (x) F (x)}

and

D

n= max

x2 IR {F(x ) F n (x)}.

Then it is easy to see that

Dn = max { D +

n, D 

N}.

Further, since Fn (x(i) ) = i

n. it can be shown that

D+

n= max max

1in i

n F(x(i) ) ,0

Goodness of Fit Tests 658

and

D

n= max max

1in F(x (i))i 1

n ,0 .

Therefore it can also be shown that

Dn = max

1in max i

n F(x(i) ), F (x(i) ) i1

n .

The following ﬁgure illustrates the Kolmogorov-Smirnov statistics Dn when

n= 4.

Kolmogorov-Smirnov Statistic

Example 21.1. The data on the heights of 12 infants are given be-

low: 18.2 , 21.4 , 22.6 , 17.4 , 17.6 , 16.7 , 17 .1 , 21.4 , 20.1 , 17.9 , 16.8 , 23.1 . Test

the hypothesis that the data came from some normal population at a sig-

niﬁcance level ↵ = 0.1.

Answer: Here, the null hypothesis is

Ho : X⇠ N ( µ, 2 ).

First we estimate µ and 2 from the data. Thus, we get

x=230.3

12 = 19.2.

Probability and Mathematical Statistics 659

and

s2 =4482 .01  1

12 (230.3) 2

12  1= 62.17

11 = 5.65.

Hence s = 2. 38. Then by the null hypothesis

F(x(i) ) = P Z x (i) 19.2

2. 38 

where Z⇠ N (0, 1) and i = 1, 2, ..., n . Next we compute the Kolmogorov-

Smirnov statistic Dn the given sample of size 12 using the following tabular

form.

i x(i) F (x(i) ) i

12 F(x (i))F(x (i )) i1

1 16. 7 0. 1469 0. 0636 0 .1469

2 16. 8 0. 1562 0. 0105 0.0729

3 17. 1 0. 1894 0. 0606 0.0227

4 17. 4 0. 2236 0. 1097 0.0264

5 17. 6 0. 2514 0. 1653 0.0819

6 17. 9 0. 2912 0. 2088 0.1255

7 18. 2 0. 3372 0. 2461 0.1628

8 20. 1 0. 6480 0. 0187 0.0647

9 21. 4 0. 8212 0. 0121 0.0712

10 21.4

11 22. 6 0. 9236 0. 0069 0 .0903

12 23. 1 0. 9495 0. 0505 0.0328

Thus

D12 = 0.2461.

From the tabulated value, we see that d12 = 0. 34 for signiﬁcance level ↵=

0. 1. Since D12 is smaller than d12 we accept the null hypothesis Ho :X ⇠

N( µ, 2 ). Hence the data came from a normal population.

Example 21.2. Let X1 , X2 , ..., X10 be a random sample from a distribution

whose probability density function is

f(x ) =  1 if 0 < x < 1

0 otherwise.

Based on the observed values 0.62 , 0.36 , 0.23 , 0.76 , 0.65 , 0.09 , 0.55 , 0.26,

0.38,0. 24, test the hypothesis Ho :X⇠ U N IF (0, 1) against Ha :X 6⇠

U N IF (0 , 1) at a signiﬁcance level ↵ = 0 .1.

Goodness of Fit Tests 660

Answer: The null hypothesis is Ho :X⇠ U N IF (0, 1). Thus

F(x ) =  0 if x < 0

xif 0  x < 1

1 if x 1.

Hence

F(x(i) ) = x(i) for i= 1 ,2 , ..., n.

Next we compute the Kolmogorov-Smirnov statistic Dn the given sample of

size 10 using the following tabular form.

i x(i) F (x(i) ) i

10 F(x (i))F(x (i )) i1

1 0. 09 0. 09 0. 01 0.09

2 0. 23 0. 23 0. 03 0 .13

3 0. 24 0. 24 0. 06 0.04

4 0. 26 0. 26 0. 14 0.04

5 0. 36 0. 36 0. 14 0.04

6 0. 38 0. 38 0. 22 0.12

7 0. 55 0. 55 0. 15 0.05

8 0. 62 0. 62 0. 18 0.08

9 0. 65 0. 65 0. 25 0.15

10 0. 76 0. 76 0. 24 0.14

Thus

D10 = 0.25.

From the tabulated value, we see that d10 = 0. 37 for signiﬁcance level ↵ = 0.1.

Since D10 is smaller than d10 we accept the null hypothesis

Ho : X⇠ U NI F (0 , 1).

21.2 Chi-square Test

The chi-square goodness of ﬁt test was introduced by Karl Pearson in

1900. Recall that the Kolmogorov-Smirnov test is only for testing a speciﬁc

continuous distribution. Thus if we wish to test the null hypothesis

Ho : X⇠ BIN ( n, p)

against the alternative Ha :X 6⇠ BIN ( n, p ), then we can not use the

Kolmogorov-Smirnov test. Pearson chi-square goodness of ﬁt test can be

used for testing of null hypothesis involving discrete as well as continuous

Probability and Mathematical Statistics 661

distribution. Unlike Kolmogorov-Smirnov test, the Pearson chi-square test

uses the density function the population X.

Let X1 , X2 , ..., Xn be a random sample from a population X with prob-

ability density function f (x ). We wish to test the null hypothesis

Ho : X⇠ f (x)

against

Ha : X 6⇠ f (x).

If the probability density function f (x ) is continuous, then we divide up the

abscissa of the probability density function f (x ) and calculate the probability

pi for each of the interval by using

pi = x i

xi1

f(x ) dx,

where {x0 , x1 , ..., xn } is a partition of the domain of the f (x).

Discretization of continuous density function

Let Y1 , Y2 , ..., Ym denote the number of observations (from the random sample

X1 , X2 , ..., Xn ) is 1st , 2nd , 3rd , ..., mth interval, respectively.

Since the sample size is n , the number of observations expected to fall in

the ith interval is equal to npi . Then



i=1

(Yi npi )2

npi

Goodness of Fit Tests 662

measures the closeness of observed Yi to expected number npi . The distribu-

tion of Q is chi-square with m 1 degrees of freedom. The derivation of this

fact is quite involved and beyond the scope of this introductory level book.

Although the distribution of Q for m > 2 is hard to derive, yet for m = 2

it not very diﬃ cult. Thus we give a derivation to convince the reader that Q

has 2 distribution. Notice that Y1 ⇠BIN ( n, p1 ). Hence for large n by the

central limit theorem, we have

Y1  n p1

n p 1 (1 p1 ) ⇠N(0 ,1).

Thus (Y1  n p1 )2

n p1 (1  p1 )⇠  2 (1) .

Since (Y1  n p1 )2

n p1 (1  p1 )= ( Y 1  n p 1 ) 2

n p1

+(Y1  n p1 )2

n(1  p1 ) ,

we have This implies that

(Y1  n p1 )2

n p1

+(Y1  n p1 )2

n(1  p1 )⇠  2 (1)

which is (Y1  n p1 )2

n p1

+(n Y2  n+ n p2 )2

n p2 ⇠  2 (1)

due to the facts that Y1 + Y2 =n and p1 + p2 = 1. Hence



i=1

(Yi  n pi )2

n pi ⇠  2 (1) ,

that is, the chi-square statistic Q has approximate chi-square distribution.

Now the simple null hypothesis

H0 :p1 =p10 , p2 = p20 , ··· pm =pm0

is to be tested against the composite alternative

Ha : at least one pi is not equal to pi0 for some i.

Here p10 , p20 , ..., pm0 are ﬁxed probability values. If the null hypothesis is

true, then the statistic



i=1

(Yi  n pi0 )2

n pi0

Probability and Mathematical Statistics 663

has an approximate chi-square distribution with m 1 degrees of freedom.

If the signiﬁcance level ↵ of the hypothesis test is given, then

↵=P Q  2

1↵ (m 1)

and the test is "Reject Ho if Q 2

1↵ (m 1)." Here  2

1↵ (m 1) denotes

a real number such that the integral of the chi-square density function with

m1 degrees of freedom from zero to this real number 2

1↵ (m 1) is 1 ↵ .

Now we give several examples to illustrate the chi-square goodness-of-ﬁt test.

Example 21.3. A die was rolled 30 times with the results shown below:

Number of spots 1 2 3 4 5 6

Frequency (xi ) 1 4 9 9 2 5

If a chi-square goodness of ﬁt test is used to test the hypothesis that the die

is fair at a signiﬁcance level ↵ = 0. 05, then what is the value of the chi-square

statistic and decision reached?

Answer: In this problem, the null hypothesis is

Ho :p1 =p2 =··· = p6 =1

The alternative hypothesis is that not all pi 's are equal to 1

6. The test will

be based on 30 trials, so n = 30. The test statistic



i=1

(xi  n pi )2

n pi

where p1 = p2 = ·· · = p6 = 1

6. Thus

n pi = (30) 1

6= 5

and



i=1

(xi  n pi )2

n pi



i=1

(xi  5)2

5[16 + 1 + 16 + 16 + 9]

=58

5= 11.6.

Goodness of Fit Tests 664

The tabulated 2 value for 2

0.95(5) is given by

2

0.95(5) = 11.07.

Since

11. 6 = Q > 2

0.95(5) = 11.07

the null hypothesis Ho : p1 = p2 =··· = p6 = 1

6should be rejected.

Example 21.4. It is hypothesized that an experiment results in outcomes

K, L, M and N with probabilities 1

5, 3

10 , 1

10 and 2

5, respectively. Forty

independent repetitions of the experiment have results as follows:

Outcome K L M N

Frequency 11 14 5 10

If a chi-square goodness of ﬁt test is used to test the above hypothesis at the

signiﬁcance level ↵ = 0. 01, then what is the value of the chi-square statistic

and the decision reached?

Answer: Here the null hypothesis to be tested is

Ho :p( K ) = 1

5, p (L) = 3

10 , p (M ) = 1

10 , p (N ) = 2

The test will be based on n = 40 trials. The test statistic



k=1

(xk npk )2

n pk

=(x1  8)2

8+ (x2  12)2

12 + (x3  4)2

4+ (x4  16)2

=(11  8)2

8+ (14  12)2

12 + (5  4)2

4+ (10  16)2

8+ 4

12 + 1

4+ 36

=95

24 = 3.958.

From chi-square table, we have

2

0.99(3) = 11.35.

Thus

3. 958 = Q < 2

0.99(3) = 11.35.

Probability and Mathematical Statistics 665

Therefore we accept the null hypothesis.

Example 21.5. Test at the 10% signiﬁcance level the hypothesis that the

following data

06. 88 06 . 92 04 . 80 09 . 85 07 . 05 19 . 06 06 . 54 03 . 67 02 . 94 04 .89

69. 82 06 . 97 04 . 34 13 . 45 05 . 74 10 . 07 16 . 91 07 . 47 05 . 04 07 .97

15. 74 00 . 32 04 . 14 05 . 19 18 . 69 02 . 45 23 . 69 44 . 10 01 . 70 02 .14

05. 79 03 . 02 09 . 87 02 . 44 18 . 99 18 . 90 05 . 42 01 . 54 01 . 55 20 .99

07. 99 05 . 38 02 . 36 09 . 66 00 . 97 04 . 82 10 . 43 15 . 06 00 . 49 02 .81

give the values of a random sample of size 50 from an exponential distribution

with probability density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 elsewhere,

where ✓> 0.

Answer: From the data x = 9. 74 and s = 11. 71. Notice that

Ho : X⇠ EX P (✓).

Hence we have to partition the domain of the experimental distribution into

mparts. There is no rule to determine what should be the value of m. We

assume m = 10 (an arbitrary choice for the sake of convenience). We partition

the domain of the given probability density function into 10 mutually disjoint

sets of equal probability. This partition can be found as follow.

Note that x estimate ✓ . Thus



✓=x = 9.74.

Now we compute the points x1 , x2 , ..., x10 which will be used to partition the

domain of f (x)

10 =  x 1

✓e  x

✓

= e x

✓ x 1

= 1  e x1

✓.

Hence

x1 =✓ ln  10

9

= 9. 74 ln  10

9

= 1.026.

Goodness of Fit Tests 666

Using the value of x1 , we can ﬁnd the value of x2 . That is

10 =  x 2

✓e  x

✓

=e x1

✓e x2

✓.

Hence

x2 = ✓ ln  e x1

✓1

10  .

In general

xk = ✓ ln  e xk1

✓1

10 

for k = 1, 2, ..., 9, and x10 = 1 . Using these xk 's we ﬁnd the intervals

Ak = [xk , xk+1 ) which are tabulates in the table below along with the number

of data points in each each interval.

Interval Ai Frequency (oi ) Expected value (ei )

[0,1. 026) 3 5

[1.026,2. 173) 4 5

[2.173,3. 474) 6 5

[3.474,4. 975) 6 5

[4.975,6. 751) 7 5

[6.751,8. 925) 7 5

[8.925,11. 727) 5 5

[11.727,15. 676) 2 5

[15.676,22. 437) 7 5

[22.437, 1 ) 3 5

Total 50 50

From this table, we compute the statistics



i=1

(oi ei )2

= 6.4.

and from the chi-square table, we obtain

2

0. 9 (9) = 14.68.

Since

6. 4 = Q < 2

0. 9 (9) = 14.68

Probability and Mathematical Statistics 667

we accept the null hypothesis that the sample was taken from a population

with exponential distribution.

21.3. Review Exercises

1. The data on the heights of 4 infants are: 18.2 , 21.4 , 16. 7 and 23. 1. For

a signiﬁcance level ↵ = 0. 1, use Kolmogorov-Smirnov Test to test the hy-

pothesis that the data came from some uniform population on the interval

(15, 25). (Use d4 = 0 . 56 at ↵ = 0 .1.)

2. A four-sided die was rolled 40 times with the following results

Number of spots 1 2 3 4

Frequency 5 9 10 16

If a chi-square goodness of ﬁt test is used to test the hypothesis that the die

is fair at a signiﬁcance level ↵ = 0. 05, then what is the value of the chi-square

statistic?

3. A coin is tossed 500 times and k heads are observed. If the chi-squares

distribution is used to test the hypothesis that the coin is unbiased, this

hypothesis will be accepted at 5 percents level of signiﬁcance if and only if k

lies between what values? (Use 2

0.05(1) = 3.84.)

4. It is hypothesized that an experiment results in outcomes A ,C,T and G

with probabilities 1

16 , 5

16 , 1

8and 3

8, respectively. Eighty independent repeti-

tions of the experiment have results as follows:

Outcome A G C T

Frequency 3 28 15 34

If a chi-square goodness of ﬁt test is used to test the above hypothesis at the

signiﬁcance level ↵ = 0. 1, then what is the value of the chi-square statistic

and the decision reached?

5. A die was rolled 50 times with the results shown below:

Number of spots 1 2 3 4 5 6

Frequency (xi ) 8 7 12 13 4 6

If a chi-square goodness of ﬁt test is used to test the hypothesis that the die

is fair at a signiﬁcance level ↵ = 0. 1, then what is the value of the chi-square

statistic and decision reached?

Goodness of Fit Tests 668

6. Test at the 10% signiﬁcance level the hypothesis that the following data

05. 88 05 . 92 03 . 80 08 . 85 06 . 05 18 . 06 05 . 54 02 . 67 01 . 94 03 .89

70. 82 07 . 97 05 . 34 14 . 45 06 . 74 11 . 07 17 . 91 08 . 47 06 . 04 08 .97

16. 74 01 . 32 03 . 14 06 . 19 19 . 69 03 . 45 24 . 69 45 . 10 02 . 70 03 .14

04. 79 02 . 02 08 . 87 03 . 44 17 . 99 17 . 90 04 . 42 01 . 54 01 . 55 19 .99

06. 99 05 . 38 03 . 36 08 . 66 01 . 97 03 . 82 11 . 43 14 . 06 01 . 49 01 .81

give the values of a random sample of size 50 from an exponential distribution

with probability density function

f(x ;✓ ) = 





✓e  x

✓if 0 <x<1

0 elsewhere,

where ✓> 0.

7. Test at the 10% signiﬁcance level the hypothesis that the following data

0. 88 0 . 92 0 . 80 0 . 85 0 . 05 0 . 06 0 . 54 0 . 67 0 . 94 0 .89

0. 82 0 . 97 0 . 34 0 . 45 0 . 74 0 . 07 0 . 91 0 . 47 0 . 04 0 .97

0. 74 0 . 32 0 . 14 0 . 19 0 . 69 0 . 45 0 . 69 0 . 10 0 . 70 0 .14

0. 79 0 . 02 0 . 87 0 . 44 0 . 99 0 . 90 0 . 42 0 . 54 0 . 55 0 .99

0. 94 0 . 38 0 . 36 0 . 66 0 . 97 0 . 82 0 . 43 0 . 06 0 . 49 0 .81

give the values of a random sample of size 50 from an exponential distribution

with probability density function

f(x ;✓ ) = 





(1 + ✓ ) x✓ if 0 x  1

0 elsewhere,

where ✓> 0.

8. Test at the 10% signiﬁcance level the hypothesis that the following data

06. 88 06 . 92 04 . 80 09 . 85 07 . 05 19 . 06 06 . 54 03 . 67 02 . 94 04 .89

29. 82 06 . 97 04 . 34 13 . 45 05 . 74 10 . 07 16 . 91 07 . 47 05 . 04 07 .97

15. 74 00 . 32 04 . 14 05 . 19 18 . 69 02 . 45 23 . 69 24 . 10 01 . 70 02 .14

05. 79 03 . 02 09 . 87 02 . 44 18 . 99 18 . 90 05 . 42 01 . 54 01 . 55 20 .99

07. 99 05 . 38 02 . 36 09 . 66 00 . 97 04 . 82 10 . 43 15 . 06 00 . 49 02 .81

give the values of a random sample of size 50 from an exponential distribution

with probability density function

f(x ;✓ ) =  1

✓if 0 x ✓

0 elsewhere.

Probability and Mathematical Statistics 669

9. Suppose that in 60 rolls of a die the outcomes 1, 2, 3, 4, 5, and 6 occur

with frequencies n1 , n2 , 14, 8, 10, and 8 respectively. What is the least value

of  2

i=1(n i 10) 2 for which the chi-square test rejects the hypothesis that

the die is fair at 1% level of signiﬁcance level?

10. It is hypothesized that of all marathon runners 70% are adult men, 25%

are adult women, and 5% are youths. To test this hypothesis, the following

data from the a recent marathon are used:

Adult Men Adult Women Youths Total

630 300 70 1000

A chi-square goodness-of-ﬁt test is used. What is the value of the statistics?

Goodness of Fit Tests 670

Probability and Mathematical Statistics 671

REFERENCES

[1] Aitken, A. C. (1944). Statistical Mathematics . 3rd edn. Edinburgh and

London: Oliver and Boyd,

[2] Arbous, A. G. and Kerrich, J. E. (1951). Accident statistics and the

concept of accident-proneness. Biometrics , 7, 340-432.

[3] Arnold, S. (1984). Pivotal quantities and invariant conﬁdence regions.

Statistics and Decisions 2, 257-280.

[4] Bain, L. J. and Engelhardt. M. (1992). Introduction to Probability and

Mathematical Statistics. Belmont: Duxbury Press.

[5] Bartlett, M. S. (1937). Properties of suﬃ ciency and statistical tests.

Proceedings of the Royal Society, London, Ser. A, 160, 268-282.

[6] Bartlett, M. S. (1937). Some examples of statistical methods of research

in agriculture and applied biology. J. R. Stat. Soc., Suppli. , 4, 137-183.

[7] Brown, L. D. (1988). Lecture Notes, Department of Mathematics, Cornell

University. Ithaca, New York.

[8] Brown, M. B. and Forsythe, A. B. (1974). Robust tests for equality of

variances. Journal of American Statistical Association, 69, 364-367.

[9] Campbell, J. T. (1934). THe Poisson correlation function. Proc. Edin.

Math. Soc., Series 2, 4, 18-26.

[10] Casella, G. and Berger, R. L. (1990). Statistical Inference. Belmont:

Wadsworth.

[11] Castillo, E. (1988). Extreme Value Theory in Engineering. San Diego:

Academic Press.

[12] Cherian, K. C. (1941). A bivariate correlated gamma-type distribution

function. J. Indian Math. Soc. , 5, 133-144.

References 672

[13] Dahiya, R., and Guttman, I. (1982). Shortest conﬁdence and prediction

intervals for the log-normal. The canadian Journal of Statistics 10, 777-

891.

[14] David, F.N. and Fix, E. (1961). Rank correlation and regression in a

non-normal surface. Proc. 4th Berkeley Symp. Math. Statist. & Prob.,

1, 177-197.

[15] Desu, M. (1971). Optimal conﬁdence intervals of ﬁxed width. The Amer-

ican Statistician 25, 27-29.

[16] Dynkin, E. B. (1951). Necessary and suﬃ cient statistics for a family of

probability distributions. English translation in Selected Translations in

Mathematical Statistics and Probability, 1 (1961), 23-41.

[17] Einstein, A. (1905). ¨

Uber die von der molekularkinetischen Theorie

der W¨arme geforderte Bewegung von in ruhenden Fl¨ussigkeiten sus-

pendierten Teilchen, Ann. Phys. 17, 549560.

[18] Eisenhart, C., Hastay, M. W. and Wallis, W. A. (1947). Selected Tech-

niques of Statistical Analysis, New York: McGraw-Hill.

[19] Feller, W. (1968). An Introduction to Probability Theory and Its Appli-

cations, Volume I. New York: Wiley.

[20] Feller, W. (1971). An Introduction to Probability Theory and Its Appli-

cations, Volume II. New York: Wiley.

[21] Ferentinos, K. K. (1988). On shortest conﬁdence intervals and their

relation uniformly minimum variance unbiased estimators. Statistical

Papers 29, 59-75.

[22] Fisher, R. A. (1922), On the mathematical foundations of theoretical

statistics. Reprinted in Contributions to Mathematical Statistics (by R.

A. Fisher) (1950), J. Wiley & Sons, New York.

[23] Freund, J. E. and Walpole, R. E. (1987). Mathematical Statistics. En-

glewood Cli↵ s: Prantice-Hall.

[24] Galton, F. (1879). The geometric mean in vital and social statistics.

Proc. Roy. Soc., 29, 365-367.

Probability and Mathematical Statistics 673

[25] Galton, F. (1886). Family likeness in stature. With an appendix by

J.D.H. Dickson. Proc. Roy. Soc., 40, 42-73.

[26] Ghahramani, S. (2000). Fundamentals of Probability. Upper Saddle

River, New Jersey: Prentice Hall.

[27] Graybill, F. A. (1961). An Introduction to Linear Statistical Models, Vol.

1. New YorK: McGraw-Hill.

[28] Guenther, W. (1969). Shortest conﬁdence intervals. The American

Statistician 23, 51-53.

[29] Guldberg, A. (1934). On discontinuous frequency functions of two vari-

ables. Skand. Aktuar. , 17, 89-117.

[30] Gumbel, E. J. (1960). Bivariate exponetial distributions. J. Amer.

Statist. Ass., 55, 698-707.

[31] Hamedani, G. G. (1992). Bivariate and multivariate normal characteri-

zations: a brief survey. Comm. Statist. Theory Methods , 21, 2665-2688.

[32] Hamming, R. W. (1991). The Art of Probability for Scientists and En-

gineers New York: Addison-Wesley.

[33] Hogg, R. V. and Craig, A. T. (1978). Introduction to Mathematical

Statistics. New York: Macmillan.

[34] Hogg, R. V. and Tanis, E. A. (1993). Probability and Statistical Inference.

New York: Macmillan.

[35] Holgate, P. (1964). Estimation for the bivariate Poisson distribution.

Biometrika, 51, 241-245.

[36] Kapteyn, J. C. (1903). Skew Frequency Curves in Biology and Statistics.

Astronomical Laboratory, Noordho↵ , Groningen.

[37] Kibble, W. F. (1941). A two-variate gamma type distribution. Sankhya,

5, 137-150.

[38] Kolmogorov, A. N. (1933). Grundbegri↵ e der Wahrscheinlichkeitsrech-

nung. Erg. Math., Vol 2, Berlin: Springer-Verlag.

[39] Kolmogorov, A. N. (1956). Foundations of the Theory of Probability.

New York: Chelsea Publishing Company.

References 674

[40] Kotlarski, I. I. (1960). On random variables whose quotient follows the

Cauchy law. Colloquium Mathematicum . 7, 277-284.

[41] Isserlis, L. (1914). The application of solid hypergeometrical series to

frequency distributions in space. Phil. Mag. , 28, 379-403.

[42] Laha, G. (1959). On a class of distribution functions where the quotient

follows the Cauchy law. Trans. Amer. Math. Soc. 93, 205-215.

[43] Levene, H. (1960). In Contributions to Probability and Statistics: Essays

in Honor of Harold Hotelling. I. Olkin et. al. eds., Stanford University

Press, 278-292.

[44] Lundberg, O. (1934). On Random Processes and their Applications to

Sickness and Accident Statistics. Uppsala: Almqvist and Wiksell.

[45] Mardia, K. V. (1970). Families of Bivariate Distributions. London:

Charles Griﬃ n & Co Ltd.

[46] Marshall, A. W. and Olkin, I. (1967). A multivariate exponential distri-

bution. J. Amer. Statist. Ass. , 62. 30-44.

[47] McAlister, D. (1879). The law of the geometric mean. Proc. Roy. Soc.,

29, 367-375.

[48] McKay, A. T. (1934). Sampling from batches. J. Roy. Statist. Soc.,

Suppliment, 1, 207-216.

[49] Meyer, P. L. (1970). Introductory Probability and Statistical Applica-

tions. Reading: Addison-Wesley.

[50] Mood, A., Graybill, G. and Boes, D. (1974). Introduction to the Theory

of Statistics (3rd Ed.). New York: McGraw-Hill.

[51] Moran, P. A. P. (1967). Testing for correlation between non-negative

variates. Biometrika , 54, 385-394.

[52] Morgenstern, D. (1956). Einfache Beispiele zweidimensionaler Verteilun-

gen. Mitt. Math. Statist. , 8, 234-235.

[53] Papoulis, A. (1990). Probability and Statistics. Englewood Cli↵s:

Prantice-Hall.

[54] Pearson, K. (1924). On the moments of the hypergeometrical series.

Biometrika, 16, 157-160.

Probability and Mathematical Statistics 675

[55] Pestman, W. R. (1998). Mathematical Statistics: An Introduction New

York: Walter de Gruyter.

[56] Pitman, J. (1993). Probability. New York: Springer-Verlag.

[57] Plackett, R. L. (1965). A class of bivariate distributions. J. Amer.

Statist. Ass., 60, 516-522.

[58] Rice, S. O. (1944). Mathematical analysis of random noise. Bell. Syst.

Tech. J., 23, 282-332.

[59] Rice, S. O. (1945). Mathematical analysis of random noise. Bell. Syst.

Tech. J., 24, 46-156.

[60] Rinaman, W. C. (1993). Foundations of Probability and Statistics. New

York: Saunders College Publishing.

[61] Rosenthal, J. S. (2000). A First Look at Rigorous Probability Theory.

Singapore: World Scientiﬁc.

[62] Ross, S. (1988). A First Course in Probability. New York: Macmillan.

[63] Ross, S. M. (2000). Introduction to Probability and Statistics for Engi-

neers and Scientists. San Diego: Harcourt Academic Press.

[64] Roussas, G. (2003). An Introduction to Probability and Statistical Infer-

ence. San Diego: Academic Press.

[65] Sahai, H. and Ageel, M. I. (2000). The Analysis of Variance. Boston:

Birkhauser.

[66] Seshadri, V. and Patil, G. P. (1964). A characterization of a bivariate

distribution by the marginal and the conditional distributions of the same

component. Ann. Inst. Statist. Math. , 15, 215-221.

[67] H. Sche↵ ´e (1959). The Analysis of Variance. New York: Wiley.

[68] Smoluchowski, M. (1906). Zur kinetischen Theorie der Brownschen

Molekularbewe-gung und der Suspensionen, Ann. Phys. 21, 756780.

[69] Snedecor, G. W. and Cochran, W. G. (1983). Statistical Methods. 6th

eds. Iowa State University Press, Ames, Iowa.

[70] Stigler, S. M. (1984). Kruskal's proof of the joint distribution of Xand

s2 . The American Statistician, 38, 134-135.

References 676

[71] Sveshnikov, A. A. (1978). Problems in Probability Theory, Mathematical

Statistics and Theory of Random Functions. New York: Dover.

[72] Tardi↵ , R. M. (1981). L'Hospital rule and the central limit theorem.

American Statistician, 35, 43-44.

[73] Taylor, L. D. (1974). Probability and Mathematical Statistics. New York:

Harper & Row.

[74] Tweedie, M. C. K. (1945). Inverse statistical variates. Nature, 155, 453.

[75] Waissi, G. R. (1993). A unifying probability density function. Appl.

Math. Lett. 6, 25-26.

[76] Waissi, G. R. (1994). An improved unifying density function. Appl.

Math. Lett. 7, 71-73.

[77] Waissi, G. R. (1998). Transformation of the unifying density to the

normal distribution. Appl. Math. Lett. 11, 45-28.

[78] Wicksell, S. D. (1933). On correlation functions of Type III. Biometrika,

25, 121-133.

[79] Zehna, P. W. (1966). Invariance of maximum likelihood estimators. An-

nals of Mathematical Statistics , 37, 744.

Probability and Mathematical Statistics 677

ANSWERES

SELECTED

REVIEW EXERCISES

CHAPTER 1

1. 7

1912 .

2. 244.

3. 7488.

4. (a) 4

24 , (b) 6

24 and (c) 4

24 .

5. 0.95.

6. 4

7. 2

8. 7560.

10. 43.

11. 2.

12. 0.3238.

13. S has countable number of elements.

14. S has uncountable number of elements.

15. 25

648 .

16. (n 1)(n 2)  1

2 n+1.

17. (5!)2.

18. 7

10 .

19. 1

20. n+1

3n 1 .

21. 6

11 .

22. 1

Answers to Selected Problems 678

CHAPTER 2

1. 1

2. (6!) 2

(21)6 .

3. 0.941.

4. 4

5. 6

11 .

6. 255

256 .

7. 0.2929.

8. 10

17 .

9. 30

31 .

10. 7

24 .

11. ( 4

10 )( 3

10 ) ( 3

6) + ( 6

10 ) ( 2

5).

12. (0.01) (0.9)

(0. 01) (0 .9)+(0. 99) (0 . 1) .

13. 1

14. 2

15. (a)  2

5 4

52 +  3

5 4

16  and (b) ( 3

5)( 4

16 )

5) ( 4

52 ) + ( 3

5) ( 4

16 ).

16. 1

17. 3

18. 5.

19. 5

42 .

20. 1

Probability and Mathematical Statistics 679

CHAPTER 3

1. 1

2. k+1

2k +1 .

3. 1

p2 .

4. Mode of X = 0 and median of X= 0.

5. ✓ ln  10

9.

6. 2 ln 2.

7. 0.25.

8. f (2) = 0.5 , f (3) = 0.2 , f (⇡ ) = 0 .3.

9. f (x ) = 1

6x 3 e x.

10. 3

11. a = 500, mode = 0. 2, and P (X 0. 2) = 0 .6766.

12. 0.5.

13. 0.5.

14. 1F (y ).

15. 1

16. RX = {3 , 4 , 5 , 6 , 7 , 8 , 9};

f(3) = f(4) = 2

20 , f (5) = f(6) = f(7) = 4

20 , f (8) = f(9) = 2

20 .

17. RX = {2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12};

f(2) = 1

36 , f (3) = 2

36 , f (4) = 3

36 , f (5) = 4

36 , f (6) = 5

36 , f (7) = 6

36 , f (8) =

36 , f (9) = 4

36 , f (10) = 3

36 , f (11) = 2

36 , f (12) = 1

36 .

18. RX = {0 , 1 , 2 , 3 , 4 , 5};

f(0) = 59049

105 , f (1) = 32805

105 , f (2) = 7290

105 , f (3) = 810

105 , f (4) = 45

105 , f (5) = 1

105 .

19. RX = {1 , 2 , 3 , 4 , 5 , 6 , 7};

f(1) = 0 .4 , f (2) = 0 .2666 , f (3) = 0 .1666 , f (4) = 0 .0952 , f (5) =

0.0476, f (6) = 0 .0190, f (7) = 0 .0048.

20. c = 1 and P (X = even ) = 1

21. c= 1

2,P(1  X2) = 3

22. c= 3

2and P X 1

2= 3

16 .

Answers to Selected Problems 680

CHAPTER 4

1. 0.995.

2. (a) 1

33 , (b) 12

33 , (c) 65

33 .

3. (c) 0. 25, (d) 0. 75, (e) 0.75, (f) 0.

4. (a) 3. 75, (b) 2. 6875, (c) 10. 5, (d) 10. 75, (e) 71.5.

5. (a) 0. 5, (b) ⇡ , (c) 3

10 ⇡.

6. 17

p✓ .

7. 4

1

E(x2 ) .

8. 8

9. 280.

10. 9

20 .

11. 5.25.

12. a= 4h3

p⇡ ,E( X) = 2

hp ⇡,V ar( X ) = 1

h2  3

2 4

⇡.

13. E (X ) = 7

4,E( Y) = 7

14.  2

38 .

15. 38.

16. M (t ) = 1 + 2t + 6 t2 +··· .

17. 1

18. n  n1

i=0 (k+ i).

19. 1

43e 2t+e 3t .

20. 120.

21. E (X ) = 3, V ar (X ) = 2.

22. 11.

23. c= E (X).

24. F (c ) = 0 .5.

25. E (X ) = 0, V ar (X ) = 2.

26. 1

625 .

27. 38.

28. a = 5 and b =  34 or a =  5 and b = 36.

29. 0.25.

30. 10.

31.  1

1ppln p.

Probability and Mathematical Statistics 681

CHAPTER 5

1. 5

16 .

2. 5

16 .

3. 25

72 .

4. 4375

279936 .

5. 3

6. 11

16 .

7. 0.008304.

8. 3

9. 1

10. 0.671.

11. 1

16 .

12. 0.0000399994.

13. n 2 3n+2

2n+1 .

14. 0.2668.

15. ( 6

3k ) ( 4

(10

3),0k  3.

16. 0.4019.

17. 1 1

e2 .

18.  x1

2 1

6 3  5

6 x3 .

19. 5

16 .

20. 0.22345.

21. 1.43.

22. 24.

23. 26.25.

24. 2.

25. 0.3005.

26. 4

e4 1 .

27. 0.9130.

28. 0.1239.

Answers to Selected Problems 682

CHAPTER 6

1. f (x ) = ex 0<x< 1.

2. Y⇠ U N IF (0 , 1).

3. f (w ) = 1

wp 2⇡2 e  1

2( ln wµ

) 2 .

4. 0.2313.

5. 3 ln 4.

6. 20.1.

7. 3

8. 2.0.

9. 53.04.

10. 44.5314.

11. 75.

12. 0.4649.

13. n!

✓n .

14. 0.8664.

15. e 1

2k 2 .

16. 1

17. 64.3441.

18. g (y ) =  4

y3 if 0 < y < p 2

0 otherwise.

19. 0.5.

20. 0.7745.

21. 0.4.

22. 2

3✓y  1

3e  y

✓.

23. 4

p2⇡ y e  y 4

24. ln(X )⇠  \(µ, 2 ).

25. eµ2 .

26. eµ .

27. 0.3669.

29. Y⇠ GBE T A(↵ , , a, b).

32. (i) 1

2p⇡, (ii) 1

2, (iii) 1

4p⇡, (iv) 1

33. (i) 1

180 , (ii) (100) 13 5! 7!

13! , (iii) 1

360 .

35.  1 ↵

 2 .

36. E (Xn ) = ✓n (n+↵)

( ↵) .

Probability and Mathematical Statistics 683

CHAPTER 7

1. f1 (x) = 2x+3

21 ,and f 2 (y ) = 3y+6

21 .

2. f (x, y ) =  1

36 if 1 < x < y = 2 x < 12

36 if 1 < x < y < 2 x < 12

0 otherwise.

3. 1

18 .

4. 1

2e4 .

5. 1

6. f1 (x) =  2(1  x) if 0 <x<1

0 otherwise.

7. (e2  1)(e 1)

e5 .

8. 0.2922.

9. 5

10. f1 (x) =  5

48 x(8  x 3 ) if 0 < x < 2

0 otherwise.

11. f2 ( y ) =  2 y if 0 < y < 1

0 otherwise.

12. f (y/x ) =  1

1+p1(x 1)2 if (x 1) 2 + (y 1) 2 1

0 otherwise.

13. 6

14. f (y/x ) =  1

2x if 0 <y< 2x < 1

0 otherwise.

15. 4

16. g (w ) = 2 ew  2e2w .

17. g (w ) =  1w 3

✓3  6w2

✓3 .

18. 11

36 .

19. 7

12 .

20. 5

21. No.

22. Yes.

23. 7

32 .

24. 1

25. 1

26. x ex .

Answers to Selected Problems 684

CHAPTER 8

1. 13.

2. Cov ( X, Y ) = 0. Since 0 = f (0 , 0) 6= f1 (0)f2 (0) = 1

4,Xand Yare not

independent.

3. 1

p8 .

4. 1

(14t)(16t ) .

5. X+ Y⇠ BIN ( n+ m, p).

6. 1

2X 2 +Y 2 ⇠EX P (1).

7. M (s, t ) = e s 1

s+ e t 1

9.  15

16 .

10. Cov ( X, Y ) = 0. No.

11. a= 6

8and b= 9

12. Cov = 45

112 .

13. Corr( X, Y ) =  1

14. 136.

15. 1

2p1 + ⇢.

16. (1 p + pet )(1 p + pet ).

17.  2

n[1 + (n 1)⇢].

18. 2.

19. 4

20. 1.

Probability and Mathematical Statistics 685

CHAPTER 9

1. 6.

2. 1

2(1 + x 2 ).

3. 1

2y 2 .

4. 1

2+x.

5. 2x.

6. µX = 22

3and µ Y = 112

7. 1

2+3y 28y 3

1+2y 8y2 .

8. 3

2x.

9. 1

2y.

10. 4

3x.

11. 203.

12. 15  1

⇡.

13. 1

12 (1 x) 2 .

14. 1

12 1x 2  2 .

15. f2 ( y ) = 





96y

7for 0 y1

3(2y )2

7if 1 y  2

and V ar  X | Y = 3

2= 1

72 .

16. 1

12 .

17. 180.

19. x

6+ 5

12 .

20. x

2+ 1.

Answers to Selected Problems 686

CHAPTER 10

1. g (y ) =  1

2+ 1

4p y for 0 y  1

0 otherwise.

2. g (y ) = 





mp m for 0 y  4m

0 otherwise.

3. g (y ) =  2y for 0 y1

0 otherwise.

4. g (z ) = 









16 (z + 4) for  4z 0

16 (4 z ) for 0 z4

0 otherwise.

5. g (z, x ) =  1

2e x for 0 < x < z < 2 + x < 1

0 otherwise.

6. g (y ) =  4

y3 for 0 < y < p 2

0 otherwise.

7. g (z ) = 









15000  z 2

250 + z

25 for 0 z10

15  2z

25  z 2

250  z 3

15000 for 10 z20

0 otherwise.

8. g (u ) = 





4a2

u3 ln  ua

a+ 2 a(u 2a)

u2 (ua) for 2a u < 1

0 otherwise.

9. h( y ) = 3z2  2z+1

216 , z = 1, 2,3,4,5,6.

10. g (z ) = 





4h3

mp ⇡ 2 z

me  2h2z

mfor 0 z < 1

0 otherwise.

11. g (u, v ) =   3u

350 + 9v

350 for 10  3u +v 20, u  0, v  0

0 otherwise.

12. g1 (u) =  2u

(1+u)3 if 0 u < 1

0 otherwise.

Probability and Mathematical Statistics 687

13. g (u, v ) = 





5[ 9v3 5u2 v +3uv2 +u3 ]

32768 for 0 < 2v + 2u < 3v u < 16

0 otherwise.

14. g (u, v ) =  u+v

32 for 0 < u + v < 2p5 v 3 u < 8

0 otherwise.

15. g1 (u) = 









2 + 4u + 2u2 if  1u 0

2p 1 4u if 0 u  1

0 otherwise.

16. g1 (u) = 









3uif 0 u1

3u  5 if 1 u < 1

0 otherwise.

17. g1 (u) = 





4u 1

34u if 0 u1

0 otherwise.

18. g1 (u) =  2 u 3 if 1  u < 1

0 otherwise.

19. f (w ) =











6if 0 w2

6if 2 w3

5w

6if 3 w5

0 otherwise.

20. BIN (2 n, p)

21. GAM (✓ , 2)

22. CAU (0)

23. N (2µ, 22 )

24. f1 (↵ ) = 





4(2 |↵|) if |↵ |  2

0 otherwise,

f2 ( ) =   1

2ln(||) if | | 1

0 otherwise.

Answers to Selected Problems 688

CHAPTER 11

2. 7

10 .

3. 960

75 .

6. 0.7627.

Probability and Mathematical Statistics 689

CHAPTER 12

6. 0.16.

Answers to Selected Problems 690

CHAPTER 13

3. 0.115.

4. 1.0.

5. 7

16 .

6. 0.352.

7. 6

8. 100.64.

9. 1+ln(2)

10. [1 F (x6 )]5.

11. ✓+ 1

12. 2 ew [1  ew ].

13. 6w 2

✓3 1 w 3

✓3 .

14. N (0,1).

15. 25.

16. X has a degenerate distribution with MGF M (t ) = e 1

2t .

17. P OI (1995).

18.  1

2 n (n + 1).

19. 8 8

119 35.

20. f (x ) = 60

✓1e  x

✓ 3 e  3x

✓for 0 < x < 1.

21. X(n+1) ⇠Beta( n + 1 , n + 1).

Probability and Mathematical Statistics 691

CHAPTER 14

1. N (0,32).

2. 2 (3); the MGF of X 2

1X 2

2is M (t ) = 1

p1 4t2 .

3. t(3).

4. f (x1 , x2, x3 ) = 1

✓3 e  (x1 +x 2+x 3)

✓.

5.  2

6. t(2).

7. M (t ) = 1

p(12t)(14t)(16t)(18t ) .

8. 0.625.

9.  4

n2 2(n 1).

10. 0.

11. 27.

12. 2 (2n).

13. t( n+ p).

14. 2 (n).

15. (1,2).

16. 0.84.

17. 2 2

n2 .

18. 11.07.

19. 2 (2n 2).

20. 2.25.

21. 6.37.

Answers to Selected Problems 692

CHAPTER 15

1. 



3



i=1

2. 1

X1 .

3. 2

4.  n



i=1

ln Xi

5. n



i=1

ln Xi

1.

6. 2

7. 4.2

8. 19

26 .

9. 15

10. 2.

11. ˆ ↵= 3. 534 and ˆ

= 3.409.

12. 1.

13. 1

3max{x 1 , x 2 , ..., x n }.

14.  1 1

max{x1,x2,...,xn } .

15. 0.6207.

18. 0.75.

19. 1 + 5

ln(2) .

20. ¯

1+ ¯

21. ¯

22. 8.

23. n



i=1|X i µ|

Probability and Mathematical Statistics 693

24. 1

25. p ¯

26. ˆ

=n¯

(n 1)S2 and ˆ ↵=n¯

(n 1)S2 .

27. 10 n

p(1p ) .

28. 2n

✓2 .

29.  n

2 0

24 .

30.  n

µ3 0

22 .

31.  ↵=X



,

=1

X 1

n n

i=1 X 2

iX .

32. 

✓is obtained by solving numerically the equation  n

i=1

2(xi ✓ )

1+(xi ✓ )2 = 0.

33. 

✓is the median of the sample.

34. n

.

35. n

(1p ) p2 .

36. 

✓= 3 X.

37. 

✓=50

30 X.

Answers to Selected Problems 694

CHAPTER 16

1. b= 2

2cov(T 1 ,T 2 )

2

1+ 2

22cov(T 1 ,T 2 .

2. 

✓=|X | , E (|X | ) = ✓, unbiased.

4. n = 20.

5. k= 1

6. a= 25

61 , b = 36

61 , c= 12.47.



i=1



i=1

i, no.

9. k= 4

⇡.

10. k = 2.

11. k = 2.

13. ln



i=1

(1 + Xi ).

14.



i=1

15. X(1) , and suﬃ cient.

16. X(1) is biased and X 1 is unbiased. X(1) is eﬃ cient then X 1.

17.



i=1

ln Xi .

18.



i=1

Xi .

19.



i=1

ln Xi .

22. Yes.

23. Yes.

Probability and Mathematical Statistics 695

24. Yes.

25. Yes.

26. Yes.

Answers to Selected Problems 696

CHAPTER 17

7. The pdf of Q is g (q ) =  n e n q if 0 < q < 1

0 otherwise.

The conﬁdence interval is  X(1)  1

nln  2

↵, X (1)  1

nln  2

2↵ .

8. The pdf of Q is g (q ) =  1

2e  1

2q if 0 <q<1

0 otherwise.

The conﬁdence interval is  X(1)  1

nln  2

↵, X (1)  1

nln  2

2↵ .

9. The pdf of Q is g (q ) =  n q n1 if 0 <q<1

0 otherwise.

The conﬁdence interval is  X(1)  1

nln  2

↵, X (1)  1

nln  2

2↵ .

10. The pdf g (q ) of Q is given by g (q ) =  n q n1 if 0 q  1

0 otherwise.

The conﬁdence interval is  2

↵ 1

nX (n), 2

2↵ 1

nX (n) .

11. The pdf of Q is given by g (q ) =  n (n 1) q n2 (1 q ) if 0 q  1

0 otherwise.

12.  X(1)  z ↵

pn , X (1) +z ↵

pn  .

13.  

✓z↵

2

✓+1

pn , 

✓+z↵

2

✓+1

pn ,where 

✓= 1 + n

n

i=1 ln x i .

14.  2

Xz↵

2 2

n X2 , 2

X+z↵

2 2

n X2 .

15.  X4  z ↵

X4

pn , X 4 + z ↵

X4

pn .

16.  X(n)  z ↵

X(n)

(n +1) p n +2 , X (n)+z ↵

X(n)

(n +1) p n +2 .

17.  1

4Xz↵

8p n, 1

4X+z↵

8p n.

Probability and Mathematical Statistics 697

CHAPTER 18

1. ↵ = 0. 03125 and  = 0.763.

2. Do not reject Ho .

3. ↵ = 0. 0511 and  ( ) = 1 



x=0

(8)x e8

x! ,6= 0 .5.

4. ↵ = 0. 08 and  = 0.46.

5. ↵ = 0.19.

6. ↵ = 0.0109.

7. ↵ = 0. 0668 and  = 0.0062.

8. C= {(x1 , x2 ) | x2  3.9395 }.

9. C= {(x1 , ..., x10 ) |x 0.3}.

10. C= {x2 [0, 1] |x 0.829}.

11. C= {(x1 , x2 ) | x1 +x2  5 }.

12. C= {(x1 , ..., x8 ) |x x ln x a}.

13. C= {(x1 , ..., xn ) | 35 ln x x a}.

14. C=  (x1 , ..., x5 )| x

2x 2  5x5 x 5 a .

15. C= {(x1 , x2, x3 ) | |x 3 | 1.96}.

16. C=  (x1 , x2 , x3 )| x e 1

3x a .

17. C=  (x1 , x2 , ..., xn )| e

10 x  3xa .

18. 1

19. C=  (x1 , x2, x3 )| x(3)  3

p117 .

20. C= {(x1 , x2, x3 ) |x 12.04}.

21. ↵= 1

16 and = 255

256 .

22. ↵ = 0.05.

Answers to Selected Problems 698

CHAPTER 21

9.  2

i=1(n i 10) 2 63.43.

10. 25.

... To define the optimal number of experiments and the highest accuracy degree and reliability of the obtained results, as well as for the processing of these results, methods of mathematical statistics were used [62]. ...

... Two types of measurement errors-random and systematic-may occur during the experiment conducting [62]. ...

... The bilateral confidence interval of the arithmetic mean value ε was determined by the following function [62], provided that this parameter is located in the confidence interval with the probability not less than 95%: ...

Citation: Yukhymenko, M.; Artyukhov, A.; Ostroha, R.; Artyukhova, N.; Krmela, J.; Bocko, J.The article deals with the theoretical description and experimental study of the hydrodynamic and heat transfer properties regarding the operation of multistage gravitational devices of the fluidized bed with inclined perforated shelves. The peculiarities of the work and the implementation field of the multistage shelf units are described. A theoretical model to define the solubilizer's velocity above the perforation holes, in the above-shelf space of the device and in the outloading gap, as well as the residence time of the dispersed phase at the stage (perforated shelf contact) of the device is presented. The results of experimental studies regarding the influence, made by the structural parameters of the perforated shelf contacts, on the distribution pattern of single-phase and gas-dispersed flows in the workspace of the device, on the intensity of interphase heat transfer are presented. The conditions to create active hydrodynamic operating modes of multistage gravitational shelf devices, which provide higher efficiency of heat-mass transfer processes, and with lower gas consumption and hydraulic resistance compared to typical fluidized bed devices, are proved. Peculiarities regarding the implementation of heat-mass transfer processes in multistage devices are described using heat treatment and drying processes as examples.

... Two types of measurement errors -random and systematic, may occur during the experiment conducting [26]. A random error reduces the accuracy of experiment results. ...

... The bilateral confidence interval of the arithmetic mean value ε was determined by the following function [26], provided that this parameter is located in the confidence interval with the probability not less than 95 %: A r t i c l e i n P r e s s 6 t n   = (9) where t is the Student's criterion [27]. The root-mean-square error of indirect measurements is calculated as: ...

The aim of the article is a theoretical description and experimental study of the melt jet expiration process from a perforated shell of a vibrating granulator. Mathematical modeling of hydrodynamic flows was carried out based on the points of classical fluid and gas mechanics and technical hydromechanics. Reliability of the obtained experimental results is based on the application of time-tested in practice methods. Hydrodynamic properties of the liquid jet outflow were obtained. The presented mathematical model allows calculation of the radial component of the jet outflow velocity, as well as determination of the influences of physical and chemical properties of the liquid and the outflow hole diameter on the jet length and flow velocity along the axis to its disintegration into separated drops. The developed mathematical model extended with the theoretical description of the melt dispersion process from rotating perforated shells allowed us to improve design of the granulator to stabilize hydrodynamic parameters of the melt movement. The nitrogen fertilizers melt disperser was investigated regarding industrial-scale production and operating parameters of the process of jet decay into drops, drop size and monodispersity level were optimized.

... Equality of higher order cumulants follows from independence. 1 c can thus be chosen as that constant maximizing the probability of statistical equality between the empirical distribution from the realizations of M m and the cumulative standard normal distribution. In this work, we adopted the Kolomogorov-Smirnov (K-S) test to perform such check [15]. Then, if the c value maximizing that probability leads to a passed K-S test, the hypotheses of time-averaged unbiasedness of the corrected TLE is accepted. ...

As space traffic increases, Space Situational Awareness (SSA) is becoming fundamental for safe spaceflight operations. Cost-driven missions based on small satellite platforms would benefit from the availability of alternative tools providing preliminary SSA from publicly available information, such as two-line elements. In this work, we propose an orbit prediction and uncertainty evaluation method based on the well-established TLE differencing technique aided by a machine learning corrector. By designing a Recurrent Neural Network with carefully chosen input parameters, the TLE prediction accuracy is significantly improved, when tested against precise orbital data of real satellites. The prediction error is reduced, on average, by 45% across a prediction window of 16 days which may include manoeuvres. We further show that in combination with a statistical test for equality between error distributions, the differencing technique applied to the corrected TLE allows a reliable variance estimate in most situations. Limitation of the work is the training of a dedicated neural network corrector for each specific space object, which will be deposed as part of our ongoing efforts.

... for x ˃ θ, α˃0, β˃ 0, α is the dimensionless shape parameter, β is the scale parameter, θ is the location parameter. The MLM estimates are obtained by solving iteratively [8,9]: ...

Meeran Akram Fawzee
Samira M. Salh
Slahaddin A. Ahmed

Study the statistical distribution for rainfall is important to know the behaviour of the rainfall series and to know the most frequently rainfall amount in each month. Five statistical distribution were applied on Sulaimani, Erbil and Duhok rainfall series for the period (1941-2017) except Duhok (1944-2017). These distributions were Gamma(3P), Weibul(3P), Earlang (3P), Normal and General extreme value. Kolmogrove-Semirnov, Anderson-Darling and Chi-Square goodness of fit test were used to know the best fit distribution from these five distributions.

... 1. Choose a seed to start generating real random numbers with a uniform distribution in the range of 0 to 1. Let's call this distribution as unif[0, 1] (Sahoo, 2013). Let U be a random variable having uniform distribution on the interval [0, 1] i.e. ...

The volume of fluid (VOF) method is widely used to simulate the flow of immiscible fluids. It uses a discrete and sharp volume fractions field to represent the fluid-fluid interface on a Eulerian grid. The most challenging part of the VOF method is the accurate computation of the local interface curvature which is essential for evaluation of the surface tension force at the interface. In this paper, a machine learning approach is used to develop a model which predicts the local interface curvature from neighbouring volume fractions. A novel data generation methodology is devised which generates well-balanced randomized data sets comprising of spherical interface patches of different configurations/orientations. A two-layer feed-forward neural network with different network parameters is trained on these data sets and the developed models are tested for different shapes i.e. ellipsoid, 3D wave and Gaussian. The best model is selected on the basis of specific criteria and subsequently compared with conventional curvature computation methods (convolution and height function) to check the nature and grid convergence of the model. The model is also coupled with a multiphase flow solver to evaluate its performance using standard test cases: i) stationary bubble, ii) oscillating bubble and iii) rising bubble under gravity. Our results demonstrate that machine learning is a feasible approach for fairly accurate curvature computation. It easily outperforms the convolution method and even matches the accuracy of the height function method for some test cases.

Maize (Zea mays L.) is a staple food crop for people in Kenya. It is usually contaminated by fungi especially Fusarium that produces mycotoxins, Fumonisins (FBs). This is a group of fungal toxins, occurring worldwide in maize infected by Fusarium verticillioides. Most common is Fumonisin FB1 whose intake above 2 mg/Kg body weight/day plays a role in Neural tube defect and/or Oesophangeal cancer. There is no available data on distribution of different FB1 production in various maize genotypes in Kenya. The objective of this study was to establish the levels of FB1 and FB2 in maize genotypes in Njoro and Molo Sub-Counties in Nakuru County, Kenya. Using purposive sampling, maize kernels showing no symptoms of Fusarium infection were collected from 277 farmers' stores in Molo and Njoro Sub-Counties. Fumonisin (FB) levels were determined using Liquid Chromatography Mass spectrometry (LC-MS). The levels of Fumonisin B1 (FB1) showed that H629 had 4437.53 µg/kg while H614 had the lowest 1315.7 µg/kg FB1 levels. The levels of FB1 in locations tested using t-test were significantly different (P < 0.05). Levels of Fumonisin B2 (FB2) was higher in maize genotype H629 which had 628.1 µg/kg. This finding reveals presence of Fumonisins (FBs) on symptomless maize kernels and map out the levels of FBs in maize genotypes. This is to enlighten the public and the agricultural officers on the maize genotypes and the levels of Fumonisins (FBs) in each genotype, hence the need to grow maize genotypes that will have the minimal levels of Fumonisins (FBs) for this area under the same environmental condition.

ResearchGate has not been able to resolve any references for this publication.