Introduction to Statistical Inference

The following paragraphs provide a quick review of ideas central to statistical inference. For a more detailed treatment, see Chapter 6 of [DeGroot, 1986].

Parameters and Parameter Spaces

Statistical inference concerns the problem of inferring properties of an unknown distribution from data generated by that distribution. The most common type of inference involves approximating the unknown distribution by choosing a distribution from a restricted family of distributions. Generally the restricted family of distributions is specified parametrically.

For example, N(mu,sigma^2) represents the family of normal distributions where mu corresponds to the mean of a normal distribution and ranges over the real numbers (R) and sigma^2 corresponds to the variance of a normal distribution and ranges over the positive real numbers (R^+). mu and sigma^2 are called parameters with R and R^+ their corresponding parameter spaces. The two-dimensional space R cross R^+ is the parameter space for the family of normal distributions. By fixing mu = 0 we can restrict attention to the family corresponding to the set of zero-mean normal distributions. In the following, we use theta to denote a real-valued parameter (or vector of real-valued parameters theta = (theta_1,...,theta_k)) with parameter space Omega_theta (Product_{i=1}^k Omega_{theta_i}).

Priors, Posteriors, and Likelihoods

Let X = X_1,...,X_n be a sequence of independent observations generated by the distribution (p.f. or p.d.f.) f(X|theta). One common problem in statistical inference involves estimating theta from the observations. In some cases (the hard-line Bayesian would say in all cases), we have prior information about theta which can be summarized in a prior distribution xi(theta) representing the likelihood prior to observing any data. The posterior distribution xi(theta|x) represents the relative likelihood after observing x = x_1,...,x_n (i.e., X_1=x_1,...,X_n=x_n). The relation between the prior and posterior distributions is given by Bayes rule

                              f_n(x|theta) xi(theta)
     xi(theta|x) = -----------------------------------------------
                    Integral_{Omega_theta} f_n(x|theta) xi(theta)

where f_n is the probability function for sequences of length n.

        f_n(x|theta) = f(x_1|theta) f(x_2|theta) ... f(x_n|theta)

The function f_n is often called the likelihood function. Since the denominator in Bayes rule is independent of theta, we have the following proportionality.

             xi(theta|x) ~ f_n(x|theta) xi(theta)

If the observations are obtained one at a time, we can update the posterior distribution as follows.

        xi(theta|x_1) ~ f(x_1|theta) xi(theta)
        xi(theta|x_1,x_2) ~ f(x_2|theta) xi(theta|x_1)
                        ...
        xi(theta|x_1,...,x_n) ~ f_n(x_n|theta) xi(theta|x_1,...,x_{n-1})

Conjugate Prior Distributions

It is convenenient if, starting from a prior xi(theta) in a particular family and given an observation x, we end up with posterior distribution xi(theta|x) from the same family. This is the defining property for a family of conjugate prior distributions. Here are the families of conjugate prior distributions for some sample generating distributions.

        probability distribution        corresponding family of 
        generating the sample           conjugate prior distributions
        ------------------------        -----------------------------
        Poisson                         gamma
        normal                          normal
        exponential                     gamma
        binomial                        beta
        multinomial                     Dirichlet

Bayes Estimators

Suppose that X_1,...,X_n is a random sample generated from f(x|theta). An estimator delta(X_1,...,X_n) provides an estimate (delta(x_1,..,x_n)) for theta.

If a is an estimate for theta and L(theta,a) is a real-valued loss function, the expected loss of choosing a before observing any data is

        E(L(theta,a)) = Integral_{Omega_theta} L(theta,a) xi(theta)

or, after observing the data x,

        E(L(theta,a)|x) = Integral_{Omega_theta} L(theta,a) xi(theta|x)

The squared-error loss function is perhaps the most common loss function used in practice.

                L(theta,a) = (theta - a)^2

A Bayes estimator delta^* is an estimator such that

        E(L(theta,delta^*(x))|x) = min_a E(L(theta,a)|x)

Maximum Likelihood Estimators

In cases in which the experimenter feels uncomfortable assigning a prior xi(theta), a maximum likelihood estimator is often used instead of a Bayes estimator. The maximum likelihood estimate (MLE) is that value of theta maximizing f_n(x|theta). In some (but not all) cases, the MLE is identical to the estimate generated by a Bayes estimator.

In many practical cases, we are forced to choose a family of distributions not knowing whether or not the distribution generating the observed samples is in this particular family. A robust estimator is an estimator that chooses an estimate that is a good estimate even if the generating distribution is not in the chosen family of distributions (see Chapter 9 [DeGroot, 1986]).

Sufficient Statistics

Any real-valued function T = r(X_1,...,X_n) of the observations generated by the distribution f_n(X) is called a statistic. For real-valued X, the sample mean (average) r(x_1,...,x_n) = 1/n Sum_{i=1}^n x_n is a common statistic. The function r(x) = 3.14 for all x in Omega_X is also a statistic though admittedly less useful than the sample mean.

Suppose that T = r(X) is a statistic and t any value of T. Let f_n(x|theta)|_{r(x)=t}, i.e., conditional joint distribution for x given that r(x)=t. In general, f_n(x|theta)|_{r(x)=t} will depend on theta. If f_n(x|theta)|_{r(x)=t} does not depend on theta then T is called a sufficient statistic. A sufficient statistic summarizes all of the information in a random sample so that knowledge of the individual values in the sample is irrelevant in searching for a good esimator for theta. For example, if the generating distribution is a zero-mean normal distribution, then the sample variance is a sufficient statistic for estimating sigma^2. Sufficient statistics are often used where a maximum likelihood or Bayes estimator is unsuitable. The sample mean and sample variance are said to be jointly sufficient statistics for the mean and variance of normal distributions.

Back to Tutorial