[Math Review] Statistics Basic: Estimation

时间：2019-02-07 09:26:38 阅读：153 评论：0 收藏：0 [点我收藏+]

标签：diff layout pen .com pos could question put param

Two Types of Estimation

One of the major applications of statistics is estimating population parameters from sample statistics. There are types of estimation:

Point Estimate: the value of sample statistics

技术图片

Point estimates of average height with multiple samples (Source: Zhihu)

Confidence Intervals: intervals constructed using a method that contains the population parameter a specified proportion of the time.

技术图片

95% confidence interval of average height with multiple samples (Source: Zhihu)

Confidence Interval for the Mean

Population Variance is known

Suppose that M is the mean of N samples X₁, X₂, ......, X_n, i.e.

技术图片

According to Central Limit Theorem, the the sampling distribution of the mean M is

技术图片

where μ and σ²are the mean and variance of the population respectively. If repeated samples were taken and the 95% confidence interval computed for each sample, 95% of the intervals would contain the population mean. So the 95% confidence interval for M is the inverval that is symetric about the point estimate μ so that the area under normal distribution is 0.95.

技术图片

That is,

Since we don‘t know the mean of population, we could use the sample mean $技术图片$ instead.

Population Variance is Unknown

Dregree of Freedom

The degrees of freedom (df) of an estimate is the number of independent pieces of information on which the estimate is based. In general, the degrees of freedom for an estimate is equal to the number of values minus the number of parameters estimated en route to the estimate in question.

If the variance in a sample is used to estimate the variance in a population, we couldn‘t calculate the sample variace as

技术图片

That‘s because we have two parameters to estimate (i.e., sample mean and sample variance). The degree of freedom should be N-1, so the previous formula underestimates the variance. Instead, we should use the following formula

where s² is the estimate of the variance and M is the sample mean. The denominator of this formula is the degree of freedom.

Student‘s t-Distribution

Suppose that X is a random variable of normal distribution, i.e., X ~ N(μ, σ²)

is sample mean and

is sample deviation.

技术图片

is a random variable of normal distribution.

技术图片

is a random variable of student‘s t distribution.

The probability density function of T is

技术图片

where $技术图片$ is the degree of freedom, $技术图片$ is a gamma function.

The t distribution is very similar to the normal distribution when the estimate of variance is based on many degrees of freedom, but has relatively more scores in its tails when there are fewer degrees of freedom. Here are t distributions with 2, 4, and 10 degrees of freedom and the standard normal distribution. Notice that the normal distribution has relatively more scores in the center of the distribution and the t distribution has relatively more in the tails.

技术图片

The t distribution is therefore leptokurtic. The t distribution approaches the normal distribution as the degrees of freedom increase.

Confidence Interval of t Distribution

Now consider the case in which you have a normal distribution but you do not know the standard deviation. You sample N values and compute the sample mean (M) and estimate the standard error of the mean (σ_M) with s_M. What is the probability that M will be within 1.96 s_M of the population mean (μ)? This is a difficult problem because there are two ways in which M could be more than 1.96 s_M from μ: (1) M could, by chance, be either very high or very low and (2) s_M could, by chance, be very low. Intuitively, it makes sense that the probability of being within 1.96 standard errors of the mean should be smaller than in the case when the standard deviation is known (and cannot be underestimated).

Luckily, however, we can prove that random variable T will be student‘s t distribution. So we can use t distribution to estimate the mean of a normal distribution population in situations where the sample size is small and population standard deviation is unknown. For 90% confidence interval, it can be calculated as

where A is value of T that contains 90% of the area of the t distribution for n-1 degree of freedom. We can calculate A through the t table.

[Math Review] Statistics Basic: Estimation

标签：diff layout pen .com pos could question put param

原文地址：https://www.cnblogs.com/sherrydatascience/p/10354428.html

踩

(0)

评论一句话评论（0）

分享档案

更多>

2021年07月29日 (22)
2021年07月28日 (40)
2021年07月27日 (32)
2021年07月26日 (79)
2021年07月23日 (29)
2021年07月22日 (30)
2021年07月21日 (42)
2021年07月20日 (16)
2021年07月19日 (90)
2021年07月16日 (35)

周排行