python 置信区间
Confidence Interval (CI) is essential in statistics and very important for data scientists. In this article, I will explain it thoroughly with necessary formulas and also demonstrate how to calculate it using python.
置信区间(CI)在统计中至关重要,对数据科学家而言非常重要。 在本文中,我将用必要的公式彻底解释它,并演示如何使用python计算它。
置信区间 (Confidence Interval)
As it sounds, the confidence interval is a range of values. In the ideal condition, it should contain the best estimate of a statistical parameter. It is expressed as a percentage. 95% confidence interval is the most common. You can use other values like 97%, 90%, 75%, or even 99% confidence interval if your research demands. Let’s understand it by an example:
听起来,置信区间是一个值范围。 在理想条件下,它应包含统计参数的最佳估计。 用百分比表示。 95%的置信区间是最常见的。 如果您的研究需要,可以使用其他值,例如97%,90%,75%甚至99%置信区间。 让我们通过一个例子来理解它:
Here is a statement:
这是一条声明:
“In a sample of 659 parents with toddlers, about 85%, stated they use a car seat for all travel with their toddler. From these results, a 95% confidence interval was provided, going from about 82.3% up to 87.7%.”
“在659位有学步的父母的样本中,大约85%的受访者表示,他们在学步时都使用汽车安全座椅。 根据这些结果,提供了95%的置信区间,从大约82.3%上升到87.7%。”
This statement means, we are 95% certain that the population proportion who use a car seat for all travel with their toddler will fall between 82.3% and 87.7%. If we take a different sample or a subsample of these 659 people, 95% of the time, the percentage of the population who use a car seat in all travel with their toddlers will be in between 82.3% and 87.7%.
该声明意味着,我们有95%的把握确定,带小孩旅行的所有年龄段的孩子所占的比例将介于82.3%和87.7%之间。 如果我们对这659名人群进行不同的抽样或子抽样,则有95%的时间,在与孩子一起旅行的所有时间中使用汽车安全座椅的人口比例将在82.3%至87.7%之间。
Remember, 95% confidence interval does not mean 95% probability
请记住,95%的置信区间并不意味着95%的概率
The reason confidence interval is so popular and useful is, we cannot take data from all populations. Like the example above, we could not get the information from all the parents with toddlers. We had to calculate the result from 659 parents. From that result, we tried to get an estimate of the overall population. So, it is reasonable to consider a margin of error and take a range. That’s why we take a confidence interval which is a range.
置信区间如此受欢迎且有用的原因是,我们无法从所有人群中获取数据。 像上面的示例一样,我们无法从所有有学步的父母那里获得信息。 我们必须计算659位父母的结果。 根据该结果,我们试图获得总体人口的估计值。 因此,考虑误差幅度并取一个范围是合理的。 这就是为什么我们将置信区间设为一个范围。
We want a simple random sample and a normal distribution to construct a confidence interval. But if the sample size is large enough (30 or more) normal distribution is not necessary.
我们想要一个简单的随机样本和一个正态分布来构造一个置信区间。 但是,如果样本大小足够大(30个或更多),则不需要正态分布。
如何计算置信区间 (How to Calculate the Confidence Interval)
The calculation of the confidence interval involves the best estimate which is obtained by the sample and a margin of error. So, we take the best estimate and add a margin of error to it. Here is the formula for the confidence interval and the margin of error:
置信区间的计算涉及通过样本获得的最佳估计值和误差范围。 因此,我们采用最佳估计,并在其中增加误差范围。 这是置信区间和误差范围的公式:
Here, SE is the standard error.
在此,SE是标准误差 。
Normally, CI is calculated for two statistical parameters: the proportion and the mean.
通常,CI是针对两个统计参数计算的:比例和平均值。
Combining these two formulas above, we can elaborate the formula for CI as follows:
结合以上两个公式,我们可以将CI的公式阐述如下:
Population proportion or the mean is calculated from the sample. In the example of “the parents with toddlers”, the best estimate or the population proportion of parents that uses car seats in all travel with their toddlers is 85%. So, the best estimate (population proportion) is 85. z-score is fixed for the confidence level (CL).
人口比例或平均值是从样本中计算得出的。 在“有小孩的父母”的示例中,最佳估计或在与小孩一起旅行的所有时间中使用汽车安全座椅的父母的人口比例为85%。 因此,最佳估计值(人口比例)为85。z值固定为置信度(CL)。
A z-score for a 95% confidence interval for a large enough sample size(30 or more) is 1.96.
95%置信区间的z评分对于足够大的样本量(30个或更多)为1.96。
Here are the z-scores for some commonly used confidence levels:
以下是一些常用置信度水平的z得分: