Python实现常用的假设检验-百木园

开门见山。

这篇文章，教大家用Python实现常用的假设检验！

在这里插入图片描述

服从什么分布，就用什么区间估计方式，也就就用什么检验！

比如：两个样本方差比服从F分布，区间估计就采用F分布计算临界值（从而得出置信区间），最终采用F检验。

在这里插入图片描述

建设检验的基本步骤：

在这里插入图片描述

前言

假设检验用到的Python工具包

•Statsmodels是Python中，用于实现统计建模和计量经济学的工具包，主要包括描述统计、统计模型估计和统计推断

•Scipy是一个数学、科学和工程计算Python工具包，主要包括统计,优化,整合,线性代数等等与科学计算有关的包

导入数据

python学习交流Q群：906715085####
from sklearn.datasets import load_iris
import numpy as np
#导入IRIS数据集
iris = load_iris()
iris=pd.DataFrame(iris.data,columns=
[\'sepal_length\',\'sepal_width\',\'petal_legth\',\'petal_width\'])print(iris)
一个总体均值的z检验

Python学习交流Q群：906715085###
np.mean(iris[\'petal_legth\'])
\'\'\'
原假设：鸢尾花花瓣平均长度是4.2
备择假设：鸢尾花花瓣平均长度不是4.2
\'\'\'
import statsmodels.stats.weightstats
z, pval = 
statsmodels.stats.weightstats.ztest(iris
[\'petal_legth\'], value=4.2)
print(z,pval)

\'\'\'
P=0.002 <5%, 拒绝原假设，接受备则假设。
\'\'\'
一个总体均值的t检验

import scipy.stats
t, pval = scipy.stats.ttest_1samp(iris
[\'petal_legth\'], popmean=4.0)print(t, pval)
\'\'\'
P=0.0959 > 5%, 接受原假设，即花瓣长度为4.0。
 \'\'\'

模拟双样本t检验

#取两个样本
iris_1 = iris[iris.petal_legth >= 2]
iris_2 = iris[iris.petal_legth < 2]
print(np.mean(iris_1[\'petal_legth\']))
print(np.mean(iris_2[\'petal_legth\']))
\'\'\'
H0: 两种鸢尾花花瓣长度一样
H1: 两种鸢尾花花瓣长度不一样
\'\'\'

import scipy.stats
t, pval = scipy.stats.ttest_ind(iris_1
[\'petal_legth\'],iris_2[\'petal_legth\'])
print(t,pval)
\'\'\'
p<0.05,拒绝H0，认为两种鸢尾花花瓣长度不一样
\'\'\'

在这里插入图片描述

练习

数据字段说明：

•gender：性别，1为男性，2为女性

•Temperature:体温

•HeartRate：心率

•共130行，3列

•用到的数据链接：pan.baidu.com/s/1t4SKF6

本周需要解决的几个小问题：

人体体温的总体均值是否为98.6华氏度？
人体的温度是否服从正态分布?
人体体温中存在的异常数据是哪些？
男女体温是否存在明显差异？
体温与心率间的相关性(强？弱？中等?)

1.1 探索数据

import numpy as np
import pandas as pd
from scipy import stats
data = pd.read_csv
(\"C:\\\\Users\\\\baihua\\\\Desktop\\\\test.csv\")
print(data.head())
sample_size = data.size #130*3
out：   
Temperature  Gender  HeartRate
0         96.3       1         70
1         96.7       1         71
2         96.9       1         74
3         97.0       1         80
4         97.1       1         73
print(data.describe())
out： 
Temperature      
Gender   HeartRatecount   130.000000  130.000000  130.000000
mean     98.249231    1.500000   73.761538
std       0.733183    0.501934    7.062077
min      96.300000    1.000000   57.000000
25%      97.800000    1.000000   69.000000
50%      98.300000    1.500000   74.000000
75%      98.700000    2.000000   79.000000
max     100.800000    2.000000   89.000000
人体体温均值是98.249231

1.2 人体的温度是否服从正态分布?

\'\'\'
人体的温度是否服从正态分布?
先画出分布的直方图，然后使用scipy.stat.kstest
函数进行判断。

\'\'\'
%matplotlib inline
import seaborn as 
snssns.distplot(data[\'Temperature\'], 
color=\'b\', bins=10, kde=True)

在这里插入图片描述

stats.kstest(data[\'Temperature\'], \'norm\')
out：
KstestResult(statistic=1.0, pvalue=0.0)
\'\'\'
p<0.05,不符合正态分布
\'\'\'
判断是否服从t分布

\'\'\'判断是否服从t分布:
\'\'\'
np.random.seed(1)
ks = stats.t.fit(data[\'Temperature\'])
df = ks[0]
loc = ks[1]
scale = ks[2]
t_estm = stats.t.rvs(df=df, loc=loc, 
scale=scale, size=sample_size)
stats.ks_2samp(data[\'Temperature\'], 
t_estm)

\'\'\'
 pvalue=0.4321464176976891 <0.05,认为体温服从t分布
 \'\'\'
判断是否服从卡方分布

\'\'\'
判断是否服从卡方分布:

\'\'\'np.random.seed(1)
chi_square = stats.chi2.fit(data
[\'Temperature\'])
df = chi_square[0]
loc = chi_square[1]
scale = chi_square[2]
chi_estm = stats.chi2.rvs(df=df, loc=loc, 
scale=scale, size=sample_size)
stats.ks_2samp(data[\'Temperature\'],
 chi_estm)
\'\'\'
pvalue=0.3956146564478842>0.05,认为体温服从卡方分布
\'\'\'

绘制卡方分布直方图

\'\'\'
绘制卡方分布图

\'\'\'
from matplotlib import pyplot as plt
plt.figure()
data[\'Temperature\'].plot(kind = \'kde\')
chi2_distribution = stats.chi2(chi_square
[0], chi_square[1],chi_square[2])
x = np.linspace(chi2_distribution.ppf
(0.01), chi2_distribution.ppf(0.99), 100)
plt.plot(x, chi2_distribution.pdf(x),
 c=\'orange\')
 plt.xlabel(\'Human temperature\')
 plt.title(\'temperature on chi_square\',
  size=20)
 plt.legend([\'test_data\', \'chi_square\'])

在这里插入图片描述

1.3 人体体温中存在的异常数据是哪些？

\'\'\'
已知体温数据服从卡方分布的情况下，可以直接使用
Python计算出P=0.025和P=0.925时(该函数使用单侧概率值)的分布值，在分布值两侧的数据属于小概率，认为是异常值。

\'\'\'
lower1=chi2_distribution.ppf(0.025)
lower2=chi2_distribution.ppf(0.925)
t=data[\'Temperature\']
print(t[t<lower1] )
print(t[t>lower2])

out：
0     96.3
1     96.7
65    96.4
66    96.7
67    96.8
Name: Temperature, dtype: float64
63      99.4
64      99.5
126     99.4
127     99.9
128    100.0
129    100.8
Name: Temperature, dtype: float64

1.4 男女体温差异是否显著

\'\'\'
此题是一道两个总体均值之差的假设检验问题，因为是否存在差别并不涉及方向，所以是双侧检验。建立原假设和备择假设如下：
H0:u1-u2 =0  没有显著差
H1:u1-u2 != 0  有显著差别

\'\'\'
data.groupby([\'Gender\']).size() #样本量65
male_df = data.loc[data[\'Gender\'] == 1]
female_df = data.loc[data[\'Gender\'] == 2]

\'\'\'
使用Python自带的函数，P用的双侧累计概率
\'\'\'

import scipy.stats
t, pval = scipy.stats.ttest_ind(male_df
[\'Temperature\'],female_df[\'Temperature\'])
print(t,pval)if pval > 0.05:    
print(\'不能拒绝原假设，男女体温无明显差异。\')
else:    
print(\'拒绝原假设，男女体温存在明显差异。\')
out：
-2.2854345381654984 0.02393188312240236拒绝原假设，男女体温存在明显差异。

在这里插入图片描述

1.5 体温与心率间的相关性(强？弱？中等?)

\'\'\'
体温与心率间的相关性(强？弱？中等?)
\'\'\'
heartrate_s = data[\'HeartRate\']
temperature_s = data[\'Temperature\']
from matplotlib import pyplot as plt
plt.scatter(heartrate_s, temperature_s)

在这里插入图片描述

stat, p = stats.pearsonr(heartrate_s,
 temperature_s)
 print(\'stat=%.3f, p=%.3f\' % (stat, p))
 print(stats.pearsonr(heartrate_s,
  temperature_s))
\'\'\'
相关系数为0.004，可以认为二者之间没有相关性
\'\'\'

最后，今天给大家分享的这篇文章到这里就结束了，喜欢的小伙伴记得点赞收藏，有问题的小伙伴记得及时提出来解决问题，下一篇文章见了。
在这里插入图片描述

来源：https://www.cnblogs.com/123456feng/p/16132444.html
本站部分图文来源于网络，如有侵权请联系删除。

Python实现常用的假设检验

前言

练习

1.1 探索数据

1.2 人体的温度是否服从正态分布?

1.3 人体体温中存在的异常数据是哪些？

1.4 男女体温差异是否显著

1.5 体温与心率间的相关性(强？弱？中等?)

相关推荐

热门文章