我稍微清理了一下你的代码。写信给StringIO
所以它比写入文件更精简。设置默认氛围 w/seaborn
代替matplotlib
使其看起来更现代。这bins
如果您希望统计测试保持一致,则两个样本的阈值应该相同。我认为如果你迭代并以这种方式制作垃圾箱,整个事情可能会比需要的时间更长。Counter
可能很有用,因为您只需循环一次...而且您将能够制作相同的垃圾箱大小。将浮点数转换为整数,因为您将它们合并在一起。from collections import Counter
then C = Counter()
and C[value] += 1
。你会有一个dict
最后你可以从那里制作垃圾箱list(C.keys())
。这会很好,因为你的数据是如此粗糙。另外,你应该看看是否有办法做chunksize
with numpy
代替pandas
b/c numpy
索引速度更快。尝试一个%timeit
for DF.iloc[i,j]
and ARRAY[i,j]
你就会明白我的意思了。我将其中大部分内容编写为一个函数,以尝试使其更加模块化。
import numpy as np
import pandas as pd
import math
import matplotlib.pyplot as plt
from io import StringIO
from scipy.stats import ks_2samp
import seaborn as sns; sns.set()
%matplotlib inline
#Added seaborn b/c it looks mo betta
mu = [100, 120]
sigma = 30
def write_random(file,mu,sigma=30):
dist = np.random.normal(mu, sigma, 10000)
for i,s in enumerate(dist):
file.write('{}\t{}\t{}\n'.format("label_A-%d" % i, "label_B-%d" % i, str(s)))
return(file)
#Writing to StringIO instead of an actual file
gs1_test_1 = write_random(StringIO(),mu=100)
gs1_test_2 = write_random(StringIO(),mu=120)
chunksize = 1000
def make_hist(fh,ax):
# find the min, max, line qty, for bins
low = np.inf
high = -np.inf
loop = 0
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, sep='\t'):
low = np.minimum(chunk.iloc[:, 2].min(), low) #btw, iloc is way slower than numpy array indexing
high = np.maximum(chunk.iloc[:, 2].max(), high) #you might wanna import and do the chunks with numpy
loop += 1
lines = loop*chunksize
nbins = math.ceil(math.sqrt(lines))
bin_edges = np.linspace(low, high, nbins + 1)
total = np.zeros(nbins, np.int64) # np.ndarray filled with np.uint32 zeros, CHANGED TO int64
fh.seek(0)
for chunk in pd.read_table(fh, header=None, chunksize=chunksize, delimiter='\t'):
# compute bin counts over the 3rd column
subtotal, e = np.histogram(chunk.iloc[:, 2], bins=bin_edges) # np.ndarray filled with np.int64
# accumulate bin counts over chunks
total += subtotal
plt.hist(bin_edges[:-1], bins=bin_edges, weights=total,axes=ax,alpha=0.5)
return(ax,bin_edges,total)
#Make the plot canvas to write on to give it to the function
fig,ax = plt.subplots()
test_1_data = make_hist(gs1_test_1,ax)
test_2_data = make_hist(gs1_test_2,ax)
#test_1_data[1] == test_2_data[1] The bins should be the same if you're going try and compare them...
ax.set_title("ks: %f, p_in_the_v: %f" % ks_2samp(test_1_data[2], test_2_data[2]))