有趣的问题!
根据斯科特的建议,我快速尝试了一下。
Inputs:
import random
import pandas as pd
import numpy as np
# fixing the random seed
random.seed(a=1, version=2)
# formating floats
pd.options.display.float_format = '{:.1f}'.format
# given inputs
count = 5388
mean = 4173
median = 4072
lower_percentile = 10
lower_percentile_value = 2720
upper_percentile = 90
upper_percentile_value = 5676
max_value = 6325
min_value = 2101
功能:
def generate_dataset(count, mean, median, lower_percentile, upper_percentile,
lower_percentile_value, upper_percentile_value,
min_value, max_value
):
# Calculate the number of values that fall within each percentile
p_1_size = int(float(lower_percentile) * float(count) / 100)
p_4_size = int(count - (float(upper_percentile) * float(count) / 100))
p_2_size = int((count / 2) - p_1_size)
p_3_size = int((count / 2) - p_4_size)
# can be used to adjust the mean
mean_adjuster = 5790
# randomly pick values of right size from a range
p_1 = random.choices(range(min_value, lower_percentile_value), k=p_1_size)
p_2 = random.choices(range(lower_percentile_value, median), k=p_2_size)
p_3 = random.choices(range(median, mean_adjuster), k=p_3_size)
p_4 = random.choices(range(upper_percentile_value, max_value), k=p_4_size)
return p_1 + p_2 + p_3 + p_4
dataset = generate_dataset(
count, mean, median, lower_percentile, upper_percentile,
lower_percentile_value, upper_percentile_value, min_value, max_value
)
比较:
# converting into DataFrame
df = pd.DataFrame({"x": dataset})
new_count = len(df)
new_mean = np.mean(df.x)
new_median = np.quantile(df.x, 0.5)
new_lower_percentile = np.quantile(df.x, lower_percentile/100)
new_upper_percentile = np.quantile(df.x, upper_percentile/100)
compare = pd.DataFrame(
{
"value": ["count", "mean", "median", "low_p", "high_p"],
"original": [count, mean, median, lower_percentile_value, upper_percentile_value],
"new":[new_count, new_mean, new_median, new_lower_percentile, new_upper_percentile]
}
)
print(compare)
Output:
value original new
0 count 5388 5388.0
1 mean 4173 4173.4
2 median 4072 4072.5
3 low_p 2720 2720.4
4 high_p 5676 5743.0
当所有值都是整数而不是浮点数时,使值完全相等有点棘手。
您可以添加另一个变量来控制两个数字的平均值,或者更改随机种子,看看是否可以获得更接近的值。或者,您可以编写一个函数来更改种子,直到值相等。 (可能需要几分钟或几个世纪:)
Cheers!