我想问一下目前的数据集API是否允许实现过采样算法?我处理高度不平衡的阶级问题。我认为在数据集解析(即在线生成)过程中对特定类进行过采样会很好。我已经看到了rejection_resample函数的实现,但是这会删除样本而不是复制它们,并且会减慢批次生成速度(当目标分布与初始分布有很大不同时)。我想要实现的目标是:举个例子,看看它的类概率来决定是否重复它。然后打电话dataset.shuffle(...)
dataset.batch(...)
并获取迭代器。最好的(在我看来)方法是对低概率类别进行过采样并对最可能的类别进行子采样。我想在网上做,因为它更灵活。
这个问题已经在issue中解决了#14451。
只需在此处发布 anwser 即可使其对其他开发人员更加可见。
示例代码对低频类进行过采样,对高频类进行欠采样,其中class_target_prob
在我的例子中只是均匀分布。我想检查最近手稿的一些结论卷积神经网络中类别不平衡问题的系统研究
特定类的过采样是通过调用完成的:
dataset = dataset.flat_map(
lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)
这是完成所有操作的完整代码片段:
# sampling parameters
oversampling_coef = 0.9 # if equal to 0 then oversample_classes() always returns 1
undersampling_coef = 0.5 # if equal to 0 then undersampling_filter() always returns True
def oversample_classes(example):
"""
Returns the number of copies of given example
"""
class_prob = example['class_prob']
class_target_prob = example['class_target_prob']
prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
# soften ratio is oversampling_coef==0 we recover original distribution
prob_ratio = prob_ratio ** oversampling_coef
# for classes with probability higher than class_target_prob we
# want to return 1
prob_ratio = tf.maximum(prob_ratio, 1)
# for low probability classes this number will be very large
repeat_count = tf.floor(prob_ratio)
# prob_ratio can be e.g 1.9 which means that there is still 90%
# of change that we should return 2 instead of 1
repeat_residual = prob_ratio - repeat_count # a number between 0-1
residual_acceptance = tf.less_equal(
tf.random_uniform([], dtype=tf.float32), repeat_residual
)
residual_acceptance = tf.cast(residual_acceptance, tf.int64)
repeat_count = tf.cast(repeat_count, dtype=tf.int64)
return repeat_count + residual_acceptance
def undersampling_filter(example):
"""
Computes if given example is rejected or not.
"""
class_prob = example['class_prob']
class_target_prob = example['class_target_prob']
prob_ratio = tf.cast(class_target_prob/class_prob, dtype=tf.float32)
prob_ratio = prob_ratio ** undersampling_coef
prob_ratio = tf.minimum(prob_ratio, 1.0)
acceptance = tf.less_equal(tf.random_uniform([], dtype=tf.float32), prob_ratio)
return acceptance
dataset = dataset.flat_map(
lambda x: tf.data.Dataset.from_tensors(x).repeat(oversample_classes(x))
)
dataset = dataset.filter(undersampling_filter)
dataset = dataset.repeat(-1)
dataset = dataset.shuffle(2048)
dataset = dataset.batch(32)
sess.run(tf.global_variables_initializer())
iterator = dataset.make_one_shot_iterator()
next_element = iterator.get_next()
更新#1
这是一个简单的Jupyter笔记本它在玩具模型上实现了上述过采样/欠采样。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)