tf.estimator 错误：ResourceExhausted：打开的文件太多（TF 使 events.out.tfevents 文件保持打开状态）

2024-01-11

多次调用后出现以下错误train_model在下面的课程中：

terminate called after throwing an instance of 'std::system_error' what(): Resource temporarily unavailable

我希望能够打电话train_model多次固定num_steps和不同的model_dir实例化该类后。

（我认为某处存在内存泄漏，但我无法找出导致它的原因（每次调用后GPU内存使用量不会改变）train_model，但每次调用后 RAM 使用量都会略有增加。当我收到错误时，我的机器上仍然有大量可用的 RAM 和 GPU 内存。） --> Update: 不是内存泄漏而是打开了太多文件

Update:

我在另一台机器上运行了这个，错误更清楚：资源耗尽：打开的文件太多

我在看lsof并注意到 TF 一直在events.out.tfevents每次调用后打开文件train_model具有不同的检查点目录。知道如何关闭events.out.tfevents调用 tf.estimator 后的文件？

这是我的代码（一个简单的前馈神经网络，带有 dropout 和用于分类的批量归一化）：

class Model:
def __init__(self):
    self.data_loaded = False
    self.train_data = None
    self.valid_data = None
    print('class created!')


def input_fn(self, mode, batch_size, num_epochs=None):
    if mode == 'train':
        features_cont = self.data_dict['x_train_cont']
        features_cat = self.data_dict['x_train_cat']
        labels = self.data_dict['y_train']
    elif mode == 'valid':
        features_cont = self.data_dict['x_valid_cont']
        features_cat = self.data_dict['x_valid_cat']
        labels = self.data_dict['y_valid']
        num_epochs = 1
    elif mode == 'test':
        features_cont = self.data_dict['x_test_cont']
        features_cat = self.data_dict['x_test_cat']
        labels = np.zeros([features_cont.shape[0], 1])
        num_epochs = 1

    features = np.concatenate([features_cont, features_cat], axis=1)

    shuffle = mode == 'train'

    return tf.estimator.inputs.numpy_input_fn(features, labels,
                                              batch_size=batch_size,
                                              num_epochs=num_epochs,
                                              shuffle=shuffle)


def model_fn(self, features, labels, mode, params):
    is_training = mode == tf.estimator.ModeKeys.TRAIN
    # concat cont and cat features
    l2_weights = params['l2_w']
    dropout_rates = params['dropout']
    n_classes = 2  # binary classification
    x_in = features #tf.concat(features, axis=1)
    hidden_layer = x_in

    for i, num_units in enumerate(params['num_units']):
        hidden_layer = tf.layers.dense(inputs=hidden_layer,
                                         units=num_units,
                                         name='hidden_{}'.format(i),
                                         kernel_regularizer=tf.contrib.layers.l2_regularizer(scale=l2_weights[i]))

        hidden_layer = tf.layers.batch_normalization(hidden_layer, training=is_training)

        hidden_layer = tf.nn.relu(hidden_layer)

        hidden_layer = tf.layers.dropout(inputs=hidden_layer,
                                         rate=dropout_rates[i],
                                         name='hidden_drop_{}'.format(i),
                                         training=is_training)

    logits = tf.layers.dense(inputs=hidden_layer,
                             units=n_classes,
                             name='output')

    predictions = tf.nn.softmax(logits, name='probability_predictions')

    if mode == tf.estimator.ModeKeys.PREDICT:
        return tf.estimator.EstimatorSpec(mode,
                                          predictions={'predictions': predictions},
                                          # export_outputs=export_outputs
                                          )

    weights = tf.gather(tf.constant(self.class_weights), tf.cast(labels[:, 1], tf.int32))
    l2_loss = tf.add_n(tf.get_collection(tf.GraphKeys.REGULARIZATION_LOSSES))
    loss = tf.losses.softmax_cross_entropy(labels, logits, weights) + l2_loss

    auc = tf.metrics.auc(labels[:, 1], predictions[:, 1])
    eval_metric_ops = {'auc': auc}

    if mode == tf.estimator.ModeKeys.EVAL:
        return tf.estimator.EstimatorSpec(mode, loss=loss,
                                          eval_metric_ops=eval_metric_ops)

    assert mode == tf.estimator.ModeKeys.TRAIN
    # needed for batch norm layer
    extra_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
    global_step = tf.train.get_global_step()
    optimizer = tf.train.AdamOptimizer(learning_rate=params['learning_rate'], epsilon=1e-07)
    with tf.control_dependencies(extra_ops):
        train_op = optimizer.minimize(loss, global_step=global_step)

    # Set logging hook for tf.estimator
    logging_hook = tf.train.LoggingTensorHook({'step': global_step,
                                               'loss': loss,
                                               #'auc': auc[1]
                                               },
                                              every_n_iter=1)
    return tf.estimator.EstimatorSpec(mode=mode,
                                      loss=loss,
                                      train_op=train_op,
                                      training_hooks=[logging_hook])

def train_model(self, hps, model_dir=None, num_steps=None):
    max_steps = None
    num_epochs = None
    # get TF logger
    tf.logging.set_verbosity(tf.logging.INFO)
    if not os.path.exists(model_dir):
        os.makedirs(model_dir)
    self.setup_tf_logger(model_dir)
    config = tf.ConfigProto(allow_soft_placement=True,
                            log_device_placement=False)
    config.gpu_options.allow_growth = True
    run_config = tf.estimator.RunConfig(
        save_checkpoints_steps=1e10,
        keep_checkpoint_max=1,
        model_dir=model_dir,
        session_config=config
    )
    batch_size = hps['batch_size']

    if self.train_data is None:
        self.train_data =  self.input_fn(mode='train',
                                           batch_size=batch_size,
                                           num_epochs=num_epochs)
        self.valid_data =  self.input_fn(mode='valid',
                                           batch_size=100000,
                                           num_epochs=1)

    model = tf.estimator.Estimator(model_fn=self.model_fn,
                                   params=hps,
                                   config=run_config)
    model.train(input_fn=self.train_data,
                steps=num_steps,
                max_steps=None)
    eval_out = model.evaluate(input_fn=self.valid_data)
    return eval_out['auc']

更新2：

我不得不更改 TF 代码来解决这个问题。目前在basic_session_run_hooks.py and estimator.py flush()在摘要编写器上调用，它仅转储数据但不关闭文件。

我将摘要编写器的调用更改为close()代替flush()。这些文件似乎在调用 tf.estimator 后关闭，我不再收到 ResourceExhausted 错误。

张量流团队使用一定有一个原因（可能是打开和关闭文件的成本）flush()代替close()关于摘要编写者，但这可能会导致类似于我在这里报告的问题。

None

本文内容由网友自发贡献，版权归原作者所有，本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容，请联系:hwhale#tublm.com(使用前将#替换为@)

tensorflow