训练nfm模型,每2000个step进行保存。一开始模型训练正常,但是在使用tf.train.Saver的save方法保存模型时出现了如下错误:
tensorflow.python.framework.errors_impl.InvalidArgumentError: Cannot serialize protocol buffer of type tensorflow.GraphDef as the serialized size (2239053850bytes) would be larger than the limit (2147483647 bytes)
首先根据字面分析,tf的Graph太大了,超过了2个G,导致无法序列号。进一步查看保存的模型文件后发现,每2000步保存的meta文件(也就是图结构文件)都比上一个2000步要大200M左右,明显不正常,并且我使用的是NFM模型,模型本身并不是很大,为什么会出现这样的错误呢?
先看一下模型的部分代码:
class NFMNetwork:
...
def initialized_sparse_embedding(self, weights_indices, weights_values, weights_shape, ids, values):
weights_indices_tensor = tf.convert_to_tensor(weights_indices, dtype=tf.int64)
weights_values_tensor = tf.convert_to_tensor(weights_values, dtype=tf.float32)
weights_shape_tensor = tf.convert_to_tensor([int(weights_shape[0]), int(weights_shape[1])], dtype=tf.int64)
weights_sparse = tf.sparse.SparseTensor(weights_indices_tensor, weights_values_tensor, weights_shape_tensor)
weights = tf.sparse.to_dense(weights_sparse, default_value = 0, validate_indices = False)
result = tf.nn.embedding_lookup_sparse(weights, ids, values, partition_strategy="div", combiner="mean")
return result
def get_initialized_embeddings(self, weights_indices, weights_values, num_features, embed_units, features, values):
weights_shape = [num_features, embed_units]
hidden1 = self.initialized_sparse_embedding(weights_indices, weights_values, weights_shape, features, values)
return tf.nn.leaky_relu(hidden1)
def inference(self, feature_dict, ...):
...
for (feature_name, feature_info) in self.feature_config.word2vec_sparse_feature.items():
with tf.variable_scope("sparse_feature_embedding_%s" % feature_name.replace('#','_')):
sparse_embedding_feature[feature_name] = self.get_initialized_embeddings(
word2vec_indices,
word2vec_values,
num_features=feature_info['feature_max_size'],
embed_units=feature_info['feature_embedding_size'],
features=feature_dict[feature_name],
values=None,
)
......
重点就是在initialized_sparse_embedding方法。这个方法的目的是使用预先训练好的词向量放到模型中,实现预训练的embedding。但是我错误的使用了tf.convert_to_tensor方法。tf.convert_to_tensor的作用是将python数组转为tensor,但是和一般理解不同的是,tf.convert_to_tensor方法是作为图操作被添加而不是作为图变量或常量,而数据是记录在该图操作中(可参考这篇文章:https://zhuanlan.zhihu.com/p/31308247)。在训练过程中inference方法被频繁调用,也就多次调用了initialized_sparse_embedding方法,,即使使用with tf.variable_scope限定的变量范围,但由于tf.convert_to_tensor是图操作,因此并没有限定作用,每次调用tf.convert_to_tensor都将在计算图上添加图操作,从而导致图结构变得十分庞大,最终导致在训练到第30000步是模型图结构大小超过2G,导致了以上错误。
本文内容由网友自发贡献,版权归原作者所有,本站不承担相应法律责任。如您发现有涉嫌抄袭侵权的内容,请联系:hwhale#tublm.com(使用前将#替换为@)