我使用两个工作程序/副本和一个参数服务器。喜欢
--ps_hosts='hosta.com:2222' --worker_hosts='hosta.com:2223,hostb.com:2223'
使用tf.train.SyncReplicasOptimizer like
opt = tf.train.SyncReplicasOptimizer(
opt,
replicas_to_aggregate=2,
replica_id=FLAGS.task_id,
total_num_replicas=2,
variables_to_average=variables_to_average)
从日志中我看到由于跨机器网络通信,worker0(hosta.com:2223)比worker1(hostb.com:2223)快得多。看起来worker0 没有等待worker1 的梯度。即使我终止了worker1的作业,worker0仍在处理。并且worker0有很多重复的日志,例如
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.659749: step 29010, loss = 0.40(812.0 examples/sec; 0.315 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:02.990509: step 29010, loss = 0.59(775.3 examples/sec; 0.330 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.650522: step 29013, loss = 0.56(774.0 examples/sec; 0.331 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:04.989555: step 29013, loss = 0.47(756.3 examples/sec; 0.338 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.549120: step 29016, loss = 0.49(816.6 examples/sec; 0.313 sec/batch)
INFO:tensorflow:Worker 0: 2016-04-21 03:24:06.867229: step 29016, loss = 0.48(806.1 examples/sec; 0.318 sec/batch)
那么,tf.train.SyncReplicasOptimizer 不应该挂起并等待所有的replicas_to_aggregate 工作线程吗?