Skip to content Skip to sidebar Skip to footer

Tensorflow Distributed: Createsession Still Waiting For Response From Worker: /job:ps/replica:0/task:0

I'm trying to run my first example of distributed training with TF. I've used the example that is in TF documentation https://www.tensorflow.org/deploy/distributed with one ps and

Solution 1:

OK, the problem was kinda silly! I ssh to the remote server using user@example.com and that what I was using to define the cluster IP address in Tensorflow, but it turned out the I should be only using example.com and the problem was solved after that!

Another thing that I found other people suggesting is that the task id should match with the cluster IP address. For simplicity, try with one ps and one worker that both are on the same machine and see if it works with you.

Post a Comment for "Tensorflow Distributed: Createsession Still Waiting For Response From Worker: /job:ps/replica:0/task:0"