Tensorflow Distributed: Createsession Still Waiting For Response From Worker: /job:ps/replica:0/task:0
I'm trying to run my first example of distributed training with TF. I've used the example that is in TF documentation https://www.tensorflow.org/deploy/distributed with one ps and
Solution 1:
OK, the problem was kinda silly! I ssh to the remote server using user@example.com
and that what I was using to define the cluster IP address in Tensorflow, but it turned out the I should be only using example.com
and the problem was solved after that!
Another thing that I found other people suggesting is that the task id should match with the cluster IP address. For simplicity, try with one ps and one worker that both are on the same machine and see if it works with you.
Post a Comment for "Tensorflow Distributed: Createsession Still Waiting For Response From Worker: /job:ps/replica:0/task:0"