Avoid Simultaneously Reading Multiple Files For A Dask Array
From a library, I get a function that reads a file and returns a numpy array. I want to build a Dask array with multiple blocks from multiple files. Each block is the result of cal
Solution 1:
You could use an distributed lock primitive - so that your loader function does acquire-read-release.
read_lock = distributed.Lock('numpy-read')
@dask.delayed
def load_numpy(lock, fn):
lock.acquire()
out = np.load(fn)
lock.release()
return out
y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]
Also, da.from_array
accepts a lock, so you could maybe create individual arrays from the delayed function np.load
directly supplying the lock.
Alternatively, you could assign a single unit of resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.
Response to comment: to_hdf
wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False)
with a h5py.File
, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.
Post a Comment for "Avoid Simultaneously Reading Multiple Files For A Dask Array"