Skip to content Skip to sidebar Skip to footer

Avoid Simultaneously Reading Multiple Files For A Dask Array

From a library, I get a function that reads a file and returns a numpy array. I want to build a Dask array with multiple blocks from multiple files. Each block is the result of cal

Solution 1:

You could use an distributed lock primitive - so that your loader function does acquire-read-release.

read_lock = distributed.Lock('numpy-read')

@dask.delayed
def load_numpy(lock, fn):
    lock.acquire()
    out = np.load(fn)
    lock.release()
    return out

y = [load_numpy(lock, '%d.npy' % i) for i in range(2)]

Also, da.from_array accepts a lock, so you could maybe create individual arrays from the delayed function np.load directly supplying the lock.

Alternatively, you could assign a single unit of resource to the worker (with multiple threads), and then compute (or persist) with a requirement of one unit per file-read task, as in the example in the linked doc.

Response to comment: to_hdf wasn't specified in the question, I am not sure why it is being questioned now; however, you can use da.store(compute=False) with a h5py.File, and then specify the resource to use when calling compute. Note that this does not materialise the data into memory.


Post a Comment for "Avoid Simultaneously Reading Multiple Files For A Dask Array"