Skip to content Skip to sidebar Skip to footer

Adding State To A Function Which Gets Called Via Pool.map -- How To Avoid Pickling Errors

I've hit the common problem of getting a pickle error when using the multiprocessing module. My exact problem is that I need to give the function I'm calling some state before I ca

Solution 1:

Map is a way to distribute workload. If you store the data in the func i think you vanish the initial purpose.

Let's try to find why it is slower. It's not normal and there must be something else.

First, the number of processes must be suitable for the machine running them. In your example you're using a pool of 2 processes so a total of 3 processes is involved. How many cores are on the system you're using? What else is running? What's the system load while crunching data? What does the function do with the data? Does it access disk? Or maybe it uses DB which means there is probably another process accessing disk and cores. What about memory? Is it sufficient for storing the initial lists?

The right implementation is your Attempt 1.

Try to profile the execution using iostat for example. This way you can spot the bottlenecks.

If it stalls on the cpu then you can try some tweaks to the code.

From another answer on Stackoverflow (by me so no problem copy and pasting it here :P ):

You're using .map() which collect the results and then returns. So for large dataset probably you're stuck in the collecting phase.

You can try using .imap() which is the iterator version on .map() or even the .imap_unordered() if the order of results is not important (as it seems from your example).

Here's the relevant documentation. Worth noting the line:

For very long iterables using a large value for chunksize can make the job complete much faster than using the default value of 1.


Post a Comment for "Adding State To A Function Which Gets Called Via Pool.map -- How To Avoid Pickling Errors"