Wednesday, October 7, 2015

Why you should learn Python's multiprocessing module

Python multilprocessing basics

This post is not an in depth guide to multiprocessing in Python or even a brief intro. Rather it is intended to give you motivation to bother learning it. When I recently experimented with Python's cross-platform multiprocessing module I was pleasantly surprised on how easy it was to use. I was able to quickly parallelize iterative tasks in Python. For example I have a script that runs the same text parsing on files in multiple directories. Using a simple tool from the multiprocessing module allowed me to easily run the text processing in multiple directories simultaneously. This saved a lot of time since the alternative was looping over the list of directories and running the same command in each directory one at a time. You might have heard that Python is not a good language for parallel processing due to the Global Interpreter Lock issue but the multiprocessing module enables bypassing the lock.



Simple example using Pool

The multiprocessing module has several objects that may be useful for you; one that stands out is the pool class. Pool allows you to utilize multiple processors to run a set of independent tasks quite efficiently (small amount of code) in parallel. To use pool first create a pool object and then simply call its map method which maps a python data collection e.g. a list or dictionary to a single parameter function. In other words map applies the function you pass it to every element in a collection you also must pass to it as input. Check it out,

import multiprocessing as mp
import time

a_list = range(8)

def f(x):
    time.sleep(5)
    print x


pool = mp.Pool(processes=8)
pool.map(f,a_list)

When run produces,

$ python mp.py 
2
1
0
3
4
5
6
7

In this arbitrary and simple example the last two lines of code do all the work. These two lines create the pool object and apply the function "f" to each element in "a_list" simultaneously on 8 cpu threads. The pool map method is analogous to the built in Python map function. Also notice that f simply prints the input parameter x but when run the output is not in the original order of a_list. This is the expected result because it is running 8 processes in parallel. Pool.map does not apply the function to the elements in the collection you pass it in any order. This means that the tasks you assign to pool must be completely independent from one another.

You can easily extend this example to fit more complicated workflows and as you can see it is incredibly simple. Chances are you have access to a multi-core/hyper-threaded processor that you are under-utilizing. So don't be scared any more, this Python module allows anyone to utilize multiprocessing in a very simple way. Any time you invest learning this module will greatly reward you by saving your runtime later. Cheers

No comments:

Post a Comment