Python - 多进程multiprocessing库

博主： AIHGF
发布时间：2021 年 04 月 18 日
2565 次浏览
暂无评论
9074字数
分类： Python

multiprocessing 使用子进程代替线程，有效避免 GIL, Global Interpreter Lock 的影响.

multiprocessing 模块允许充分利用机器上的多个核心进行处理.

multiprocessing 库中的 multiprocessing.pool.Pool对象，提供了可以跨多个输入值并行化函数的执行，跨进程分配输入数据（数据并行）的方法.

multiprocessing.pool.Pool 提供了如下接口：

[1] - apply(func[, args[, kwds]]), 等价于 apply_async( ... ).get()
[2] - apply_async(func[, args[, kwds[, callback[, error_callback]]]])
[3] - map(func, iterable[, chunksize]), 等价于 map_async( ... ).get()
[4] - map_async(func, iterable[, chunksize[, callback[, error_callback]]])
[5] - imap(func, iterable[, chunksize])
[6] - imap_unordered(func, iterable[, chunksize])
[7] - starmap(func, iterable[, chunksize]), 等价于 starmap_async( ... ).get()
[8] - starmap_async(func, iterable[, chunksize[, callback[, error_callback]]])

1. apply 和 apply_async

1.1. apply

apply(func[,args[,kwds]])

apply是阻塞的，需要等待上一个进程结束，下一个进程才开始，所以无法加速.

示例1如：

from multiprocessing import Pool
import time

def square(x):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x


if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    for x in xs:
        ret = p.apply(square, (x,))  #会阻塞
        print(ret)
    print("[INFO]timecost: ", time.time() - start)

示例2：

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        ret = list(p.apply(square, args=(x,)) for x in xs)
    print("[INFO]timecost: ", time.time() - start)

示例3：

#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        ret = list(tqdm((p.apply(square, args=(x,)) for x in xs), total=len(xs)))
    print("[INFO]timecost: ", time.time() - start)

1.2. apply_async

apply_async(func[,args[,kwds[,callback[,error_callback]]]])

单次启动一个任务，但是异步执行，启动后不等这个进程结束又开始执行新任务.

相比 apply，apply_async是异步的，返回一个异步对象，可以使用 .get() 方法等待结果，如果不需结果不必获取. 有加速效果.

示例1如：

from multiprocessing import Pool
import time

def square(x):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x


if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = []
    for x in xs:
        ret = p.apply_async(square, (x,))
        rets.append(ret)
    #
    for ret in rets:
        print(ret.get())  #get会阻塞
    print("[INFO]timecost: ", time.time() - start)

示例2如：

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(p.apply_async(square, args=(x,)) for x in xs)
        rets = [r.get() for r in rets]
    print("[INFO]timecost: ", time.time() - start)

示例3如：

#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(p.apply_async(square, args=(x,)) for x in xs)
        rets = [r.get() for r in tqdm(rets)]
    print("[INFO]timecost: ", time.time() - start)

示例4如：

#回调函数Callback方式
#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with tqdm(total=len(xs)) as pbar:
        with Pool(processes=2) as p:
            def callback(*args):
                #callback
                pbar.update()
                return
            results = [
                p.apply_async(
                    square,
                    args=(x, ),
                    callback=callback) for x in xs]
            results = [r.get() for r in results]
    print("[INFO]timecost: ", time.time() - start)

2. map 和 map_async

注：避免使用 map 和 map_async，有更好的选择，如starmap.

2.1. map

map(func,iterable[,chunksize])

阻塞到任务列表中所有任务完成再往下执行.

示例1如：

from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x


if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    ret = p.map(square, xs)  #会阻塞
    print(ret)
    print("[INFO]timecost: ", time.time() - start)

示例2如：

#With tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(tqdm(p.map(square, xs, chunksize=len(xs)//2)))
    print(rets)
    print("[INFO]timecost: ", time.time() - start)

2.2. map_async

map_async(func,iterable[,chunksize[,callback[,error_callback]]])

map_async生成子进程时使用的是list.

示例如：

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.map_async(square, xs)
        print(rets.get())#get会阻塞
    print("[INFO]timecost: ", time.time() - start)

3. imap 和 imap_unordered

imap 和 imap_unordered 与 map_async 同样是异步，区别是:

[1] - map_async生成子进程时使用的是list，而imap和 imap_unordered则是Iterable，map_async效率略高，而imap和 imap_unordered内存消耗显著的小.

[2] - 在处理结果上，imap 和 imap_unordered 可以尽快返回一个Iterable的结果，而map_async 则需要等待全部Task执行完毕，返回list.

imap 和 imap_unordered 的区别是：

imap 和 map_async一样，都按顺序等待Task的执行结果，而imap_unordered则不必.

imap_unordered返回的Iterable，会优先迭代到先执行完成的Task.

使用imap/imap_unordered替代map_async主要的原因有：

[1] - 可迭代对象足够大，将其转换为列表会导致您耗尽/使用太多内存。

[2] - 希望能够在完成所有结果之前就先处理结果

3.1. imap

imap(func,iterable[,chunksize])

示例如：

from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x


if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = p.imap(square, xs)  #不会阻塞
    for ret in rets:#这里会阻塞
        print(ret)
    print("[INFO]timecost: ", time.time() - start)

示例2如：

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(p.imap(square, xs, chunksize=len(xs) // 2))
        print(results)
    print("[INFO]timecost: ", time.time() - start)

示例3如：

#With tqdm
if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(tqdm(p.imap(square, xs, chunksize=len(xs) // 2), total=len(xs)))
        print(results)
    print("[INFO]timecost: ", time.time() - start)

3.2. imap_unordered

imap_unordered(func,iterable[,chunksize])

相对 imap，imap_unordered 的结果是无序的，哪个进程先结束，结果就先获得. 而 imap结果是有序的.

示例1如：

from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x


if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = p.imap_unordered(square, xs)  #不会阻塞
    for ret in rets:#这里会阻塞
        print(ret)
    print("[INFO]timecost: ", time.time() - start)

示例2如：

#With tqdm
if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(tqdm(p.imap(square, xs, chunksize=len(xs) // 2), total=len(xs)))
        print(results)
    print("[INFO]timecost: ", time.time() - start)

4. starmap 和 starmap_async

starmap 和 starmap_async 与 map 和 map_async 的区别是: starmap 和 starmap_async 可以传入多个参数.

4.1. startmap

示例如：

from multiprocessing import Pool
import time

def square(x, y):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**y

if __name__ == '__main__':
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.starmap(square, zip(xs, xs), chunksize=len(xs)//2)
    print("[INFO]timecost: ", time.time() - start)

4.2. starmap_async

示例如：

from multiprocessing import Pool
import time

def square(x, y):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**y

if __name__ == '__main__':
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.starmap_async(square, zip(xs, xs), chunksize=len(xs)//2).get()
    print("[INFO]timecost: ", time.time() - start)

参考

[1] - python 多进程加速执行代码 mutiprocessing Pool - 2021.01.06 - 知乎

[2] - multiprocessing --- 基于进程的并行

[3] - Python进程池multiprocessing.Pool八个函数对比 - 2020.10.24

[4] - python multiprocessing 中imap和map的不同 - 2018.11.20

[5] - Progress Bars for Python Multiprocessing Tasks

[6] - Parallelism with Python - 2020.12.17

最后修改：2021 年 07 月 13 日

如果觉得我的文章对你有用，请随意赞赏

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

评论 *

私密评论

名称 *

🎲

邮箱 *

地址

Python - 多进程multiprocessing库

AIHGF • 2021 年 04 月 18 日

<p>multiprocessing 使用子进程代替线程，有效避免 GIL, Global Interpreter Lock 的影响.</p><p>multiprocessing 模块允许充分利用机器上的多个核心进行处理.</p><p>multiprocessing 库中的 <code>multiprocessing.pool.Pool</code>对象，提供了可以跨多个输入值并行化函数的执行，跨进程分配输入数据（数据并行）的方法.</p><p><code>multiprocessing.pool.Pool</code> 提供了如下接口：</p><pre><code class="lang-python">[1] - apply(func[, args[, kwds]]), 等价于 apply_async( ... ).get()
[2] - apply_async(func[, args[, kwds[, callback[, error_callback]]]])
[3] - map(func, iterable[, chunksize]), 等价于 map_async( ... ).get()
[4] - map_async(func, iterable[, chunksize[, callback[, error_callback]]])
[5] - imap(func, iterable[, chunksize])
[6] - imap_unordered(func, iterable[, chunksize])
[7] - starmap(func, iterable[, chunksize]), 等价于 starmap_async( ... ).get()
[8] - starmap_async(func, iterable[, chunksize[, callback[, error_callback]]])</code></pre><h2>1. apply 和 apply_async</h2><h3>1.1. apply</h3><pre><code class="lang-python">apply(func[,args[,kwds]])</code></pre><p>apply是阻塞的，需要等待上一个进程结束，下一个进程才开始，<strong>所以无法加速</strong>.</p><p>示例1如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x

if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    for x in xs:
        ret = p.apply(square, (x,))  #会阻塞
        print(ret)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例2：</p><pre><code class="lang-python">if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        ret = list(p.apply(square, args=(x,)) for x in xs)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例3：</p><pre><code class="lang-python">#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        ret = list(tqdm((p.apply(square, args=(x,)) for x in xs), total=len(xs)))
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h3>1.2. apply_async</h3><pre><code class="lang-python">apply_async(func[,args[,kwds[,callback[,error_callback]]]])</code></pre><p>单次启动一个任务，但是异步执行，启动后不等这个进程结束又开始执行新任务.</p><p>相比 apply，apply_async是异步的，返回一个异步对象，可以使用 <code>.get()</code> 方法等待结果 ， 如果不需结果不必获取. 有加速效果.</p><p>示例1如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x

if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = []
    for x in xs:
        ret = p.apply_async(square, (x,))
        rets.append(ret)
    #
    for ret in rets:
        print(ret.get())  #get会阻塞
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例2如：</p><pre><code class="lang-python">if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(p.apply_async(square, args=(x,)) for x in xs)
        rets = [r.get() for r in rets]
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例3如：</p><pre><code class="lang-python">#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(p.apply_async(square, args=(x,)) for x in xs)
        rets = [r.get() for r in tqdm(rets)]
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例4如：</p><pre><code class="lang-python">#回调函数Callback方式
#With tqdm
from tqdm import tqdm

if __name__ == '__main__':
    start = time.time()
    with tqdm(total=len(xs)) as pbar:
        with Pool(processes=2) as p:
            def callback(*args):
                #callback
                pbar.update()
                return
            results = [
                p.apply_async(
                    square,
                    args=(x, ),
                    callback=callback) for x in xs]
            results = [r.get() for r in results]
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h2>2. map 和 map_async</h2><blockquote>注：避免使用 map 和 map_async，有更好的选择，如starmap.</blockquote><h3>2.1. map</h3><pre><code class="lang-python">map(func,iterable[,chunksize])</code></pre><p>阻塞到任务列表中所有任务完成再往下执行.</p><p>示例1如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x

if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    ret = p.map(square, xs)  #会阻塞
    print(ret)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例2如：</p><pre><code class="lang-python">#With tqdm

if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = list(tqdm(p.map(square, xs, chunksize=len(xs)//2)))
    print(rets)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h3>2.2. map_async</h3><pre><code class="lang-python">map_async(func,iterable[,chunksize[,callback[,error_callback]]])</code></pre><p>map_async生成子进程时使用的是list.</p><p>示例如：</p><pre><code class="lang-python">if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.map_async(square, xs)
        print(rets.get())#get会阻塞
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h2>3. imap 和 imap_unordered</h2><p>imap 和 imap_unordered 与 map_async 同样是异步，区别是:</p><p>[1] - map_async生成子进程时使用的是list，而imap和 imap_unordered则是Iterable，map_async效率略高，而imap和 imap_unordered内存消耗显著的小.</p><p>[2] - 在处理结果上，imap 和 imap_unordered 可以尽快返回一个Iterable的结果，而map_async 则需要等待全部Task执行完毕，返回list.</p><p>imap 和 imap_unordered 的区别是：</p><p>imap 和 map_async一样，都按顺序等待Task的执行结果，而imap_unordered则不必. </p><p>imap_unordered返回的Iterable，会优先迭代到先执行完成的Task.</p><p><strong>使用imap/imap_unordered替代map_async主要的原因有</strong>：</p><p>[1] - 可迭代对象足够大，将其转换为列表会导致您耗尽/使用太多内存。</p><p>[2] - 希望能够在完成所有结果之前就先处理结果</p><h3>3.1. imap</h3><pre><code class="lang-python">imap(func,iterable[,chunksize])</code></pre><p>示例如：</p><pre><code class="lang-shell">from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x

if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = p.imap(square, xs)  #不会阻塞
    for ret in rets:#这里会阻塞
        print(ret)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例2如：</p><pre><code class="lang-python">if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(p.imap(square, xs, chunksize=len(xs) // 2))
        print(results)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例3如：</p><pre><code class="lang-python">#With tqdm
if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(tqdm(p.imap(square, xs, chunksize=len(xs) // 2), total=len(xs)))
        print(results)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h3>3.2. imap_unordered</h3><pre><code class="lang-python">imap_unordered(func,iterable[,chunksize])</code></pre><p>相对 imap，imap_unordered 的结果是无序的，哪个进程先结束，结果就先获得. 而 imap结果是有序的.</p><p>示例1如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x): #map：只接收一个参数
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**x

if __name__ == '__main__':
    p = Pool(2)
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    rets = p.imap_unordered(square, xs)  #不会阻塞
    for ret in rets:#这里会阻塞
        print(ret)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><p>示例2如：</p><pre><code class="lang-python">#With tqdm
if __name__ == '__main__':
    start = time.time()
    with Pool(processes=2) as p:
        results = list(tqdm(p.imap(square, xs, chunksize=len(xs) // 2), total=len(xs)))
        print(results)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h2>4. starmap 和 starmap_async</h2><p>starmap 和 starmap_async 与 map 和 map_async 的区别是: starmap 和 starmap_async 可以传入多个参数.</p><h3>4.1. startmap</h3><p>示例如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x, y):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**y

if __name__ == '__main__':
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.starmap(square, zip(xs, xs), chunksize=len(xs)//2)
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h3>4.2. starmap_async</h3><p>示例如：</p><pre><code class="lang-python">from multiprocessing import Pool
import time

def square(x, y):
    time.sleep(2)
    print('[INFO]...processing: ', x)
    return x**y

if __name__ == '__main__':
    xs = [1, 2, 3, 4]
    #
    start = time.time()
    with Pool(processes=2) as p:
        rets = p.starmap_async(square, zip(xs, xs), chunksize=len(xs)//2).get()
    print(&quot;[INFO]timecost: &quot;, time.time() - start)</code></pre><h2>参考</h2><p>[1] - <span class="external-link"><a class="no-external-link" href="https://zhuanlan.zhihu.com/p/342405289" target="_blank"><i data-feather="external-link"></i>python 多进程加速执行代码 mutiprocessing Pool - 2021.01.06 - 知乎</a></span></p><p>[2] - <span class="external-link"><a class="no-external-link" href="https://docs.python.org/zh-tw/3.7/library/multiprocessing.html" target="_blank"><i data-feather="external-link"></i>multiprocessing --- 基于进程的并行</a></span></p><p>[3] - <span class="external-link"><a class="no-external-link" href="https://blog.csdn.net/BobYuan888/article/details/109266020" target="_blank"><i data-feather="external-link"></i>Python进程池multiprocessing.Pool八个函数对比 - 2020.10.24</a></span></p><p>[4] - <span class="external-link"><a class="no-external-link" href="https://miraachan.github.io/2018/11/20/20/" target="_blank"><i data-feather="external-link"></i>python multiprocessing 中imap和map的不同 - 2018.11.20</a></span></p><p>[5] - <span class="external-link"><a class="no-external-link" href="https://leimao.github.io/blog/Python-tqdm-Multiprocessing/" target="_blank"><i data-feather="external-link"></i>Progress Bars for Python Multiprocessing Tasks</a></span></p><p>[6] - <span class="external-link"><a class="no-external-link" href="https://towardsdatascience.com/parallelism-with-python-part-1-196f0458ca14" target="_blank"><i data-feather="external-link"></i>Parallelism with Python - 2020.12.17</a></span></p>

1. apply 和 apply_async

1.1. apply

1.2. apply_async

2. map 和 map_async

2.1. map

2.2. map_async

3. imap 和 imap_unordered

3.1. imap

3.2. imap_unordered

4. starmap 和 starmap_async

4.1. startmap

4.2. starmap_async

参考

发表评论 取消回复 使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款

Python - 多进程multiprocessing库

发表评论取消回复
使用cookie技术保留您的个人信息以便您下次快速评论，继续评论表示您已同意该条款