Post

【挖坑/pytorch笔记】DistributedDataParallel怎么用

DistributedDataParallel怎么用

【挖坑/pytorch笔记】DistributedDataParallel怎么用

前言

  • 使用教程来自官方文档
  • 配置环境:torch2.0.1+cu118与对应torchaudio和torchvision

我还没看懂 先发上来防止自己忘了

官方解释

  • 官方推荐使用 DistributedDataParallel 而不是 DataParallel DistributedDataParallel uses multiprocessing where a process is created for each GPU, while DataParallel uses multithreading. By using multiprocessing, each GPU has its dedicated process, this avoids the performance overhead caused by GIL of Python interpreter.

DDP使用多进程,为每个GPU创建一个进程;而DP使用多线程;使用多进程,每一个GPU都有一个它们自己专用的进程,避免了由python解释器的GIL带来的性能开销

DP使用代码

1
2
3
4
5
6
7
8
9
10
11
# 官方给的示例
net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(input_var)  # input_var can be on any device, including CPU

# 之前训练Resnet50的时候也用过以下方式
mycuda = 'cuda'
myDevice = torch.device(mycuda if torch.cuda.is_available() else 'cpu')
resnet50 = nn.DataParallel(resnet50).to(device=myDevice)

# 如果mycuda设为'cuda',就是自动分散到所有GPU上
# 如果想要指定某一个GPU,也可以使用'cuda:0','cuda:1'等等,后面的数字是GPU的id

优点

  • 方便使用

缺点(这个复制的,无法完全理解含义)

  • 每个前向过程中模型复制引入的延迟
  • GPU利用不均衡
  • 不支持多机多卡

DDP官方演示使用代码 - 没看懂呢 难绷 - 不急 现在还用不到

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp

from torch.nn.parallel import DistributedDataParallel as DDP

# On Windows platform, the torch.distributed package only
# supports Gloo backend, FileStore and TcpStore.
# For FileStore, set init_method parameter in init_process_group
# to a local file. Example as follow:
# init_method="file:///f:/libtmp/some_file"
# dist.init_process_group(
#    "gloo",
#    rank=rank,
#    init_method=init_method,
#    world_size=world_size)
# For TcpStore, same way as on Linux.

def setup(rank, world_size):
    os.environ['MASTER_ADDR'] = 'localhost'
    os.environ['MASTER_PORT'] = '12355'

    # initialize the process group
    dist.init_process_group("gloo", rank=rank, world_size=world_size)

def cleanup():
    dist.destroy_process_group()

class ToyModel(nn.Module):
    def __init__(self):
        super(ToyModel, self).__init__()
        self.net1 = nn.Linear(10, 10)
        self.relu = nn.ReLU()
        self.net2 = nn.Linear(10, 5)

    def forward(self, x):
        return self.net2(self.relu(self.net1(x)))


def demo_basic(rank, world_size):
    print(f"Running basic DDP example on rank {rank}.")
    setup(rank, world_size)

    # create model and move it to GPU with id rank
    model = ToyModel().to(rank)
    ddp_model = DDP(model, device_ids=[rank])

    loss_fn = nn.MSELoss()
    optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)

    optimizer.zero_grad()
    outputs = ddp_model(torch.randn(20, 10))
    labels = torch.randn(20, 5).to(rank)
    loss_fn(outputs, labels).backward()
    optimizer.step()

    cleanup()


def run_demo(demo_fn, world_size):
    mp.spawn(demo_fn,
             args=(world_size,),
             nprocs=world_size,
             join=True)
This post is licensed under CC BY 4.0 by the author.