【挖坑/pytorch笔记】DistributedDataParallel怎么用
DistributedDataParallel怎么用
【挖坑/pytorch笔记】DistributedDataParallel怎么用
前言
- 使用教程来自官方文档
- 配置环境:torch2.0.1+cu118与对应torchaudio和torchvision
我还没看懂 先发上来防止自己忘了
官方解释
- 官方推荐使用
DistributedDataParallel
而不是DataParallel
DistributedDataParallel
uses multiprocessing where a process is created for each GPU, whileDataParallel
uses multithreading. By using multiprocessing, each GPU has its dedicated process, this avoids the performance overhead caused by GIL of Python interpreter.
DDP使用多进程,为每个GPU创建一个进程;而DP使用多线程;使用多进程,每一个GPU都有一个它们自己专用的进程,避免了由python解释器的GIL带来的性能开销
DP使用代码
1
2
3
4
5
6
7
8
9
10
11
# 官方给的示例
net = torch.nn.DataParallel(model, device_ids=[0, 1, 2])
output = net(input_var) # input_var can be on any device, including CPU
# 之前训练Resnet50的时候也用过以下方式
mycuda = 'cuda'
myDevice = torch.device(mycuda if torch.cuda.is_available() else 'cpu')
resnet50 = nn.DataParallel(resnet50).to(device=myDevice)
# 如果mycuda设为'cuda',就是自动分散到所有GPU上
# 如果想要指定某一个GPU,也可以使用'cuda:0','cuda:1'等等,后面的数字是GPU的id
优点
- 方便使用
缺点(这个复制的,无法完全理解含义)
- 每个前向过程中模型复制引入的延迟
- GPU利用不均衡
- 不支持多机多卡
DDP官方演示使用代码 - 没看懂呢 难绷 - 不急 现在还用不到
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
import os
import sys
import tempfile
import torch
import torch.distributed as dist
import torch.nn as nn
import torch.optim as optim
import torch.multiprocessing as mp
from torch.nn.parallel import DistributedDataParallel as DDP
# On Windows platform, the torch.distributed package only
# supports Gloo backend, FileStore and TcpStore.
# For FileStore, set init_method parameter in init_process_group
# to a local file. Example as follow:
# init_method="file:///f:/libtmp/some_file"
# dist.init_process_group(
# "gloo",
# rank=rank,
# init_method=init_method,
# world_size=world_size)
# For TcpStore, same way as on Linux.
def setup(rank, world_size):
os.environ['MASTER_ADDR'] = 'localhost'
os.environ['MASTER_PORT'] = '12355'
# initialize the process group
dist.init_process_group("gloo", rank=rank, world_size=world_size)
def cleanup():
dist.destroy_process_group()
class ToyModel(nn.Module):
def __init__(self):
super(ToyModel, self).__init__()
self.net1 = nn.Linear(10, 10)
self.relu = nn.ReLU()
self.net2 = nn.Linear(10, 5)
def forward(self, x):
return self.net2(self.relu(self.net1(x)))
def demo_basic(rank, world_size):
print(f"Running basic DDP example on rank {rank}.")
setup(rank, world_size)
# create model and move it to GPU with id rank
model = ToyModel().to(rank)
ddp_model = DDP(model, device_ids=[rank])
loss_fn = nn.MSELoss()
optimizer = optim.SGD(ddp_model.parameters(), lr=0.001)
optimizer.zero_grad()
outputs = ddp_model(torch.randn(20, 10))
labels = torch.randn(20, 5).to(rank)
loss_fn(outputs, labels).backward()
optimizer.step()
cleanup()
def run_demo(demo_fn, world_size):
mp.spawn(demo_fn,
args=(world_size,),
nprocs=world_size,
join=True)
This post is licensed under CC BY 4.0 by the author.