"find_unused_parameters=True" 这个参数的 true 设置会影响多卡模型训练的收

2023-04-01 11:42:14 +08:00
 yiyi1010

当我设置"find_unused_parameters=True", 是模型在训练不收敛,感觉好像啥都没学到似的,感觉应该就是 gradients 出现了问题。

当我设置"find_unused_parameters=False" 会报一下的错误,这个错误是因为 decoder 没有返回梯度 gradient ,这是什么原因造成的 他没有返回梯度呢?有什么建议吗?

In my model training, if "find_unused_parameters=False", it will raise an error as follows: RuntimeError: Expected to have finished reduction in the prior iteration before starting a new one. This error indicates that your module has parameters that were not used in producing loss. You can enable unused parameter detection by passing the keyword argument find_unused_parameters=True to torch.nn.parallel.DistributedDataParallel, and by making sure all forward function outputs participate in calculating loss. If you already have done the above, then the distributed data parallel module wasn't able to locate the output tensors in the return value of your module's forward function. Please include the loss function and the structure of the return value of forward of your module when reporting this issue (e.g. list, dict, iterable). Parameters which did not receive grad for rank 1: decoder.0.norm1.weight, decoder.0.norm1.bias, decoder.0.self_attn.qkv.weight, decoder.0.self_attn.proj.weight, decoder.0.self_attn.proj.bias, decoder.0.norm_q.weight, decoder.0.norm_q.bias, decoder.0.norm_v.weight, decoder.0.norm_v.bias, decoder.0.cross_attn.q_map.weight, decoder.0.cross_attn.k_map.weight, decoder.0.cross_attn.v_map.weight, decoder.0.cross_attn.proj.weight, decoder.0.cross_attn.proj.bias, decoder.0.norm2.weight, decoder.0.norm2.bias, decoder.0.mlp.fc1.weight, decoder.0.mlp.fc1.bias, decoder.0.mlp.fc2.weight, decoder.0.mlp.fc2.bias, decoder.1.norm1.weight, decoder.1.norm1.bias, decoder.1.self_attn.qkv.weight, decoder.1.self_attn.proj.weight, decoder.1.self_attn.proj.bias, decoder.1.norm_q.weight, decoder.1.norm_q.bias, decoder.1.norm_v.weight, decoder.1.norm_v.bias, decoder.1.cross_attn.q_map.weight, decoder.1.cross_attn.k_map.weight, decoder.1.cross_attn.v_map.weight, decoder.1.cross_attn.proj.weight, decoder.1.cross_attn.proj.bias, decoder.1.norm2.weight, decoder.1.norm2.bias, decoder.1.mlp.fc1.weight, decoder.1.mlp.fc1.bias, decoder.1.mlp.fc2.weight, decoder.1.mlp.fc2.bias Parameter indices which did not receive grad for rank 1: 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 ERROR:torch.distributed.elastic.multiprocessing.api:failed (exitcode: 1) local_rank: 0 (pid: 463885) of binary: /home/mnt/xyqian/miniconda3/envs/detector_21806_2/bin/python /home/mnt/xyqian/miniconda3/envs/detector_21806_2/lib/python3.8/site-packages/torch/distributed/elastic/multiprocessing/errors/init.py:367: UserWarning:

922 次点击
所在节点    问与答
0 条回复

这是一个专为移动设备优化的页面(即为了让你能够在 Google 搜索结果里秒开这个页面),如果你希望参与 V2EX 社区的讨论,你可以继续到 V2EX 上打开本讨论主题的完整版本。

https://www.v2ex.com/t/928956

V2EX 是创意工作者们的社区,是一个分享自己正在做的有趣事物、交流想法,可以遇见新朋友甚至新机会的地方。

V2EX is a community of developers, designers and creative people.

© 2021 V2EX