项目场景：

使用Tensorflow的keras模型构建，然后使用tf.distribute.MirroredStrategy进行多gpu训练

问题描述

调试其他部分后，开始训练model.fit以下错误报告在句子中，然后程序停止运行。

Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles<T, NumThreads, TileLongSide, TileShortSide>, total_tiles_count, NumThreads, 0, d.stream(), input, input_dims, output) status: Internal: invalid configuration argument Aborted (core dumped)  }

但在使用单gpu训练时没有发现错误。

原因分析：

batch_size和gpu由于数量关系不匹配，使用多个gpu训练时，每一个gpu负载的数据量为batch_size / gpu_num，因此batch_size要是使用gpu数量的整数倍。我以前在这里使用过订单gpu训练，batch_size=1，然后用两个gpu进行训练，batch_size没有变化。

解决方案：

将batch_size改为使用的gpu数量可以是整数倍。我的程序就要到了batch_size修改为batch_size=2完美解决。

参考内容：Tensorflow mirrored strategy error: Non-OK-status: GpuLaunchKernel( SwapDimension1And2InTensor3UsingTiles)

资讯详情

在使用Tensorflow 分布式训练出现的Non-OK-status: GpuLaunchKernel问题

项目场景：

问题描述

原因分析：

解决方案：

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

在使用Tensorflow 分布式训练出现的Non-OK-status: GpuLaunchKernel问题

项目场景：

问题描述

原因分析：

解决方案：

动力学技术KTU1121 USB Type-C 端口保护器的介绍、特性、及应用

最近热搜

历史搜索 清除历史记录

历史搜索清除历史记录