点击上面的计算机视觉车间,选择星标
第一时间送达干货
作者丨ChaucerG
来源集智书童
PP-YOLOE是基于PP-YOLOv优秀的单阶段Anchor-free模型超越了各种流行模型和各种流行模型yolo模型。PP-YOLOE有一系列模型,即s/m/l/x,可以通过width multiplier和depth multiplier配置。PP-YOLOE避免使用等deformable convolution或者matrix nms这样的特殊算子可以很容易地部署在各种硬件上。
1模型架构
PP-YOLOE由以下方法组成:
可扩展的backbone和neck
Task Alignment Learning
Efficient Task-aligned head with DFL和VFL
SiLU激活函数
1.1、Backbone
PP-YOLOE的Backbone主要是使用RepVGG模块以及CSP模型思想是对的ResNet逆行改进也被使用SiLU激活函数、Effitive SE Attention等模块,我们一起来。
1、RepVGG
RepVGG,这个网络就在VGG改进的主要思路包括:
在VGG网络的Block块中加入了Identity与残差分支相当于ResNet网络精华应用 到VGG网络中;
通过模型推理阶段Op整合策略将所有网络层转换为3×三卷积,便于网络的部署和加速。
上图显示了模型推理阶段的重参数过程,实际上是一个OP融合和OP替换过程。图A从结构化的角度展示了整个重参数化过程, 图B从模型参数的角度展示了整个重参数过程。整个重参数步骤如下:
:首先,通过3将残余块中的卷积层和BN该操作将在许多深度学习框架的推理阶段进行。执行图中蓝色框3×3卷积 BN1.图中黑色矩形框中的层融合×1卷积 BN图中融合在图中的黄色矩形框中执行3×3卷积(卷积核设置为全1) BN层的融合。转换前的卷积层参数表示BN层的平均值,表示BN层的方差和分别表示BN层的尺度因子和偏移因子,W’和b分别表示集成后卷积的权重和偏置。
:将融合后的卷积层转换为3×3卷积,即将不同卷积核的具体卷积转换为3卷积×卷积核三大小。因为整个残差块可能包含1×1卷积分支和Identity如图中黑框和黄框所示,两个分支。对于1×就1卷积分支而言,整个转换过程是使用3卷积分支×3卷积核替换1×1卷积核,具体细节如图中紫框所示,即将1卷×1卷积核中的数值移动到3×3卷积核的中心点;对于Identity就分支而言,该分支不会改变输入的特征映射值,因此可以设置3×卷积核,将所有9个位置的权重值设置为1,在与输入特征映射相乘后,保持了原始值,具体细节如图中的棕色框所示。
:3.合并残差分支×3卷积。将所有分支的权重W和偏置B叠加,以获得集成后的3×3卷积层。
为什么要用VGG式模型?
除了相信简单就是美,VGG极简模型至少有五个现实优势:
3×卷积很快GPU上,3×3卷积的计算密度(理论运算量除以所用时间)可达1×1和5×5卷积的4倍。
由于并行性高,单路架构非常快。计算量相同,大而整的运算效率远高于小而碎的运算。
单路架构节省内存。ResNet的shortcut虽然不占计算量,但显存占用增加了一倍。
单路架构更灵活,容易改变各层的宽度(如剪枝)。
RepVGG主体部分只有一个算子:3×3卷积接ReLU。在设计特殊芯片时,给定芯片的尺寸或成本可以集成大量3×3卷积 ReLU实现高效率的计算单元。
下图表示RepVGG推理融合后ONNX输出,可见简化了很多。
classRepVggBlock(nn.Layer): def__init__(self,ch_in,ch_out,act='relu'): super(RepVggBlock,self).__init__() self.ch_in=ch_in self.ch_out=ch_out self.conv1=ConvBNLayer(ch_in,ch_out,3,stride=1,padding=1,act=None) self.conv2=ConvBNLayer(ch_in,ch_out,1,stride=1,padding=0,act=None) self.act=get_act_fn(act)ifactisNoneorisinstance(act,(str,dict))elseact defforward(self,x): ifhasattr(self,'conv'): y=self.conv(x) else: y=self.conv1(x) self.conv2(x) y=self.act(y) returny defconvert_to_deploy(self): ifnothasattr(self,'conv'): self.conv=nn.Conv2D(in_channels=self.ch_in,out_channels=self.ch_out,kernel_size=3,stride=1,padding=1,groups=1) kernel,bias=self.get_equivalent_kernel_bias() self.conv.weight.set_value(kernel) self.conv.bias.set_value(bias) defget_equivalent_kernel_bias(self): #融合推理 &bsp; kernel3x3, bias3x3 = self._fuse_bn_tensor(self.conv1)
kernel1x1, bias1x1 = self._fuse_bn_tensor(self.conv2)
return kernel3x3 + self._pad_1x1_to_3x3_tensor(kernel1x1), bias3x3 + bias1x1
def _pad_1x1_to_3x3_tensor(self, kernel1x1):
if kernel1x1 is None:
return 0
else:
return nn.functional.pad(kernel1x1, [1, 1, 1, 1])
def _fuse_bn_tensor(self, branch):
if branch is None:
return 0, 0
kernel = branch.conv.weight
running_mean = branch.bn._mean
running_var = branch.bn._variance
gamma = branch.bn.weight
beta = branch.bn.bias
eps = branch.bn._epsilon
std = (running_var + eps).sqrt()
t = (gamma / std).reshape((-1, 1, 1, 1))
return kernel * t, beta - running_mean * gamma / std
2、Swish激活函数
从代码和公式来看,Swish包含了SiLU,换句话说SiLU是Swish的一种特例。
所以画图基本上都使用了SiLU代替Swish,因为YOLOE中的Swish的,也就是SiLU激活函数。
β是个常数或可训练的参数。Swish 具备无上界有下界、平滑、非单调的特性。Swish 在深层模型上的效果优于 ReLU。
例如,仅仅使用 Swish 单元替换 ReLU 就能把 Mobile NASNetA 在 ImageNet 上的 top-1 分类准确率提高 0.9%,Inception-ResNet-v 的分类准确率提高 0.6%。
class ConvBNLayer(nn.Layer):
def __init__(self, ch_in, ch_out, filter_size=3, stride=1, groups=1, padding=0, act=None):
super(ConvBNLayer, self).__init__()
self.conv = nn.Conv2D(in_channels=ch_in, out_channels=ch_out, kernel_size=filter_size, stride=stride, padding=padding, groups=groups, bias_attr=False)
self.bn = nn.BatchNorm2D(ch_out, weight_attr=ParamAttr(regularizer=L2Decay(0.0)), bias_attr=ParamAttr(regularizer=L2Decay(0.0)))
self.act = get_act_fn(act) if act is None or isinstance(act, (str, dict)) else act
def forward(self, x):
x = self.conv(x)
x = self.bn(x)
x = self.act(x)
return x
3、Effective SE Attention
该模块主要是来自于《CenterMask:Real-Time Anchor-Free Instance Segmentation》中的eSE模块;
在输出的内部添加了一个channel上的attention模块eSE。原始的SE模块中使用2个FC去进行channel权重映射,但是为了减少计算量通常会将FC中的channel给剪裁一些(小于输入的channel),这就引入了一些信息的损失,为此文章直接将2个FC替换为了1个FC。
class EffectiveSELayer(nn.Layer):
def __init__(self, channels, act='hardsigmoid'):
super(EffectiveSELayer, self).__init__()
self.fc = nn.Conv2D(channels, channels, kernel_size=1, padding=0)
self.act = get_act_fn(act) if act is None or isinstance(act, (str, dict)) else act
def forward(self, x):
x_se = x.mean((2, 3), keepdim=True)
x_se = self.fc(x_se)
return x * self.act(x_se)
4、CSPNet结构
CSPNet的主要思想还是Partial Dense Block,设计Partial Dense Block的目的是:
增加梯度路径:通过分裂合并策略,可以使梯度路径的数目翻倍。由于采用了跨阶段的策略,可以减轻使用显式特征映射复制进行连接的缺点;
平衡各层的计算:通常情况下,DenseNet底层的信道数远远大于增长率。由于部分dense block中的dense layer操作所涉及的底层信道只占原始信道的一半,因此可以有效地解决近一半的计算瓶颈;
减少内存流量:假设dense block在DenseNet中的基本特征映射大小为,增长率为d,并且共m层。然后,该dense block的CIO为,部分dense block的CIO为。虽然m和d通常比c小得多,但部分dense block最多可以节省网络内存流量的一半。
class CSPResStage(nn.Layer):
def __init__(self, block_fn, ch_in, ch_out, n, stride, act='relu', attn='eca'):
super(CSPResStage, self).__init__()
ch_mid = (ch_in + ch_out) // 2
if stride == 2:
self.conv_down = ConvBNLayer(ch_in, ch_mid, 3, stride=2, padding=1, act=act)
else:
self.conv_down = None
self.conv1 = ConvBNLayer(ch_mid, ch_mid // 2, 1, act=act)
self.conv2 = ConvBNLayer(ch_mid, ch_mid // 2, 1, act=act)
self.blocks = nn.Sequential(* [block_fn(ch_mid // 2, ch_mid // 2, act=act, shortcut=True) for i in range(n)])
if attn:
self.attn = EffectiveSELayer(ch_mid, act='hardsigmoid')
else:
self.attn = None
self.conv3 = ConvBNLayer(ch_mid, ch_out, 1, act=act)
def forward(self, x):
if self.conv_down is not None:
x = self.conv_down(x)
y1 = self.conv1(x)
y2 = self.blocks(self.conv2(x))
y = paddle.concat([y1, y2], axis=1)
if self.attn is not None:
y = self.attn(y)
y = self.conv3(y)
return y
5、SPP结构
SPP-Net全名为Spatial Pyramid Pooling(空间金字塔池化结构),2015年由微软研究院的何恺明提出,主要解决2个问题:
有效避免了R-CNN算法对图像区域剪裁、缩放操作导致的图像物体剪裁不全以及形状扭曲等问题。
解决了卷积神经网络对图像重复特征提取的问题,大大提高了产生候选框的速度,且节省了计算成本。
SPP 显著特点
不管输入尺寸是怎样,SPP 可以产生固定大小的输出
使用多个窗口(pooling window)
SPP 可以使用同一图像不同尺寸(scale)作为输入, 得到同样长度的池化特征。
其它特点
由于对输入图像的不同纵横比和不同尺寸,SPP同样可以处理,所以提高了图像的尺度不变(scale-invariance)和降低了过拟合(over-fitting)
实验表明训练图像尺寸的多样性比单一尺寸的训练图像更容易使得网络收敛(convergence)
SPP 对于特定的CNN网络设计和结构是独立的。(也就是说,只要把SPP放在最后一层卷积层后面,对网络的结构是没有影响的, 它只是替换了原来的pooling层)
不仅可以用于图像分类而且可以用来目标检测
通过spp模块实现局部特征和全局特征(所以空间金字塔池化结构的最大的池化核要尽可能的接近等于需要池化的featherMap的大小)的featherMap级别的融合,丰富最终特征图的表达能力,从而提高MAP。
class SPP(nn.Layer):
def __init__(self, ch_in, ch_out, k, pool_size, act='swish', data_format='NCHW'):
super(SPP, self).__init__()
self.pool = []
self.data_format = data_format
for i, size in enumerate(pool_size):
pool = self.add_sublayer('pool{}'.format(i), nn.MaxPool2D(kernel_size=size, stride=1, padding=size // 2, data_format=data_format, ceil_mode=False))
self.pool.append(pool)
self.conv = ConvBNLayer(ch_in, ch_out, k, padding=k // 2, act=act)
def forward(self, x):
outs = [x]
for pool in self.pool:
outs.append(pool(x))
if self.data_format == 'NCHW':
y = paddle.concat(outs, axis=1)
else:
y = paddle.concat(outs, axis=-1)
y = self.conv(y)
return y
class CSPStage(nn.Layer):
def __init__(self, block_fn, ch_in, ch_out, n, act='swish', spp=False):
super(CSPStage, self).__init__()
ch_mid = int(ch_out // 2)
self.conv1 = ConvBNLayer(ch_in, ch_mid, 1, act=act)
self.conv2 = ConvBNLayer(ch_in, ch_mid, 1, act=act)
self.convs = nn.Sequential()
next_ch_in = ch_mid
for i in range(n):
self.convs.add_sublayer(str(i), eval(block_fn)(next_ch_in, ch_mid, act=act, shortcut=False))
if i == (n - 1) // 2 and spp:
self.convs.add_sublayer('spp', SPP(ch_mid * 4, ch_mid, 1, [5, 9, 13], act=act))
next_ch_in = ch_mid
self.conv3 = ConvBNLayer(ch_mid * 2, ch_out, 1, act=act)
def forward(self, x):
y1 = self.conv1(x)
y2 = self.conv2(x)
y2 = self.convs(y2)
y = paddle.concat([y1, y2], axis=1)
y = self.conv3(y)
return y
1.2、Neck
yoloe的neck结构采用的依旧是FPN+PAN结构模式,将Neck部分用立体图画出来,更直观的看下两部分之间是如何通过FPN结构融合的。
如图所示,FPN是自顶向下的,将高层特征通过上采样和低层特征做融合得到进行预测的特征图。
和FPN层不同,yoloe在FPN层的后面还添加了一个自底向上的特征金字塔。FPN是自顶向下,将高层的强语义特征传递下来,对整个金字塔进行增强,不过只增强了语义信息,对定位信息没有传递,而本文就是针对这一点,在FPN的后面添加一个自底向上的金字塔。这样的操作是对FPN的补充,将低层的强定位特征传递上去。
class CustomCSPPAN(nn.Layer):
__shared__ = ['norm_type', 'data_format', 'width_mult', 'depth_mult', 'trt']
def __init__(self, in_channels=[256, 512, 1024], out_channels=[1024, 512, 256], norm_type='bn', act='leaky',
stage_fn='CSPStage', block_fn='BasicBlock', stage_num=1, block_num=3, drop_block=False,
block_size=3, keep_prob=0.9, spp=False, data_format='NCHW', width_mult=1.0,
depth_mult=1.0, trt=False):
super(CustomCSPPAN, self).__init__()
out_channels = [max(round(c * width_mult), 1) for c in out_channels]
block_num = max(round(block_num * depth_mult), 1)
act = get_act_fn(act, trt=trt) if act is None or isinstance(act, (str, dict)) else act
self.num_blocks = len(in_channels)
self.data_format = data_format
self._out_channels = out_channels
in_channels = in_channels[::-1]
fpn_stages = []
fpn_routes = []
for i, (ch_in, ch_out) in enumerate(zip(in_channels, out_channels)):
if i > 0:
ch_in += ch_pre // 2
stage = nn.Sequential()
for j in range(stage_num):
stage.add_sublayer(str(j), eval(stage_fn)(block_fn, ch_in if j == 0 else ch_out, ch_out, block_num, act=act, spp=(spp and i == 0)))
if drop_block:
stage.add_sublayer('drop', DropBlock(block_size, keep_prob))
fpn_stages.append(stage)
if i < self.num_blocks - 1:
fpn_routes.append(ConvBNLayer(ch_in=ch_out, ch_out=ch_out // 2, filter_size=1, stride=1, padding=0, act=act))
ch_pre = ch_out
self.fpn_stages = nn.LayerList(fpn_stages)
self.fpn_routes = nn.LayerList(fpn_routes)
pan_stages = []
pan_routes = []
for i in reversed(range(self.num_blocks - 1)):
pan_routes.append(ConvBNLayer(ch_in=out_channels[i + 1], ch_out=out_channels[i + 1], filter_size=3, stride=2, padding=1, act=act))
ch_in = out_channels[i] + out_channels[i + 1]
ch_out = out_channels[i]
stage = nn.Sequential()
for j in range(stage_num):
stage.add_sublayer(str(j), eval(stage_fn)(block_fn, ch_in if j == 0 else ch_out, ch_out, block_num, act=act, spp=False))
if drop_block:
stage.add_sublayer('drop', DropBlock(block_size, keep_prob))
pan_stages.append(stage)
self.pan_stages = nn.LayerList(pan_stages[::-1])
self.pan_routes = nn.LayerList(pan_routes[::-1])
def forward(self, blocks, for_mot=False):
blocks = blocks[::-1]
fpn_feats = []
for i, block in enumerate(blocks):
if i > 0:
block = paddle.concat([route, block], axis=1)
route = self.fpn_stages[i](block)
fpn_feats.append(route)
if i < self.num_blocks - 1:
route = self.fpn_routes[i](route)
route = F.interpolate(route, scale_factor=2., data_format=self.data_format)
pan_feats = [fpn_feats[-1], ]
route = fpn_feats[-1]
for i in reversed(range(self.num_blocks - 1)):
block = fpn_feats[i]
route = self.pan_routes[i](route)
block = paddle.concat([route, block], axis=1)
route = self.pan_stages[i](block)
pan_feats.append(route)
return pan_feats[::-1]
1.3、Head
对于PP-YOLOE的head部分,其依旧是TOOD的head,也就是T-Head,主要是包括了Cls Head和Loc Head。具体来说,T-head首先在FPN特征基础上进行分类与定位预测;然后TAL基于所提任务对齐测度计算任务对齐信息;最后T-head根据从TAL传回的信息自动调整分类概率与定位预测。
由于2个任务的预测都是基于这个交互特征来完成的,但是2个任务对于特征的需求肯定是不一样的,因为作者设计了一个layer attention来为每个任务单独的调整一下特征,这个部分的结构也很简单,可以理解为是一个channel-wise的注意力机制。这样的话就得到了对于每个任务单独的特征,然后再利用这些特征生成所需要的类别或者定位的特征图。
class PPYOLOEHead(nn.Layer):
__shared__ = ['num_classes', 'trt', 'exclude_nms']
__inject__ = ['static_assigner', 'assigner', 'nms']
def __init__(self,
in_channels=[1024, 512, 256],
num_classes=80,
act='swish',
fpn_strides=(32, 16, 8),
grid_cell_scale=5.0,
grid_cell_offset=0.5,
reg_max=16,
static_assigner_epoch=4,
use_varifocal_loss=True,
static_assigner='ATSSAssigner',
assigner='TaskAlignedAssigner',
nms='MultiClassNMS',
eval_input_size=[],
loss_weight={'class': 1.0, 'iou': 2.5, 'dfl': 0.5,},
trt=False,
exclude_nms=False):
super(PPYOLOEHead, self).__init__()
assert len(in_channels) > 0, "len(in_channels) should > 0"
self.in_channels = in_channels
self.num_classes = num_classes
self.fpn_strides = fpn_strides
self.grid_cell_scale = grid_cell_scale
self.grid_cell_offset = grid_cell_offset
self.reg_max = reg_max
self.iou_loss = GIoULoss()
self.loss_weight = loss_weight
self.use_varifocal_loss = use_varifocal_loss
self.eval_input_size = eval_input_size
self.static_assigner_epoch = static_assigner_epoch
self.static_assigner = static_assigner
self.assigner = assigner
self.nms = nms
self.exclude_nms = exclude_nms
# stem
self.stem_cls = nn.LayerList()
self.stem_reg = nn.LayerList()
act = get_act_fn(act, trt=trt) if act is None or isinstance(act, (str, dict)) else act
for in_c in self.in_channels:
self.stem_cls.append(ESEAttn(in_c, act=act))
self.stem_reg.append(ESEAttn(in_c, act=act))
# pred head
self.pred_cls = nn.LayerList()
self.pred_reg = nn.LayerList()
for in_c in self.in_channels:
self.pred_cls.append(nn.Conv2D(in_c, self.num_classes, 3, padding=1))
self.pred_reg.append(nn.Conv2D(in_c, 4 * (self.reg_max + 1), 3, padding=1))
# projection conv
self.proj_conv = nn.Conv2D(self.reg_max + 1, 1, 1, bias_attr=False)
self._init_weights()
@classmethod
def from_config(cls, cfg, input_shape):
return {'in_channels': [i.channels for i in input_shape], }
def forward_train(self, feats, targets):
anchors, anchor_points, num_anchors_list, stride_tensor = generate_anchors_for_grid_cell(feats, self.fpn_strides, self.grid_cell_scale, self.grid_cell_offset)
cls_score_list, reg_distri_list = [], []
for i, feat in enumerate(feats):
avg_feat = F.adaptive_avg_pool2d(feat, (1, 1))
cls_logit = self.pred_cls[i](self.stem_cls[i](feat, avg_feat) + feat)
reg_distri = self.pred_reg[i](self.stem_reg[i](feat, avg_feat))
# cls and reg
cls_score = F.sigmoid(cls_logit)
cls_score_list.append(cls_score.flatten(2).transpose([0, 2, 1]))
reg_distri_list.append(reg_distri.flatten(2).transpose([0, 2, 1]))
cls_score_list = paddle.concat(cls_score_list, axis=1)
reg_distri_list = paddle.concat(reg_distri_list, axis=1)
return self.get_loss([cls_score_list, reg_distri_list, anchors, anchor_points, num_anchors_list, stride_tensor], targets)
def forward_eval(self, feats):
if self.eval_input_size:
anchor_points, stride_tensor = self.anchor_points, self.stride_tensor
else:
anchor_points, stride_tensor = self._generate_anchors(feats)
cls_score_list, reg_dist_list = [], []
for i, feat in enumerate(feats):
b, _, h, w = feat.shape
l = h * w
avg_feat = F.adaptive_avg_pool2d(feat, (1, 1))
cls_logit = self.pred_cls[i](self.stem_cls[i](feat, avg_feat) + feat)
reg_dist = self.pred_reg[i](self.stem_reg[i](feat, avg_feat))
reg_dist = reg_dist.reshape([-1, 4, self.reg_max + 1, l]).transpose([0, 2, 1, 3])
reg_dist = self.proj_conv(F.softmax(reg_dist, axis=1))
# cls and reg
cls_score = F.sigmoid(cls_logit)
cls_score_list.append(cls_score.reshape([b, self.num_classes, l]))
reg_dist_list.append(reg_dist.reshape([b, 4, l]))
cls_score_list = paddle.concat(cls_score_list, axis=-1)
reg_dist_list = paddle.concat(reg_dist_list, axis=-1)
return cls_score_list, reg_dist_list, anchor_points, stride_tensor
def forward(self, feats, targets=None):
assert len(feats) == len(self.fpn_strides), "The size of feats is not equal to size of fpn_strides"
if self.training:
return self.forward_train(feats, targets)
else:
return self.forward_eval(feats)
2样本匹配
2.1 ATSS Assigner思想
ATSS论文指出One-Stage Anchor-Based和Center-Based Anchor-Free检测算法间的差异主要来自于正负样本的选择,基于此提出ATSS(Adaptive Training Sample Selection)方法,该方法能够自动根据GT的相关统计特征选择合适的Anchor Box作为正样本,在不带来额外计算量和参数的情况下,能够大幅提升模型的性能。
ATSS选取正样本的方法如下:其简要流程为:
计算每个 gt bbox 和多尺度输出层的所有 anchor 之间的 IoU
计算每个 gt bbox 中心坐标和多尺度输出层的所有 anchor 中心坐标的 l2 距离
遍历每个输出层,遍历每个 gt bbox,找出当前层中 topk (超参,默认是 9 )个最小 l2 距离的 anchor 。假设一共有 l 个输出层,那么对于任何一个 gt bbox,都会挑选出 topk×l 个候选位置
对于每个 gt bbox,计算所有候选位置 IoU 的均值和标准差,两者相加得到该 gt bbox 的自适应阈值
遍历每个 gt bbox,选择出候选位置中 IoU 大于阈值的位置,该位置认为是正样本,负责预测该 gt bbox
如果 topk 参数设置过大,可能会导致某些正样本位置不在 gt bbox 内部,故需要过滤掉这部分正样本,设置为背景样本
1、ATSS主要2大特性:
保证了所有的正样本Anchor都是在Ground Truth的周围。
最主要是根据不同层的特性对不同层的正样本的阈值进行了微调。
2、ATSS的贡献
指出Anchor-Base检测器和Anchor-Free检测器之间的本质区别实际上是如何定义正训练样本和负训练样本;
提出自适应训练样本选择,以根据目标的统计特征自动选择正负样本;
证明了在图像上的每个位置上平铺多个Anchor来提升检测的性能是没效果的;
class ATSSAssigner(nn.Layer):
"""Bridging the Gap Between Anchor-based and Anchor-free Detection
via Adaptive Training Sample Selection
"""
__shared__ = ['num_classes']
def __init__(self, topk=9, num_classes=80, force_gt_matching=False, eps=1e-9):
super(ATSSAssigner, self).__init__()
self.topk = topk
self.num_classes = num_classes
self.force_gt_matching = force_gt_matching
self.eps = eps
def _gather_topk_pyramid(self, gt2anchor_distances, num_anchors_list, pad_gt_mask):
pad_gt_mask = pad_gt_mask.tile([1, 1, self.topk]).astype(paddle.bool)
gt2anchor_distances_list = paddle.split(gt2anchor_distances, num_anchors_list, axis=-1)
num_anchors_index = np.cumsum(num_anchors_list).tolist()
num_anchors_index = [0, ] + num_anchors_index[:-1]
is_in_topk_list = []
topk_idxs_list = []
for distances, anchors_index in zip(gt2anchor_distances_list, num_anchors_index):
num_anchors = distances.shape[-1]
topk_metrics, topk_idxs = paddle.topk(distances, self.topk, axis=-1, largest=False)
topk_idxs_list.append(topk_idxs + anchors_index)
topk_idxs = paddle.where(pad_gt_mask, topk_idxs, paddle.zeros_like(topk_idxs))
is_in_topk = F.one_hot(topk_idxs, num_anchors).sum(axis=-2)
is_in_topk = paddle.where(is_in_topk > 1, paddle.zeros_like(is_in_topk), is_in_topk)
is_in_topk_list.append(is_in_topk.astype(gt2anchor_distances.dtype))
is_in_topk_list = paddle.concat(is_in_topk_list, axis=-1)
topk_idxs_list = paddle.concat(topk_idxs_list, axis=-1)
return is_in_topk_list, topk_idxs_list
@paddle.no_grad()
def forward(self, anchor_bboxes, num_anchors_list, gt_labels, gt_bboxes, pad_gt_mask, bg_index, gt_scores=None, pred_bboxes=None):
"""
ATSS匹配步骤如下:
1. 计算所有预测bbox与GT之间的IoU
2. 计算所有预测bbox与GT之间的距离
3. 在每个pyramid level上,对于每个gt,选择k个中心距离gt中心最近的bbox,总共选择k*l个bbox作为每个gt的候选框
4. 获取这些候选框对应的iou,计算mean和std,设 mean + std为 iou 阈值
5. 选择iou大于或等于阈值的样本为正样本
6. 将正样本的中心限制在gt内
7. 如果Anchor框被分配到多个gts,则选择具有最高的IoU的那个。
Args:
anchor_bboxes (Tensor, float32): pre-defined anchors, shape(L, 4),
"xmin, xmax, ymin, ymax" format
num_anchors_list (List): num of anchors in each level
gt_labels (Tensor, int64|int32): Label of gt_bboxes, shape(B, n, 1)
gt_bboxes (Tensor, float32): Ground truth bboxes, shape(B, n, 4)
pad_gt_mask (Tensor, float32): 1 means bbox, 0 means no bbox, shape(B, n, 1)
bg_index (int): background index
gt_scores (Tensor|None, float32) Score of gt_bboxes,
shape(B, n, 1), if None, then it will initialize with one_hot label
pred_bboxes (Tensor, float32, optional): predicted bounding boxes, shape(B, L, 4)
Returns:
assigned_labels (Tensor): (B, L)
assigned_bboxes (Tensor): (B, L, 4)
assigned_scores (Tensor): (B, L, C), if pred_bboxes is not None, then output ious
"""
assert gt_labels.ndim == gt_bboxes.ndim and gt_bboxes.ndim == 3
num_anchors, _ = anchor_bboxes.shape
batch_size, num_max_boxes, _ = gt_bboxes.shape
# 1. 计算所有预测bbox与GT之间的IoU, [B, n, L]
ious = iou_similarity(gt_bboxes.reshape([-1, 4]), anchor_bboxes)
ious = ious.reshape([batch_size, -1, num_anchors])
# 2. 计算所有预测bbox与GT之间的距离, [B, n, L]
gt_centers = bbox_center(gt_bboxes.reshape([-1, 4])).unsqueeze(1)
anchor_centers = bbox_center(anchor_bboxes)
gt2anchor_distances = (gt_centers - anchor_centers.unsqueeze(0)).norm(2, axis=-1).reshape([batch_size, -1, num_anchors])
# 3. 在每个pyramid level上,对于每个gt,选择k个中心距离gt中心最近的bbox,总共选择k*l个bbox作为每个gt的候选框
# based on the center distance, [B, n, L]
is_in_topk, topk_idxs = self._gather_topk_pyramid(gt2anchor_distances, num_anchors_list, pad_gt_mask)
# 4. 获取这些候选框对应的iou,计算mean和std,设 mean + std为 iou 阈值
iou_candidates = ious * is_in_topk
iou_threshold = paddle.index_sample(iou_candidates.flatten(stop_axis=-2), topk_idxs.flatten(stop_axis=-2))
iou_threshold = iou_threshold.reshape([batch_size, num_max_boxes, -1])
iou_threshold = iou_threshold.mean(axis=-1, keepdim=True) + iou_threshold.std(axis=-1, keepdim=True)
is_in_topk = paddle.where(iou_candidates > iou_threshold.tile([1, 1, num_anchors]), is_in_topk, paddle.zeros_like(is_in_topk))
# 6. 将正样本的中心限制在gt内, [B, n, L]
is_in_gts = check_points_inside_bboxes(anchor_centers, gt_bboxes)
# 选择正样本, [B, n, L]
mask_positive = is_in_topk * is_in_gts * pad_gt_mask
# 7. 如果Anchor框被分配到多个gts,则选择具有最高的IoU的那个。
mask_positive_sum = mask_positive.sum(axis=-2)
if mask_positive_sum.max() > 1:
mask_multiple_gts = (mask_positive_sum.unsqueeze(1) > 1).tile([1, num_max_boxes, 1])
is_max_iou = compute_max_iou_anchor(ious)
mask_positive = paddle.where(mask_multiple_gts, is_max_iou, mask_positive)
mask_positive_sum = mask_positive.sum(axis=-2)
# 8. 确认每个gt_bbox 都匹配到了 anchor
if self.force_gt_matching:
is_max_iou = compute_max_iou_gt(ious) * pad_gt_mask
mask_max_iou = (is_max_iou.sum(-2, keepdim=True) == 1).tile([1, num_max_boxes, 1])
mask_positive = paddle.where(mask_max_iou, is_max_iou, mask_positive)
mask_positive_sum = mask_positive.sum(axis=-2)
assigned_gt_index = mask_positive.argmax(axis=-2)
# 匹配目标
batch_ind = paddle.arange(end=batch_size, dtype=gt_labels.dtype).unsqueeze(-1)
assigned_gt_index = assigned_gt_index + batch_ind * num_max_boxes
assigned_labels = paddle.gather(gt_labels.flatten(), assigned_gt_index.flatten(), axis=0)
assigned_labels = assigned_labels.reshape([batch_size, num_anchors])
assigned_labels = paddle.where(mask_positive_sum > 0, assigned_labels, paddle.full_like(assigned_labels, bg_index))
assigned_bboxes = paddle.gather(gt_bboxes.reshape([-1, 4]), assigned_gt_index.flatten(), axis=0)
assigned_bboxes = assigned_bboxes.reshape([batch_size, num_anchors, 4])
assigned_scores = F.one_hot(assigned_labels, self.num_classes)
if pred_bboxes is not None:
# assigned iou
ious = batch_iou_similarity(gt_bboxes, pred_bboxes) * mask_positive
ious = ious.max(axis=-2).unsqueeze(-1)
assigned_scores *= ious
elif gt_scores is not None:
gather_scores = paddle.gather(gt_scores.flatten(), assigned_gt_index.flatten(), axis=0)
gather_scores = gather_scores.reshape([batch_size, num_anchors])
gather_scores = paddle.where(mask_positive_sum > 0, gather_scores, paddle.zeros_like(gather_scores))
assigned_scores *= gather_scores.unsqueeze(-1)
return assigned_labels, assigned_bboxes, assigned_scores
2.2、Task-aligned Assigner思想(TOOD)
TOOD提出了Task Alignment Learning (TAL) 来显式的把2个任务的最优Anchor拉近。这是通过设计一个样本分配策略和任务对齐loss来实现的。样本分配计算每个Anchor的任务对齐度,同时任务对齐loss可以逐步将分类和定位的最佳Anchor统一起来。
类似于近期提出的One-Stage检测器,所提TOOD采用了类似的架构:Backbone-FPN-Head
。考虑到效率与简单性,类似ATSS, TOOD在每个位置放置一个Anchor。
正如所讨论的,由于分类与定位任务的发散性,现有One-Stage检测器存在任务不对齐(Task Mis-Alignment)约束问题。本文提出通过显式方式采用T-head+TAL
对2个任务进行对齐,见上图。T-head
与TAL
通过协同工作方式改善2个任务的对齐问题;
TOOD选取样本的方法具体来说:
首先,
T-head
在FPN特征基础上进行分类与定位预测;然后,
TAL
基于所提任务对齐测度计算任务对齐信息;最后,
T-head
根据从TAL传回的信息自动调整分类概率与定位预测。
1、Task-Aligned Head
本文所提T-Head见下图,它具有非常简单的结构:特征提取器+TAP
。
为增强分类与定位之间的相互作用,作者通过特征提取器学习任务交互
(Task-Interactive)特征,如中蓝色框部分。这种设计不仅有助于任务交互,同时可以为2个任务提供多级多尺度特征。
假设X表示FPN特征,特征提取器采用N个连续卷积计算任务交互特征:
因此,通过特征提取器可以得到丰富的多尺度特征并用于送入到后续2个TAP模块中进行分类与定位对齐。
2、Task-Aligned Sample Assignment
为与NMS搭配,训练样例的Anchor分配需要满足以下规则:
正常对齐的Anchor应当可以预测高分类得分,同时具有精确定位;
不对齐的Anchor应当具有低分类得分,并在NMS阶段被抑制。
基于上述两个规则,。该对齐度量将集成到样本分配与损失函数中以动态提炼每个Anchor的预测。
考虑到分类得分与IoU表征了预测质量,我们采用2者的高阶组合度量任务对齐度,公式定义如下:
其中,s与u分别表示分类得分与IoU值,而用于控制两者的影响。因此,t在联合优化中起着非常重要的作用,它。
正如已有研究表明,。
为提升两个任务的对齐性,TOOD聚焦于任务对齐Anchor,采用一种简单的分配规则选择训练样本:对每个实例,选择m个具有最大t值的Anchor作为正样例,选择其余的Anchor作为负样例。然后,通过新的损失函数(针对分类与定位的对齐而设计的损失函数)任务进行训练。
class TaskAlignedAssigner(nn.Layer):
def __init__(self, topk=13, alpha=1.0, beta=6.0, eps=1e-9):
super(TaskAlignedAssigner, self).__init__()
self.topk = topk
self.alpha = alpha
self.beta = beta
self.eps = eps
@paddle.no_grad()
def forward(self, pred_scores, pred_bboxes, anchor_points, num_anchors_list, gt_labels, gt_bboxes, pad_gt_mask, bg_index, gt_scores=None):
"""
Task-Aligned Assigner计算步骤如下:
1. 计算所有 bbox与 gt 之间的对齐度
2. 选择 top-k bbox 作为每个 gt 的候选项
3. 将正样品的中心限制在 gt 内(因为Anchor-Free检测器只能预测大于0的距离)
4. 如果一个Anchor被分配给多个gt,将选择IoU最高的那个。
Args:
pred_scores (Tensor, float32): predicted class probability, shape(B, L, C)
pred_bboxes (Tensor, float32): predicted bounding boxes, shape(B, L, 4)
anchor_points (Tensor, float32): pre-defined anchors, shape(L, 2), "cxcy" format
num_anchors_list (List): num of anchors in each level, shape(L)
gt_labels (Tensor, int64|int32): Label of gt_bboxes, shape(B, n, 1)
gt_bboxes (Tensor, float32): Ground truth bboxes, shape(B, n, 4)
pad_gt_mask (Tensor, float32): 1 means bbox, 0 means no bbox, shape(B, n, 1)
bg_index (int): background index
gt_scores (Tensor|None, float32) Score of gt_bboxes, shape(B, n, 1)
Returns:
assigned_labels (Tensor): (B, L)
assigne