机器学习：在Tensorflow中实施YOLO v3（TF-Slim）

演示图像与检测到的对象

我假设您卷积神经网络(Convolutional Neural Network,CNN)，对象检测，YOLO v3架构等以及Tensorflow和TF-Slim框架。如果没有，最好从相应的论文/教程开始。我不会解释每一行的作用，而是提供工作代码，解释我偶然发现的一些问题。

1.设置

我希望以类似于在Tensorflow模型存储库中组织代码的方式组织代码。我使用tf - slim，因为它让我们定义诸如激活函数、批归一化参数等常用参数，从而使定义的神经网络更快。

我们从yolo_v3.py文件开始，在这里我们将放置初始化网络的函数以及加载预训练的权重的函数。

# -*- coding: utf-8 -*-

import tensorflow as tf

slim = tf.contrib.slim

def darknet53(inputs):

"""

Builds Darknet-53 model.

"""

pass

def yolo_v3(inputs, num_classes, is_training=False, data_format='NCHW', reuse=False):

"""

Creates YOLO v3 model.

:param inputs: a 4-D tensor of size [batch_size, height, width, channels].

Dimension batch_size may be undefined.

:param num_classes: number of predicted classes.

:param is_training: whether is training or not.

:param data_format: data format NCHW or NHWC.

:param reuse: whether or not the network and its variables should be reused.

:return:

"""

pass

def load_weights(var_list, weights_file):

"""

Loads and converts pre-trained weights.

:param var_list: list of network variables.

:param weights_file: name of the binary file.

:return:

"""

pass

在文件的顶部添加必要的常量（由YOLO的作者调整）。

_BATCH_NORM_DECAY = 0.9

_BATCH_NORM_EPSILON = 1e-05

_LEAKY_RELU = 0.1

YOLO v3将输入标准化为范围0..1。检测器中的大多数层在卷积后立即进行批归一化，不存在偏差并使用Leaky ReLU激活。定义slim arg作用域来处理这种情况是很方便的。在不使用BN和LReLU的层中，我们需要隐式定义它。

# transpose the inputs to NCHW

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 3, 1, 2])

# normalize values to range [0..1]

inputs = inputs / 255

# set batch norm params

batch_norm_params = {

'decay': _BATCH_NORM_DECAY,

'epsilon': _BATCH_NORM_EPSILON,

'scale': True,

'is_training': is_training,

'fused': None, # Use fused batch norm if possible.

}

# Set activation_fn and parameters for conv2d, batch_norm.

with slim.arg_scope([slim.conv2d, slim.batch_norm, _fixed_padding], data_format=data_format, reuse=reuse):

with slim.arg_scope([slim.conv2d], normalizer_fn=slim.batch_norm, normalizer_params=batch_norm_params,

biases_initializer=None, activation_fn=lambda x: tf.nn.leaky_relu(x, alpha=_LEAKY_RELU)):

with tf.variable_scope('darknet-53'):

inputs = darknet53(inputs)

我们现在准备定义Darknet-53层

2.实现Darknet-53层

在YOLO v3论文中，作者提出了名为Darknet-53的新特征提取器的更深层架构。正如其名称所暗示的那样，它包含53个卷积层，每层都有一个批归一化层和Leaky ReLU激活。降频采样由conv层完成stride=2。

机器学习：在Tensorflow中实施YOLO v3（TF-Slim）

在我们定义卷积层之前，我们必须认识到作者的实现使用与输入大小无关的fixed padding。为了达到同样的行为，我们可以使用下面的函数

@tf.contrib.framework.add_arg_scope

def _fixed_padding(inputs, kernel_size, *args, mode='CONSTANT', **kwargs):

"""

Pads the input along the spatial dimensions independently of input size.

Args:

inputs: A tensor of size [batch, channels, height_in, width_in] or

[batch, height_in, width_in, channels] depending on data_format.

kernel_size: The kernel to be used in the conv2d or max_pool2d operation.

Should be a positive integer.

data_format: The input format ('NHWC' or 'NCHW').

mode: The mode for tf.pad.

Returns:

A tensor with the same format as the input with the data either intact

(if kernel_size == 1) or padded (if kernel_size > 1).

"""

pad_total = kernel_size - 1

pad_beg = pad_total // 2

pad_end = pad_total - pad_beg

if kwargs['data_format'] == 'NCHW':

padded_inputs = tf.pad(inputs, [[0, 0], [0, 0],

[pad_beg, pad_end], [pad_beg, pad_end]], mode=mode)

else:

padded_inputs = tf.pad(inputs, [[0, 0], [pad_beg, pad_end],

[pad_beg, pad_end], [0, 0]], mode=mode)

return padded_inputs

_fixed_padding沿着高度和宽度尺寸输入适当数量的0（当mode='CONSTANT'）。我们稍后使用mode='SYMMETRIC'。

现在我们可以定义_conv2d_fixed_padding函数：

def _conv2d_fixed_padding(inputs, filters, kernel_size, strides=1):

if strides > 1:

inputs = _fixed_padding(inputs, kernel_size)

inputs = slim.conv2d(inputs, filters, kernel_size, stride=strides, padding=('SAME' if strides == 1 else 'VALID'))

return inputs

Darknet-53模型由一些具有2个conv层的块构建，快捷连接之后是下采样层。为了避免boilerplate code,，我们定义_darknet_block函数：

def _darknet53_block(inputs, filters):

shortcut = inputs

inputs = _conv2d_fixed_padding(inputs, filters, 1)

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

inputs = inputs + shortcut

return inputs

最后，我们为Darknet-53模型提供了所有必需的构建块：

def darknet53(inputs):

"""

Builds Darknet-53 model.

"""

inputs = _conv2d_fixed_padding(inputs, 32, 3)

inputs = _conv2d_fixed_padding(inputs, 64, 3, strides=2)

inputs = _darknet53_block(inputs, 32)

inputs = _conv2d_fixed_padding(inputs, 128, 3, strides=2)

for i in range(2):

inputs = _darknet53_block(inputs, 64)

inputs = _conv2d_fixed_padding(inputs, 256, 3, strides=2)

for i in range(8):

inputs = _darknet53_block(inputs, 128)

inputs = _conv2d_fixed_padding(inputs, 512, 3, strides=2)

for i in range(8):

inputs = _darknet53_block(inputs, 256)

inputs = _conv2d_fixed_padding(inputs, 1024, 3, strides=2)

for i in range(4):

inputs = _darknet53_block(inputs, 512)

return inputs

在最后一个块之后有全局avg pool层和softmax，但它们不被YOLO v3使用（所以实际上，我们有52层而不是53层））

3. YOLO v3检测层的实现。

Darknet-53提取的特征指向检测层。检测模块由一定数量的集合块组成的conv层，上采样层和3个具有线性激活功能的conv层构成，可在3个不同的尺度上进行检测。我们从编写帮助函数开始_yolo_block：

def _yolo_block(inputs, filters):

inputs = _conv2d_fixed_padding(inputs, filters, 1)

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

inputs = _conv2d_fixed_padding(inputs, filters, 1)

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

inputs = _conv2d_fixed_padding(inputs, filters, 1)

route = inputs

inputs = _conv2d_fixed_padding(inputs, filters * 2, 3)

return route, inputs

在块中的第5层的激活会被路由到另一个conv层，然后被向上采样，而第6层的激活会进入_detection_layer，我们现在要定义:

def _detection_layer(inputs, num_classes, anchors, img_size, data_format):

num_anchors = len(anchors)

predictions = slim.conv2d(inputs, num_anchors * (5 + num_classes), 1, stride=1, normalizer_fn=None,

activation_fn=None, biases_initializer=tf.zeros_initializer())

shape = predictions.get_shape().as_list()

grid_size = _get_size(shape, data_format)

dim = grid_size[0] * grid_size[1]

bbox_attrs = 5 + num_classes

if data_format == 'NCHW':

predictions = tf.reshape(predictions, [-1, num_anchors * bbox_attrs, dim])

predictions = tf.transpose(predictions, [0, 2, 1])

predictions = tf.reshape(predictions, [-1, num_anchors * dim, bbox_attrs])

stride = (img_size[0] // grid_size[0], img_size[1] // grid_size[1])

anchors = [(a[0] / stride[0], a[1] / stride[1]) for a in anchors]

box_centers, box_sizes, confidence, classes = tf.split(predictions, [2, 2, 1, num_classes], axis=-1)

box_centers = tf.nn.sigmoid(box_centers)

confidence = tf.nn.sigmoid(confidence)

grid_x = tf.range(grid_size[0], dtype=tf.float32)

grid_y = tf.range(grid_size[1], dtype=tf.float32)

a, b = tf.meshgrid(grid_x, grid_y)

x_offset = tf.reshape(a, (-1, 1))

y_offset = tf.reshape(b, (-1, 1))

x_y_offset = tf.concat([x_offset, y_offset], axis=-1)

x_y_offset = tf.reshape(tf.tile(x_y_offset, [1, num_anchors]), [1, -1, 2])

box_centers = box_centers + x_y_offset

box_centers = box_centers * stride

anchors = tf.tile(anchors, [dim, 1])

box_sizes = tf.exp(box_sizes) * anchors

box_sizes = box_sizes * stride

detections = tf.concat([box_centers, box_sizes, confidence], axis=-1)

classes = tf.nn.sigmoid(classes)

predictions = tf.concat([detections, classes], axis=-1)

return predictions

该层根据以下等式转换原始预测。由于每个比例上的YOLO v3都会检测不同大小和宽高比的对象，anchors因此将传递参数，该参数是每个比例的3个元组（高度，宽度）的列表。anchors 需要为数据集定制（在本教程中，我们将使用COCO数据集的anchors ）。只需在yolo_v3.py文件顶部添加此常量。

_ANCHORS = [(10, 13), (16, 30), (33, 23), (30, 61), (62, 45), (59, 119), (116, 90), (156, 198), (373, 326)]

机器学习：在Tensorflow中实施YOLO v3（TF-Slim）

我们需要一个小的helper函数_get_size，它返回输入的高度和宽度:

def _get_size(shape, data_format):

if len(shape) == 4:

shape = shape[1:]

return shape[1:3] if data_format == 'NCHW' else shape[0:2]

如前所述，我们需要实现YOLO v3的最后一个构建块是upsample层。YOLO探测器采用双线性上采样法。为什么我们不能使用标准的tf图像。来自Tensorflow API的resize_bilinear方法?原因是，就目前而言(TF version 1.8.0)，所有的上行采样方法都使用常数pad模式。在YOLO作者的repo和PyTorch中，标准的pad方法是edge。这个微小的差异对检测有显著的影响。

要解决这个问题，我们将手动填充1个像素的输入mode='SYMMETRIC'，这相当于edge模式

# we just need to pad with one pixel, so we set kernel_size = 3

inputs = _fixed_padding(inputs, 3, 'NHWC', mode='SYMMETRIC')

整个_upsample功能python代码如下所示

def _upsample(inputs, out_shape, data_format='NCHW'):

# we need to pad with one pixel, so we set kernel_size = 3

inputs = _fixed_padding(inputs, 3, mode='SYMMETRIC')

# tf.image.resize_bilinear accepts input in format NHWC

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 2, 3, 1])

if data_format == 'NCHW':

height = out_shape[3]

width = out_shape[2]

else:

height = out_shape[2]

width = out_shape[1]

# we padded with 1 pixel from each side and upsample by factor of 2, so new dimensions will be

# greater by 4 pixels after interpolation

new_height = height + 4

new_width = width + 4

inputs = tf.image.resize_bilinear(inputs, (new_height, new_width))

# trim back to desired size

inputs = inputs[:, 2:-2, 2:-2, :]

# back to NCHW if needed

if data_format == 'NCHW':

inputs = tf.transpose(inputs, [0, 3, 1, 2])

inputs = tf.identity(inputs, name='upsampled')

return inputs

Upsampled 激活与Darknet-53层的激活一起连接在通道轴上。这就是为什么我们需要返回darknet53函数，并在第4层和第5层之前从conv层返回激活。

def darknet53(inputs):

"""

Builds Darknet-53 model.

"""

inputs = _conv2d_fixed_padding(inputs, 32, 3)

inputs = _conv2d_fixed_padding(inputs, 64, 3, strides=2)

inputs = _darknet53_block(inputs, 32)

inputs = _conv2d_fixed_padding(inputs, 128, 3, strides=2)

for i in range(2):

inputs = _darknet53_block(inputs, 64)

inputs = _conv2d_fixed_padding(inputs, 256, 3, strides=2)

for i in range(8):

inputs = _darknet53_block(inputs, 128)

route1 = inputs

inputs = _conv2d_fixed_padding(inputs, 512, 3, strides=2)

for i in range(8):

inputs = _darknet53_block(inputs, 256)

route2 = inputs

inputs = _conv2d_fixed_padding(inputs, 1024, 3, strides=2)

for i in range(4):

inputs = _darknet53_block(inputs, 512)

return route1, route2, inputs

现在我们准备好定义探测器模块。让我们回到yolo_v3功能并在slim arg范围下添加以下行：

with tf.variable_scope('darknet-53'):

route_1, route_2, inputs = darknet53(inputs)

with tf.variable_scope('yolo-v3'):

route, inputs = _yolo_block(inputs, 512)

detect_1 = _detection_layer(inputs, num_classes, _ANCHORS[6:9], img_size, data_format)

detect_1 = tf.identity(detect_1, name='detect_1')

inputs = _conv2d_fixed_padding(route, 256, 1)

upsample_size = route_2.get_shape().as_list()

inputs = _upsample(inputs, upsample_size, data_format)

inputs = tf.concat([inputs, route_2], axis=1 if data_format == 'NCHW' else 3)

route, inputs = _yolo_block(inputs, 256)

detect_2 = _detection_layer(inputs, num_classes, _ANCHORS[3:6], img_size, data_format)

detect_2 = tf.identity(detect_2, name='detect_2')

inputs = _conv2d_fixed_padding(route, 128, 1)

upsample_size = route_1.get_shape().as_list()

inputs = _upsample(inputs, upsample_size, data_format)

inputs = tf.concat([inputs, route_1], axis=1 if data_format == 'NCHW' else 3)

_, inputs = _yolo_block(inputs, 128)

detect_3 = _detection_layer(inputs, num_classes, _ANCHORS[0:3], img_size, data_format)

detect_3 = tf.identity(detect_3, name='detect_3')

detections = tf.concat([detect_1, detect_2, detect_3], axis=1)

return detections

4.转换预先训练的COCO重量

我们定义了探测器的结构。要使用它，我们必须在我们自己的数据集上进行训练，或者使用预训练的权重。在COCO数据集上预训的权重可供公众使用。我们可以使用这个命令下载它：

wget https://pjreddie.com/media/files/yolov3.weights

这个二进制文件的结构如下:

前5个int32值是头信息:主版本号、次要版本号、subversion号和在训练期间由网络看到的图像。在它们之后，有62 001 757 float32值，它们是每个conv和 batch norm层的权重。重要的是要记住，它们是以row-major格式保存的，这与Tensorflow(column-major)使用的格式相反。

那么，我们应该如何从这个文件中读取权重呢?

我们从第一个conv层开始。大部分卷积层紧随其后是批次归一化层。在这种情况下，我们需要先读取4* num_filters权重，其中是batch norm层:gamma、beta、移动平均值和移动方差、thenkernel_size[0] * kernel_size[1] * num_filters * input_channels of conv层的权重。

在相反的情况下，当conv层没有跟随batch norm层时，而不是读取batch norm参数，我们需要读取num_filters偏置权重。

我们开始编写load_weights函数的代码。它需要2个参数：图中的变量列表和二进制文件的名称。

我们从打开文件开始，跳过前5个int32值并读取其他所有内容作为列表：

def load_weights(var_list, weights_file):

with open(weights_file, "rb") as fp:

_ = np.fromfile(fp, dtype=np.int32, count=5)

weights = np.fromfile(fp, dtype=np.float32)

然后我们将使用两个pointers，首先遍历变量列表，var_list然后使用加载的变量遍历列表weights。我们需要检查当前正在处理的图层的类型并读取适当数量的值。在代码i将遍历var_list并ptr会遍历weights。我们将返回一个tf.assignops 列表。我只是通过比较它的名称来检查图层的类型。

ptr = 0

i = 0

assign_ops = []

while i < len(var_list) - 1:

var1 = var_list[i]

var2 = var_list[i + 1]

# do something only if we process conv layer

if 'Conv' in var1.name.split('/')[-2]:

# check type of next layer

if 'BatchNorm' in var2.name.split('/')[-2]:

# load batch norm params

gamma, beta, mean, var = var_list[i + 1:i + 5]

batch_norm_vars = [beta, gamma, mean, var]

for var in batch_norm_vars:

shape = var.shape.as_list()

num_params = np.prod(shape)

var_weights = weights[ptr:ptr + num_params].reshape(shape)

ptr += num_params

assign_ops.append(tf.assign(var, var_weights, validate_shape=True))

# we move the pointer by 4, because we loaded 4 variables

i += 4

elif 'Conv' in var2.name.split('/')[-2]:

# load biases

bias = var2

bias_shape = bias.shape.as_list()

bias_params = np.prod(bias_shape)

bias_weights = weights[ptr:ptr + bias_params].reshape(bias_shape)

ptr += bias_params

assign_ops.append(tf.assign(bias, bias_weights, validate_shape=True))

# we loaded 2 variables

i += 1

# we can load weights of conv layer

shape = var1.shape.as_list()

num_params = np.prod(shape)

var_weights = weights[ptr:ptr + num_params].reshape((shape[3], shape[2], shape[0], shape[1]))

# remember to transpose to column-major

var_weights = np.transpose(var_weights, (2, 3, 1, 0))

ptr += num_params

assign_ops.append(tf.assign(var1, var_weights, validate_shape=True))

i += 1

return assign_ops

现在我们可以通过执行类似下面的代码行来恢复模型的权重：

with tf.variable_scope('model'):

model = yolo_v3(inputs, 80)

model_vars = tf.global_variables(scope='model')

assign_ops = load_variables(model_vars, 'yolov3.weights')

sess = tf.Session()

sess.run(assign_ops)

为了将来的使用，使用tf.train.Saver导出权重可能会更容易，并且从检查点加载。

5.post-processing 算法的实现

我们的模型返回tensor of shape：

batch_size x 10647 x (num_classes + 5 bounding box attrs)

数字10647等于总和507 +2028 + 8112，这是在每个尺度上检测到的可能物体的数量。描述边界框属性的5个值代表center_x, center_y, width, height。在大多数情况下，处理两点的坐标比较容易：左上角和右下角。我们将检测器的输出转换为这种格式。

这个功能非常简单：

def detections_boxes(detections):

center_x, center_y, width, height, attrs = tf.split(detections, [1, 1, 1, 1, -1], axis=-1)

w2 = width / 2

h2 = height / 2

x0 = center_x - w2

y0 = center_y - h2

x1 = center_x + w2

y1 = center_y + h2

boxes = tf.concat([x0, y0, x1, y1], axis=-1)

detections = tf.concat([boxes, attrs], axis=-1)

return detections

我们的检测器通常会多次检测同一物体（中心和大小略有不同）。在大多数情况下，我们不希望保留所有这些仅通过少量像素而不同的检测。这个问题的标准解决方案是非最大抑制。

为什么我们不使用tf.image.non_max_suppressionTensorflow API 的功能？有两个主要原因。首先，在我看来，每个类执行NMS要好得多，因为我们可能会遇到来自2个不同类别的对象高度重叠并且全球NMS会压制其中一个对话框的情况。其次，一些人抱怨说这个功能很慢，因为它还没有被优化。

我们来实现NMS算法。首先，我们需要一个函数来计算两个边界框的IoU（联合交集）：

def _iou(box1, box2):

b1_x0, b1_y0, b1_x1, b1_y1 = box1

b2_x0, b2_y0, b2_x1, b2_y1 = box2

int_x0 = max(b1_x0, b2_x0)

int_y0 = max(b1_y0, b2_y0)

int_x1 = min(b1_x1, b2_x1)

int_y1 = min(b1_y1, b2_y1)

int_area = (int_x1 - int_x0) * (int_y1 - int_y0)

b1_area = (b1_x1 - b1_x0) * (b1_y1 - b1_y0)

b2_area = (b2_x1 - b2_x0) * (b2_y1 - b2_y0)

iou = int_area / (b1_area + b2_area - int_area + 1e-05)

return iou

现在我们可以编写non_max_suppression函数的代码了。我使用NumPy库进行快速矢量操作。

def non_max_suppression(predictions_with_boxes, confidence_threshold, iou_threshold=0.4):

"""

Applies Non-max suppression to prediction boxes.

:param predictions_with_boxes: 3D numpy array, first 4 values in 3rd dimension are bbox attrs, 5th is confidence

:param confidence_threshold: the threshold for deciding if prediction is valid

:param iou_threshold: the threshold for deciding if two boxes overlap

:return: dict: class -> [(box, score)]

"""

它需要3个参数：来自YOLO v3检测器的输出，置信度阈值和IoU阈值。这个函数的主体如下：

conf_mask = np.expand_dims((predictions_with_boxes[:, :, 4] > confidence_threshold), -1)

predictions = predictions_with_boxes * conf_mask

result = {}

for i, image_pred in enumerate(predictions):

shape = image_pred.shape

non_zero_idxs = np.nonzero(image_pred)

image_pred = image_pred[non_zero_idxs]

image_pred = image_pred.reshape(-1, shape[-1])

bbox_attrs = image_pred[:, :5]

classes = image_pred[:, 5:]

classes = np.argmax(classes, axis=-1)

unique_classes = list(set(classes.reshape(-1)))

for cls in unique_classes:

cls_mask = classes == cls

cls_boxes = bbox_attrs[np.nonzero(cls_mask)]

cls_boxes = cls_boxes[cls_boxes[:, -1].argsort()[::-1]]

cls_scores = cls_boxes[:, -1]

cls_boxes = cls_boxes[:, :-1]

while len(cls_boxes) > 0:

box = cls_boxes[0]

score = cls_scores[0]

if not cls in result:

result[cls] = []

result[cls].append((box, score))

cls_boxes = cls_boxes[1:]

ious = np.array([_iou(box, x) for x in cls_boxes])

iou_mask = ious < iou_threshold

cls_boxes = cls_boxes[np.nonzero(iou_mask)]

cls_scores = cls_scores[np.nonzero(iou_mask)]

return result

我们实施了YOLO v3工作所需的全部功能。

6.总结

在repo(https://github.com/mystic123/tensorflow-yolo-v3)中，您可以找到代码和运行检测的一些演示脚本。该检测器可以NHWC和NCHW两种数据格式工作，因此您可以轻松选择在您的机器上哪种格式工作得更快。

机器学习：在Tensorflow中实施YOLO v3（TF-Slim）

1.设置

2.实现Darknet-53层

3. YOLO v3检测层的实现。

4.转换预先训练的COCO重量

5.post-processing 算法的实现

6.总结

NSapientia

相关推荐

PyTorch版YOLOv4更新了，适用于自定义数据集

Yolo v3 Introduction to object detection with TensorFlow 2

计算机视觉：rcnn、fast-rcnn、faster-rcnn、SSD、YOLO

linux下darknet深度学习框架上手

深度学习目标检测系列：一文弄懂YOLO算法|附Python源码

一文带你学会使用YOLO及Opencv完成图像及视频流目标检测（上）|附源码

实战｜手把手教你用苹果CoreML实现iPhone的目标识别

研究｜YOLO一眼就能认出你：看一个神经网络如何全视野实时检测目标

深度学习目标检测系列：一文弄懂YOLO算法｜附Python源码

一文带你学会使用YOLO及Opencv完成图像及视频流目标检测（上）

物体检测经典模型YOLO新升级，看一眼，速度提升 3 倍！

机器学习部分：使YOLO对象检测在没有编程背景的情况下工作

Python+树莓派+YOLO打造一款人工智能照相机