Complete Analysis of YOLOv8 Decoding Process
Table of Contents
Complete Analysis of YOLOv8 Decoding Process
This article provides a detailed analysis of prediction decoding and post-processing mechanisms in the YOLOv8 object detection algorithm, including key components such as DFL (Distribution Focal Loss) decoding and Non-Maximum Suppression (NMS).
Table of Contents
- 1. Prediction Decoding Process (decode_predictions)
- 2. Post-processing Workflow (post_process)
- 3. Core Method Analysis
- 4. Complete Code Reference
1. Prediction Decoding Process (decode_predictions)
YOLOv8 adopts an anchor-free design, where the prediction decoding process converts network outputs into standard bounding box format. The entire process can be divided into several key steps:
1.1 Grid Point Generation
Generate reference point coordinates and corresponding stride values for each feature map:
# Generate anchor points and corresponding strides
anchors, strides = (x.transpose(0, 1) for x in
self.make_anchors(predictions[1], self.stride, 0.5))
# anchors: grid point coordinates for all feature maps
# strides: corresponding stride values (8/16/32)
This step accomplishes:
- Generating grid points for three feature maps (P3/P4/P5)
- Generating corresponding stride values (P3:8, P4:16, P5:32)
1.2 Feature Map Processing
Process prediction results from three scales uniformly, separating bounding box and class predictions:
# Concatenate prediction results from three feature maps
x_cat = torch.cat([xi.view(1, self.nc + 16 * 4, -1) for xi in predictions[1]], 2)
# P3: (1, nc+64, 6400) # 80*80=6400
# P4: (1, nc+64, 1600) # 40*40=1600
# P5: (1, nc+64, 400) # 20*20=400
# Separate bounding box predictions and class predictions
box, cls = x_cat.split((16 * 4, self.nc), 1)
# box: (1, 64, 8400) # 64=16*4, each coordinate encoded with 16 values
# cls: (1, nc, 8400) # nc is the number of classes
Dimension unpacking to prepare for DFL decoding:
b, c, a = box.shape
# b: batch size, usually 1
# c: channels, equals 64 (16*4), representing encoding dimension for each bounding box
# a: anchors, equals 8400 (6400+1600+400), representing total number of grid points from all feature maps
# For example:
box.shape = (1, 64, 8400)
# Then:
b = 1 # batch size
c = 64 # 16*4, each coordinate (x,y,w,h) encoded with 16 values
a = 8400 # total grid points = P3(80*80) + P4(40*40) + P5(20*20)
1.3 DFL Decoding Implementation
YOLOv8 uses DFL (Distribution Focal Loss) to convert discrete predicted values into continuous coordinates. The core of this step is implementing weighted averaging through 1x1 convolution:
# Create 1x1 convolution for DFL decoding
# - Input channels: 16 (encoding dimension for each coordinate)
# - Output channels: 1 (single decoded value)
# - Kernel size: 1x1
# - No bias needed
# - No gradient needed (fixed weights)
conv = nn.Conv2d(16, 1, 1, bias=False).requires_grad_(False)
# Create array [0,1,2,...,15]
x = torch.arange(16, dtype=torch.float) # tensor([0., 1., 2., ..., 15.])
# Set convolution weights to [0,1,2,...,15]
conv.weight.data[:] = nn.Parameter(x.view(1, 16, 1, 1))
# This way convolution operation equals weighted average:
# output = 0*p0 + 1*p1 + 2*p2 + ... + 15*p15
Implementation details analysis:
# 1. Create array [0,1,2,...,15]
x = torch.arange(16, dtype=torch.float)
# tensor([0., 1., 2., ..., 15.])
# 2. Reshape to convolution weight shape
x = x.view(1, 16, 1, 1)
# 1: output channels
# 16: input channels
# 1,1: convolution kernel size
# shape: torch.Size([1, 16, 1, 1])
# 3. Set it as convolution layer weights
conv.weight.data[:] = nn.Parameter(x)
# conv.weight.shape = (1, 16, 1, 1)
Core principles of DFL design:
- Use 1x1 convolution to implement weighted average
- Weights fixed to [0-15], no learning needed
- Convert 16 probability values into one continuous coordinate value
Example:
# If predicted probability distribution is:
probs = [0.0, 0.7, 0.3, 0.0, ..., 0.0]
# Through convolution (weighted average) get:
result = 1*0.7 + 2*0.3 = 1.3
1.4 Coordinate Transformation
Finally, use DFL decoding and convert to actual bounding box coordinates:
# Important: use these dimensions for tensor reshaping
# Convert 16 values for each coordinate into actual predicted values
dfl = conv(box.view(b, 4, 16, a).transpose(2, 1).softmax(1)).view(b, 4, a)
# Convert to actual bounding box coordinates
dbox = self.dist2bbox(dfl, anchors.unsqueeze(0), xywh=True, dim=1) * strides
# Combine final results
return torch.cat((dbox, cls.sigmoid()), 1)
2. Post-processing Workflow (post_process)
Post-processing mainly completes four tasks:
- Confidence filtering: Remove detection boxes with low confidence
- Coordinate conversion: Convert center point format to top-left bottom-right format
- NMS processing: Remove overlapping detection boxes, keep the optimal ones
- Scale restoration: Map coordinates back to original image size
2.1 Confidence Filtering
First process input format and perform initial confidence filtering:
# 1. Process input format
prediction = pred[0] if isinstance(pred, (list, tuple)) else pred
# 2. First round confidence filtering
xc = prediction[:, 4:84].amax(1) > conf_thres # Find boxes with maximum class probability > threshold
2.2 Coordinate Format Conversion
Convert center point format to top-left bottom-right format:
# 3. Coordinate format conversion
prediction = prediction.transpose(-1, -2)
prediction[..., :4] = self.xywh2xyxy(prediction[..., :4]) # Convert center point format to top-left bottom-right format
# 4-5. Filter based on confidence
x = prediction[0] # Take first batch
x = x[xc[0]] # Keep high confidence boxes
Get final bounding boxes and classes:
# 6-7. Get final bounding boxes and classes
box, cls = x.split((4, self.nc), 1) # Separate coordinates and class probabilities
conf, j = cls.max(1, keepdim=True) # Find class with highest probability
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]
Code implementation details:
# 1. First look at data shapes
box.shape # (N, 4) - Bounding box coordinates (x1,y1,x2,y2)
conf.shape # (N, 1) - Maximum confidence values
j.shape # (N, 1) - Corresponding class indices
# 2. torch.cat concatenation operation
x = torch.cat((box, conf, j.float()), 1)
# Result: (N, 6)
# - First 4 columns are bounding box coordinates
# - 5th column is confidence
# - 6th column is class index
# 3. conf.view(-1)
conf.view(-1) # Change shape from (N,1) to (N,)
# 4. Filter based on confidence threshold
mask = conf.view(-1) > conf_thres # Boolean mask
x = x[mask] # Only keep rows with confidence above threshold
# Example:
# Suppose we have 3 detection boxes
box = torch.tensor([[10,20,30,40], # Box1
[50,60,70,80], # Box2
[90,100,110,120]])# Box3
conf = torch.tensor([[0.9], # Box1 confidence
[0.3], # Box2 confidence
[0.8]]) # Box3 confidence
j = torch.tensor([[0], # Box1 is class 0
[1], # Box2 is class 1
[2]]) # Box3 is class 2
# Concatenate and filter (conf_thres=0.5)
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > 0.5]
# Result only keeps Box1 and Box3 (confidence>0.5)
2.3 Non-Maximum Suppression
Execute Non-Maximum Suppression (NMS) to remove overlapping boxes:
# 8. Non-Maximum Suppression (NMS)
# 1. Calculate class offset
c = x[:, 5:6] * max_wh # max_wh=7680
# For example:
# Class 0: 0 * 7680 = 0
# Class 1: 1 * 7680 = 7680
# Class 2: 2 * 7680 = 15360
# 2. Add offset to bounding box coordinates
boxes, scores = x[:, :4] + c, x[:, 4]
# boxes = x[:, :4] + c
# This way boxes of different classes will be separated into different spatial regions
# 3. Get confidence scores
# scores = x[:, 4]
i = torchvision.ops.nms(boxes, scores, iou_thres) # Execute NMS
2.4 Coordinate Scale Restoration
Finally scale bounding box coordinates back to original image size:
# 9-10. Scale to original size
pred = x[i] # Keep boxes after NMS
pred[:, :4] = self.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
3. Core Method Analysis
3.1 dist2bbox Method
Convert predicted distances to bounding box coordinates:
def dist2bbox(self, distance, anchor_points, xywh=True, dim=-1):
"""Convert predicted distances to bounding box coordinates"""
lt, rb = torch.split(distance, 2, dim)
x1y1 = anchor_points - lt
x2y2 = anchor_points + rb
if xywh:
c_xy = (x1y1 + x2y2) / 2
wh = x2y2 - x1y1
return torch.cat((c_xy, wh), dim)
return torch.cat((x1y1, x2y2), dim)
3.2 scale_boxes Method
Scale detection box coordinates from model input size back to original image size:
def scale_boxes(self, img1_shape, boxes, img0_shape):
"""
img1_shape: Model input size (640, 640)
boxes: Detection box coordinates (x1,y1,x2,y2)
img0_shape: Original image size (h, w)
"""
# 1. Calculate scale ratio: gain = input size / original size
gain = min(img1_shape[0] / img0_shape[0], # height ratio
img1_shape[1] / img0_shape[1]) # width ratio
# 2. Calculate padding amount
pad = ((img1_shape[1] - img0_shape[1] * gain) / 2, # width padding
(img1_shape[0] - img0_shape[0] * gain) / 2) # height padding
# 3. Remove padding
boxes[..., [0, 2]] -= pad[0] # x coordinates minus horizontal padding
boxes[..., [1, 3]] -= pad[1] # y coordinates minus vertical padding
# 4. Scale back to original size
boxes[..., :4] /= gain
Example analysis:
# Suppose:
img1_shape = (640, 640) # Model input size
img0_shape = (800, 600) # Original image size
# 1. Calculate scale ratio
gain = min(640/800, 640/600) # = min(0.8, 1.067) = 0.8
# 2. Calculate padding amount
pad_w = (640 - 600 * 0.8) / 2 # Width direction padding
pad_h = (640 - 800 * 0.8) / 2 # Height direction padding
# Original image size after scaling: (800*0.8, 600*0.8) = (640, 480)
# Need to pad to 640x640, so:
# - Width padded by (640-480)/2 pixels on each side
# - Height padded by (640-640)/2 = 0 pixels on each side
Scaling example:
# Suppose:
# - Original image: 1000x800
# - Model input: 640x640
# - gain = 640/1000 = 0.64 (scale ratio)
# 1. Model predicted detection box (on 640x640 scale)
box = [100, 200, 300, 400] # [x1, y1, x2, y2]
# 2. Scale back to original size
box /= gain # equivalent to box / 0.64
# [100/0.64, 200/0.64, 300/0.64, 400/0.64]
# = [156.25, 312.5, 468.75, 625]
4. Complete Code Reference
Complete Implementation of decode_predictions
def decode_predictions(self, predictions):
"""Decode model's raw output into detection boxes and class probabilities
Args:
predictions: Model's raw prediction results
"""
# Generate anchor points and corresponding strides
anchors, strides = (x.transpose(0, 1) for x in
self.make_anchors(predictions[1], self.stride, 0.5))
# Concatenate prediction results from three feature maps
x_cat = torch.cat([xi.view(1, self.nc + 16 * 4, -1) for xi in predictions[1]], 2)
# Separate bounding box predictions and class predictions
box, cls = x_cat.split((16 * 4, self.nc), 1) # 16*4 for DFL decoding, nc is number of classes
b, c, a = box.shape
conv = nn.Conv2d(16, 1, 1, bias=False).requires_grad_(False)
x = torch.arange(16, dtype=torch.float)
conv.weight.data[:] = nn.Parameter(x.view(1, 16, 1, 1))
dfl = conv(box.view(b, 4, 16, a).transpose(2, 1).softmax(1)).view(b, 4, a)
dbox = self.dist2bbox(dfl, anchors.unsqueeze(0), xywh=True, dim=1) * strides
return torch.cat((dbox, cls.sigmoid()), 1)
Complete Implementation of post_process
def post_process(self, pred, img, orig_img, conf_thres=0.25, iou_thres=0.7, max_wh=7680):
"""Post-processing: Execute Non-Maximum Suppression (NMS)"""
# 1. Process validation mode output
prediction = pred[0] if isinstance(pred, (list, tuple)) else pred
# 2. Filter candidate boxes based on confidence threshold
xc = prediction[:, 4:84].amax(1) > conf_thres
# 3. Adjust prediction results shape and format
prediction = prediction.transpose(-1, -2)
prediction[..., :4] = self.xywh2xyxy(prediction[..., :4])
# 4. Get prediction results for current batch
xi = 0
x = prediction[xi]
# 5. Filter prediction boxes based on confidence
x = x[xc[xi]]
# 6. Separate bounding box and class information
box, cls = x.split((4, self.nc), 1)
# 7. Get maximum confidence and corresponding class for each box
conf, j = cls.max(1, keepdim=True)
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]
# 8. Execute NMS
c = x[:, 5:6] * max_wh
boxes, scores = x[:, :4] + c, x[:, 4]
i = torchvision.ops.nms(boxes, scores, iou_thres)
# 9. Get final prediction results
pred = x[i]
# 10. Scale bounding box coordinates to original image size
pred[:, :4] = self.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)
return pred
文章对话
由AI生成的"小T"和"好奇宝宝"之间的对话,帮助理解文章内容