Table of Contents

Complete Analysis of YOLOv8 Decoding Process

This article provides a detailed analysis of prediction decoding and post-processing mechanisms in the YOLOv8 object detection algorithm, including key components such as DFL (Distribution Focal Loss) decoding and Non-Maximum Suppression (NMS).

1. Prediction Decoding Process (decode_predictions)

YOLOv8 adopts an anchor-free design, where the prediction decoding process converts network outputs into standard bounding box format. The entire process can be divided into several key steps:

1.1 Grid Point Generation

Generate reference point coordinates and corresponding stride values for each feature map:

# Generate anchor points and corresponding strides
anchors, strides = (x.transpose(0, 1) for x in 
                  self.make_anchors(predictions[1], self.stride, 0.5))
# anchors: grid point coordinates for all feature maps
# strides: corresponding stride values (8/16/32)

This step accomplishes:

Generating grid points for three feature maps (P3/P4/P5)
Generating corresponding stride values (P3:8, P4:16, P5:32)

1.2 Feature Map Processing

Process prediction results from three scales uniformly, separating bounding box and class predictions:

# Concatenate prediction results from three feature maps
x_cat = torch.cat([xi.view(1, self.nc + 16 * 4, -1) for xi in predictions[1]], 2)
# P3: (1, nc+64, 6400)  # 80*80=6400
# P4: (1, nc+64, 1600)  # 40*40=1600
# P5: (1, nc+64, 400)   # 20*20=400

# Separate bounding box predictions and class predictions
box, cls = x_cat.split((16 * 4, self.nc), 1)  
# box: (1, 64, 8400)  # 64=16*4, each coordinate encoded with 16 values
# cls: (1, nc, 8400)  # nc is the number of classes

Dimension unpacking to prepare for DFL decoding:

b, c, a = box.shape
# b: batch size, usually 1
# c: channels, equals 64 (16*4), representing encoding dimension for each bounding box
# a: anchors, equals 8400 (6400+1600+400), representing total number of grid points from all feature maps

# For example:
box.shape = (1, 64, 8400)
# Then:
b = 1      # batch size
c = 64     # 16*4, each coordinate (x,y,w,h) encoded with 16 values
a = 8400   # total grid points = P3(80*80) + P4(40*40) + P5(20*20)

1.3 DFL Decoding Implementation

YOLOv8 uses DFL (Distribution Focal Loss) to convert discrete predicted values into continuous coordinates. The core of this step is implementing weighted averaging through 1x1 convolution:

# Create 1x1 convolution for DFL decoding
# - Input channels: 16 (encoding dimension for each coordinate)
# - Output channels: 1 (single decoded value)
# - Kernel size: 1x1
# - No bias needed
# - No gradient needed (fixed weights)
conv = nn.Conv2d(16, 1, 1, bias=False).requires_grad_(False)

# Create array [0,1,2,...,15]
x = torch.arange(16, dtype=torch.float) # tensor([0., 1., 2., ..., 15.])

# Set convolution weights to [0,1,2,...,15]
conv.weight.data[:] = nn.Parameter(x.view(1, 16, 1, 1))
# This way convolution operation equals weighted average:
# output = 0*p0 + 1*p1 + 2*p2 + ... + 15*p15

Implementation details analysis:

# 1. Create array [0,1,2,...,15]
x = torch.arange(16, dtype=torch.float)
# tensor([0., 1., 2., ..., 15.])

# 2. Reshape to convolution weight shape
x = x.view(1, 16, 1, 1)
# 1: output channels
# 16: input channels
# 1,1: convolution kernel size
# shape: torch.Size([1, 16, 1, 1])

# 3. Set it as convolution layer weights
conv.weight.data[:] = nn.Parameter(x)
# conv.weight.shape = (1, 16, 1, 1)

Core principles of DFL design:

Use 1x1 convolution to implement weighted average
Weights fixed to [0-15], no learning needed
Convert 16 probability values into one continuous coordinate value

Example:

# If predicted probability distribution is:
probs = [0.0, 0.7, 0.3, 0.0, ..., 0.0]
# Through convolution (weighted average) get:
result = 1*0.7 + 2*0.3 = 1.3

1.4 Coordinate Transformation

Finally, use DFL decoding and convert to actual bounding box coordinates:

# Important: use these dimensions for tensor reshaping
# Convert 16 values for each coordinate into actual predicted values
dfl = conv(box.view(b, 4, 16, a).transpose(2, 1).softmax(1)).view(b, 4, a)

# Convert to actual bounding box coordinates
dbox = self.dist2bbox(dfl, anchors.unsqueeze(0), xywh=True, dim=1) * strides

# Combine final results
return torch.cat((dbox, cls.sigmoid()), 1)

2. Post-processing Workflow (post_process)

Post-processing mainly completes four tasks:

Confidence filtering: Remove detection boxes with low confidence
Coordinate conversion: Convert center point format to top-left bottom-right format
NMS processing: Remove overlapping detection boxes, keep the optimal ones
Scale restoration: Map coordinates back to original image size

2.1 Confidence Filtering

First process input format and perform initial confidence filtering:

# 1. Process input format
prediction = pred[0] if isinstance(pred, (list, tuple)) else pred

# 2. First round confidence filtering
xc = prediction[:, 4:84].amax(1) > conf_thres  # Find boxes with maximum class probability > threshold

2.2 Coordinate Format Conversion

Convert center point format to top-left bottom-right format:

# 3. Coordinate format conversion
prediction = prediction.transpose(-1, -2)
prediction[..., :4] = self.xywh2xyxy(prediction[..., :4])  # Convert center point format to top-left bottom-right format

# 4-5. Filter based on confidence
x = prediction[0]  # Take first batch
x = x[xc[0]]      # Keep high confidence boxes

Get final bounding boxes and classes:

# 6-7. Get final bounding boxes and classes
box, cls = x.split((4, self.nc), 1)  # Separate coordinates and class probabilities
conf, j = cls.max(1, keepdim=True)    # Find class with highest probability
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]

Code implementation details:

# 1. First look at data shapes
box.shape  # (N, 4)       - Bounding box coordinates (x1,y1,x2,y2)
conf.shape # (N, 1)       - Maximum confidence values
j.shape    # (N, 1)       - Corresponding class indices

# 2. torch.cat concatenation operation
x = torch.cat((box, conf, j.float()), 1)
# Result: (N, 6)
# - First 4 columns are bounding box coordinates
# - 5th column is confidence
# - 6th column is class index

# 3. conf.view(-1)
conf.view(-1)  # Change shape from (N,1) to (N,)

# 4. Filter based on confidence threshold
mask = conf.view(-1) > conf_thres  # Boolean mask
x = x[mask]  # Only keep rows with confidence above threshold

# Example:
# Suppose we have 3 detection boxes
box = torch.tensor([[10,20,30,40],   # Box1
                   [50,60,70,80],   # Box2
                   [90,100,110,120]])# Box3

conf = torch.tensor([[0.9],  # Box1 confidence
                    [0.3],  # Box2 confidence
                    [0.8]]) # Box3 confidence

j = torch.tensor([[0],  # Box1 is class 0
                 [1],  # Box2 is class 1
                 [2]]) # Box3 is class 2

# Concatenate and filter (conf_thres=0.5)
x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > 0.5]
# Result only keeps Box1 and Box3 (confidence>0.5)

2.3 Non-Maximum Suppression

Execute Non-Maximum Suppression (NMS) to remove overlapping boxes:

# 8. Non-Maximum Suppression (NMS)
# 1. Calculate class offset
c = x[:, 5:6] * max_wh  # max_wh=7680
# For example:
# Class 0: 0 * 7680 = 0
# Class 1: 1 * 7680 = 7680
# Class 2: 2 * 7680 = 15360

# 2. Add offset to bounding box coordinates
boxes, scores = x[:, :4] + c, x[:, 4]
# boxes = x[:, :4] + c
# This way boxes of different classes will be separated into different spatial regions
# 3. Get confidence scores
# scores = x[:, 4]

i = torchvision.ops.nms(boxes, scores, iou_thres)  # Execute NMS

2.4 Coordinate Scale Restoration

Finally scale bounding box coordinates back to original image size:

# 9-10. Scale to original size
pred = x[i]  # Keep boxes after NMS
pred[:, :4] = self.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)

3. Core Method Analysis

3.1 dist2bbox Method

Convert predicted distances to bounding box coordinates:

def dist2bbox(self, distance, anchor_points, xywh=True, dim=-1):
    """Convert predicted distances to bounding box coordinates"""
    lt, rb = torch.split(distance, 2, dim)
    x1y1 = anchor_points - lt
    x2y2 = anchor_points + rb
    
    if xywh:
        c_xy = (x1y1 + x2y2) / 2
        wh = x2y2 - x1y1
        return torch.cat((c_xy, wh), dim)
    return torch.cat((x1y1, x2y2), dim)

3.2 scale_boxes Method

Scale detection box coordinates from model input size back to original image size:

def scale_boxes(self, img1_shape, boxes, img0_shape):
    """
    img1_shape: Model input size (640, 640)
    boxes: Detection box coordinates (x1,y1,x2,y2)
    img0_shape: Original image size (h, w)
    """
    # 1. Calculate scale ratio: gain = input size / original size
    gain = min(img1_shape[0] / img0_shape[0],    # height ratio
              img1_shape[1] / img0_shape[1])      # width ratio
    
    # 2. Calculate padding amount
    pad = ((img1_shape[1] - img0_shape[1] * gain) / 2,     # width padding
           (img1_shape[0] - img0_shape[0] * gain) / 2)      # height padding
    
    # 3. Remove padding
    boxes[..., [0, 2]] -= pad[0]  # x coordinates minus horizontal padding
    boxes[..., [1, 3]] -= pad[1]  # y coordinates minus vertical padding
    
    # 4. Scale back to original size
    boxes[..., :4] /= gain

Example analysis:

# Suppose:
img1_shape = (640, 640)    # Model input size
img0_shape = (800, 600)    # Original image size

# 1. Calculate scale ratio
gain = min(640/800, 640/600)  # = min(0.8, 1.067) = 0.8

# 2. Calculate padding amount
pad_w = (640 - 600 * 0.8) / 2  # Width direction padding
pad_h = (640 - 800 * 0.8) / 2  # Height direction padding

# Original image size after scaling: (800*0.8, 600*0.8) = (640, 480)
# Need to pad to 640x640, so:
# - Width padded by (640-480)/2 pixels on each side
# - Height padded by (640-640)/2 = 0 pixels on each side

Scaling example:

# Suppose:
# - Original image: 1000x800
# - Model input: 640x640
# - gain = 640/1000 = 0.64 (scale ratio)

# 1. Model predicted detection box (on 640x640 scale)
box = [100, 200, 300, 400]  # [x1, y1, x2, y2]

# 2. Scale back to original size
box /= gain  # equivalent to box / 0.64
# [100/0.64, 200/0.64, 300/0.64, 400/0.64]
# = [156.25, 312.5, 468.75, 625]

4. Complete Code Reference

Complete Implementation of decode_predictions

def decode_predictions(self, predictions):
    """Decode model's raw output into detection boxes and class probabilities
    Args:
        predictions: Model's raw prediction results
    """
    # Generate anchor points and corresponding strides
    anchors, strides = (x.transpose(0, 1) for x in 
                      self.make_anchors(predictions[1], self.stride, 0.5))
    
    # Concatenate prediction results from three feature maps
    x_cat = torch.cat([xi.view(1, self.nc + 16 * 4, -1) for xi in predictions[1]], 2)
    # Separate bounding box predictions and class predictions
    box, cls = x_cat.split((16 * 4, self.nc), 1)  # 16*4 for DFL decoding, nc is number of classes
    b, c, a = box.shape
    conv = nn.Conv2d(16, 1, 1, bias=False).requires_grad_(False)
    x = torch.arange(16, dtype=torch.float)
    conv.weight.data[:] = nn.Parameter(x.view(1, 16, 1, 1))
    
    dfl = conv(box.view(b, 4, 16, a).transpose(2, 1).softmax(1)).view(b, 4, a)
    dbox = self.dist2bbox(dfl, anchors.unsqueeze(0), xywh=True, dim=1) * strides
    
    return torch.cat((dbox, cls.sigmoid()), 1)

Complete Implementation of post_process

def post_process(self, pred, img, orig_img, conf_thres=0.25, iou_thres=0.7, max_wh=7680):
    """Post-processing: Execute Non-Maximum Suppression (NMS)"""
    # 1. Process validation mode output
    prediction = pred[0] if isinstance(pred, (list, tuple)) else pred

    # 2. Filter candidate boxes based on confidence threshold
    xc = prediction[:, 4:84].amax(1) > conf_thres

    # 3. Adjust prediction results shape and format
    prediction = prediction.transpose(-1, -2)
    prediction[..., :4] = self.xywh2xyxy(prediction[..., :4])

    # 4. Get prediction results for current batch
    xi = 0
    x = prediction[xi]
    
    # 5. Filter prediction boxes based on confidence
    x = x[xc[xi]]

    # 6. Separate bounding box and class information
    box, cls = x.split((4, self.nc), 1)
    
    # 7. Get maximum confidence and corresponding class for each box
    conf, j = cls.max(1, keepdim=True)
    x = torch.cat((box, conf, j.float()), 1)[conf.view(-1) > conf_thres]

    # 8. Execute NMS
    c = x[:, 5:6] * max_wh
    boxes, scores = x[:, :4] + c, x[:, 4]
    i = torchvision.ops.nms(boxes, scores, iou_thres)

    # 9. Get final prediction results
    pred = x[i]
    
    # 10. Scale bounding box coordinates to original image size
    pred[:, :4] = self.scale_boxes(img.shape[2:], pred[:, :4], orig_img.shape)

    return pred

文章对话

由AI生成的"小T"和"好奇宝宝"之间的对话，帮助理解文章内容