Table of Contents

📝 Update Log
2024-05-26: Added latest progress in vision models for 2023-2024, included Phase 6 architecture analysis
2023-11-15: Added analysis of current CNN patterns
2023-09-15: Expanded Phase 5 (2020-present) architecture introduction, added ConvNeXt analysis
2020-03-01: Initial article publication

Evolution of CNN

Phase 1: Foundation Era (1998-2011)

LeNet-5 (1998)

Founder: Yann LeCun

Main Architecture:

7-layer structure: 3 convolutional layers, 2 pooling layers, 2 fully connected layers
Uses 5×5 convolution kernels
Uses sigmoid/tanh activation functions

Breakthroughs:

First successful application to real problems (handwritten digit recognition)
Established the basic paradigm of “convolutional layer-pooling layer-fully connected layer”
Introduced weight sharing concept to reduce parameter count

Limitations:

Network relatively shallow due to computational resource constraints
Lacked modern training techniques like batch normalization, ReLU activation function

The emergence of LeNet-5 marked the official birth of CNNs, but development was slow in the following decade due to limited computational capabilities and the superior performance of other traditional machine learning methods. It wasn’t until the improvement in GPU computing power and the availability of large-scale training data that things turned around.

Phase 2: Deep Learning Explosion (2012-2014)

AlexNet (2012) - The Spark of Deep Learning Revolution

Founders: Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton

Main Architecture:

8 layers: 5 convolutional layers, 3 fully connected layers
First large-scale use of ReLU activation function
Uses overlapping max pooling (pooling window size larger than stride, adjacent output units have overlapping receptive fields, mitigates overfitting, smooths feature transitions, expands receptive field)
Uses Dropout to prevent overfitting

Breakthroughs:

2012 ImageNet challenge winner, error rate reduced from 26% to 15.3%
Landmark event in deep learning revolution
Demonstrated importance of GPU for training deep networks
First large-scale use of data augmentation: multi-scale cropping, horizontal flipping, PCA color jittering

ZFNet (2013) - Opening the CNN Black Box

Founders: Matthew Zeiler and Rob Fergus

Main Architecture:

Improved version of AlexNet
Smaller first layer convolution kernels (7×7 instead of 11×11)
Smaller stride

Breakthroughs:

2013 ImageNet challenge winner
First use of visualization techniques to explain CNN internal mechanisms
Introduced concept of “transposed convolution” (Deconvolution is inverse convolution)

Contributions:

Deep understanding of CNN feature learning process
Laid foundation for CNN interpretability research

VGGNet (2014) - Simple but Deep

Founders: Visual Geometry Group team at Oxford University

Main Architecture:

Uses uniform 3×3 small convolution kernel stacking
Depth varies from 11 layers (VGG11) to 19 layers (VGG19)
2×2 max pooling layers
Three fully connected layer structure

Breakthroughs:

Demonstrated critical impact of “depth” on performance
Replaced large convolution kernels with multiple small ones (two 3×3 conv layers ≈ one 5×5 conv layer, same receptive field, lower parameter count and computation, but feature types more uniform compared to later inception)
Simple and unified network structure design philosophy

Impact:

Still used as feature extraction backbone network today
Simple design philosophy
First systematic study of network depth impact

GoogLeNet/Inception-v1 (2014) - Multi-scale Feature Intelligence

Founders: Google

Main Architecture:

22-layer deep network
Introduced “Inception module”: parallel multiple size convolution kernels
Uses 1×1 convolutions for dimension reduction
Introduced auxiliary classifiers to help training

Breakthroughs:

2014 ImageNet challenge winner
Significantly reduced parameter count (only 5M, 12 times less than AlexNet)
Introduced modular design thinking
Solved balance between computational efficiency and model expressiveness

Contributions:

Proved complex networks can be efficiently designed
1×1 convolution became standard design tool
Initiated “network in network” design paradigm

GoogLeNet fundamentally changed CNN design approach, network structure design was no longer simple stacking.

Phase 3: Architecture Innovation (2015-2017)

ResNet (2015) - Breakthrough in Ultra-Deep Networks

Founders: Kaiming He team (Microsoft)

Main Architecture:

Ultra-deep network (from 34 to 152 layers, later even to 1000+ layers)
Core innovation: Residual Block
Formula: H(x) = F(x) + x, directly adds input to output

Breakthroughs:

2015 ImageNet challenge winner (3.57% error rate, first to surpass humans)
Fundamentally solved degradation problem in deep networks
Error rate reduced from AlexNet’s 15.3% to 3.57%

Historical Significance:

One of the most important innovations in CNN history
Made truly deep networks possible
ResNet and its variants still mainstream architecture today
Residual learning became standard technique in deep learning

ResNet’s emergence was a milestone in CNN history. Not only did it first break through hundred-layer depth, more importantly, it proposed an elegant solution to overcome deep network degradation. The simple yet effective design of residual connections has since become a standard component in almost all deep networks.

Inception-v2/v3 (2015) - Refined Module Design

Founders: Google team

Main Architecture:

Improved Inception module
Factorized large convolution kernels (7×7 factorized into 1×7 and 7×1)
Introduced batch normalization
More effective dimension reduction strategy

Breakthroughs:

Further reduced parameter count while improving performance
Successfully applied convolution factorization technique
Proved effectiveness of asymmetric convolutions

Contributions:

Asymmetric convolution design influenced subsequent lightweight networks
Batch normalization became standard training technique

Introduced innovations like batch normalization (BN) and asymmetric convolutions (AC)

Inception-v4 and Inception-ResNet (2016) - Beginning of Architecture Fusion

Founders: Google team

Main Architecture:

Combined Inception architecture with ResNet residual connections
More unified, simplified Inception module
Used residual scaling to prevent instability

Breakthroughs:

Proved residual connections can combine with various architectures
Improved training speed and model performance
Demonstrated power of architecture hybridization

Impact:

Promoted research in model fusion and architecture hybridization
Paved way for subsequent “hybrid” architectures

Inception-ResNet represented a new trend in CNN development: combining advantages of different architectures to create more powerful networks. This “taking the best of both worlds” approach provided a new direction for subsequent network design.

DenseNet (2016) - Ultimate Feature Reuse

Founders: Gao Huang team

Main Architecture:

Dense connections: each layer directly connected to all its preceding layers
Feature reuse: information transmitted through concatenation (not addition)
Bottleneck layer design reduces parameter count

Breakthroughs:

Achieved similar performance to ResNet with fewer parameters
Alleviated vanishing gradient problem
Improved feature propagation efficiency
Strong regularization effect reduces overfitting

Contributions:

Another paradigm for solving deep network training problems
Influenced subsequent feature reuse and connection strategy design

If ResNet solved deep network training problems through “shortcuts”, then DenseNet achieved more efficient feature propagation through “highways”. While these approaches differ in methodology, they both aimed at the same goal: making deep network training more stable and efficient. DenseNet’s dense connection mechanism brought stronger feature reuse capability and regularization effect.

Phase 4: Efficiency and Lightweight Era (2017-2019)

MobileNet Series (2017-2019) - AI Revolution for Mobile Devices

Founders: Google team

Main Architecture:

Depthwise separable convolution: decompose standard convolution into depthwise and pointwise convolutions
MobileNetV2 introduced inverted residual structure
MobileNetV3 combined neural architecture search and SE modules

Breakthroughs:

Greatly reduced computational complexity (8-9 times less than standard CNN)
Suitable for mobile and embedded devices
Proposed width multiplier and resolution multiplier to adjust computational complexity

Significance:

Opened new direction in lightweight CNN research
Made deep learning practical on resource-constrained devices
Influenced all subsequent mobile network designs

The MobileNet series marked an important shift in CNN research: from pursuing ultimate performance to balancing computational efficiency and practicality. This shift enabled deep learning to move beyond cloud data centers into everyday devices like smartphones, greatly expanding CNN’s application scenarios.

SENet (2017) - Pioneer of Attention Mechanism

Founders: Jie Hu et al.

Main Architecture:

Introduced “Squeeze-and-Excitation” (SE) module
“Squeeze” features through global pooling
“Excite” feature channels through two fully connected layers
Can be inserted into any existing architecture

Breakthroughs:

2017 ImageNet challenge winner
Explicitly modeled dependencies between feature channels
Significant performance improvement with minimal parameter increase (~10%)

Impact:

Pioneered channel attention mechanism research
Influenced all subsequent attention mechanism designs
SE module became standard component widely adopted

SENet’s importance far exceeded its performance improvement; it introduced the concept of “attention” into CNN design, inspiring a series of subsequent innovations based on attention mechanisms. This lightweight yet effective design also perfectly aligned with the efficiency-seeking trend of the time.

EfficientNet (2019) - Scientific Method for Network Scaling

Founders: Google team (Mingxing Tan, Quoc V. Le)

Main Architecture:

Based on MobileNetV2’s mobile inverted bottleneck structure
Uses SE modules
Core innovation: compound scaling method, simultaneously balancing network width, depth, and resolution

Breakthroughs:

First systematic solution to CNN scaling problem
Achieved state-of-the-art performance with fewer parameters and computations
EfficientNet-B7: 84.4% accuracy, 5% higher than previous large models

Significance:

Provided new paradigm for CNN design
Provided theoretical foundation for network scale and computational efficiency optimization
Became benchmark for lightweight efficient network design

EfficientNet represented an important milestone in CNN design: from art to science. Through systematic study of the network scaling problem, it provided theoretical guidance for balancing various dimensions, making model design no longer completely dependent on experience and intuition. The impact of this methodology extended far beyond the performance improvement of the network itself.

Phase 5: Paradigm Shift and Fusion (2020-2022)

Vision Transformer (ViT, 2020) - Paradigm Revolution in Vision Models

Founders: Google team

Main Architecture:

Not traditional CNN, but direct application of Transformer to images
Splits image into fixed-size patch sequence
Uses pure self-attention mechanism for visual tasks

Breakthroughs:

Proved non-CNN architectures can excel at vision tasks
Surpassed CNN performance on large-scale datasets
Successful application of self-attention mechanism in vision domain

Significance:

Opened new paradigm for vision models
Promoted unification of vision and language model architectures
Drove subsequent research in CNN and Transformer hybridization

The emergence of ViT was a turning point in vision model development, challenging CNN’s status as the sole dominant architecture for vision tasks. Although strictly speaking ViT doesn’t belong to the CNN family, it profoundly influenced CNN development, prompting researchers to rethink vision model design principles.

Swin Transformer (2021) - Milestone in Hierarchical Vision Transformers

Founders: Microsoft Research team (Ze Liu, Yutong Lin et al.)

Main Architecture:

Hierarchical design: adopts CNN-like multi-scale feature hierarchy
Shifted window attention mechanism: balances computational efficiency and cross-window connections
Relative position encoding: enhances spatial position awareness
Coarse-to-fine feature pyramid structure

Breakthroughs:

Successfully addressed ViT’s limitations in dense prediction tasks
Linear computational complexity, significantly better than standard global self-attention
Excellent performance in downstream tasks like object detection and semantic segmentation
Achieved SOTA results on COCO object detection and ADE20K semantic segmentation benchmarks

Significance:

Bridged design gap between CNN and Transformer
Paved way for practical applications of vision Transformers
Influenced numerous subsequent vision architecture designs
Became paradigm for hierarchical vision Transformers

ConvNeXt (2022) - Perfect Fusion of Traditional and Modern

Founders: Meta

Main Architecture:

“Modernized” pure CNN design
Reconstructed CNN borrowing design principles from Transformer
Retained CNN’s local inductive bias

Breakthroughs:

Proved modernized CNN can match Transformer performance
Combined CNN’s efficient computation with Transformer’s advanced design principles
Achieved excellent performance on various vision tasks

Significance:

Convergence of CNN and Transformer architectures
Demonstrated classic CNN’s continued vitality
Exemplar of mutual learning between two major vision model paradigms

ConvNeXt represented the latest direction in CNN development: drawing inspiration from Transformer while retaining CNN’s advantages, achieving fusion of two paradigms. This “reverse borrowing” showed that classic CNN architecture wasn’t obsolete, but gained new life under inspiration from new ideas.

Phase 6: Multiple Paradigms and Efficient Fusion (2023-2024)

InternImage (2023) - Ultimate Application of Deformable Convolution

Founders: Shanghai AI Lab and SenseTime team

Main Architecture:

Design based on large-scale deformable convolution network (DCNv3)
Multi-level feature extraction and adaptive receptive field adjustment
Decoupled content-position modeling and feature aggregation
Integrated large-scale data training strategies

Breakthroughs:

First proof that pure convolution architecture can surpass Transformer in all vision tasks
Achieved adaptive spatial modeling without explicit attention mechanism
Achieved SOTA results in object detection, instance segmentation, etc.
Proposed concept of large-scale deformable vision backbone

Significance:

Re-established convolution architecture’s dominant position in vision domain
Elevated deformable convolution from auxiliary component to core building block
Created new paradigm for universal vision backbone
Bridged performance gap between CNN and Transformer

InternImage represented reinvention of traditional convolutional neural networks, achieving performance surpassing Transformer by pushing deformable convolution to its limits. It proved that classic CNN ideas still have strong potential in competition with emerging architectures, providing new direction for vision architecture design.

Vision Mamba (2024) - Vision Innovation with State Space Models

Founders: Shanghai AI Lab

Main Architecture:

Applied selective state space model (SSM/Mamba) to vision tasks
Bidirectional scanning strategy captures long-range dependencies in 2D images
Linear computational complexity, breaking quadratic complexity bottleneck of self-attention mechanism

Breakthroughs:

More efficient than ViT in modeling long-range spatial dependencies
Achieved competitive performance with ViT at similar parameter count
Significantly better inference speed and memory consumption than Transformer

Technical Details:

Uses scan-aggregate strategy to process 2D images
Combines CNN’s local perception ability with SSM’s long-range modeling capability
Achieves efficient inference through hardware-aware design

Significance:

Opened third technical route for vision models
Provided efficient alternative for computation-constrained scenarios
Vision and language model architectures converged again

SAM (2023) - Foundation Model for Segment Anything

Founders: Meta AI Research

Main Architecture:

Composed of image encoder, prompt encoder, and mask decoder
ViT-based backbone with lightweight mask prediction head
Supports multiple prompt inputs: points, boxes, text, or masks
Zero-shot segmentation capability and interactive segmentation design

Breakthroughs:

First universal segmentation model trained on over 1 billion masks
Can perform zero-shot segmentation on arbitrary objects
Created new paradigm for prompt-driven visual understanding
Created SA-1B dataset containing over 1.1 billion masks

Significance:

Created new direction for vision foundation models
Introduced interactive understanding into vision model design
Changed computer vision task design approach
Provided powerful visual understanding foundation for downstream applications

SAM represented important turning point from fixed tasks toward universal foundation models in vision models.

文章对话

由AI生成的"小T"和"好奇宝宝"之间的对话，帮助理解文章内容