Evolution of CNN

Table of Contents

📝 Update Log

  • 2024-05-26: Added latest progress in vision models for 2023-2024, included Phase 6 architecture analysis
  • 2023-11-15: Added analysis of current CNN patterns
  • 2023-09-15: Expanded Phase 5 (2020-present) architecture introduction, added ConvNeXt analysis
  • 2020-03-01: Initial article publication

Evolution of CNN

Phase 1: Foundation Era (1998-2011)

LeNet-5 (1998)

Founder: Yann LeCun

Main Architecture:

  • 7-layer structure: 3 convolutional layers, 2 pooling layers, 2 fully connected layers
  • Uses 5×5 convolution kernels
  • Uses sigmoid/tanh activation functions

Breakthroughs:

  • First successful application to real problems (handwritten digit recognition)
  • Established the basic paradigm of “convolutional layer-pooling layer-fully connected layer”
  • Introduced weight sharing concept to reduce parameter count

Limitations:

  • Network relatively shallow due to computational resource constraints
  • Lacked modern training techniques like batch normalization, ReLU activation function

The emergence of LeNet-5 marked the official birth of CNNs, but development was slow in the following decade due to limited computational capabilities and the superior performance of other traditional machine learning methods. It wasn’t until the improvement in GPU computing power and the availability of large-scale training data that things turned around.

Phase 2: Deep Learning Explosion (2012-2014)

AlexNet (2012) - The Spark of Deep Learning Revolution

Founders: Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton

Main Architecture:

  • 8 layers: 5 convolutional layers, 3 fully connected layers
  • First large-scale use of ReLU activation function
  • Uses overlapping max pooling (pooling window size larger than stride, adjacent output units have overlapping receptive fields, mitigates overfitting, smooths feature transitions, expands receptive field)
  • Uses Dropout to prevent overfitting

Breakthroughs:

  • 2012 ImageNet challenge winner, error rate reduced from 26% to 15.3%
  • Landmark event in deep learning revolution
  • Demonstrated importance of GPU for training deep networks
  • First large-scale use of data augmentation: multi-scale cropping, horizontal flipping, PCA color jittering

ZFNet (2013) - Opening the CNN Black Box

Founders: Matthew Zeiler and Rob Fergus

Main Architecture:

  • Improved version of AlexNet
  • Smaller first layer convolution kernels (7×7 instead of 11×11)
  • Smaller stride

Breakthroughs:

  • 2013 ImageNet challenge winner
  • First use of visualization techniques to explain CNN internal mechanisms
  • Introduced concept of “transposed convolution” (Deconvolution is inverse convolution)

Contributions:

  • Deep understanding of CNN feature learning process
  • Laid foundation for CNN interpretability research

VGGNet (2014) - Simple but Deep

Founders: Visual Geometry Group team at Oxford University

Main Architecture:

  • Uses uniform 3×3 small convolution kernel stacking
  • Depth varies from 11 layers (VGG11) to 19 layers (VGG19)
  • 2×2 max pooling layers
  • Three fully connected layer structure

Breakthroughs:

  • Demonstrated critical impact of “depth” on performance
  • Replaced large convolution kernels with multiple small ones (two 3×3 conv layers ≈ one 5×5 conv layer, same receptive field, lower parameter count and computation, but feature types more uniform compared to later inception)
  • Simple and unified network structure design philosophy

Impact:

  • Still used as feature extraction backbone network today
  • Simple design philosophy
  • First systematic study of network depth impact

GoogLeNet/Inception-v1 (2014) - Multi-scale Feature Intelligence

Founders: Google

Main Architecture:

  • 22-layer deep network
  • Introduced “Inception module”: parallel multiple size convolution kernels
  • Uses 1×1 convolutions for dimension reduction
  • Introduced auxiliary classifiers to help training

Breakthroughs:

  • 2014 ImageNet challenge winner
  • Significantly reduced parameter count (only 5M, 12 times less than AlexNet)
  • Introduced modular design thinking
  • Solved balance between computational efficiency and model expressiveness

Contributions:

  • Proved complex networks can be efficiently designed
  • 1×1 convolution became standard design tool
  • Initiated “network in network” design paradigm

GoogLeNet fundamentally changed CNN design approach, network structure design was no longer simple stacking.

Phase 3: Architecture Innovation (2015-2017)

ResNet (2015) - Breakthrough in Ultra-Deep Networks

Founders: Kaiming He team (Microsoft)

Main Architecture:

  • Ultra-deep network (from 34 to 152 layers, later even to 1000+ layers)
  • Core innovation: Residual Block
  • Formula: H(x) = F(x) + x, directly adds input to output

Breakthroughs:

  • 2015 ImageNet challenge winner (3.57% error rate, first to surpass humans)
  • Fundamentally solved degradation problem in deep networks
  • Error rate reduced from AlexNet’s 15.3% to 3.57%

Historical Significance:

  • One of the most important innovations in CNN history
  • Made truly deep networks possible
  • ResNet and its variants still mainstream architecture today
  • Residual learning became standard technique in deep learning

ResNet’s emergence was a milestone in CNN history. Not only did it first break through hundred-layer depth, more importantly, it proposed an elegant solution to overcome deep network degradation. The simple yet effective design of residual connections has since become a standard component in almost all deep networks.

Inception-v2/v3 (2015) - Refined Module Design

Founders: Google team

Main Architecture:

  • Improved Inception module
  • Factorized large convolution kernels (7×7 factorized into 1×7 and 7×1)
  • Introduced batch normalization
  • More effective dimension reduction strategy

Breakthroughs:

  • Further reduced parameter count while improving performance
  • Successfully applied convolution factorization technique
  • Proved effectiveness of asymmetric convolutions

Contributions:

  • Asymmetric convolution design influenced subsequent lightweight networks
  • Batch normalization became standard training technique

Introduced innovations like batch normalization (BN) and asymmetric convolutions (AC)

Inception-v4 and Inception-ResNet (2016) - Beginning of Architecture Fusion

Founders: Google team

Main Architecture:

  • Combined Inception architecture with ResNet residual connections
  • More unified, simplified Inception module
  • Used residual scaling to prevent instability

Breakthroughs:

  • Proved residual connections can combine with various architectures
  • Improved training speed and model performance
  • Demonstrated power of architecture hybridization

Impact:

  • Promoted research in model fusion and architecture hybridization
  • Paved way for subsequent “hybrid” architectures

Inception-ResNet represented a new trend in CNN development: combining advantages of different architectures to create more powerful networks. This “taking the best of both worlds” approach provided a new direction for subsequent network design.

DenseNet (2016) - Ultimate Feature Reuse

Founders: Gao Huang team

Main Architecture:

  • Dense connections: each layer directly connected to all its preceding layers
  • Feature reuse: information transmitted through concatenation (not addition)
  • Bottleneck layer design reduces parameter count

Breakthroughs:

  • Achieved similar performance to ResNet with fewer parameters
  • Alleviated vanishing gradient problem
  • Improved feature propagation efficiency
  • Strong regularization effect reduces overfitting

Contributions:

  • Another paradigm for solving deep network training problems
  • Influenced subsequent feature reuse and connection strategy design

If ResNet solved deep network training problems through “shortcuts”, then DenseNet achieved more efficient feature propagation through “highways”. While these approaches differ in methodology, they both aimed at the same goal: making deep network training more stable and efficient. DenseNet’s dense connection mechanism brought stronger feature reuse capability and regularization effect.

Phase 4: Efficiency and Lightweight Era (2017-2019)

MobileNet Series (2017-2019) - AI Revolution for Mobile Devices

Founders: Google team

Main Architecture:

  • Depthwise separable convolution: decompose standard convolution into depthwise and pointwise convolutions
  • MobileNetV2 introduced inverted residual structure
  • MobileNetV3 combined neural architecture search and SE modules

Breakthroughs:

  • Greatly reduced computational complexity (8-9 times less than standard CNN)
  • Suitable for mobile and embedded devices
  • Proposed width multiplier and resolution multiplier to adjust computational complexity

Significance:

  • Opened new direction in lightweight CNN research
  • Made deep learning practical on resource-constrained devices
  • Influenced all subsequent mobile network designs

The MobileNet series marked an important shift in CNN research: from pursuing ultimate performance to balancing computational efficiency and practicality. This shift enabled deep learning to move beyond cloud data centers into everyday devices like smartphones, greatly expanding CNN’s application scenarios.

SENet (2017) - Pioneer of Attention Mechanism

Founders: Jie Hu et al.

Main Architecture:

  • Introduced “Squeeze-and-Excitation” (SE) module
  • “Squeeze” features through global pooling
  • “Excite” feature channels through two fully connected layers
  • Can be inserted into any existing architecture

Breakthroughs:

  • 2017 ImageNet challenge winner
  • Explicitly modeled dependencies between feature channels
  • Significant performance improvement with minimal parameter increase (~10%)

Impact:

  • Pioneered channel attention mechanism research
  • Influenced all subsequent attention mechanism designs
  • SE module became standard component widely adopted

SENet’s importance far exceeded its performance improvement; it introduced the concept of “attention” into CNN design, inspiring a series of subsequent innovations based on attention mechanisms. This lightweight yet effective design also perfectly aligned with the efficiency-seeking trend of the time.

EfficientNet (2019) - Scientific Method for Network Scaling

Founders: Google team (Mingxing Tan, Quoc V. Le)

Main Architecture:

  • Based on MobileNetV2’s mobile inverted bottleneck structure
  • Uses SE modules
  • Core innovation: compound scaling method, simultaneously balancing network width, depth, and resolution

Breakthroughs:

  • First systematic solution to CNN scaling problem
  • Achieved state-of-the-art performance with fewer parameters and computations
  • EfficientNet-B7: 84.4% accuracy, 5% higher than previous large models

Significance:

  • Provided new paradigm for CNN design
  • Provided theoretical foundation for network scale and computational efficiency optimization
  • Became benchmark for lightweight efficient network design

EfficientNet represented an important milestone in CNN design: from art to science. Through systematic study of the network scaling problem, it provided theoretical guidance for balancing various dimensions, making model design no longer completely dependent on experience and intuition. The impact of this methodology extended far beyond the performance improvement of the network itself.

Phase 5: Paradigm Shift and Fusion (2020-2022)

Vision Transformer (ViT, 2020) - Paradigm Revolution in Vision Models

Founders: Google team

Main Architecture:

  • Not traditional CNN, but direct application of Transformer to images
  • Splits image into fixed-size patch sequence
  • Uses pure self-attention mechanism for visual tasks

Breakthroughs:

  • Proved non-CNN architectures can excel at vision tasks
  • Surpassed CNN performance on large-scale datasets
  • Successful application of self-attention mechanism in vision domain

Significance:

  • Opened new paradigm for vision models
  • Promoted unification of vision and language model architectures
  • Drove subsequent research in CNN and Transformer hybridization

The emergence of ViT was a turning point in vision model development, challenging CNN’s status as the sole dominant architecture for vision tasks. Although strictly speaking ViT doesn’t belong to the CNN family, it profoundly influenced CNN development, prompting researchers to rethink vision model design principles.

Swin Transformer (2021) - Milestone in Hierarchical Vision Transformers

Founders: Microsoft Research team (Ze Liu, Yutong Lin et al.)

Main Architecture:

  • Hierarchical design: adopts CNN-like multi-scale feature hierarchy
  • Shifted window attention mechanism: balances computational efficiency and cross-window connections
  • Relative position encoding: enhances spatial position awareness
  • Coarse-to-fine feature pyramid structure

Breakthroughs:

  • Successfully addressed ViT’s limitations in dense prediction tasks
  • Linear computational complexity, significantly better than standard global self-attention
  • Excellent performance in downstream tasks like object detection and semantic segmentation
  • Achieved SOTA results on COCO object detection and ADE20K semantic segmentation benchmarks

Significance:

  • Bridged design gap between CNN and Transformer
  • Paved way for practical applications of vision Transformers
  • Influenced numerous subsequent vision architecture designs
  • Became paradigm for hierarchical vision Transformers

ConvNeXt (2022) - Perfect Fusion of Traditional and Modern

Founders: Meta

Main Architecture:

  • “Modernized” pure CNN design
  • Reconstructed CNN borrowing design principles from Transformer
  • Retained CNN’s local inductive bias

Breakthroughs:

  • Proved modernized CNN can match Transformer performance
  • Combined CNN’s efficient computation with Transformer’s advanced design principles
  • Achieved excellent performance on various vision tasks

Significance:

  • Convergence of CNN and Transformer architectures
  • Demonstrated classic CNN’s continued vitality
  • Exemplar of mutual learning between two major vision model paradigms

ConvNeXt represented the latest direction in CNN development: drawing inspiration from Transformer while retaining CNN’s advantages, achieving fusion of two paradigms. This “reverse borrowing” showed that classic CNN architecture wasn’t obsolete, but gained new life under inspiration from new ideas.

Phase 6: Multiple Paradigms and Efficient Fusion (2023-2024)

InternImage (2023) - Ultimate Application of Deformable Convolution

Founders: Shanghai AI Lab and SenseTime team

Main Architecture:

  • Design based on large-scale deformable convolution network (DCNv3)
  • Multi-level feature extraction and adaptive receptive field adjustment
  • Decoupled content-position modeling and feature aggregation
  • Integrated large-scale data training strategies

Breakthroughs:

  • First proof that pure convolution architecture can surpass Transformer in all vision tasks
  • Achieved adaptive spatial modeling without explicit attention mechanism
  • Achieved SOTA results in object detection, instance segmentation, etc.
  • Proposed concept of large-scale deformable vision backbone

Significance:

  • Re-established convolution architecture’s dominant position in vision domain
  • Elevated deformable convolution from auxiliary component to core building block
  • Created new paradigm for universal vision backbone
  • Bridged performance gap between CNN and Transformer

InternImage represented reinvention of traditional convolutional neural networks, achieving performance surpassing Transformer by pushing deformable convolution to its limits. It proved that classic CNN ideas still have strong potential in competition with emerging architectures, providing new direction for vision architecture design.

Vision Mamba (2024) - Vision Innovation with State Space Models

Founders: Shanghai AI Lab

Main Architecture:

  • Applied selective state space model (SSM/Mamba) to vision tasks
  • Bidirectional scanning strategy captures long-range dependencies in 2D images
  • Linear computational complexity, breaking quadratic complexity bottleneck of self-attention mechanism

Breakthroughs:

  • More efficient than ViT in modeling long-range spatial dependencies
  • Achieved competitive performance with ViT at similar parameter count
  • Significantly better inference speed and memory consumption than Transformer

Technical Details:

  • Uses scan-aggregate strategy to process 2D images
  • Combines CNN’s local perception ability with SSM’s long-range modeling capability
  • Achieves efficient inference through hardware-aware design

Significance:

  • Opened third technical route for vision models
  • Provided efficient alternative for computation-constrained scenarios
  • Vision and language model architectures converged again

SAM (2023) - Foundation Model for Segment Anything

Founders: Meta AI Research

Main Architecture:

  • Composed of image encoder, prompt encoder, and mask decoder
  • ViT-based backbone with lightweight mask prediction head
  • Supports multiple prompt inputs: points, boxes, text, or masks
  • Zero-shot segmentation capability and interactive segmentation design

Breakthroughs:

  • First universal segmentation model trained on over 1 billion masks
  • Can perform zero-shot segmentation on arbitrary objects
  • Created new paradigm for prompt-driven visual understanding
  • Created SA-1B dataset containing over 1.1 billion masks

Significance:

  • Created new direction for vision foundation models
  • Introduced interactive understanding into vision model design
  • Changed computer vision task design approach
  • Provided powerful visual understanding foundation for downstream applications

SAM represented important turning point from fixed tasks toward universal foundation models in vision models.

文章对话

由AI生成的"小T"和"好奇宝宝"之间的对话,帮助理解文章内容