YOLOv5 Simplified: A Beginner's Visual Guide to Understanding Each Step

Introduction

YOLOv5 is a popular object detection model that has been widely used in various applications. However, it can be challenging to understand how the model works, especially for beginners. In this article, we will explore the YOLOv5 model architecture and visualize its components to gain a better understanding of how it works.

YOLOv5m Architecture

Medium-scale Model with Enhanced Feature Capacity

Input Processing

Input:
(B, 3, H, W)
Output:
(B, 3, 640, 640)

Input Specifications

Shape(B, 3, 640, 640)
Typefloat32
Range[0, 1]
Memory3.75 MB

Preprocessing

Input(B, 3, H, W)
Resize(B, 3, 640, 640)
Normalize/255.0
FormatNCHW

Backbone (CSP-Darknet53)

Input:
(B, 3, 640, 640)
Output:
P3: (B, 192, 80, 80) P4: (B, 384, 40, 40) P5: (B, 768, 20, 20)

P3 Features

Shape(B, 192, 80, 80)
Scale1/8
Elements1,228,800
RF52×52

P4 Features

Shape(B, 384, 40, 40)
Scale1/16
Elements614,400
RF104×104

P5 Features

Shape(B, 768, 20, 20)
Scale1/32
Elements307,200
RF208×208

Neck (PANet)

Input:
P3: (B, 192, 80, 80) P4: (B, 384, 40, 40) P5: (B, 768, 20, 20)
Output:
N3: (B, 192, 80, 80) N4: (B, 384, 40, 40) N5: (B, 768, 20, 20)

Bottom-up Path

P5→P4(B, 384, 40, 40)
P4→P3(B, 192, 80, 80)
OperationUpsample
FusionConcat

Top-down Path

P3→P4(B, 384, 40, 40)
P4→P5(B, 768, 20, 20)
OperationStrided Conv
FusionConcat

Detection Heads

Input:
N3: (B, 192, 80, 80) N4: (B, 384, 40, 40) N5: (B, 768, 20, 20)
Output:
(B, 25200, 85)

Small Objects

Input(B, 192, 80, 80)
Output(B, 85, 80, 80)
Anchors3
Predictions19,200

Medium Objects

Input(B, 384, 40, 40)
Output(B, 85, 40, 40)
Anchors3
Predictions4,800

Large Objects

Input(B, 768, 20, 20)
Output(B, 85, 20, 20)
Anchors3
Predictions1,200

Output Processing

Input:
(B, 25200, 85)
Output:
(B, 300, 85)

Pre-NMS

Shape(B, 25200, 85)
Boxes25,200
Classes80
Conf1

Post-NMS

Shape(B, 300, 85)
Boxes300
FormatXYWH
Conf>0.25

Model Summary

Performance

  • • mAP@0.5: 0.451 (COCO)
  • • Inference: ~8.2ms (V100)
  • • FPS: ~122 (batch=1)
  • • Size: 42.5MB

Architecture

  • • Parameters: 21.2M
  • • GFLOPs: 49.0
  • • Memory: ~240MB
  • • Layers: 294

Features

  • • CSP Bottlenecks
  • • PANet Feature Fusion
  • • Multi-scale Detection
  • • Auto-learning Anchors

YOLOv5 Feature Pyramid Network - Detailed Merge Process

Backbone Features
P5 (Stride 32)512ch
P4 (Stride 16)256ch
P3 (Stride 8)128ch
FPN Features
P5512ch
P4256ch
P3128ch
PAN Features
P5512ch
P4256ch
P3128ch
Final Features
P5512ch
P4256ch
P3128ch
Feature Pyramid Network Details:
• Backbone extracts multi-scale features with increasing semantic information but decreasing spatial resolution (P3: 128 channels, P4: 256 channels, P5: 512 channels)

YOLOv5 Multi-Scale Fusion

YOLOv5 Multi-scale Prediction Fusion (P4 + P5)
P5 Anchor 1
P5 Anchor 2
P5 Anchor 3
P4 Anchor 1
P4 Anchor 2
P4 Anchor 3

Multi-scale Fusion Process

P5 Scale (8×8 grid)

Base Anchors:

Square Anchor:

Width: 1.2, Height: 1.2

For square objects

Tall Anchor:

Width: 1, Height: 2

For tall/vertical objects

Wide Anchor:

Width: 2, Height: 1

For wide/horizontal objects

P4 Scale (16×16 grid)

Base Anchors:

Square Anchor:

Width: 1, Height: 1

For square objects

Tall Anchor:

Width: 0.8, Height: 1.6

For tall/vertical objects

Wide Anchor:

Width: 1.6, Height: 0.8

For wide/horizontal objects

Mastodon