Introduction

YOLOv5 is a popular object detection model that has been widely used in various applications. However, it can be challenging to understand how the model works, especially for beginners. In this article, we will explore the YOLOv5 model architecture and visualize its components to gain a better understanding of how it works.

YOLOv5m Architecture

Medium-scale Model with Enhanced Feature Capacity

Input Processing

Input:

(B, 3, H, W)

Output:

(B, 3, 640, 640)

Input Specifications

Shape(B, 3, 640, 640)

Typefloat32

Range[0, 1]

Memory3.75 MB

Preprocessing

Input(B, 3, H, W)

Resize(B, 3, 640, 640)

Normalize/255.0

FormatNCHW

Backbone (CSP-Darknet53)

Input:

(B, 3, 640, 640)

Output:

P3: (B, 192, 80, 80) P4: (B, 384, 40, 40) P5: (B, 768, 20, 20)

P3 Features

Shape(B, 192, 80, 80)

Scale1/8

Elements1,228,800

RF52×52

P4 Features

Shape(B, 384, 40, 40)

Scale1/16

Elements614,400

RF104×104

P5 Features

Shape(B, 768, 20, 20)

Scale1/32

Elements307,200

RF208×208

Neck (PANet)

Input:

P3: (B, 192, 80, 80) P4: (B, 384, 40, 40) P5: (B, 768, 20, 20)

Output:

N3: (B, 192, 80, 80) N4: (B, 384, 40, 40) N5: (B, 768, 20, 20)

Bottom-up Path

P5→P4(B, 384, 40, 40)

P4→P3(B, 192, 80, 80)

OperationUpsample

FusionConcat

Top-down Path

P3→P4(B, 384, 40, 40)

P4→P5(B, 768, 20, 20)

OperationStrided Conv

FusionConcat

Detection Heads

Input:

N3: (B, 192, 80, 80) N4: (B, 384, 40, 40) N5: (B, 768, 20, 20)

Output:

(B, 25200, 85)

Small Objects

Input(B, 192, 80, 80)

Output(B, 85, 80, 80)

Anchors3

Predictions19,200

Medium Objects

Input(B, 384, 40, 40)

Output(B, 85, 40, 40)

Anchors3

Predictions4,800

Large Objects

Input(B, 768, 20, 20)

Output(B, 85, 20, 20)

Anchors3

Predictions1,200

Output Processing

Input:

(B, 25200, 85)

Output:

(B, 300, 85)

Pre-NMS

Shape(B, 25200, 85)

Boxes25,200

Classes80

Conf1

Post-NMS

Shape(B, 300, 85)

Boxes300

FormatXYWH

Conf>0.25

Model Summary

Performance

• mAP@0.5: 0.451 (COCO)
• Inference: ~8.2ms (V100)
• FPS: ~122 (batch=1)
• Size: 42.5MB

Architecture

• Parameters: 21.2M
• GFLOPs: 49.0
• Memory: ~240MB
• Layers: 294

Features

• CSP Bottlenecks
• PANet Feature Fusion
• Multi-scale Detection
• Auto-learning Anchors

YOLOv5 Feature Pyramid Network - Detailed Merge Process

Backbone Features

P5 (Stride 32)512ch

P4 (Stride 16)256ch

P3 (Stride 8)128ch

FPN Features

P5512ch

P4256ch

P3128ch

PAN Features

P5512ch

P4256ch

P3128ch

Final Features

P5512ch

P4256ch

P3128ch

Feature Pyramid Network Details:

• Backbone extracts multi-scale features with increasing semantic information but decreasing spatial resolution (P3: 128 channels, P4: 256 channels, P5: 512 channels)

YOLOv5 Multi-Scale Fusion

YOLOv5 Multi-scale Prediction Fusion (P4 + P5)

P5 Anchor 1

P5 Anchor 2

P5 Anchor 3

P4 Anchor 1

P4 Anchor 2

P4 Anchor 3

Multi-scale Fusion Process

P5 Scale (8×8 grid)

Base Anchors:

Square Anchor:

Width: 1.2, Height: 1.2

For square objects

Tall Anchor:

Width: 1, Height: 2

For tall/vertical objects

Wide Anchor:

Width: 2, Height: 1

For wide/horizontal objects

P4 Scale (16×16 grid)

Base Anchors:

Square Anchor:

Width: 1, Height: 1

For square objects

Tall Anchor:

Width: 0.8, Height: 1.6

For tall/vertical objects

Wide Anchor:

Width: 1.6, Height: 0.8

For wide/horizontal objects

YOLOv5 Simplified: A Beginner's Visual Guide to Understanding Each Step

Introduction

YOLOv5m Architecture

Input Processing

Input Specifications

Preprocessing

Backbone (CSP-Darknet53)

P3 Features

P4 Features

P5 Features

Neck (PANet)

Bottom-up Path

Top-down Path

Detection Heads

Small Objects

Medium Objects

Large Objects

Output Processing

Pre-NMS

Post-NMS

Model Summary

Performance

Architecture

Features

YOLOv5 Feature Pyramid Network - Detailed Merge Process

YOLOv5 Multi-Scale Fusion

Multi-scale Fusion Process