How 3×3 Filters Changed Everything? The VGGNet Story

How this legendary paper with 145k+ citations threatened the dominance of AlexNet

and

May 29, 2025

When the Oxford Visual Geometry Group (VGG) released their paper in 2014, it did not come with flashy new layers, exotic activation functions, or radical ideas.

What it offered instead was a revolution in restraint.

Replace large, complicated convolution filters with stacks of simple 3×3 filters.
Go deeper, not wider.

This small idea became a giant leap for computer vision.

AI/ML LIVE course

What was happening before VGG

Between 2014 and 2016, the deep learning community faced a major question: "how can we improve CNN performance without exploding compute and parameter costs?"

Back then, CNNs were like handcrafted machines:

AlexNet used a mix of 11×11, 5×5, and 3×3 filters.
Researchers played with architecture like artists with brushes - but lacked structure.
Networks were shallow by today’s standards - 5 to 8 learnable layers.

The idea that going deeper and using only small 3×3 convolutions could outperform these handcrafted giants?
That was radical.

The VGG hypothesis

Karen Simonyan and Andrew Zisserman (authors of the VGG paper) asked:

Can we get better performance by stacking multiple 3×3 convolutions, instead of using bigger filters?

Turns out - you can. And it is elegant.

A quick recap on what we were doing in the Computer Vision course so far

Why was VGGNet such a big deal?

VGG was a major milestone in the evolution of deep learning for computer vision. Despite its simplicity, it had a lasting impact.

Simplicity with depth

Before VGG, architectures like AlexNet had a mix of kernel sizes (11×11, 5×5, etc.).
VGG used only 3×3 convolutions, stacked deeper:

This proved that depth alone could drastically improve accuracy.
It provided a clean, modular architecture that was easy to reason about.

Stacking small filters

VGG replaced larger kernels (like 5×5 or 7×7) with multiple 3×3 filters:

This gave the same effective receptive field.
But it introduced more non-linearity and fewer parameters.

This concept shaped how modern CNNs are designed (e.g., ResNet, DenseNet, etc.).

Pretrained backbone

VGG16 and VGG19 became the go-to pretrained models for transfer learning.

Even today, many models use VGG as a feature extractor in tasks like object detection, segmentation, and style transfer.

Benchmark performance

VGG16 placed 2nd in the 2014 ImageNet Challenge, just behind GoogLeNet (Inception).
It outperformed older models like AlexNet by a large margin.

Deeper networks generalize better when trained properly.

Blueprint for modern CNNs

Many subsequent architectures (e.g., ResNet, UNet, MobileNet) drew structural ideas from VGG’s block-wise design.
It popularized "block-based thinking" in CNN design.

VGG architecture

Code implementation

Here I am providing the main code snippets and have highlighted the key parts of the code. If you want the full code (Google Colab) go to the end of this article.

Surprising results (val_accuracy>train_accuracy )

Our results are very surprising. Training accuracy is ~75% and the validation accuracy is 85%. This is a new and interesting problem in our computer vision journey.

We froze most VGG layers and only trained the classifier.
The pretrained convolutional layers already work well on natural images (like flowers).
The classifier may not have fully adapted to the training set yet, but the validation set happens to align well with the ImageNet-style features.

📌 Conclusion: Higher validation accuracy in early epochs often indicates good generalization from pretrained features - not overfitting.