Reading: Hands-On ML - Quick notes (Chapter 14 & 15)

👉 List of all notes for this book. IMPORTANT UPDATE Nov 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

Chapter 14 — Deep Computer Vision Using Convolutional Neural Networks (CNN)

Related notes: DL by DL.AI — Course 4: CNN , R-CNN & Fast R-CNN & Faster R-CNN & Mask R-CNN, TF by DL.AI — Course 2: CNN in TF

Good: [PDF] A guide to convolution arithmetic for deep learning + hình động của mấy hình. ← Tài liệu này giải thích về convolution, pooling, các thông số và các công thức arithmetic giữa các thông số (padding, strides,…)

CNNs are not restricted to visual perception: they are also successful at many other tasks, such as voice recognition and natural language processing

In NN, some neurons react only to images of horizontal lines, while others react only to lines with different orientations

Figure 14-1

An important milestone was a 1998 paper by Yann LeCun et al. that introduced the famous LeNet-5 architecture, which became widely used by banks to recognize handwritten digits on checks.

Why not simply use a deep neural network with fully connected layers for image recognition tasks? → big image → huge number of params → CNN solves this issue.

The most important building block of a CNN is the convolutional layer

Convolution

No padding and 1x1 strides

1x1 border zeros padding and 2x2 strides.

Zero padding: current layer has the same height and width as the previous layer → add zeros around the input.

Figure 14-3. Connections between layers and zero padding

Stride: the horizontal or vertical step size from one receptive field to the next.

Figure 14-4. Reducing dimensionality using a stride of 2

Filters = convolution kernels = kernels.

Feature map: a layer full of neurons using the same filter outputs a feature map, which highlights the areas in an image that activate the filter the most.

CNN has many feature maps

CNN vs FCN: The fact that all neurons in a feature map share the same parameters dramatically reduces the number of parameters in the model. Once the CNN has learned to recognize a pattern in one location, it can recognize it in any other location. In contrast, once a fully connected neural network has learned to recognize a pattern in one location, it can only recognize it in that particular location.

padding="valid" means no zero-padding

Some cases with padding and stride

Note that the height and width of the input images do not appear in the kernel’s shape

A convolutional layer performs a linear operation, so if you stacked multiple convolutional layers without any activation functions they would all be equivalent to a single convolutional layer, and they wouldn’t be able to learn anything really complex.

The convolutional layers require a huge amount of RAM. ← out-of-memory error, you can try reducing the mini-batch size

You only need as much RAM as required by two consecutive layers.

Pooling layers:

Their goal is to subsample (i.e., shrink) the input image in order to reduce the computational load, the memory usage, and the number of parameters (thereby limiting the risk of overfitting).
A pooling neuron has no weights.
A pooling layer typically works on every input channel independently, so the output depth (i.e., the number of channels) is the same as the input depth.
By inserting a max pooling layer every few layers in a CNN, it is possible to get some level of translation invariance at a larger scale.
Max pooling has downsides → dropping 75% of the input values. It preserves only the strongest features.
Average pooling layers used to be very popular, but people mostly use max pooling layers now, as they generally perform better.
Depthwise max pooling

Figure 14-11. Depthwise max pooling can help the CNN learn to be invariant (to rotation in this case)

Global average pooling layer: often see in modern architectures. It just outputs a single number per feature map and per instance. It can be useful just before the output layer.

Typical CNN architecture

Figure 4-12

A common mistake is to use convolution kernels that are too large.

ImageNet

Today, we would use ReLU instead of tanh and softmax instead of RBF.

Some famous networks:

LeNet-5 (Yann LeCun 1998), widely known CNN, on MNIST dataset.
AlexNet: similar to LeNet-5 but larger and deeper, 1st to stack convolutional layers directly on top of one another.
GoogLeNet: much deeper than previous.
VGGNet: use many 3x3 filters
ResNet: deeper and deeper and fewer params, use skip connections. ResNet (Residual Networks)
Xception: merges the ideas of GooLeNet and ResNet, but it replaces the inception modules with a special type of layer called a depthwise separable convolution layer.

Since separable convolutional layers only have one spatial filter per input channel, you should avoid using them after layers that have too few channels, such as the input layer

SENet: This architecture extends existing architectures such as inception networks and ResNets, and boosts their performance.

The boost comes from the fact that a SENet adds a small neural network, called an SE block, to every inception module or residual unit in the original architecture

EfficientNet: The authors proposed a method to scale any CNN efficiently, by jointly increasing the depth (number of layers), width (number of filters per layer), and resolution (size of the input image) in a principled way. This is called compound scaling.

Understanding EfficientNet’s compound scaling method is helpful to gain a deeper understanding of CNNs, especially if you ever need to scale a CNN architecture.

Keras Applications ← full list of networks supported by Keras

Data augmentation: Data augmentation artificially increases the size of the training set by generating many realistic variants of each training instance. This reduces overfitting, making this a regularization technique. For example, you can slightly shift, rotate, and resize every picture in the training set by various amounts and add the resulting pictures to the training set

Data augmentation is also useful when you have an unbalanced dataset: you can use it to generate more samples of the less frequent classes. This is called the synthetic minority oversampling technique, or SMOTE for short.

It is very easy to create a pretty good image classifier using a pretrained model.

Pretrained models for Transfer Learning

It’s usually a good idea to freeze the weights of the pretrained layers, at least at the beginning of training.
After training the model for a few epochs, its validation accuracy should reach a bit over 80% and then stop improving. This means that the top layers are now pretty well trained, and we are ready to unfreeze some of the base model’s top layers, then continue training.
If you tune the hyperparameters, lower the learning rate, and train for quite a bit longer, you should be able to reach 95% to 97%.

Classification and Localization:

Localizing an object in a picture can be expressed as a regression task,
to predict a bounding box around the object, a common approach is to predict the horizontal and vertical coordinates of the object’s center, as well as its height and width.
a problem: the flowers dataset does not have bounding boxes around the flowers. So, we need to add them ourselves. This is often one of the hardest and most costly parts of a machine learning project: getting the labels.
[1611.02145] Crowdsourcing in Computer Vision ← labeling via crowdsourcing
The bounding boxes should be normalized so that the horizontal and vertical coordinates, as well as the height and width, all range from 0 to 1.
The most common metric for this is the intersection over union (IoU): the area of overlap between the predicted bounding box and the target bounding box, divided by the area of their union

Figure 14-24. IoU metric for bounding boxes

Object detection: The task of classifying and localizing multiple objects in an image.

Instead of an objectness score, a “no-object” class was sometimes added, but in general this did not work as well: the questions “Is an object present?” and “What type of object is it?” are best answered separately.

The sliding-CNN approach

Figure 14-25. Detecting multiple objects by sliding a CNN across the image

This technique is fairly straightforward, but as you can see it will often detect the same object multiple times, at slightly different positions. Some postprocessing is needed to get rid of all the unnecessary bounding boxes. A common approach for this is called non-max suppression.
This simple approach to object detection works pretty well, but it requires running the CNN many times (15 times in this example), so it is quite slow.

To convert a dense layer to a convolutional layer, the number of filters in the convolutional layer must be equal to the number of units in the dense layer, the filter size must be equal to the size of the input feature maps, and you must use "valid" padding. The stride may be set to 1 or more, as you will see shortly.

YOLO (You Only Look Once): It is so fast that it can run in real time on a video

How YOLO Object Detection Works - YouTube

mAP (mean average precision) → a very common metric used in object detection. Read page 690, the author explains the idea of mAP very well.

R-CNN: Clearly EXPLAINED! - YouTube

TensorFlow Hub Object Detection Colab

Object Tracking:

Object tracking is a challenging task: objects move, they may grow or shrink as they get closer to or further away from the camera, their appearance may change as they turn around or move to different lighting conditions or backgrounds, they may be temporarily occluded by other objects, and so on.
DeepSORT.

Semantic Segmentation: In semantic segmentation, each pixel is classified according to the class of the object it belongs to. Different objects of the same class are not distintuished.

Instance segmentation: similar to semantic segmentation but instead of merging all objects of the same class into one big lump, each object is distinguished from the others.

the field of deep computer vision is vast and fast-paced, with all sorts of architectures popping up every year. Almost all of them are based on convolutional neural networks, but since 2020 another neural net architecture has entered the computer vision space: transformers

Chapter 15 — Processing Sequences Using RNNs and CNNs