Reading: Hands-On ML - Quick notes (Chapter 10 — 13)

Anh-Thi Dinh
👉 List of all notes for this book. IMPORTANT UPDATE Nov 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).

Chapter 10. Introduction to Artificial Neural Networks with Keras

  • The Perceptron: one of the simplest ANN architectures (ANN = Artificial Neural Networks)
    • Figure 10-4. TLU (threshold logic unit): an artificial neuron that computes a weighted sum of its inputs , plus a bias term b, then applies a step function
  • Most common step function is Heaviside step function, sometimes sign function is used.
  • How is a perceptron trained? → follows Hebb’s rule. “Cells that fire together, wire together” (the connection weight between two neurons tends to increase when they fire simultaneously.)
  • perceptrons has limit (eg. cannot solve XOR problem) → use multiplayer perceptron (MLP)
  • perceptrons do not output a class probability → use logistic regression instead.
  • When an ANN contains a deep stack of hidden layers → deep neural network (DNN)
  • Thời xưa, máy tính chưa mạnh → train MLPs is a problem kể cả khi dùng gradient descent.
  • Backpropagation : an algo to minimize the cost function of MLPs.
    • Forward propagation: from X to compute the cost J
    • Backward propagation: compute derivaties and optimize the params → update params
    • → Read this note (DL course 1).
From this, I've decided to browse additional materials to deepen my understanding of Deep Learning. I found that the book has become more generalized than I expected, so I'll explore other resources before returning to finish it.
  • It’s important to initialize all the hidden layers connection weights randomly!
  • Replace step function in MLP by sigmoid function because sigmoid function has a well-defined nonzero derivative everywhere!
  • The ReLU activation (The rectified linear unit function) is continous but not differentiable at 0. In practice, it works very well and fast to compute, so it becomes the default.
  • Some popular activations
  • Conversely, a large enough DNN with nonlinear activations can theoretically approximate any continuous function.
  • Regression MLPs → use MLPs for regression tasks, MLPRegressor
  • gradient descent does not converge very well when the features have very different scales
  • softplus activation (a smooth variant of ReLU): softplus(z) = log(1 + exp(z))
  • If you are not satisfied with the training result: try tune the hyperparameters (eg. learning rate) and then fine tune the model hyperparameters (number of layers, #neurons per layer,…)
  • 3 ways to build Keras model: Sequential API (clean and straightforward), Functional API (multiple inputs/outputs), Subclassing API (to build dynamic models)
    • 1model = tf.keras.Sequential([
      2	tf.keras.layers.Flatten(input_shape=[28, 28]),
      3	tf.keras.layers.Dense(300, activation="relu"),
      4	tf.keras.layers.Dense(100, activation="relu"),
      5	tf.keras.layers.Dense(10, activation="softmax")
      6])
      Sequential API
       
      Figure 10-13. Wide & Deep neural network
      1input_wide = tf.keras.layers.Input(shape=[5]) # features 0 to 4
      2input_deep = tf.keras.layers.Input(shape=[6]) # features 2 to 7
      3norm_layer_wide = tf.keras.layers.Normalization()
      4norm_layer_deep = tf.keras.layers.Normalization()
      5norm_wide = norm_layer_wide(input_wide)
      6norm_deep = norm_layer_deep(input_deep)
      7hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
      8hidden2 = tf.keras.layers.Dense(30, activation="relu")(hidden1)
      9concat = tf.keras.layers.concatenate([norm_wide, hidden2])
      10output = tf.keras.layers.Dense(1)(concat)
      11model = tf.keras.Model(inputs=[input_wide, input_deep], outputs=[output])
      Functional API
  • Sequential API is clean and straightforward. Need more complex topologies/multiple inputs or outputs → functional API.
  • Functional API
    • One example of a nonsequential neural network is a Wide & Deep neural network.
    • Each Dense layer is created and called on the same line. This is a common practice
      • 1norm_deep = norm_layer_deep(input_deep)
        2
        3# instead of
        4hidden_layer1 = tf.keras.layers.Dense(30, activation="relu")
        5hidden1 = hidden_layer1(norm_deep)
        6
        7# do
        8hidden1 = tf.keras.layers.Dense(30, activation="relu")(norm_deep)
    • Multiple outputs: You could train one neural network per task, but in many cases you will get better results on all tasks by training a single neural network with one output per task.
    • you may want to add an auxiliary output in a neural network architecture (see Figure 10-15) to ensure that the underlying part of the network learns something useful on its own, without relying on the rest of the network.
    • Figure 10-15. Handling multiple outputs, in this example to add an auxiliary output for regularization
    • Each output need its own loss function
      • 1model.compile(
        2	loss=("mse", "mse"), loss_weights=(0.9, 0.1),
        3	# ...
        4)
  • Cả sequential API và functional API đều là dạng “declarative” (dễ debug) nhưng nhược điểm là chúng là dạng “static” (graph of layers to use). Nếu chúng ta cần dạng loops, vary shapes, conditional branching,… → need Subclassing API (tf.keras.Model)
  • Using TensorBoard for Visualization
  • Fine tuning NN Hyperparameters
    • One option is to convert your Keras model to a Scikit-Learn estimator, and then use GridSearchCV or RandomizedSearchCV to fine-tune the hyperparameters, as you did in Chapter 2.
    • better way: you can use the Keras Tuner library, which is a hyperparameter tuning library for Keras models.
    • Number of hidden layers
    • Transfer learning
    • Number of Neurons per hidden layer
      • Ngày xưa thì là giảm dần nhưng sau đó cái này ko còn đúng, chuyển thành giống nhau (khi ấy chỉ có 1 hyperparameter để tune thui).
      • (1 đứa ở Google) Cứ dùng nhiều neurons ban đầu rùi giảm dần, điều này sẽ tránh được tình trạng có 1 layer nào đó ko capture đủ information, dẫn đến việc các layer sau đó cho dù nhiều neurons đến đâu cũng ko thê capture inform bị mất.
      • In general you will get more bang for your buck by increasing the number of layers instead of the number of neurons per layer.
    • The learning rate is arguably the most important hyperparameter.
    • Optimizer → chap 11
    • Batch size
      • The batch size can have a significant impact on your model’s performance and training time.
      • Use largest batch size that can fit in GPU RAM. ← weak: large batch sizes often lead to training instabilities, especially at the beginning of training, and the resulting model may not generalize as well as a model trained with a small batch size
      • using small batches (from 2 to 32) was preferable because small batches led to better models in less training time
      • one strategy is to try to using a large batch size, with learning rate warmup, and if training is unstable or the final performance is disappointing, then try using a small batch size instead.
    • The optimal learning rate depends on the other hyperparameters—especially the batch size
    • [1803.09820] A disciplined approach to neural network hyper-parameters: Part 1 -- learning rate, batch size, momentum, and weight decay

Chapter 11 — Training Deep Neural Networks

The Vanishing/Exploding Gradients Problems

  • The exploding gradients problems (thường gặp ở recurrent neural networks)
  • Cách khắc phục thường kết hợp giữa activation và initialization techniques.
  • Glorot and He Initialization
  • ReLU isn’t perfect → dying ReLUs problems: during training, some neurons effectively “die”, meaning they stop outputting anything other than 0. In some cases, you may find that half of your network’s neurons are dead, especially if you used a large learning rate. → use leaky ReLU
  • Leaky ReLU
    • Setting (huge leak) seemed to result in better performace than (small leak)
    • There is also randomized leaky ReLU (RReLU) ( picked randomly) → reducing overfitting.
    • parametric leaky ReLU (PReLU) → PReLU was reported to strongly outperform ReLU on large image datasets, but on smaller datasets it runs the risk of overfitting the training set.
    • ReLU, leaky ReLU, and PReLU all suffer from the fact that they are not smooth functions: their derivatives abruptly change (at z = 0)
  • ELU (exponential linear unit), SELU (scaled ELU)
    • SELU activation function may outperform other activation functions for MLPs, especially deep ones. But it requires some conditions to happen (check page 469).
  • Self-normalize: the output of each layer will tend to preserve a mean of 0 and a standard deviation of 1 during training, which solves the vanishing/exploding gradients problem.
  • GELU, Swish, and Mishconsistently on most task
    • Mish overlaps almost perfectly with Swish when z is negative, and almost perfectly with GELU when z is positive.
  • Which one to use?
    • ReLU remains a good default for simple tasks
    • Swish is probably a better default for more complex tasks
    • If you care a lot about runtime latency, then you may prefer leaky ReLU,
    • deep MLPs, give SELU a try
  • Batch Normalization
    • The technique consists of adding an operation in the model just before or after the activation function of each hidden layer. This operation simply zerocenters and normalizes each input, then scales and shifts the result
    • if you add a BN layer as the very first layer of your neural network, you do not need to standardize your training set.
    • It does so by evaluating the mean and standard deviation of the input over the current mini-batch (hence the name “batch normalization”)
    • So during training, BN standardizes its inputs, then rescales and offsets them. Good! What about at test time? Well, it’s not that simple. Indeed, we may need to make predictions for individual instances rather than for batches of instances: in this case, we will have no way to compute each input’s mean and standard deviation.
    • batch normalization acts like a regularizer, reducing the need for other regularization techniques
    • Batch normalization does, however, add some complexity to the model. the neural network makes slower predictions due to the extra computations
    • Tổng thời gian train nếu dùng BN là ngắn hơn dù cho nó phức tạp hơn. Lý do là bởi nó dùng fewer epochs.
    • The authors of the BN paper argued in favor of adding the BN layers before the activation functions, rather than after (as we just did). ← có vài tranh cãi
    • Batch normalization has become one of the most-used layers in deep neural networks, especially deep convolutional neural networks\
  • Gradient Clipping
    • mitigate the exploding gradients problem is to clip the gradients during backprop so that they never exceed some threshold

Reusing Pretrained Layers

  • Transfer learning ← It will not only speed up training considerably, but also requires significantly less training data.
  • transfer learning will work best when the inputs have similar low-level features.
  • The more similar the tasks are, the more layers you will want to reuse (starting with the lower layers). For very similar tasks, try to keep all the hidden layers and just replace the output layer.
  • How many number of layers to reuse? → try freezing all the reused layers first. Then try unfreezing one or two of the top hidden layers to let backpropagation tweak them and see if performance improves.
    • The more training data you have, the more layers you can unfreeze.
    • After unfreezing the reused layers, it is usually a good idea to reduce the learning rate
    • You must always compile your model after you freeze or unfreeze layers.
  • When a paper just looks too positive, you should be suspicious ← so many results in science can never be reproduced.
  • It turns out that transfer learning does not work very well with small dense networks
    • small → few patterns
    • dense → very specific patterns (not useful for other tasks)
  • Unsupervised Pretraining
    • don’t have much labeled training data, but unfortunately you cannot find a model trained on a similar task.
    • you can try to use it to train an unsupervised model, such as an autoencoder or a generative adversarial network (GAN)
    • Unsupervised pretraining (today typically using autoencoders or GANs rather than RBMs) is still a good option when you have a complex task to solve, no similar model you can reuse, and little labeled training data but plenty of unlabeled training data.
    • Figure 11-6. In unsupervised training, a model is trained on all data, including the unlabeled data, using an unsupervised learning technique, then it is fine-tuned for the final task on just the labeled data using a supervised learning technique; the unsupervised part may train one layer at a time as shown here, or it may train the full model directly
  • Pretraining on an Auxiliary Task
    • If you do not have much labeled training data, one last option is to train a first neural network on an auxiliary task for which you can easily obtain or generate labeled training data, then reuse the lower layers of that network for your actual task. The first neural network’s lower layers will learn feature detectors that will likely be reusable by the second neural network.
    • Self-supervised learning is when you automatically generate the labels from the data itself

Faster Optimizers

So far, to speed up training:
  1. A good initialization
  1. A good activation function
  1. Using Batch Normlization
  1. Use pretrained network (transfer learning)
  1. Faster optimizer
  • Momentum optimization
    • Idea: bowling ball rolling down a gentle slope on a smooth surface: it will start out slowly, but it will quickly pick up momentum until it eventually reaches terminal velocity
    • vs regular gradient descent: small steps when the slope is gentle and big steps when the slope is steep, but it will never increase speed.
    • A typical momentum value is 0.9
    • momentum optimization will roll down the valley faster and faster until it reaches the bottom (the optimum). In deep neural networks that don’t use batch normalization, the upper layers will often end up having inputs with very different scales, so using momentum optimization helps a lot.
    • one drawback of momentum optimization is that it adds yet another hyperparameter to tune.
  • Nesterov Accelerated Gradient (NAG)
    • Figure 11-7. Regular versus Nesterov momentum optimization: the former applies the gradients computed before the momentum step, while the latter applies the gradients computed after
  • AdaGrad
    • Figure 11-8. AdaGrad versus gradient descent: the former can correct its direction earlier to point to the optimum
    • elongated bowl problem: gradient descent starts by quickly going down the steepest slope, which does not point straight toward the global optimum, then it very slowly goes down to the bottom of the valley.
    • ideal of the algorithm could correct its direction earlier to point a bit more toward the global optimum
    • AdaGrad frequently performs well for simple quadratic problems
    • you should not use it to train deep neural networks (it may be efficient for simpler tasks such as linear regression, though)
  • RMSProp
    • AdaGrad runs the risk of slowing down a bit too fast and never converging to the global optimum. The RMSProp algorithm fixes this by accumulating only the gradients from the most recent iterations
    • this optimizer almost always performs much better than AdaGrad. In fact, it was the preferred optimization algorithm of many researchers until Adam optimization came around
  • Adam (adaptive moment estimation)
    • combines the ideas of momentum optimization and RMSProp: it keeps track of an exponentially decaying average of past gradients; and just like RMSProp, it keeps track of an exponentially decaying average of past squared gradients
    • The momentum decay hyperparameter β 1is typically initialized to 0.9, while the scaling decay hyperparameter β 2is often initialized to 0.999. often use the default value η = 0.001
    • three variants of Adam: AdaMax, Nadam, AdamW
      • In practice, this can make AdaMax more stable than Adam, but it really depends on the dataset, and in general Adam performs better.
  • Adaptive optimization methods: RMSProp, Adam, AdaMax, Nadam, and AdamW optimization
  • when you are disappointed by your model’s performance (by using Adaptive optimization), try using NAG instead
* is bad, ** is average, and *** is good

Learning Rate Scheduling

  • LR too high → training may diverge, too low → take very long time to converge
  • Limit budget? → interrupt training before it has converged properly
    • Learning curves for various learning rates η
  • Power scheduling: first drops quickly, then more and more slowly
  • Exponential scheduling: exponential scheduling keeps slashing it by a factor of 10 every s steps
  • Piecewise constant scheduling: constant learning rate for a number of epochs. This solution can work very well.
  • Performance scheduling: validation error every N steps.
  • 1cycle scheduling (ko có sẵn trong Keras) ← có thể converge rất nhanh.
  • To sum up, exponential decay, performance scheduling, and 1cycle can considerably speed up convergence, so give them a try!

Avoiding Overfitting Through Regularization

  • early stopping: one of the best regularization techniques in Chapter 10: early stopping
  • l1 and l2 Regularization
  • Dropout: Dropout is one of the most popular regularization techniques for deep neural networks.
      • many state-of-the-art neural networks use dropout, as it gives them a 1%–2% accuracy boost.
      • at every training step, every neuron (including the input neurons, but always excluding the output neurons) has a probability p of being temporarily “dropped out”
      • closer to 20%–30% in recurrent neural nets , and closer to 40%–50% in convolutional neural networks
      • Neurons trained with dropout cannot coadapt with their neighboring neurons; they have to be as useful as possible on their own.
      • In practice, you can usually apply dropout only to the neurons in the top one to three layers (excluding the output layer).
    • Warning: Since dropout is only active during training, comparing the training loss and the validation loss can be misleading. In particular, a model may be overfitting the training set and yet have similar training and validation losses. So, make sure to evaluate the training loss without dropout (e.g., after training).
    • If you observe that the model is overfitting, you can increase the dropout rate and vice versa.
    • many state-of-the-art architectures only use dropout after the last hidden layer, so you may want to try this if full dropout is too strong.
  • Monte Carlo (MC) Dropout:
    • MC dropout tends to improve the reliability of the model’s probability estimates. This means that it’s less likely to be confident but wrong, which can be dangerous: just imagine a self-driving car confidently ignoring a stop sign. It’s also useful to know exactly which other classes are most likely.
  • Max-Norm Regularization: can also help alleviate the unstable gradients problems (if you are not using batch normalization).

Summary and Practical Guidelines

network is a simple stack of dense layers, then it can self-normalize
  • Don’t forget to normalize the input features
  • Should try to use pretrained NN
  • Use unsupervised pretraining if you have a lot of unlabeled data
  • Use pretraining on an auxiliary task.
  • If you need sparse model, use l1 regularization.
  • If you need a low-latency model → fast activation
  • Need risk-sensitive application → MC dropout.

Chapter 12 — Custom Models and Training with TensorFlow

Chương nảy không có đọc (kỹ). Chủ yếu nó giới thiệu về các hàm và cách sử dụng cũng như modify theo ý muốn khi dùng TensorFlow. Sau này khi nào làm rồi đọc lại sẽ hay hơn.
Bên dưới là vài ý chính trong chapter:
  • 95% of the use cases you will encounter will not require anything other than Keras
  • TF → Its core is very similar to NumPy, but with GPU support.
  • Lowest level, implemented using C++.
  • Có API cho cả C++, Java, Swift và JavaScript.
  • TensorFlow’s API revolves around tensors, which flow from operation to operation—hence the name TensorFlow.
  • Tensors play nice with NumPy.
  • NumPy uses 64-bit precision by default, while TensorFlow uses 32-bit. → when you create a tensor from a NumPy array, make sure to set dtype=tf.float32.
  • tf.Tensor values are immutable → use tf.Variable nếu muốn variables có thể modify được.
  • Những data structures khác (Appendix C): Sparse tensors, Tensor array, Ragged tensors, String tensors, Sets, Queues.
  • Customizing Models and Training Algorithms:
    • Custom Loss functions
    • Saving and Loading Models That Contain Custom Components
    • Custom Activation Functions, Initializers, Regularizers, and Constraints
    • Custom Metrics
    • Custom Layers
    • Custom Models
    • Losses and Metrics Based on Model Internals
    • Computing Gradients Using Autodiff
    • Custom Training Loops
  • TensorFlow Functions and Graphs
    • AutoGraph and Tracing
    • TF Function Rules

Chapter 13 — Loading and Preprocessing Data with TensorFlow

Giống chương 2, chương này đọc qua loa. Chương này chủ yếu nói về việc dùng TensorFlow để customize Deep Learning model and processing.
  • tf.data API:
    • The tf.data API is a streaming API: you can very efficiently iterate through a dataset’s items, but the API is not designed for indexing or slicing.
    • Chaining Transformations: apply all sorts of transformations.
    • Shuffling the Data: gradient descent works best when the instances in the training set are independent and identically distributed
    • Interleaving Lines from Multiple Files
    • Preprocessing the Data
    • Prefetching
  • The TFRecord Format
    • CSV files, which are common, simple, and convenient but not really efficient, and do not support large or complex data structures (such as images or audio) very well. So, let’s see how to use TFRecords instead.
    • The TFRecord format is TensorFlow’s preferred format for storing large amounts of data and reading it efficiently
    • Compressed TFRecord Files
    • Protocol Buffers:
      • TFRecord files usually contain serialized protocol buffers (also called protobufs). This is a portable, extensible, and efficient binary format developed at Google back in 2001 and made open source in 2008; protobufs are now widely used, in particular in gRPC
  • Keras Preprocessing Layers
    • Normalization layer
      • Since we included the Normalization layer inside the model, we can now deploy this model to production without having to worry about normalization again: the model will just handle it (see Figure 13-4).
      • Figure 13-4. Including preprocessing layers inside a model → slow
      • Including the preprocessing layer directly in the model is nice and straightforward, but it will slow down training → We can do better by normalizing the whole training set just once before training.
      • But now the model won’t preprocess its inputs when we deploy it to production. To fix this, we just need to create a new model that wraps both the adapted Normalization layer and the model we just trained.
      • Figure 13-5. Preprocessing the data just once before training using preprocessing layers, then deploying these layers inside the final model
      • Now we have the best of both worlds: training is fast because we only preprocess the data once before training begins, and the final model can preprocess its inputs on the fly without any risk of preprocessing mismatch.
    • The Discretization Layer: The Discretization layer’s goal is to transform a numerical feature into a categorical feature by mapping value ranges (called bins) to categories.
    • The CategoryEncoding Layer: When there are only a few categories (e.g., less than a dozen or two), then one-hot encoding is often a good option
      • It’s hard to know in advance whether a single multi-hot encoding or a perfeature one-hot encoding will work best: it depends on the task, and you may need to test both options.
    • The StringLookup Layer: ← Now you can encode categorical integer features using one-hot or multi-hot encoding. But what about categorical text features? For this, you can use the StringLookup layer.
    • The Hashing Layer: This idea of mapping categories pseudorandomly to buckets is called the hashing trick.
      • For each category, the Keras Hashing layer computes a hash, modulo the number of buckets (or “bins”). The mapping is entirely pseudorandom, but stable across runs and platforms (i.e., the same category will always be mapped to the same integer, as long as the number of bins is unchanged).
      • The benefit of this layer is that it does not need to be adapted at all, which may sometimes be useful, especially in an out-of-core setting (when the dataset is too large to fit in memory).
      • → it’s usually preferable to stick to the StringLookup layer.
    • Encoding Categorical Features Using Embeddings:
      • An embedding is a dense representation of some higher-dimensional data, such as a category, or a word in a vocabulary.
      • In deep learning, embeddings are usually initialized randomly,
      • training tends to make embeddings useful representations of the categories. This is called representation learning
      • Figure 13-6. Embeddings will gradually improve during training
      • Word embedding: when you are working on a natural language processing task, you are often better off reusing pretrained word embeddings than training your own.
      • word embeddings were also organized along meaningful axes in the embedding space. Eg: if you compute King – Man + Woman (adding and subtracting the embedding vectors of these words), then the result will be very close to the embedding of the word Queen
        • Word embeddings of similar words tend to be close, and some axes seem to encode meaningful concepts
      • Unfortunately, word embeddings sometimes capture our worst biases. For example, although they correctly learn that Man is to King as Woman is to Queen, they also seem to learn that Man is to Doctor as Woman is to Nurse
      • An Embedding layer is initialized randomly, so it does not make sense to use it outside of a model as a standalone preprocessing layer unless you initialize it with pretrained weights.
      • In this example we used 2D embeddings, but as a rule of thumb embeddings typically have 10 to 300 dimensions, depending on the task, the vocabulary size, and the size of your training set. You will have to tune this hyperparameter.
      • One-hot encoding followed by a Dense layer (with no activation function and no biases) is equivalent to an Embedding layer. However, the Embedding layer uses way fewer computations as it avoids many multiplications by zero
    • Keras provides a TextVectorization layer for basic text preprocessing.
    • Image Preprocessing Layers: preprocessing layers + data augmentation.