This note serves as a reminder of the book's content, including additional research on the mentioned topics. It is not a substitute for the book. Most images are sourced from the book or referenced.
👉 List of all notes for this book. IMPORTANT UPDATE Nov 18, 2024: I've stopped taking detailed notes from the book and now only highlight and annotate directly in the PDF files/book. With so many books to read, I don't have time to type everything. In the future, if I make notes while reading a book, they'll contain only the most notable points (for me).
- These notes are for the version 3.
- Author: Aurélien Géron.
This book is organized in 2 parts:
- Andrew Ng’s ML course on Coursera (my notes for the old version of this course).
- Scikit-Learn’s User Guide.
- Blogs listed on Quora.
The chapter 1 introduces a lot of fundamental concepts (and jargon) that every data scientist should know by heart. If you already familiar with machine learning basics, you may want to skip directly to Chapter 2.
A computer program is said to learn from experience E with respect to some task T and some performance measure P, if its performance on T, as measured by P, improves with experience E. — Tom Mitchell, 1997.
Example: email spam filter ← give it examples of spam/non-spam emails so that it can learn to flag spam.
- Training set: examples the system uses to learn. Each training example is call training instance (or sample).
- Model: The part of ML system that learns and makes predictions. Example: Neural Networks, Random Forest,…
- T = task to flag spam for new emails. E = training data. The perfomance measure P needs to bedfined ← it’s called accuracy.
For example, some words like “4U” in the subject,
- Use (1), we will ignore all of these words (ignore all patterns we think) → spammer changes to use “For U” → we need to update (1) again → … → bad!
- (2) will detects the frequent patterns of words in the spam examples and detect the new ones.
Data Mining = digging into large amounts of data to discover hidden patterns.
Machine learning is great for:
- Problems for which existing solutions require a lot of fine-tuning or long lists of rules (a machine learning model can often simplify code and perform better than the traditional approach)
- Complex problems for which using a traditional approach yields no good solution (the best machine learning techniques can perhaps find a solution)
- Fluctuating environments (a machine learning system can easily be retrained on new data, always keeping it up to date)
- Getting insights about complex problems and large amounts of data
- Analyzing images of products on a production line to automatically classify them ← Image Classification ← using CNNs, Transformer.
- Detecting tumors in brain scans ← Image Segmentation ← also using CNNs and Transformers.
- Automatically classifying news articles ← NLP (Natural Language Processing) ← use RNN (Recurrent Neural Networks), Transformers.
- Automatically flagging offensive comments on discussion forums ← Text classifications.
- Summarizing long documents automatically ← Text summarization.
- Creating a chatbot or a personal assistant ← NLU (Natural Language Understanding), Question-Answering modules.
- Forecasting your company’s revenue next year, based on many performance metrics ← Linear Regression, Polynomial Regression, SVM (Support Vector Machine), Random Forest, Neural Networks.
- Making your app react to voice commands ← Speech Recognition ← RNNs, CNNs, Transformers.
- Detecting credit card fraud ← Anomaly Detection ← Isolation Forests, Gaussian mixture models, Autoencoders.
- Segmenting clients based on their purchases so that you can design a different marketing strategy for each segment ← Clustering ← K-Means, DBSCAN,…
- Representing a complex, high-dimensional dataset in a clear and insightful diagram ← Data Visualization, Dimentionality Reduction
- Recommending a product that a client may be interested in, based on past purchases ← Recommender System.
- Building an intelligent bot for a game ← Reinforcement Learning
Classify types of ML based on:
- Supervised during training? ← supervised, unsupervised, semi-supervised, self-supervised,…
- Can they learn incrementally on the fly? ← onlive learning vs batch learning
- Comparing new data to known data? Or detecting new patterns? ← Instance-based learning vs model-based learning.
Above types can be used together.
- Training set fed to the algo includes the solutions ← labels
- Classification: train examples with their classes → it classifies new instance.
- Regression: predicts a target (eg. price of car) given a set of features/predictors/attributes (eg. mileage, age, brand,…). Regression model can be used for classification. ← Logistic regression
- Training data is unlabeled. ← clustering can be used to detect group of similar data. If you use hierarchical clustering, it may subdivide each group into smaller groups.
- Visualization is an example of unsupervised learning. ← These algorithms try to preserve as much structure as they can.
- Dimensionality reduction: simplify the data without losing too much information. ← merge correlated features into one.
- Anomaly Detection: eg. detecting unusual credit card transactions, catching manufacturing defects, or automatically removing outliers from a dataset before feeding it to another learning algorithm. ← system learns the normal + meet new instance → it’s “arnomal” or not.
- Novelty detection: alike anomaly, it looks for new instances that look different from all in the training set.
- Association rule learning: dig into large amount of data → find the patterns, relation between features. Eg. relation between products bought in a supermaket.
It’s algos dealing with data that partially labeled. Eg. Google photos labels your face in the new photos or label all faces in a photo.
Most semi-supervised = unsupervised + supervised. Eg: using clustering to label unlabled data and then use supervised algo with this new all-labeled data.
Generate a fully labeled dataset from a fully unlabeled one.
A large amount of unlabeled data can be processed by masking certain parts in an image and training a model to reconstruct the missing parts. Additionally, the model can classify species such as cats and dogs, although it may not know their specific names yet. Later on, we can map this knowledge to the labeled names that humans use.
Transfer learning = transfering knowledge from one task to another task. ← one of the important techniches in ML.
Agent = the learning system. → it can observe the env + select and perform actions + get rewards (or penalties). ← it must find the best strategy (policy).
Example: DeepMind’s AlphaGo beats Ke Jie (number one in Go game) by learning from milions of games and play with itself.
It’s trained from all the available data, done offline. ← Offline Learning.
- Model tends to decay because the world keep changes → model rot or data drift.
- If you want Batch Learning to know new data → retrain on the full dataset (new + old).
- It’s not effective (time / resources consumption).
- It feeds the system data sequentially (mini batches) ← quick and cheap, new data can be learnt on the fly.
- Can be used if the data changes fast or you have limited computing resources (out-of-core learning).
- Can be used to train huge data (cannot be trained at once)
- Learning rate = how fast the system should adapt to the data changes. Too high → quickly adapt but quickly forget and vice versa.
- Weakness: The system is vulnerable to bad data being fed while it is live. To address this, set up a mechanism to turn off learning if a drop in performance is detected.
- One way to categorize ML systems is by how they generalize.
- Should: good performance in both training and predict.
Learn by heart + ability of measure of similarity to detect “look-alike” spam emails, for example.
Generalize from dataset → build a model → use this model to make predictions.
You want to know if money makes people happy?
- From dataset, you plot ← data studying
- Based on the plot, it looks like a linear regression (satisfaction goes up/down linearily as GDP) ← model selection
- Plot the model
- How we know which model is the best? → measure the good by a utility function (or fitness function) or measure the bad by cost function. ← For linear regression, we usually use cost function (measures the distance between the linear model’s predictions and the training examples) ← objective: minimize the cost function!
- Predict new data ← inference
2 things can go wrong in training models → “bad model” & “bad data”.
- Insufficient Quantity of Training Data → For child, it’s easy for recognizing “an apple”, not ML models, we need a lot of data for it!
In this paper, MS researchers show that, with enough data, different models perform almost identically results!
The idea that data matters more than algorithms for complex problems!
However, data is usually not enough!
- Nonrepresentative Training Data → to generalize well, training data need to be representative of new cases!
- Sample is too small → sampling noise (nonrepresentative data). Large sample can be nonrepresentative if sampling method is flawed ← sampling bias!
- Poor-Quality Data: it’s worthy to spend time cleaning up the training data. ← most data scientist spend a significant part of their time to do that!
- Irrelevant Features: garbage in, garbage out. A critical part of the success of a machine learning project is coming up with a good set of features to train on ← Feature engineering. 2 steps:
- Feature selection: select the most useful features to train.
- Feature extraction: combine existing features to make a more useful one.
- Overfitting the training data: the model performs well on the training data, but it does not generalize well.
- Simplify model: fewer parameters, reducing the number of attributes, constraining the model,…
- Gather more data.
- Reduce the noise in data.
Overfitting happens when model is too complex relative to the amount of data. Possible solutions:
Regularization = constraining a model to make it simpler and reduce the overfitting. The amount of refularization to apply during learning is controlled by hyperparameters.
You want to find the right balance between fitting the training data perfectly and keeping the model simple enough to ensure that it will generalize well.
→ Tuning hyperparameters is an important part of building a machine learning system!
- Underfitting the training data: your model is too simple to learn the underlying structure of the data. Possible solutions:
- Select a more powerful model (more parameters)
- Better features (feature engineering)
- Reduce the constraints on the model (eg. reducing the regularization hyperparameter).
- Split data into 2 sets: training set (train the model using it) & test set (test if the model works well using it). Commonly use 80% training and 20% test (but not all the cases depending on the size of dataset).
- Evaluate your model with test set → get generalization error (out-of-sample error).
- If training error is low but generalization error is high → overfitting.
Problem: You have a model → how to choose value of regularization hyperparameters? → train 100 different models using 100 different values → test with test set → get the best value → but when you apply to real data, it’s bad ← Why? Because it’s fixed to the test data itself!
→ Common solution is holdout validation
Holdout validation: split training set into “new” training set + validation set (or development set or dev set).
Process: train multiple models (various hyperparameters) with “new” training set → select model performed best on validation set → retrain the best model on the whole training set (new + validation) → final model → evaluate with test set.
Problem:
- Validation set is too small → model may be a “suboptimal” one.
- Validation set is too large → remaining training is much smaller than the full training set → It’s bad because it likes “selecting the fastest sprinter to participate in a marathon” ← solution: perform repeated cross-validation (use multiple validation sets and get the average) ← weakness: training time takes longer!
It is easy to obtain a large amount of data, but such data may not be representative enough to be used in production.
For example, when building a mobile app to detect flowers, training the model using data downloaded from the web may not yield accurate results. ← we don’t know when model is bad because there is overfitting or mismatch!
Remember: validation set & test set must be representative of the data you expect to use in production!
Solution: use train-dev set ← Idea: train on “train” + evaluate on “train-dev” → if it’s poor, it’s overfitting. Otherwise → no overfitting → evaluate on “dev” → if it’s poor, it’s mismatch! → when it’s good → evaluate on test → when it’s good → production.
No free lunch theorem
If you make absolutely no assumption about the data, then there is no reason to prefer one model over any other. In practice you make some reasonable assumptions about the data and evaluate only a few reasonable models.
Read the book.