the100pagemlbook

1. Introduction
2. Notations and Definitions
3. Fundamental Algorithms
4. Anatomy of a Learning Algorithm
5. Basic Practice
6. Neural Networks and Deep Learning
7. Problems and Solutions
8. Advanced Practice
9. Unsupervised Learning
10. Other Forms of Learning
11. Conclusions

The parent reference sentinel for this book is rooted as The Hundred Page Machine Learning Book under the major machine learning node. I populate this node with the intention to index into other major nodes of the field and fill in some holes that are generic and require a book for an end to end coverage cause tending to them in a non-project oriented scenario isn't worth the time - this isn't a book summary.

1. Introduction

1.1. What is ML?

subdomain of Computer Science
- building algorithms that rely on a collection of instances about a phenomenon to solve tasks related to it.
- these instances can be sourced in multiple ways:
  - from nature
  - from another algorithm
  - synthesized according to human behaviour
a very basic breakdown for the process of machine learning is broken down as :
1. Gather a dataset
2. algorithmically build a statistical model based on that dataset

1.2. Types of ML

1.3. PAC learning

on learnability and understanding relationship between model error, training set size, model building time and the mathematical formulation of the model
this is an index; explored in depth in a dedicated node
- PAC learning

2. Notations and Definitions

well, I won't be focusing on the notation cause I'm proceeding with symbolic representations via s-expressions
still indexing the important definitions though

2.1. Scalars, Vectors and Sets

2.2. Math as S-expressions

I won't be using mathmatical notation in these notes cause of the reasons listed here : NIL
I find it convenient to just build up an abstraction when needed

2.3. Functions

mathematically speaking : a relation that maps elements from the domain set to the codomain set (aka range), with some constraints on the nature of this map.
manipulation of the various parameters in the above dfeinition (types of the sets, nature of the mapping, etc) gives rise to different classes of functions.

2.4. Max and Arg Max

The former refers to the maximum element in a set
- this requires the notion of comparison to be defined over that set
The latter refers to an optimal element in a set that pushes the value of a function to its optimal value (max)
Note that both are functions and have relevant counterparts as min and arg min

2.5. Derivatives and Gradient

a complete treatment will be present in Matrix Calculus

2.6. Random Variables

helps capture the notion of probability into mathematical expressions
- can be discrete and continuous
Probability is a large epistemological root in itself and will be explored independently
a few important terms relevant to a random variable that will be referred to going forward:
- probability mass distribution, probability density
- expectation
- variance, covariance
these will be conceptualized programmatically going forward

2.7. Unbiased Estimator

if the expectation of the estimator random variable is equal to the statistic being estimated, it's termed as an unbiased estimator of that statistic.
for instance, sample mean is an unbiased estimator of the mean statistic of a random variable

2.8. Bayes' Rule (aka Bayes' Theorem)

denotes equivalence of the probability of a joint event broken down into its conditionals and respective singulars

(defmacro P (event)
  ...) ; probabilty of a given event

(defmacro and (X-sample Y-sample)
  ...) ;conceptualizes the event (X-sample and Y-sample)
(defmacro given (X-sample Y-sample)
  ...) ; conceptualizes the event (X-sample, given Y-sample)

(defmacro sample (rand-var instance)
  ...) ; denotes the event that instance was sampled from rand-var

(defmacro declare-randvar (tag ...)
  ...) ; declare a random variable with associated info

;;Bayes' rule then is:

(declare-randvar X ...)
(declare-randvar Y ...)

(let ((X=x (sample X x))
      (Y=y (sample Y y)))
  (assert-mathematical-equal
   (P (and X=x
           Y=y))
   (* (P (given X=x
                Y=y))
      (P Y=y))
   (* (P (given Y=y
                X=x))
      (P X=x))))

2.9. Parameter Estimation

to be explored in depth in the main node

2.10. Classification vs Regression

2.11. Model-Based vs Instance Based Learning

Model Based : use training data to build a model with learned parameters - see Support Vector Machine
Instance Based : use the whole dataset as a model (no intermediate parameters) - see k-Nearest Neighbors

2.12. Shallow vs Deep Learning

Shallow learning : model parameters are learned directly from features of training samples
Deep Learning uses optimization techniques for a loss function and it's more of a black box than the shallow learning alternative

3. Fundamental Algorithms

3.1. Linear Regression

3.2. Logistic Regression

3.3. Decision Trees

3.4. k-Nearest Neighbors

4. Anatomy of a Learning Algorithm

Any learning algorithm is centered around certain basics:

A loss function
an optimization ..
- criterion, inspired from the loss function
- routine, that finds a solution to the optimization criterion

5. Basic Practice

5.1. Feature Engineering

5.2. Algorithm Selection

5.3. Three sets

training set
validation set
testing set

Keeping the validation and test set the same size is the common practice. A usual split for a traditional situation would be 70%:15%:15%. With big data though, it's reasonable to go for 95%:2.5%:2.5%.

briefly speaking, the individual purposes are as follows:

fitting the algorithm on the training set
checking performance intermittently on the validation set and making considerable hyperparametric choices like which algorithm to use and tuning the particular algorithm.
evaluating performance on the test set.

5.4. Underfitting and Overfitting

5.4.1. Bias (Underfitting)

Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. A model with high bias pays little attention to the training data and oversimplifies the problem. This leads to underfitting, where the model doesn't capture the underlying patterns in the data and performs poorly on both the training and unseen data. In simpler terms, a high-bias model is too simple to represent the data.

5.4.2. Variance (Overfitting)

Variance, on the other hand, refers to the error introduced due to the model's sensitivity to small fluctuations or noise in the training data. A model with high variance is overly complex and captures noise in the training data rather than the underlying patterns. This leads to overfitting, where the model performs very well on the training data but poorly on unseen data because it has essentially memorized the training data and doesn't generalize well.

5.4.3. Summary

High bias (underfitting) results from a model that is too simple and doesn't fit the data well.
High variance (overfitting) results from a model that is too complex and fits the training data too closely.

The goal in machine learning is to strike a balance between bias and variance, finding a model that is complex enough to capture the essential patterns in the data but not so complex that it overfits and fails to generalize to new, unseen data. This balance is crucial for building models that perform well in real-world scenarios. Techniques like cross-validation and regularization are often used to help find this balance.

5.5. Regularization

5.6. Model Performance assesment

already covered before in Classification Evaluation Metrics For regression, using a loss function is the most basic choice to rate performance. To compare the metric obtained, we need a ..

5.6.1. Baseline

to compare any metric, consider building a baseline model for the task at hand
- for regression this can be as simple as building a mean-model, for instance (one that predicts the average of all, for all)
- for classification, you could predict the mode or build a simple model.

5.7. Hyperparameter Tuning

if choices are low, perform a grid search over a validation set.
- for numerical hyperparameters with large ranges, perform multiple searches:
  1. first with a large span using a logarithmic scale
  2. then use linear scales when you get an idea about what ranges are worth paying attention to

This can be time consuming and other efficient techniques like random search and Bayesian Hyperparameter Optimization exist.

5.8. Cross validation

when the dataset isn't large enough to carve form a separate validation set, consider k-fold cross validation:
1. partition the dataset into k folds
2. at each fitting iteration:
  - fit on k-1 folds
  - validate on one
  - change the validation fold for the next iteration
3. for reporting the final validation metric, use an average of the k computations

Finally, once fit on all these fold combinations, the model performance is ready to be evaluated on the test set.

6. Neural Networks and Deep Learning

7. Problems and Solutions

7.1. Kernel Regression

7.2. One-class classification

7.3. Multi-Label Classification

7.4. Ensemble Algorithms

7.5. Learning to label sequences

already explored Recurrent Neural Networks for sequence labelling
looking into Conditional Random Fields now

7.6. Sequence to Sequence (seq2seq) Learning

7.7. Active Learning

7.8. Semi-Supervised Learning

7.9. Few Shot Learning

8. Advanced Practice

8.1. Handling Imbalanced Datasets

most practical datasets might under-represent some classes
see Class Imbalance for a basic coverage

8.2. Combining Models

Elaborating further upon the basics from Ensemble Algorithms.

Three basic ways to combine models:

(weighted) averaging
- naturally works for regression
- adaptable to classification when working with prediction probabilities
majority vote
- applicable to classification
- report the label with most votes
  - different tie breaker strategies possible
stacking
- building a meta model that takes output of base models as input
- have to create a new dataset from the outputs of the base models to train the meta-model
- tuning hyperparameters via cross-validation is recommended.

Every time, do verify if the ensemble model performs better using the validation set.

Do note that one benefits the most from an ensemble approach when using several "uncorrelated" models. Combining different versions of similar approaches (SVMs with several different kernel for instance or versions of decision trees for instance) may not result in a performance boost.

8.3. Training Neural Networks

check out SOTA before choosing architecture for a task
tuning the networks hyperparameters and sizes (number of layers, etc) based on its behaviour over the validation and training set (check if the network is complex enough, if it overfits and the need to regularize)
- a stepped approach towards intermittently increasing the size and regularizing to check if the model fits to the training and validation set both (model being complex enough and no overfitting respectively)

8.4. Advanced Regularization

elaborated upon the basics in Regularization

8.5. Handling Multiple Inputs

most practical problems demand working with multimodal data.
- for instance, a pair of image and possible caption being input with the output being an indicator variable verifying if the caption matches the picture
Combining Models as stated before in this node can be used to deal with the different modalities.
another common approach is to vectorize and concatenate all the modalities to form a longer tensor that can then be normally processed by a model
neural networks are more flexible:
- extract embeddings for the image and text from a CNN and RNN base respectively, concatenate them and proceed with a fully connected approach later on according to our output demands (classification or regression for starters)
averaging embeddings is also a possibility if the nature of the inputs is the same, before we proceed towards the final part of the network.

8.6. Handling Multiple Outputs

other than the multilabel classification problem, producing multimodal outputs can also be a necessity
for instance, one wants to detect if an object exists in an image along with the bounding box for it and also return a label for the same.
- the overall output for such a case will be an image + a vector of coordinates + a one-hot encoded label
- to deal with this one can employ an encoder subnetwork that encodes the image into an embedding.
- this subnetwork can have a ReLU on the last layer (good choice for detecting positive reals) to detect coordinates and this can use the mean squared error between this output and the desired coordinate vector
- the second subnetwork takes this same embedding as input and predicts probabilities for each label with a final softmax to do so
- the loss for this can be the usual negative log likelihood (aka cross entropy loss)
- the final loss function for optimization can be set to a convex combination of the two with the choice to tune the underlying hyperparameter.

8.7. Transfer Learning

8.8. Algorithmic Efficiency

use linear algebra optimized operations and avoid loops
use the laziest data structure that'll do for the problem (sets instead of lists if the order doesn't matter, for instance)
avoid writing custom methods if tested popular libraries exist
employ code profiling to find out bottlenecks in your code

9. Unsupervised Learning

10. Other Forms of Learning

10.1. Metric Learning

algorithmic choices : Euclidean vs Manhattan for distance, for instance - can affect performance
engineering metrics based on custom needs is also possible
- weighing dimension differently for instance is a basic approach
do keep in mind that distance based metrics need to satisfy certain rules though:
1. non-negativity : d(a,b) >= 0
2. triangle-inequality : d(a,b) <= d(a,c) + d(c,b)
3. symmetry : d(a,b) == d(b,a)
read more here: https://en.wikipedia.org/wiki/Similarity_learning#Metric_learning

10.2. Learning to Rank

Supervised Learning problem
application : search engine results
given a dataset of documents and rank labels, goal is to model a ranking function that produces ranks, given the document
3 basic approaches to solve this problem:
1. point wise
2. pair wise
3. list wise
read more here: https://en.wikipedia.org/wiki/Learning_to_rank

10.3. Learning to Recommend

application: building recommendation systems
- two basic approaches:
  1. content based filtering
    - based on a single users consumption
  2. collaborative filtering
    - takes into account multiple users based on common likes, dislikes when recommending something
read more here: https://en.wikipedia.org/wiki/Recommender_system

10.4. Self-Supervised Learning

check out textual feature representation
read more here : https://en.wikipedia.org/wiki/Self-supervised_learning

11. Conclusions

Nodes that weren't covered in the book and noted as worth exploring by the author later on: