the100pagemlbook

Table of Contents

The parent reference sentinel for this book is rooted as The Hundred Page Machine Learning Book under the major machine learning node. I populate this node with the intention to index into other major nodes of the field and fill in some holes that are generic and require a book for an end to end coverage cause tending to them in a non-project oriented scenario isn't worth the time - this isn't a book summary.

1. Introduction

1.1. What is ML?

  • subdomain of Computer Science
    • building algorithms that rely on a collection of instances about a phenomenon to solve tasks related to it.
    • these instances can be sourced in multiple ways:
      • from nature
      • from another algorithm
      • synthesized according to human behaviour
  • a very basic breakdown for the process of machine learning is broken down as :
    1. Gather a dataset
    2. algorithmically build a statistical model based on that dataset

1.3. PAC learning

  • on learnability and understanding relationship between model error, training set size, model building time and the mathematical formulation of the model
  • this is an index; explored in depth in a dedicated node

2. Notations and Definitions

  • well, I won't be focusing on the notation cause I'm proceeding with symbolic representations via s-expressions
  • still indexing the important definitions though

2.1. Scalars, Vectors and Sets

2.2. Math as S-expressions

  • I won't be using mathmatical notation in these notes cause of the reasons listed here : NIL
  • I find it convenient to just build up an abstraction when needed

2.3. Functions

  • mathematically speaking : a relation that maps elements from the domain set to the codomain set (aka range), with some constraints on the nature of this map.
  • manipulation of the various parameters in the above dfeinition (types of the sets, nature of the mapping, etc) gives rise to different classes of functions.

2.4. Max and Arg Max

  • The former refers to the maximum element in a set
    • this requires the notion of comparison to be defined over that set
  • The latter refers to an optimal element in a set that pushes the value of a function to its optimal value (max)
  • Note that both are functions and have relevant counterparts as min and arg min

2.5. Derivatives and Gradient

2.6. Random Variables

  • helps capture the notion of probability into mathematical expressions
    • can be discrete and continuous
  • Probability is a large epistemological root in itself and will be explored independently
  • a few important terms relevant to a random variable that will be referred to going forward:
    • probability mass distribution, probability density
    • expectation
    • variance, covariance
  • these will be conceptualized programmatically going forward

2.7. Unbiased Estimator

  • if the expectation of the estimator random variable is equal to the statistic being estimated, it's termed as an unbiased estimator of that statistic.
  • for instance, sample mean is an unbiased estimator of the mean statistic of a random variable

2.8. Bayes' Rule (aka Bayes' Theorem)

  • denotes equivalence of the probability of a joint event broken down into its conditionals and respective singulars
(defmacro P (event)
  ...) ; probabilty of a given event

(defmacro and (X-sample Y-sample)
  ...) ;conceptualizes the event (X-sample and Y-sample)
(defmacro given (X-sample Y-sample)
  ...) ; conceptualizes the event (X-sample, given Y-sample)

(defmacro sample (rand-var instance)
  ...) ; denotes the event that instance was sampled from rand-var

(defmacro declare-randvar (tag ...)
  ...) ; declare a random variable with associated info

;;Bayes' rule then is:

(declare-randvar X ...)
(declare-randvar Y ...)

(let ((X=x (sample X x))
      (Y=y (sample Y y)))
  (assert-mathematical-equal
   (P (and X=x
           Y=y))
   (* (P (given X=x
                Y=y))
      (P Y=y))
   (* (P (given Y=y
                X=x))
      (P X=x))))

2.9. Parameter Estimation

  • to be explored in depth in the main node

2.11. Model-Based vs Instance Based Learning

  • Model Based : use training data to build a model with learned parameters - see Support Vector Machine
  • Instance Based : use the whole dataset as a model (no intermediate parameters) - see k-Nearest Neighbors

2.12. Shallow vs Deep Learning

  • Shallow learning : model parameters are learned directly from features of training samples
  • Deep Learning uses optimization techniques for a loss function and it's more of a black box than the shallow learning alternative

3. Fundamental Algorithms

4. Anatomy of a Learning Algorithm

Any learning algorithm is centered around certain basics:

  • A loss function
  • an optimization ..
    • criterion, inspired from the loss function
    • routine, that finds a solution to the optimization criterion

5. Basic Practice

5.3. Three sets

  • training set
  • validation set
  • testing set

Keeping the validation and test set the same size is the common practice. A usual split for a traditional situation would be 70%:15%:15%. With big data though, it's reasonable to go for 95%:2.5%:2.5%.

briefly speaking, the individual purposes are as follows:

  • fitting the algorithm on the training set
  • checking performance intermittently on the validation set and making considerable hyperparametric choices like which algorithm to use and tuning the particular algorithm.
  • evaluating performance on the test set.

5.4. Underfitting and Overfitting

5.4.1. Bias (Underfitting)

Bias refers to the error introduced by approximating a real-world problem, which may be complex, by a simplified model. A model with high bias pays little attention to the training data and oversimplifies the problem. This leads to underfitting, where the model doesn't capture the underlying patterns in the data and performs poorly on both the training and unseen data. In simpler terms, a high-bias model is too simple to represent the data.

5.4.2. Variance (Overfitting)

Variance, on the other hand, refers to the error introduced due to the model's sensitivity to small fluctuations or noise in the training data. A model with high variance is overly complex and captures noise in the training data rather than the underlying patterns. This leads to overfitting, where the model performs very well on the training data but poorly on unseen data because it has essentially memorized the training data and doesn't generalize well.

5.4.3. Summary

  • High bias (underfitting) results from a model that is too simple and doesn't fit the data well.
  • High variance (overfitting) results from a model that is too complex and fits the training data too closely.

The goal in machine learning is to strike a balance between bias and variance, finding a model that is complex enough to capture the essential patterns in the data but not so complex that it overfits and fails to generalize to new, unseen data. This balance is crucial for building models that perform well in real-world scenarios. Techniques like cross-validation and regularization are often used to help find this balance.

5.6. Model Performance assesment

already covered before in Classification Evaluation Metrics For regression, using a loss function is the most basic choice to rate performance. To compare the metric obtained, we need a ..

5.6.1. Baseline

  • to compare any metric, consider building a baseline model for the task at hand
    • for regression this can be as simple as building a mean-model, for instance (one that predicts the average of all, for all)
    • for classification, you could predict the mode or build a simple model.

5.7. Hyperparameter Tuning

  • if choices are low, perform a grid search over a validation set.
    • for numerical hyperparameters with large ranges, perform multiple searches:
      1. first with a large span using a logarithmic scale
      2. then use linear scales when you get an idea about what ranges are worth paying attention to

This can be time consuming and other efficient techniques like random search and Bayesian Hyperparameter Optimization exist.

5.8. Cross validation

  • when the dataset isn't large enough to carve form a separate validation set, consider k-fold cross validation:
    1. partition the dataset into k folds
    2. at each fitting iteration:
      • fit on k-1 folds
      • validate on one
      • change the validation fold for the next iteration
    3. for reporting the final validation metric, use an average of the k computations

Finally, once fit on all these fold combinations, the model performance is ready to be evaluated on the test set.

7. Problems and Solutions

7.5. Learning to label sequences

7.6. Sequence to Sequence (seq2seq) Learning

8. Advanced Practice

8.1. Handling Imbalanced Datasets

  • most practical datasets might under-represent some classes
  • see Class Imbalance for a basic coverage

8.2. Combining Models

Elaborating further upon the basics from Ensemble Algorithms.

Three basic ways to combine models:

  1. (weighted) averaging
    • naturally works for regression
    • adaptable to classification when working with prediction probabilities
  2. majority vote
    • applicable to classification
    • report the label with most votes
      • different tie breaker strategies possible
  3. stacking
    • building a meta model that takes output of base models as input
    • have to create a new dataset from the outputs of the base models to train the meta-model
    • tuning hyperparameters via cross-validation is recommended.

Every time, do verify if the ensemble model performs better using the validation set.

Do note that one benefits the most from an ensemble approach when using several "uncorrelated" models. Combining different versions of similar approaches (SVMs with several different kernel for instance or versions of decision trees for instance) may not result in a performance boost.

8.3. Training Neural Networks

  • check out SOTA before choosing architecture for a task
  • tuning the networks hyperparameters and sizes (number of layers, etc) based on its behaviour over the validation and training set (check if the network is complex enough, if it overfits and the need to regularize)
    • a stepped approach towards intermittently increasing the size and regularizing to check if the model fits to the training and validation set both (model being complex enough and no overfitting respectively)

8.4. Advanced Regularization

8.5. Handling Multiple Inputs

  • most practical problems demand working with multimodal data.
    • for instance, a pair of image and possible caption being input with the output being an indicator variable verifying if the caption matches the picture
  • Combining Models as stated before in this node can be used to deal with the different modalities.
  • another common approach is to vectorize and concatenate all the modalities to form a longer tensor that can then be normally processed by a model
  • neural networks are more flexible:
    • extract embeddings for the image and text from a CNN and RNN base respectively, concatenate them and proceed with a fully connected approach later on according to our output demands (classification or regression for starters)
  • averaging embeddings is also a possibility if the nature of the inputs is the same, before we proceed towards the final part of the network.

8.6. Handling Multiple Outputs

  • other than the multilabel classification problem, producing multimodal outputs can also be a necessity
  • for instance, one wants to detect if an object exists in an image along with the bounding box for it and also return a label for the same.
    • the overall output for such a case will be an image + a vector of coordinates + a one-hot encoded label
    • to deal with this one can employ an encoder subnetwork that encodes the image into an embedding.
    • this subnetwork can have a ReLU on the last layer (good choice for detecting positive reals) to detect coordinates and this can use the mean squared error between this output and the desired coordinate vector
    • the second subnetwork takes this same embedding as input and predicts probabilities for each label with a final softmax to do so
    • the loss for this can be the usual negative log likelihood (aka cross entropy loss)
    • the final loss function for optimization can be set to a convex combination of the two with the choice to tune the underlying hyperparameter.

8.8. Algorithmic Efficiency

  • use linear algebra optimized operations and avoid loops
  • use the laziest data structure that'll do for the problem (sets instead of lists if the order doesn't matter, for instance)
  • avoid writing custom methods if tested popular libraries exist
  • employ code profiling to find out bottlenecks in your code

10. Other Forms of Learning

10.1. Metric Learning

  • algorithmic choices : Euclidean vs Manhattan for distance, for instance - can affect performance
  • engineering metrics based on custom needs is also possible
    • weighing dimension differently for instance is a basic approach
  • do keep in mind that distance based metrics need to satisfy certain rules though:
    1. non-negativity : d(a,b) >= 0
    2. triangle-inequality : d(a,b) <= d(a,c) + d(c,b)
    3. symmetry : d(a,b) == d(b,a)
  • read more here: https://en.wikipedia.org/wiki/Similarity_learning#Metric_learning

10.2. Learning to Rank

10.3. Learning to Recommend

  • application: building recommendation systems
    • two basic approaches:
      1. content based filtering
        • based on a single users consumption
      2. collaborative filtering
        • takes into account multiple users based on common likes, dislikes when recommending something
  • read more here: https://en.wikipedia.org/wiki/Recommender_system

10.4. Self-Supervised Learning

11. Conclusions

Tags::book:ml:ai: