2 Math and Python Primer

This chapter introduces the key concepts in calculus, linear algebra, probability and statistics, and Python programming that you’ll need in order to understand how machine learning, neural networks, and GenerativeAI work. Previous familiarity with these concepts is helpful but not expected. The material is presented assuming you are learning these concepts for the first time. The reader is strongly encouraged to copy/paste unfamiliar terms and concepts into ChatGPT or similar AI tools for additional personalized tutoring.

2.1 Calculus Essentials

2.1.1 Why You Need Calculus in ML/AI

Neural networks are functions trained to minimize prediction errors. To train them, we need to compute how changes in weights affect the output — and for that, we use differentiation and the chain rule.

2.1.2 Derivatives

A measure of how a function changes as its input changes.

Notation:

$\frac{dy}{dx}$: derivative of output $y$ with respect to input $x$
$f'(x)$: shorthand for “the derivative of function $f$ at $x$”

Example:

If $f(x) = x^2$, then $f'(x) = 2x$

2.1.3 Partial Derivatives

A derivative with respect to one variable while keeping others constant.

Notation:

$ $

Used in computing how the loss changes with respect to each model parameter.

2.1.4 Chain Rule

Used to compute derivatives of composed functions (e.g., layer-by-layer in a neural network).

Formula:

\[ \frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx} \]

2.1.5 Gradient

A vector of all partial derivatives of a function with respect to each input.

Notation:

$\nabla L = \left[ \frac{\partial L}{\partial w_1}, \frac{\partial L}{\partial w_2}, \ldots \right]$

2.1.6 Loss Function

Measures how wrong the model’s prediction is.

Examples:
- Mean Squared Error (MSE)
- Cross-Entropy Loss

2.1.7 Backpropagation

An algorithm that uses the chain rule to compute gradients efficiently in neural networks.

2.1.8 Gradient Descent

Method used to update weights based on gradients.

Update Rule:

\[ w := w - \eta \cdot \frac{\partial L}{\partial w} \]

Where $\eta$ is the learning rate.

2.2 Linear Algebra Essentials

2.2.1 Why You Need Linear Algebra in ML/AI

Neural networks use vectors and matrices to represent data, weights, and transformations.

2.2.2 Vectors

A 1D array of numbers.

$$ =

\[\begin{bmatrix} v_1 \\ v_2 \\ \vdots \\ v_n \end{bmatrix}\]

2.2.3 Matrices

A 2D array of numbers.

A =

\[\begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix}\]

2.2.4 Matrix-Vector Multiplication

If $A$ is $m \times n$ and $\vec{x}$ is $n \times 1$, then $A \vec{x}$ is $m \times 1$.

2.2.5 Dot Product

$\vec{a} \cdot \vec{b} = \sum_{i=1}^{n} a_i b_i$

Used in similarity measures and basic model computations.

2.2.6 Transpose

Switches rows and columns in a matrix.

$A^T$

2.2.7 Identity Matrix

Square matrix with 1s on the diagonal.
Acts like “1” for matrix multiplication.

$AI = A$

2.2.8 Matrix Multiplication

Combines transformations.

If $A$ is $m \times n$ and $B$ is $n \times p$, then $C = AB$ is $m \times p$.

2.2.9 Norms

Measure the size or length of a vector.

L2 norm:

$| |_2 = $

2.3 Probability and Statistics Essentials

2.3.1 Why You Need Probability in ML/AI

Models make predictions under uncertainty. Probability describes this uncertainty and informs how we evaluate and train models.

2.3.2 Random Variables

Represent outcomes of random processes.

Discrete: number of heads
Continuous: model confidence score

2.3.3 Probability Distributions

Discrete: $P(X = x)$

Continuous: $p(x)$

Examples:
- Bernoulli (binary outcomes)
- Categorical (multi-class)
- Gaussian/Normal (real-valued data)

2.3.4 Expectation (Mean)

Discrete:

$\mathbb{E}[X] = \sum_x x \cdot P(X = x)$

Continuous:

$\mathbb{E}[X] = \int x \cdot p(x) dx$

2.3.5 Variance and Standard Deviation

$\mathrm{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2]$

$\sigma = \sqrt{\mathrm{Var}(X)}$

2.3.6 Conditional Probability

$ P(A B) = $

Used in next-token prediction:

$P(\text{next token} \mid \text{context})$

2.3.7 Bayes’ Theorem

$ P(A B) = $

2.3.8 Entropy

$ H(X) = -_x P(x) P(x) $

Measures uncertainty in a distribution.

2.3.9 Cross-Entropy Loss

$\text{Loss} = -\sum_i y_i \log(\hat{y}_i)$

Used in classification and language modeling.

2.4 Python Essentials for Building a Language Model

2.4.1 Why Python?

Python is widely used in AI due to its readability and powerful libraries like NumPy and PyTorch.

2.1 Calculus Essentials

2.1.1 Why You Need Calculus in ML/AI

2.1.2 Derivatives

2.1.3 Partial Derivatives

2.1.4 Chain Rule

2.1.5 Gradient

2.1.6 Loss Function

2.1.7 Backpropagation

2.1.8 Gradient Descent

2.2 Linear Algebra Essentials

2.2.1 Why You Need Linear Algebra in ML/AI

2.2.2 Vectors

2.2.3 Matrices

2.2.4 Matrix-Vector Multiplication

2.2.5 Dot Product

2.2.6 Transpose

2.2.7 Identity Matrix

2.2.8 Matrix Multiplication

2.2.9 Norms

2.3 Probability and Statistics Essentials

2.3.1 Why You Need Probability in ML/AI

2.3.2 Random Variables

2.3.3 Probability Distributions

2.3.4 Expectation (Mean)

2.3.5 Variance and Standard Deviation

2.3.6 Conditional Probability

2.3.7 Bayes’ Theorem

2.3.8 Entropy

2.3.9 Cross-Entropy Loss

2.4 Python Essentials for Building a Language Model

2.4.1 Why Python?

2.4.2 Variables and Types