[机器学习] 学习笔记壹基本概念

李宏毅 Machine Learning (2021 Spring)

什么是机器学习

Machine Learning ~= Looking for Function

机器学习可以做什么：

Speech Recognition
Image Recognition
Playing Go (AlphaGo)
…

类别 Types：

回归 Regression: The func outputs a scalar
分类 Classification: Given options(classes), outputs the correct one

e.g. AlphaGo
Structured Learning: create sth. with structure(img, doc)

怎么找这样的一个 Func

例子：YouTube 观看人数预测。

1. Func with Uk Parameters

$y = b + wx_1$ : Model;

$x_1$ : feature;

$w$ : weight 权重;

$b$ : bias 偏移量;

2. Define Loss from Training Data

$L(b, w)$ : Loss is a func of param

Loss: How good a set of value is.

L=\frac{1}{N}\sum_n e_n

MAE: Mean absolute error $e=|y-\hat y|$

MSE: Mean square error $e=(y-\hat y)^2$

Cross-entropy 交叉熵: if $y$ and $\hat y$ are both probability distributions

Error Surface 误差曲面: 不同的 $b$ 和 $w$ 计算出的 Loss 绘成的等高线图。

3. Optimization

$w^*, b^* = ai=arg \min_{w, b} L$ ( $arg \min$ : 使式子达到最小值时变量的取值)

Gradient Descent 梯度下降:

假设只有一个 UkPm w

Randomly Pick an inital value $w_0$

Compute $\frac{\partial L}{\partial w}|_{w=w_0}$

Negative => Increase $w$
Positive => Decrease $w$

步伐大小: $\eta\frac{\partial L}{\partial w}|_{w=w_0}$

$\eta$ : Learning rate, HyperParameters

$w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0}$

Update $w$ iteratively

什么时候停下来

达到设定的次数 epoch(HyperParam)
理想情况：微分值为 0

GD 的问题：

Local minima \ Global minima

LM 是假问题？

推广到两个参数

Randomly Pick inital values $w_0$ , $b_0$

Compute:

$\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}$

$\frac{\partial L}{\partial b}|_{w=w_0, b=b_0}$

$w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}$

$b_1\leftarrow \eta\frac{\partial L}{\partial b}|_{w=w_0, b=b_0}$

Update $w$ $b$ iteratively

New Model: $y=b+\sum_{j=1}^{7}w_jx_j$ , 将前 7 天的数据纳入考虑

Step 1. Model(Function with unknown)

Linear models have severe limitation(Model bias), so we need a more flexible model.

All Piecewise Linear Curves = constant + sum of a set of ReLU

Beyond Piecewise Curves + 取点 = constant + sum of a set of ReLU

Sigmoid Function

y=c\frac{1}{1+e^{-(b+wx_1)}}=c\ \text{sigmoid}(b+wx_1)

More flexible function

y=b+\sum_ic_i\ \text{sigmoid}(b_i+w_ix_1)

y=b+\sum_ic_i\ \text{sigmoid}(b_i+\sum_jw_{ij}x_j)

\boldsymbol{r} = \boldsymbol{b}+W\boldsymbol{x}

\boldsymbol a=\sigma(\boldsymbol r)

y=b+\boldsymbol c^T \boldsymbol a

y=b+\boldsymbol c^T \sigma(\boldsymbol{b}+W\boldsymbol{x})

$x$ : Feature

$W, \boldsymbol{b}, \boldsymbol c^T, b$ : Unknown Parameters => $\boldsymbol \theta$

Step 2. Define loss from training data

$L(\boldsymbol \theta)$

Step 3. Optimization

$\boldsymbol \theta^*=arg\min_{\boldsymbol \theta}L$

Randomly Pick inital values $\boldsymbol \theta^0$

Compute gradient:

\boldsymbol g= \begin{bmatrix} \frac{\partial L}{\partial \theta_1}|_{\boldsymbol \theta=\boldsymbol \theta^0}\\\\ \frac{\partial L}{\partial \theta_1}|_{\boldsymbol \theta=\boldsymbol \theta^0}\\\\ \vdots \end{bmatrix} =\nabla L(\boldsymbol\theta^0)

$w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}$

Update $\boldsymbol\theta$ iteratively

Mini-batch 梯度下降

$N$ 笔资料，随机分为若干个 batch，取一个 batch 计算 $L^1$ ， $\boldsymbol\theta^1\leftarrow\boldsymbol\theta^0-\eta(\boldsymbol g=\nabla L^1(\boldsymbol\theta^0))$

…

Epoch: 把所有 batch 看过一遍

Update: 每一次更新参数

Hard Sigmoid <== Rectified Linear Unit (ReLU) * 2

ReLU: $c\max(0, b+wx_1)$

y=b+\sum_{2i}c_{i}\ \max(0, b_i+\sum_jw_{ij}x_j)

Activation Function 激活函数: Sigmoid and ReLU

Neural Network

Deep Learning: Many layers means deep

Overfitting: Batter on training data, worse on unseen data

Backpropagation 反向传播

To compute the gradients efficiently.

Chain Rule

Case 1:

y=g(x), z=h(y)

\Delta x\to\Delta y\to \Delta z

\frac{\text d z}{\text d x} = \frac{\text d z}{\text d y}\frac{\text d y}{\text d x}

Case 2:

x=g(s), y=h(s), z=k(x,y)

\Delta s\to \begin{matrix} \Delta x\\\\ \Delta y \end{matrix} \to \Delta z

\frac{\text d z}{\text d s} = \frac{\partial z}{\partial x}\frac{\text d x}{\text d s} + \frac{\partial z}{\partial y}\frac{\text d y}{\text d s}

L(\theta)=\sum_{n=1}^{N}C^n(\theta)\to \frac{\partial L(\theta)}{\partial w}=\sum_{n=1}^{N}\frac{\partial C^n(\theta)}{\partial w}

取出第一个 neuron， $z=w_1x_1+w_2x_2+b$

\frac{\partial C}{\partial w} = \frac{\partial z}{\partial w} \frac{\partial C}{\partial z}

Compute $\partial z/\partial w$ :

\frac{\partial z}{\partial w_1} = x_1, \frac{\partial z}{\partial w_2} = x_2

听不懂了……