[机器学习] 学习笔记 壹 基本概念
zerc

李宏毅 Machine Learning (2021 Spring)

什么是机器学习

Machine Learning ~= Looking for Function

机器学习可以做什么:

  • Speech Recognition

  • Image Recognition

  • Playing Go (AlphaGo)

类别 Types:

  • 回归 Regression: The func outputs a scalar

  • 分类 Classification: Given options(classes), outputs the correct one

    e.g. AlphaGo

  • Structured Learning: create sth. with structure(img, doc)

怎么找这样的一个 Func

例子:YouTube 观看人数预测。

1. Func with Uk Parameters

y=b+wx1y = b + wx_1: Model;

x1x_1: feature;

ww: weight 权重;

bb: bias 偏移量;

2. Define Loss from Training Data

L(b,w)L(b, w): Loss is a func of param

Loss: How good a set of value is.

L=1NnenL=\frac{1}{N}\sum_n e_n

MAE: Mean absolute error e=yy^e=|y-\hat y|

MSE: Mean square error e=(yy^)2e=(y-\hat y)^2

Cross-entropy 交叉熵: if yy and y^\hat y are both probability distributions

Error Surface 误差曲面: 不同的 bbww 计算出的 Loss 绘成的等高线图。

3. Optimization

w,b=ai=argminw,bLw^*, b^* = ai=arg \min_{w, b} L (argminarg \min: 使式子达到最小值时变量的取值)

Gradient Descent 梯度下降:

假设只有一个 UkPm w

Randomly Pick an inital value w0w_0

Compute Lww=w0\frac{\partial L}{\partial w}|_{w=w_0}

  • Negative => Increase ww

  • Positive => Decrease ww

    步伐大小: ηLww=w0\eta\frac{\partial L}{\partial w}|_{w=w_0}

    η\eta: Learning rate, HyperParameters

    w1ηLww=w0w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0}

Update ww iteratively

什么时候停下来

  • 达到设定的次数 epoch(HyperParam)

  • 理想情况:微分值为 0

GD 的问题:

Local minima \ Global minima

LM 是假问题?

推广到两个参数

Randomly Pick inital values w0w_0, b0b_0

Compute:

Lww=w0,b=b0\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}

Lbw=w0,b=b0\frac{\partial L}{\partial b}|_{w=w_0, b=b_0}

w1ηLww=w0,b=b0w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}

b1ηLbw=w0,b=b0b_1\leftarrow \eta\frac{\partial L}{\partial b}|_{w=w_0, b=b_0}

Update ww bb iteratively

New Model: y=b+j=17wjxjy=b+\sum_{j=1}^{7}w_jx_j, 将前 7 天的数据纳入考虑

Step 1. Model(Function with unknown)

Linear models have severe limitation(Model bias), so we need a more flexible model.

All Piecewise Linear Curves = constant + sum of a set of ReLU

Beyond Piecewise Curves + 取点 = constant + sum of a set of ReLU

Sigmoid Function

y=c11+e(b+wx1)=c sigmoid(b+wx1)y=c\frac{1}{1+e^{-(b+wx_1)}}=c\ \text{sigmoid}(b+wx_1)

More flexible function

y=b+ici sigmoid(bi+wix1)y=b+\sum_ic_i\ \text{sigmoid}(b_i+w_ix_1)

y=b+ici sigmoid(bi+jwijxj)y=b+\sum_ic_i\ \text{sigmoid}(b_i+\sum_jw_{ij}x_j)

r=b+Wx\boldsymbol{r} = \boldsymbol{b}+W\boldsymbol{x}

a=σ(r)\boldsymbol a=\sigma(\boldsymbol r)

y=b+cTay=b+\boldsymbol c^T \boldsymbol a

y=b+cTσ(b+Wx)y=b+\boldsymbol c^T \sigma(\boldsymbol{b}+W\boldsymbol{x})

xx: Feature

W,b,cT,bW, \boldsymbol{b}, \boldsymbol c^T, b: Unknown Parameters => θ\boldsymbol \theta

Step 2. Define loss from training data

L(θ)L(\boldsymbol \theta)

Step 3. Optimization

θ=argminθL\boldsymbol \theta^*=arg\min_{\boldsymbol \theta}L

Randomly Pick inital values θ0\boldsymbol \theta^0

Compute gradient:

g=[Lθ1θ=θ0Lθ1θ=θ0]=L(θ0)\boldsymbol g= \begin{bmatrix} \frac{\partial L}{\partial \theta_1}|_{\boldsymbol \theta=\boldsymbol \theta^0}\\\\ \frac{\partial L}{\partial \theta_1}|_{\boldsymbol \theta=\boldsymbol \theta^0}\\\\ \vdots \end{bmatrix} =\nabla L(\boldsymbol\theta^0)

w1ηLww=w0,b=b0w_1\leftarrow \eta\frac{\partial L}{\partial w}|_{w=w_0, b=b_0}

Update θ\boldsymbol\theta iteratively

Mini-batch 梯度下降

NN 笔资料,随机分为若干个 batch,取一个 batch 计算 L1L^1θ1θ0η(g=L1(θ0))\boldsymbol\theta^1\leftarrow\boldsymbol\theta^0-\eta(\boldsymbol g=\nabla L^1(\boldsymbol\theta^0))

Epoch: 把所有 batch 看过一遍

Update: 每一次更新参数

Hard Sigmoid <== Rectified Linear Unit (ReLU) * 2

ReLU: cmax(0,b+wx1)c\max(0, b+wx_1)

y=b+2ici max(0,bi+jwijxj)y=b+\sum_{2i}c_{i}\ \max(0, b_i+\sum_jw_{ij}x_j)

Activation Function 激活函数: Sigmoid and ReLU

Neural Network

Deep Learning: Many layers means deep

Overfitting: Batter on training data, worse on unseen data

Backpropagation 反向传播

To compute the gradients efficiently.

Chain Rule

Case 1:

y=g(x),z=h(y)y=g(x), z=h(y)

ΔxΔyΔz\Delta x\to\Delta y\to \Delta z

dzdx=dzdydydx\frac{\text d z}{\text d x} = \frac{\text d z}{\text d y}\frac{\text d y}{\text d x}

Case 2:

x=g(s),y=h(s),z=k(x,y)x=g(s), y=h(s), z=k(x,y)

ΔsΔxΔyΔz\Delta s\to \begin{matrix} \Delta x\\\\ \Delta y \end{matrix} \to \Delta z

dzds=zxdxds+zydyds\frac{\text d z}{\text d s} = \frac{\partial z}{\partial x}\frac{\text d x}{\text d s} + \frac{\partial z}{\partial y}\frac{\text d y}{\text d s}

L(θ)=n=1NCn(θ)L(θ)w=n=1NCn(θ)wL(\theta)=\sum_{n=1}^{N}C^n(\theta)\to \frac{\partial L(\theta)}{\partial w}=\sum_{n=1}^{N}\frac{\partial C^n(\theta)}{\partial w}

取出第一个 neuron,z=w1x1+w2x2+bz=w_1x_1+w_2x_2+b

Cw=zwCz\frac{\partial C}{\partial w} = \frac{\partial z}{\partial w} \frac{\partial C}{\partial z}

Compute z/w\partial z/\partial w:

zw1=x1,zw2=x2\frac{\partial z}{\partial w_1} = x_1, \frac{\partial z}{\partial w_2} = x_2

听不懂了……

Powered by Hexo & Theme Keep
Total words 7.1k Unique Visitor Page View