李宏毅 Machine Learning (2021 Spring)
什么是机器学习
Machine Learning ~= Looking for Function
机器学习可以做什么:
-
Speech Recognition
-
Image Recognition
-
Playing Go (AlphaGo)
-
…
类别 Types:
-
回归 Regression: The func outputs a scalar
-
分类 Classification: Given options(classes), outputs the correct one
e.g. AlphaGo
-
Structured Learning: create sth. with structure(img, doc)
怎么找这样的一个 Func
例子:YouTube 观看人数预测。
1. Func with Uk Parameters
y=b+wx1: Model;
x1: feature;
w: weight 权重;
b: bias 偏移量;
2. Define Loss from Training Data
L(b,w): Loss is a func of param
Loss: How good a set of value is.
L=N1n∑en
MAE: Mean absolute error e=∣y−y^∣
MSE: Mean square error e=(y−y^)2
Cross-entropy 交叉熵: if y and y^ are both probability distributions
Error Surface 误差曲面: 不同的 b 和 w 计算出的 Loss 绘成的等高线图。
3. Optimization
w∗,b∗=ai=argminw,bL (argmin: 使式子达到最小值时变量的取值)
Gradient Descent 梯度下降:
假设只有一个 UkPm w
Randomly Pick an inital value w0
Compute ∂w∂L∣w=w0
-
Negative => Increase w
-
Positive => Decrease w
步伐大小: η∂w∂L∣w=w0
η: Learning rate, HyperParameters
w1←η∂w∂L∣w=w0
Update w iteratively
什么时候停下来
GD 的问题:
Local minima \ Global minima
LM 是假问题?
推广到两个参数
Randomly Pick inital values w0, b0
Compute:
∂w∂L∣w=w0,b=b0
∂b∂L∣w=w0,b=b0
w1←η∂w∂L∣w=w0,b=b0
b1←η∂b∂L∣w=w0,b=b0
Update w b iteratively
New Model: y=b+∑j=17wjxj, 将前 7 天的数据纳入考虑
Step 1. Model(Function with unknown)
Linear models have severe limitation(Model bias), so we need a more flexible model.
All Piecewise Linear Curves = constant + sum of a set of ReLU
Beyond Piecewise Curves + 取点 = constant + sum of a set of ReLU
Sigmoid Function
y=c1+e−(b+wx1)1=c sigmoid(b+wx1)
More flexible function
y=b+i∑ci sigmoid(bi+wix1)
y=b+i∑ci sigmoid(bi+j∑wijxj)
r=b+Wx
a=σ(r)
y=b+cTa
y=b+cTσ(b+Wx)
x: Feature
W,b,cT,b: Unknown Parameters => θ
Step 2. Define loss from training data
L(θ)
Step 3. Optimization
θ∗=argminθL
Randomly Pick inital values θ0
Compute gradient:
g=⎣⎢⎢⎢⎢⎢⎢⎡∂θ1∂L∣θ=θ0∂θ1∂L∣θ=θ0⋮⎦⎥⎥⎥⎥⎥⎥⎤=∇L(θ0)
w1←η∂w∂L∣w=w0,b=b0
Update θ iteratively
Mini-batch 梯度下降
N 笔资料,随机分为若干个 batch,取一个 batch 计算 L1,θ1←θ0−η(g=∇L1(θ0))
…
Epoch: 把所有 batch 看过一遍
Update: 每一次更新参数
Hard Sigmoid <== Rectified Linear Unit (ReLU) * 2
ReLU: cmax(0,b+wx1)
y=b+2i∑ci max(0,bi+j∑wijxj)
Activation Function 激活函数: Sigmoid and ReLU
Neural Network
Deep Learning: Many layers means deep
Overfitting: Batter on training data, worse on unseen data
Backpropagation 反向传播
To compute the gradients efficiently.
Chain Rule
Case 1:
y=g(x),z=h(y)
Δx→Δy→Δz
dxdz=dydzdxdy
Case 2:
x=g(s),y=h(s),z=k(x,y)
Δs→ΔxΔy→Δz
dsdz=∂x∂zdsdx+∂y∂zdsdy
L(θ)=n=1∑NCn(θ)→∂w∂L(θ)=n=1∑N∂w∂Cn(θ)
取出第一个 neuron,z=w1x1+w2x2+b
∂w∂C=∂w∂z∂z∂C
Compute ∂z/∂w:
∂w1∂z=x1,∂w2∂z=x2
听不懂了……