Terminology 术语

x = “input” variable (feature)

y = “output” variable (“target” variable)

y^\hat{y}= prediction (estimated y)

m = number of training examples

(x, y) = single training example

(x(i),y(i))(x^{(i)},y^{(i)}) = ithi^{th} training example. e.g. (1st,2nd,3rd...)(1^{st},2^{nd},3^{rd}...)

w,bw,b: parameters, coefficients or weights

Process 流程

training set -> learning algorithm -> function

x -> module(function) -> y^\hat{y}

Univariate linear regression 单变量线性回归

fw,b(x)=wx+bf_{w,b}(x)=wx+b

Linear regression with one variable(single feature x).
只有一个变量的线性回归

Cost Function(squared error cost function) 损失函数

To measure how well a choice of w, and b fits the training data.
衡量选择的 w 和 b 对训练数据的拟合程度。

y^(i)=fw,b(x(i))=wx(i)+b\hat{y}^{(i)}=f_{w,b}(x^{(i)})=wx^{(i)}+b

J(w,b)=12mi=1m(y^(i)y(i))2=12mi=1m(fw,b(x(i))y(i))2J(w,b) = \frac1{2m}\sum_{i=1}^m({\hat{y}^{(i)}}-y^{(i)})^2= \frac1{2m}\sum_{i=1}^m(f_{w,b}(x^{(i)})-y^{(i)})^2

  • fw,b(x(i))f_{w,b}(x^{(i)}) is our prediction for example ii using parameters w,bw,b.  
    fw,b(x(i))f_{w,b}(x^{(i)}) 是使用参数 w,bw,b 对第 ii 个实例的预测

  • (fw,b(x(i))y(i))2(f_{w,b}(x^{(i)}) -y^{(i)})^2 is the squared difference between the target value and the prediction.  
    (fw,b(x(i))y(i))2(f_{w,b}(x^{(i)}) -y^{(i)})^2 是目标值与预测之间的平方差

  • These differences are summed over all the mm examples and divided by 2m to produce the cost, J(w,b)J(w,b).
    这些差异在所有 mm 个实例上求和,并除以 2m 得到成本 J(w,b)J(w,b)

Find w,b: y^(i){\hat{y}}^{(i)} is close to y(i)y^{(i)} for all (x(i),y(i)).(x^{(i)},y^{(i)}).
所有数据中预测y接近真实y的w,b系数

it measures the difference between the model’s predictions, and the actual true values for y.
它衡量了模型的预测与实际 y 的真实值之间的差异

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
def compute_cost(x, y, w, b): 
"""
计算线性回归的损失函数

Args:
x (ndarray (m,)): 数据集, m个实例
y (ndarray (m,)): 结果集
w,b (scalar) : 模型参数

Returns
total_cost (float): 使用w和b作为线性回归的参数来拟合x和y中的数据点的代价
"""
# 实例个数
m = x.shape[0]

cost_sum = 0
for i in range(m):
f_wb = w * x[i] + b
cost = (f_wb - y[i]) ** 2
cost_sum = cost_sum + cost
total_cost = (1 / (2 * m)) * cost_sum

return total_cost

Gradient descent algorithm 梯度下降算法

  • Start with some w, b
    从一些初始的 w,b 开始

  • Keep changing w,b ܾ to reduce J(w, b)
    不断调整 w,b 以减小 J(w,b)

  • Until we settle at or near a minimum
    直到我们稳定在或接近一个区域最小值

implementing 实现

{w=wαJ(w,b)wb=bαJ(w,b)b\begin{cases} w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \end{cases}

α\alpha is learning rate, J(w,b)w\frac{\partial J(w,b)}{\partial w} is derivative
α\alpha是学习率, J(w,b)w\frac{\partial J(w,b)}{\partial w}是偏导数

Near a local minimum
局部最小值附近

  • Derivative becomes smaller
    导数变小

  • Update steps become smaller
    更新步长变小

repeat until convergence, where, parameters 𝑤, 𝑏 are updated simultaneously.
重复直到收敛,其中,参数𝑤, 𝑏同时更新。

Linear regression model: fw,b(x(i))=wx(i)+bf_{w,b}(x^{(i)}) = wx^{(i)} + b

Cost function: J(w,b)=12mi=0m1(fw,b(x(i))y(i))2J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2

J(w,b)w=1mi=0m1(fw,b(x(i))y(i))x(i)\frac{\partial J(w,b)}{\partial w} =\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}

J(w,b)b=1mi=0m1(fw,b(x(i))y(i))\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})

Here simultaniously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
这里的 “同时” 意味着在更新任何参数之前,同时计算所有参数的偏导数。

compute_gradient returns J(w,b)w\frac{\partial J(w,b)}{\partial w},J(w,b)b\frac{\partial J(w,b)}{\partial b}.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
def compute_gradient(x, y, w, b): 
"""
计算线性回归的梯度
Args:
x (ndarray (m,)): 数据集, m个实例
y (ndarray (m,)): 结果集
w,b (scalar) : 模型参数
Returns
dj_dw (scalar): cost相对于参数w的梯度
dj_db (scalar): cost相对于参数b的梯度
"""

# 实例个数
m = x.shape[0]
dj_dw = 0
dj_db = 0

for i in range(m):
f_wb = w * x[i] + b
dj_dw_i = (f_wb - y[i]) * x[i]
dj_db_i = f_wb - y[i]
dj_db += dj_db_i
dj_dw += dj_dw_i
dj_dw = dj_dw / m
dj_db = dj_db / m

return dj_dw, dj_db

Below gradient_descent, you will utilize this function to find optimal values of ww and bb on the training data.

{w=wαJ(w,b)wb=bαJ(w,b)b\begin{cases} w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \end{cases}

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
def gradient_descent(x, y, w_in, b_in, alpha, num_iters, gradient_function, num_iters): 
"""
执行梯度下降来拟合参数 w 和 b。通过采取学习率 alpha 进行 num_iters 次梯度步骤来更新 w 和 b

Args:
x (ndarray (m,)) : 数据集, m个实例
y (ndarray (m,)) : 结果集
w_in,b_in (scalar): 模型参数的初始值
alpha (float): 学习率
num_iters (int): 运行梯度下降的迭代次数
gradient_function: 调用该函数生成gradient
num_iters (int) : 运行梯度下降的迭代次数

Returns:
w (scalar): 在运行梯度下降后更新的参数值
b (scalar): 在运行梯度下降后更新的参数值
"""

b = b_in
w = w_in

for i in range(num_iters):
# 计算梯度并使用gradient_function更新参数
dj_dw, dj_db = gradient_function(x, y, w , b)

# 更新参数
b = b - alpha * dj_db
w = w - alpha * dj_dw

return w, b

Multiple Linear Regression 多元线性回归

Terminology

xj=jthx_j=j^{th} feature

n = number of features

x(i)\vec{x}^{(i)} = features of ithi^{th} training example

xj(i)\vec{x}^{(i)}_j= value of feature j in ithi^{th} training example

Model

fw,b(x)=wx+b=w1x1+w2x2+w3x3++wnxn+bf_{\vec{w},b}(\vec{x})= \vec{w} \cdot \vec{x} + b= w_1x_1+w_2x_2+w_3x_3+···+w_nx_n+b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
def predict_single_loop(x, w, b): 
"""
使用线性回归进行单一预测on

Args:
x (ndarray): Shape (n,) 带有多特征的例子
w (ndarray): Shape (n,) 模型参数
b (scalar): 模型参数

Returns:
p (scalar): 预测值
"""
n = x.shape[0]
p = 0
for i in range(n):
p_i = x[i] * w[i]
p = p + p_i
p = p + b
return p

# We can make use of vector operations to speed up predictions.
def predict(x, w, b):
"""
使用线性回归进行单一预测on

Args:
x (ndarray): Shape (n,) 带有多特征的例子
w (ndarray): Shape (n,) 模型参数
b (scalar): 模型参数

Returns:
p (scalar): 预测值rediction
"""
p = np.dot(x, w) + b
return p

Cost With Multiple Variables

J(w,b)=12mi=1m(fw,b(x(i))y(i))2J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})^2

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
def compute_cost(X, y, w, b): 
"""
计算损失
Args:
X (ndarray (m,n)): Data, m个实例n个特征
y (ndarray (m,)) : 目标值
w (ndarray (n,)) : 模型参数
b (scalar) : 模型参数

Returns:
cost (scalar): 损失值
"""
m = X.shape[0]
cost = 0.0
for i in range(m):
f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot)
cost = cost + (f_wb_i - y[i])**2 #scalar
cost = cost / (2 * m) #scalar
return cost

Gradient descent

wj=wjαJ(w,b)wjw_j=w_j-\alpha\frac{\partial J(\vec{w},b)}{\partial{w_j}}

b=bαJ(w,b)bb=b-\alpha\frac{\partial{J(\vec{w},b)}}{\partial{b}}

{w1=w1α1mi=1m(fw,b(x(i))y(i))x1(i)...wn=wnα1mi=1m(fw,b(x(i))y(i))xn(i)b=bα1mi=1m(fw,b(x(i))y(i))\begin{cases}w_1=w_1-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_1^{(i)}\\ .\\ .\\ .\\ w_n=w_n-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_n^{(i)}\\ b=b-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})\end{cases}

n >= 2, and simultaneously update wjw_j(for j = 1,…,n) and b

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
def compute_gradient(X, y, w, b): 
"""
计算线性回归的梯度
Args:
X (ndarray (m,n)): Data, m个实例n个特征
y (ndarray (m,)) : 目标值
w (ndarray (n,)) : 模型参数
b (scalar) : 模型参数

Returns:
dj_dw (ndarray (n,)): 对于参数 w 的cost函数梯度
dj_db (scalar): 对于参数 b 的cost函数梯度
"""
m,n = X.shape #(number of examples, number of features)
dj_dw = np.zeros((n,))
dj_db = 0.

for i in range(m):
err = (np.dot(X[i], w) + b) - y[i]
for j in range(n):
dj_dw[j] = dj_dw[j] + err * X[i, j]
dj_db = dj_db + err
dj_dw = dj_dw / m
dj_db = dj_db / m

return dj_db, dj_dw


def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters):
"""
执行批量梯度下降以学习 w 和 b。通过采取 num_iters 步长为 learning rate 的梯度步骤来更新 w 和 b

Args:
X (ndarray (m,n)) : Data, m个实例n个特征
y (ndarray (m,)) : 目标值
w_in (ndarray (n,)) : 初始模型参数
b_in (scalar) : 初始模型参数
cost_function : 计算cost的函数
gradient_function : 计算梯度的函数
alpha (float) : 学习率
num_iters (int) : 运行梯度下降的迭代次数

Returns:
w (ndarray (n,)) : 更新后的参数结果
b (scalar) : 更新后的参数结果
"""

w = copy.deepcopy(w_in) #在函数内部避免修改全局变量 w
b = b_in

for i in range(num_iters):

# 计算梯度并更新参数
dj_db,dj_dw = gradient_function(X, y, w, b) ##None

# 更新参数使用 w, b, alpha and gradient
w = w - alpha * dj_dw ##None
b = b - alpha * dj_db ##None

return w, b # 返回最终 w,b

Normal equation 正规方程

  • only for linear regression
  • solve for w,b without iterations

Disadvantage:

  • dosen’t generalize to other learning algorithms
  • slow when number of features is large (> 10000)

Feature scaling 特征缩放

Feature scaling, essentially dividing each positive feature by its maximum value, or more generally, rescale each feature by both its minimum and maximum values using (x-min)/(max-min).

aim for about -1 <= xjx_j <= 1 for each feature xjx_j

Mean normalization 均值归一化

xi:=xiμimaxminx_i := \dfrac{x_i - \mu_i}{max - min}

Z-score normalization Z-score归一化

After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.

x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4}

where jj selects a feature or a column in the X\mathbf{X} matrix. µjµ_j is the mean of all the values for feature (j) and σj\sigma_j is the standard deviation of feature (j).

Implementation Note: When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations.

Given a new x value, we must first normalize x using the mean and standard deviation that we had previously computed from the training set.