Here simultaniously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
这里的 “同时” 意味着在更新任何参数之前,同时计算所有参数的偏导数。
defpredict_single_loop(x, w, b): """ 使用线性回归进行单一预测on Args: x (ndarray): Shape (n,) 带有多特征的例子 w (ndarray): Shape (n,) 模型参数 b (scalar): 模型参数 Returns: p (scalar): 预测值 """ n = x.shape[0] p = 0 for i inrange(n): p_i = x[i] * w[i] p = p + p_i p = p + b return p
# We can make use of vector operations to speed up predictions. defpredict(x, w, b): """ 使用线性回归进行单一预测on Args: x (ndarray): Shape (n,) 带有多特征的例子 w (ndarray): Shape (n,) 模型参数 b (scalar): 模型参数 Returns: p (scalar): 预测值rediction """ p = np.dot(x, w) + b return p
Cost With Multiple Variables
J(w,b)=2m1i=1∑m(fw,b(x(i))−y(i))2
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19
defcompute_cost(X, y, w, b): """ 计算损失 Args: X (ndarray (m,n)): Data, m个实例n个特征 y (ndarray (m,)) : 目标值 w (ndarray (n,)) : 模型参数 b (scalar) : 模型参数 Returns: cost (scalar): 损失值 """ m = X.shape[0] cost = 0.0 for i inrange(m): f_wb_i = np.dot(X[i], w) + b #(n,)(n,) = scalar (see np.dot) cost = cost + (f_wb_i - y[i])**2#scalar cost = cost / (2 * m) #scalar return cost
defcompute_gradient(X, y, w, b): """ 计算线性回归的梯度 Args: X (ndarray (m,n)): Data, m个实例n个特征 y (ndarray (m,)) : 目标值 w (ndarray (n,)) : 模型参数 b (scalar) : 模型参数 Returns: dj_dw (ndarray (n,)): 对于参数 w 的cost函数梯度 dj_db (scalar): 对于参数 b 的cost函数梯度 """ m,n = X.shape #(number of examples, number of features) dj_dw = np.zeros((n,)) dj_db = 0.
for i inrange(m): err = (np.dot(X[i], w) + b) - y[i] for j inrange(n): dj_dw[j] = dj_dw[j] + err * X[i, j] dj_db = dj_db + err dj_dw = dj_dw / m dj_db = dj_db / m
return dj_db, dj_dw
defgradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): """ 执行批量梯度下降以学习 w 和 b。通过采取 num_iters 步长为 learning rate 的梯度步骤来更新 w 和 b Args: X (ndarray (m,n)) : Data, m个实例n个特征 y (ndarray (m,)) : 目标值 w_in (ndarray (n,)) : 初始模型参数 b_in (scalar) : 初始模型参数 cost_function : 计算cost的函数 gradient_function : 计算梯度的函数 alpha (float) : 学习率 num_iters (int) : 运行梯度下降的迭代次数 Returns: w (ndarray (n,)) : 更新后的参数结果 b (scalar) : 更新后的参数结果 """
w = copy.deepcopy(w_in) #在函数内部避免修改全局变量 w b = b_in
for i inrange(num_iters):
# 计算梯度并更新参数 dj_db,dj_dw = gradient_function(X, y, w, b) ##None
# 更新参数使用 w, b, alpha and gradient w = w - alpha * dj_dw ##None b = b - alpha * dj_db ##None
return w, b # 返回最终 w,b
Normal equation 正规方程
only for linear regression
solve for w,b without iterations
Disadvantage:
dosen’t generalize to other learning algorithms
slow when number of features is large (> 10000)
Feature scaling 特征缩放
Feature scaling, essentially dividing each positive feature by its maximum value, or more generally, rescale each feature by both its minimum and maximum values using (x-min)/(max-min).
aim for about -1 <= xj <= 1 for each feature xj
Mean normalization 均值归一化
xi:=max−minxi−μi
Z-score normalization Z-score归一化
After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.
where j selects a feature or a column in the X matrix. µj is the mean of all the values for feature (j) and σj is the standard deviation of feature (j).
Implementation Note: When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations.
Given a new x value, we must first normalize x using the mean and standard deviation that we had previously computed from the training set.