Linear Regression

Terminology 术语

x = “input” variable (feature)

y = “output” variable (“target” variable)

$\hat{y}$ = prediction (estimated y)

m = number of training examples

(x, y) = single training example

$(x^{(i)},y^{(i)})$ = $i^{th}$ training example. e.g. $(1^{st},2^{nd},3^{rd}...)$

$w,b$ : parameters, coefficients or weights

Process 流程

training set -> learning algorithm -> function

x -> module(function) -> $\hat{y}$

Univariate linear regression 单变量线性回归

$f_{w,b}(x)=wx+b$

Linear regression with one variable(single feature x).
只有一个变量的线性回归

Cost Function(squared error cost function) 损失函数

To measure how well a choice of w, and b fits the training data.
衡量选择的 w 和 b 对训练数据的拟合程度。

$\hat{y}^{(i)}=f_{w,b}(x^{(i)})=wx^{(i)}+b$

$J(w,b) = \frac1{2m}\sum_{i=1}^m({\hat{y}^{(i)}}-y^{(i)})^2= \frac1{2m}\sum_{i=1}^m(f_{w,b}(x^{(i)})-y^{(i)})^2$

$f_{w,b}(x^{(i)})$ is our prediction for example $i$ using parameters $w,b$ .
$f_{w,b}(x^{(i)})$ 是使用参数 $w,b$ 对第 $i$ 个实例的预测
$(f_{w,b}(x^{(i)}) -y^{(i)})^2$ is the squared difference between the target value and the prediction.
$(f_{w,b}(x^{(i)}) -y^{(i)})^2$ 是目标值与预测之间的平方差
These differences are summed over all the $m$ examples and divided by 2m to produce the cost, $J(w,b)$ .
这些差异在所有 $m$ 个实例上求和，并除以 2m 得到成本 $J(w,b)$

Find w,b: ${\hat{y}}^{(i)}$ is close to $y^{(i)}$ for all $(x^{(i)},y^{(i)}).$
所有数据中预测y接近真实y的w，b系数

it measures the difference between the model’s predictions, and the actual true values for y.
它衡量了模型的预测与实际 y 的真实值之间的差异

def compute_cost(x, y, w, b): 
    """
    计算线性回归的损失函数

    Args:
      x (ndarray (m,)): 数据集， m个实例 
      y (ndarray (m,)): 结果集
      w,b (scalar)    : 模型参数  

    Returns
        total_cost (float): 使用w和b作为线性回归的参数来拟合x和y中的数据点的代价
    """
    # 实例个数
    m = x.shape[0] 

    cost_sum = 0 
    for i in range(m): 
        f_wb = w * x[i] + b   
        cost = (f_wb - y[i]) ** 2  
        cost_sum = cost_sum + cost  
    total_cost = (1 / (2 * m)) * cost_sum  

    return total_cost

Gradient descent algorithm 梯度下降算法

Start with some w, b
从一些初始的 w,b 开始
Keep changing w,b ܾ to reduce J(w, b)
不断调整 w,b 以减小 J(w,b)
Until we settle at or near a minimum
直到我们稳定在或接近一个区域最小值

implementing 实现

$\begin{cases} w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \end{cases}$

$\alpha$ is learning rate, $\frac{\partial J(w,b)}{\partial w}$ is derivative
$\alpha$ 是学习率, $\frac{\partial J(w,b)}{\partial w}$ 是偏导数

Near a local minimum
局部最小值附近

Derivative becomes smaller
导数变小
Update steps become smaller
更新步长变小

repeat until convergence, where, parameters 𝑤, 𝑏 are updated simultaneously.
重复直到收敛，其中，参数𝑤, 𝑏同时更新。

Linear regression model: $f_{w,b}(x^{(i)}) = wx^{(i)} + b$

Cost function: $J(w,b) = \frac{1}{2m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})^2$

$\frac{\partial J(w,b)}{\partial w} =\frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})x^{(i)}$

$\frac{\partial J(w,b)}{\partial b} = \frac{1}{m} \sum\limits_{i = 0}^{m-1} (f_{w,b}(x^{(i)}) - y^{(i)})$

Here simultaniously means that you calculate the partial derivatives for all the parameters before updating any of the parameters.
这里的 “同时” 意味着在更新任何参数之前，同时计算所有参数的偏导数。

compute_gradient returns $\frac{\partial J(w,b)}{\partial w}$ , $\frac{\partial J(w,b)}{\partial b}$ .

def compute_gradient(x, y, w, b): 
    """
    计算线性回归的梯度
    Args:
      x (ndarray (m,)): 数据集， m个实例 
      y (ndarray (m,)): 结果集
      w,b (scalar)    : 模型参数  
    Returns
      dj_dw (scalar): cost相对于参数w的梯度
      dj_db (scalar): cost相对于参数b的梯度      
     """

    # 实例个数
    m = x.shape[0]    
    dj_dw = 0
    dj_db = 0

    for i in range(m):  
        f_wb = w * x[i] + b 
        dj_dw_i = (f_wb - y[i]) * x[i] 
        dj_db_i = f_wb - y[i] 
        dj_db += dj_db_i
        dj_dw += dj_dw_i 
    dj_dw = dj_dw / m 
    dj_db = dj_db / m 

    return dj_dw, dj_db

Below gradient_descent, you will utilize this function to find optimal values of $w$ and $b$ on the training data.

$\begin{cases} w &= w - \alpha \frac{\partial J(w,b)}{\partial w} \\ b &= b - \alpha \frac{\partial J(w,b)}{\partial b} \end{cases}$

def gradient_descent(x, y, w_in, b_in, alpha, num_iters, gradient_function, num_iters): 
    """
    执行梯度下降来拟合参数 w 和 b。通过采取学习率 alpha 进行 num_iters 次梯度步骤来更新 w 和 b

    Args:
      x (ndarray (m,))  : 数据集， m个实例  
      y (ndarray (m,))  : 结果集
      w_in,b_in (scalar): 模型参数的初始值  
      alpha (float):     学习率
      num_iters (int):   运行梯度下降的迭代次数
      gradient_function: 调用该函数生成gradient
      num_iters (int)     : 运行梯度下降的迭代次数

    Returns:
      w (scalar): 在运行梯度下降后更新的参数值
      b (scalar): 在运行梯度下降后更新的参数值
      """

    b = b_in
    w = w_in

    for i in range(num_iters):
        # 计算梯度并使用gradient_function更新参数
        dj_dw, dj_db = gradient_function(x, y, w , b)     

        # 更新参数
        b = b - alpha * dj_db                            
        w = w - alpha * dj_dw                            

    return w, b

Multiple Linear Regression 多元线性回归

Terminology

$x_j=j^{th}$ feature

n = number of features

$\vec{x}^{(i)}$ = features of $i^{th}$ training example

$\vec{x}^{(i)}_j$ = value of feature j in $i^{th}$ training example

Model

$f_{\vec{w},b}(\vec{x})= \vec{w} \cdot \vec{x} + b= w_1x_1+w_2x_2+w_3x_3+···+w_nx_n+b$

def predict_single_loop(x, w, b): 
    """
    使用线性回归进行单一预测on

    Args:
      x (ndarray): Shape (n,) 带有多特征的例子
      w (ndarray): Shape (n,) 模型参数    
      b (scalar):  模型参数   

    Returns:
      p (scalar):  预测值
    """
    n = x.shape[0]
    p = 0
    for i in range(n):
        p_i = x[i] * w[i]  
        p = p + p_i         
    p = p + b                
    return p

# We can make use of vector operations to speed up predictions.
def predict(x, w, b): 
    """
    使用线性回归进行单一预测on

    Args:
      x (ndarray): Shape (n,) 带有多特征的例子
      w (ndarray): Shape (n,) 模型参数    
      b (scalar):  模型参数   

    Returns:
      p (scalar):  预测值rediction
    """
    p = np.dot(x, w) + b     
    return p

Cost With Multiple Variables

$J(\vec{w},b) = \frac{1}{2m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})^2$

def compute_cost(X, y, w, b): 
    """
    计算损失
    Args:
      X (ndarray (m,n)): Data, m个实例n个特征
      y (ndarray (m,)) : 目标值
      w (ndarray (n,)) : 模型参数
      b (scalar)       : 模型参数

    Returns:
      cost (scalar): 损失值
    """
    m = X.shape[0]
    cost = 0.0
    for i in range(m):                                
        f_wb_i = np.dot(X[i], w) + b           #(n,)(n,) = scalar (see np.dot)
        cost = cost + (f_wb_i - y[i])**2       #scalar
    cost = cost / (2 * m)                      #scalar    
    return cost

Gradient descent

$w_j=w_j-\alpha\frac{\partial J(\vec{w},b)}{\partial{w_j}}$

$b=b-\alpha\frac{\partial{J(\vec{w},b)}}{\partial{b}}$

$\begin{cases}w_1=w_1-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_1^{(i)}\\ .\\ .\\ .\\ w_n=w_n-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})x_n^{(i)}\\ b=b-\alpha\frac{1}{m} \sum\limits_{i = 1}^{m} (f_{\vec{w},b}(\vec{x}^{(i)}) - y^{(i)})\end{cases}$

n >= 2, and simultaneously update $w_j$ (for j = 1,…,n) and b

def compute_gradient(X, y, w, b): 
    """
    计算线性回归的梯度
    Args:
      X (ndarray (m,n)): Data, m个实例n个特征
      y (ndarray (m,)) : 目标值
      w (ndarray (n,)) : 模型参数  
      b (scalar)       : 模型参数

    Returns:
      dj_dw (ndarray (n,)): 对于参数 w 的cost函数梯度 
      dj_db (scalar):       对于参数 b 的cost函数梯度 
    """
    m,n = X.shape           #(number of examples, number of features)
    dj_dw = np.zeros((n,))
    dj_db = 0.

    for i in range(m):                             
        err = (np.dot(X[i], w) + b) - y[i]   
        for j in range(n):                         
            dj_dw[j] = dj_dw[j] + err * X[i, j]    
        dj_db = dj_db + err                        
    dj_dw = dj_dw / m                                
    dj_db = dj_db / m                                

    return dj_db, dj_dw


def gradient_descent(X, y, w_in, b_in, cost_function, gradient_function, alpha, num_iters): 
    """
    执行批量梯度下降以学习 w 和 b。通过采取 num_iters 步长为 learning rate 的梯度步骤来更新 w 和 b

    Args:
      X (ndarray (m,n))   : Data, m个实例n个特征
      y (ndarray (m,))    : 目标值
      w_in (ndarray (n,)) : 初始模型参数
      b_in (scalar)       : 初始模型参数
      cost_function       : 计算cost的函数
      gradient_function   : 计算梯度的函数
      alpha (float)       : 学习率
      num_iters (int)     : 运行梯度下降的迭代次数

    Returns:
      w (ndarray (n,)) : 更新后的参数结果 
      b (scalar)       : 更新后的参数结果
      """

    w = copy.deepcopy(w_in)  #在函数内部避免修改全局变量 w
    b = b_in

    for i in range(num_iters):

        # 计算梯度并更新参数
        dj_db,dj_dw = gradient_function(X, y, w, b)   ##None

        # 更新参数使用 w, b, alpha and gradient
        w = w - alpha * dj_dw               ##None
        b = b - alpha * dj_db               ##None

    return w, b # 返回最终 w,b

Normal equation 正规方程

only for linear regression
solve for w,b without iterations

Disadvantage:

dosen’t generalize to other learning algorithms
slow when number of features is large (> 10000)

Feature scaling 特征缩放

Feature scaling, essentially dividing each positive feature by its maximum value, or more generally, rescale each feature by both its minimum and maximum values using (x-min)/(max-min).

aim for about -1 <= $x_j$ <= 1 for each feature $x_j$

Mean normalization 均值归一化

$x_i := \dfrac{x_i - \mu_i}{max - min}$

Z-score normalization Z-score归一化

After z-score normalization, all features will have a mean of 0 and a standard deviation of 1.

x^{(i)}_j = \dfrac{x^{(i)}_j - \mu_j}{\sigma_j} \tag{4}

where $j$ selects a feature or a column in the $\mathbf{X}$ matrix. $µ_j$ is the mean of all the values for feature (j) and $\sigma_j$ is the standard deviation of feature (j).

Implementation Note: When normalizing the features, it is important to store the values used for normalization - the mean value and the standard deviation used for the computations.

Given a new x value, we must first normalize x using the mean and standard deviation that we had previously computed from the training set.