# 单变量线性回归(LinearRegressionwithOneVariable)

2016-11-21

（Our first learning algorithm will be linear regression ）我们第一个学习算法将是线性回归。（In this lecture, you ll see what the model looks like and more importantly you ll see what the overall process of supervised learning looks like ）。

2-1 （model representation）模型代表

（Our first learning algorithm will be linear regression.）我们第一个学习算法将是线性回归。（In this lecture, you&#39;ll see what the model looks like and more importantly you&#39;ll see what the overall process of supervised learning looks like.）在这节课，你将会看到模型将会是什么样子，并且重要的是你将会看到监督学习的整个过程。

（Let&#39;s use some motivating example of predicting housing prices.）让我们使用一些激励的例子预测住房价格上涨。（We&#39;re going to use a data set of housing prices from the city of Portland, Oregon.）我们将使用来自俄勒冈州波特兰市的住房价格数据集。（And here I&#39;m gonna plot my data set of a number of houses that were different sizes that were sold for a range of different prices.）在这里，我要去绘制不同尺寸以及不同价格范围内的一些房屋的数据集。（Let&#39;s say that&#39;s given this data set, you have a friend that&#39;s trying to sell a house）比方说，这组数据，你有一个朋友，试图把房子卖了，（and let&#39;s see if friend&#39;s house is size of 1250 square feet and you want to tell them how much they might be able to sell the house for.）让我们来看看，如果朋友的房子是大小1250平方英尺，那么你要告诉他们的房子将能卖多少。（Well one thing you could do is fit a model）好了，你可以做的一件事是拟合模型。（Maybe fit a straight line to this data.）也许拟合这个数据是一条直线。（Looks something like that and based on that, maybe you could tell your friend that let&#39;s say maybe he can sell the house for around \$220000.）看起来是这样的，根据这个，你可以告诉你朋友他可以卖这个房子大概\$220000。（So this is an example of a supervised learning algorithm.）所以，这是一个监督学习算法的例子。（And it&#39;s supervised learning because we&#39;re given the quotes, "right answer" for each of our examples.）它是监督学习，对于我们每个例子，报价是“正确答案”。（Namely we&#39;re told what was the actual price of each of the houses in our data set were sold）即通过数据告诉我们已售出每个房子的实际价格，（and moreover,this is an example of a regression problem where the term regression refers to the fact that we are predicting a real-valued output namely the price.）并且这是一个涉及我们预测价格真正的值的输出的回归的例子。（And just to remind you the other most common type of supervised learning problem is called the classification problem where we predict discrete valued outputs such as if we are looking at cancer tumors and trying to decide if a tumor is malignant or benign.）提醒你，其它监督的最常见的类型的学习类型被称为分类问题，我们预测例如我们观看癌症肿瘤，并试图决定肿瘤是良性或恶性。（So that&#39;s a zero one valued discrete output.）所以这是一个零一值离散输出。

（More formally, in supervised learning, we have a data set and this data set is called a training set.）更多，在正式监督学习，我们有一组数据并且这组数据被称为训练集。（So for housing prices example, we have a training set of different housing prices and our job is to learn from this data how to predict prices of the houses.）因此对于住房价格例子，我们有一个不同住房价格的训练集以及我们的工作是从这个数据中学习如何预测住房价格。（Let&#39;s define some notation that we&#39;re using throughout this course.）在整个课程让我们定义我们使用的一些符号。（We&#39;re going to define quite a lot of symbols.）我们要定义许多符号。（It&#39;s okay if you don&#39;t remember all the symbols right now but as the course progresses it will be useful）没关系，如果你不记得所有符号，但随着课程的进展，记住它是非常有益的。（Convenient notation. So I&#39;m gonna use lower case m throughout this course to denote the number of training examples.）为了方便，我在整个课程中将使用小写m表示训练中例子的个数。（So in this data set, if I have, you know, let&#39;s say 47 rows in this table.）所以在这个数据集，如果我有，你知道，在此表中有47列。（Then I have 47 training examples and m equal to 47.）然后我有47个训练实例以及m等于47。（Let&#39;s use lowercase x to denote the input variables often also called the features.）让我们用小写字母x表示输入变量，经常也被称为特征值。（That would be the x is here, it would the input features.）这里的x将表示输入的特征值。（And I gonna use y to denote my output variables or the target variable which I&#39;m going to predict and so that&#39;s the second column here.）并且我将使用y表示我的输出值或者我将要预测的目标，也就是这里的第二列。（Notation, I&#39;m going to use (x,y) to denote a single training example.）我将会使用(x,y)代表一个训练例子。（So, a single row in this table corresponds to a single training example）所以在这个表中单排对应训练集的一个例子（and to refer to a specific training exmaple, I&#39;m going to use this notation x(i) comma gives me y(i).）并且对于特定的训练集我将使用符号x(i)将会输出y(i)。（And, we&#39;re going to use this to refer to the ith training example.）并且，我们将使用这个表示第i个训练集。（So this superscript i over here, this is not exponentiation right?）所以个这个下标i（This (x(i), y(i)),the superscript i in parentheses that&#39;s just an index into my training set and refers to the ith row in this table, okay?）在这里，这是不是求幂。(x(i), y (i))，上标i，那只是索引我的括号，训练是指在这张表中第i行数据。(So this is not x to the power of i, y to the power of i.)所以这不是x的i次方，y的i次方。（Instead (x(i),y(i)) just refers to the ith row of this table.）取而代之的是表中第i行数据。（So here&#39;s how this supervised learning algorithm works.）因此，这里是监督学习算法如何工作的原理。

（We saw that with the training set like our training set of hoursing prices and we feed that to our learning algorithm.）我们看到，例如训练集的住房价格，我们适应我们的学习算法。（Is the job of a learning algorithm to them output a function which by convention is usually denoted lowercase h and h stands for hypothesis）学习算法的工作是输出按照约定功能，该功能通常使用小写h代表假设。（And what the job of the hypothesis is a function that takes as input the size of a house like maybe the size of the new house your friend&#39;s trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding h2-1 (model representation)模型代表

(Our first learning algorithm will be linear regression.)我们第一个学习算法将是线性回归。(In this lecture, you&#39;ll see what the model looks like and more importantly you&#39;ll see what the overall process of supervised learning looks like.)在这节课，你将会看到模型将会是什么样子，并且重要的是你将会看到监督学习的整个过程。

(Let&#39;s use some motivating example of predicting housing prices.)让我们使用一些激励的例子预测住房价格上涨。(We&#39;re going to use a data set of housing prices from the city of Portland, Oregon.)我们将使用来自俄勒冈州波特兰市的住房价格数据集。(And here I&#39;m gonna plot my data set of a number of houses that were different sizes that were sold for a range of different prices.)在这里，我要去绘制不同尺寸以及不同价格范围内的一些房屋的数据集。(Let&#39;s say that&#39;s given this data set, you have a friend that&#39;s trying to sell a house)比方说，这组数据，你有一个朋友，试图把房子卖了，(and let&#39;s see if friend&#39;s house is size of 1250 square feet and you want to tell them how much they might be able to sell the house for.)让我们来看看，如果朋友的房子是大小1250平方英尺，那么你要告诉他们的房子将能卖多少。(Well one thing you could do is fit a model)好了，你可以做的一件事是拟合模型。(Maybe fit a straight line to this data.)也许拟合这个数据是一条直线。(Looks something like that and based on that, maybe you could tell your friend that let&#39;s say maybe he can sell the house for around \$220000.)看起来是这样的，根据这个，你可以告诉你朋友他可以卖这个房子大概\$220000。(So this is an example of a supervised learning algorithm.)所以，这是一个监督学习算法的例子。(And it&#39;s supervised learning because we&#39;re given the quotes, "right answer" for each of our examples.)它是监督学习，对于我们每个例子，报价是“正确答案”。(Namely we&#39;re told what was the actual price of each of the houses in our data set were sold)即通过数据告诉我们已售出每个房子的实际价格，(and moreover,this is an example of a regression problem where the term regression refers to the fact that we are predicting a real-valued output namely the price.)并且这是一个涉及我们预测价格真正的值的输出的回归的例子。(And just to remind you the other most common type of supervised learning problem is called the classification problem where we predict discrete valued outputs such as if we are looking at cancer tumors and trying to decide if a tumor is malignant or benign.)提醒你，其它监督的最常见的类型的学习类型被称为分类问题，我们预测例如我们观看癌症肿瘤，并试图决定肿瘤是良性或恶性。(So that&#39;s a zero one valued discrete output.)所以这是一个零一值离散输出。

(More formally, in supervised learning, we have a data set and this data set is called a training set.)更多，在正式监督学习，我们有一组数据并且这组数据被称为训练集。(So for housing prices example, we have a training set of different housing prices and our job is to learn from this data how to predict prices of the houses.)因此对于住房价格例子，我们有一个不同住房价格的训练集以及我们的工作是从这个数据中学习如何预测住房价格。(Let&#39;s define some notation that we&#39;re using throughout this course.)在整个课程让我们定义我们使用的一些符号。(We&#39;re going to define quite a lot of symbols.)我们要定义许多符号。(It&#39;s okay if you don&#39;t remember all the symbols right now but as the course progresses it will be useful)没关系，如果你不记得所有符号，但随着课程的进展，记住它是非常有益的。(Convenient notation. So I&#39;m gonna use lower case m throughout this course to denote the number of training examples.)为了方便，我在整个课程中将使用小写m表示训练中例子的个数。(So in this data set, if I have, you know, let&#39;s say 47 rows in this table.)所以在这个数据集，如果我有，你知道，在此表中有47列。(Then I have 47 training examples and m equal to 47.)然后我有47个训练实例以及m等于47。(Let&#39;s use lowercase x to denote the input variables often also called the features.)让我们用小写字母x表示输入变量，经常也被称为特征值。(That would be the x is here, it would the input features.)这里的x将表示输入的特征值。(And I gonna use y to denote my output variables or the target variable which I&#39;m going to predict and so that&#39;s the second column here.)并且我将使用y表示我的输出值或者我将要预测的目标，也就是这里的第二列。(Notation, I&#39;m going to use (x,y) to denote a single training example.)我将会使用(x,y)代表一个训练例子。(So, a single row in this table corresponds to a single training example)所以在这个表中单排对应训练集的一个例子(and to refer to a specific training exmaple, I&#39;m going to use this notation x(i) comma gives me y(i).)并且对于特定的训练集我将使用符号x(i)将会输出y(i)。(And, we&#39;re going to use this to refer to the ith training example.)并且，我们将使用这个表示第i个训练集。(So this superscript i over here, this is not exponentiation right?)所以个这个下标i(This (x(i), y(i)),the superscript i in parentheses that&#39;s just an index into my training set and refers to the ith row in this table, okay?)在这里，这是不是求幂。(x(i), y (i))，上标i，那只是索引我的括号，训练是指在这张表中第i行数据。(So this is not x to the power of i, y to the power of i.)所以这不是x的i次方，y的i次方。(Instead (x(i),y(i)) just refers to the ith row of this table.)取而代之的是表中第i行数据。(So here&#39;s how this supervised learning algorithm works.)因此，这里是监督学习算法如何工作的原理。

(We saw that with the training set like our training set of hoursing prices and we feed that to our learning algorithm.)我们看到，例如训练集的住房价格，我们适应我们的学习算法。(Is the job of a learning algorithm to them output a function which by convention is usually denoted lowercase h and h stands for hypothesis)学习算法的工作是输出按照约定功能，该功能通常使用小写h代表假设。(And what the job of the hypothesis is a function that takes as input the size of a house like maybe the size of the new house your friend&#39;s trying to sell so it takes in the value of x and it tries to output the estimated value of y for the corresponding house.) 并且工作的假设是作为输入那样的房子的大小的函数，例如根据你的朋友新房子的大小，输出该房子估计值的大小y。(So h is a function that maps from x&#39;s to y&#39;s)因此h是一个映射x到y的函数。(People often ask me, you know, why is this function called hypothesis.)人们经常问我，你知道，这是为什么该函数被称为假设。(Some of you may know the meaning of the term hypothesis, from the dictionary or from science or whatever.)你们有些人可能知道假设这词的意思，从字典或者从科学杂志或其它方面。(It turns out that in machine learning, this is a name that was used in the early days of machine learning and it kinda stuck.)它来自机器学习，它在早期机器学习中使用。(Because maybe not a greate name for this sort of function, for mapping from sizes of houses to the predictions, that you know...)这也许对于这个函数不是一个伟大的名字，将房子的大小映射到预测，你知道的....(I think the term hypothesis, maybe isn&#39;t the best possible name for this, but this is the standard terminology that people use in machine learning)我认为假设这词，对于这也许不是最好的名字，但是在机器学习领域这是人们使用的标准术语。(How do we represent h?)我们如何代表这个h。(And we will write this as htheta(x) equals theta0 plus theta 1 of x.)我们将写下h_theta(x) = theta(0)+theta(1)x。(And as a shorthand, sometimes instead of writing, you know, h subscript theta of x, sometimes there&#39;s a shorthand, I&#39;ll just write as a h of x.)做为缩写，有时为了代替写作，h(x) = h_theta(x)。(is predicting that y is some straight line function of x.)预测y是一些直线x的函数。(That&#39;s h of x equals theta0 plus theta1 x)这是x=theta0 + theta1*x。(Sometimes we will fit more complicated, perharps non-linear functions as well.)有时我们为了适应更复杂的情况，可能非线性会更好。(But linear case is the simple building block, we&#39;ll start with this example first of fitting linear functions)但是线性情况是简单的积木，我们将首先从这个例子进行拟合(, and we will build on this to eventually have more complex models, and more complex learning algorithms.)紧接着，我们将建立更加复杂模型和更加复杂的学习算法。(And for example , is actually linear regression with one variable, with the variable being x)例如，实际上线性回归有着一个变量，该变量就是x。(And another name for this model is univariate linear regression.)对于这个模型有另一个名字是单变量线性回归。(And univariate is just a fancy way of saying one variable.)单变量意思就是一个变量。

2-2 Cost Function(损失函数)

(In this lecture we&#39;ll define something called the cost function.)在这节课，我们将定义某些东西叫做代价函数。(This will let us figure out how to fit the best possible straight line to our data)这有助于我们弄清楚如何把最可能的直线和我们的数据拟合。(In linear regression we have a training set like that shown here.)在线性回归中我们有个如下所示的训练集。(To introduce a little bit more terminology, these theta zero and theta one, these theta i&#39;s are what I call the parameters of the model.)接下来的theta0和theta1为模型的参数。(What we&#39;re going to do in this video is talk about how to go about choosing these two parameter values, theta zero and theta one.)我们要做的就是谈谈如何选择这两个参数theta1和theta0。(To choise different parameters, we&#39;ll get different hypotheses functions.)选择不同的参数我们将得到不同的假设函数。

(What we want to do is come up with values for the parameters theta zero and theta one.)我们要做的就是得出theta0和theta1这两个参数的值。(So that the straight line we get out of this corresponds to a straight line that somehow fits the data well)使得我们得到的直线更好的拟合数据。(The idea is we&#39;re going to choose our parameters theta zero, theta one so that h(x), meaning the value we predict on input x, that this at least close to the values y for the examples in our training set.)我们的想法是我们要选择我们的参数theta0和theta1使得我们预测的值最接近样本对应的值。(So what I want really is to sum over my training set.)所以我想要做的是对所有训练样本进行一个求和。(So we&#39;re going to try to minimize my average error, which we&#39;re going to minimize one by 2M.)所以我们尽可能使得我们的平均误差尽可能的小。(So this is going to be my overall objective function for linear regression.)因此这将是我们的线性回归的整体目标函数。(So this cost function is also called the squared error function)所以这个损失函数也通常被称为平方误差函数。(The squared error cost function is probably the most commonly used one for regression problems.)平方误差损失函数可能是解决线性回归最常用的的手段了。(Later in this class we&#39;ll also talk about alternative cost functions as well, but this choice that we just had, should be a pretty reasonable thing to try for most linear regression problems.)在后续课程中我们将讨论其他损失函数，但是我们刚刚讲的选择是对于大多数线性回归问题非常合理的。

2 - 3 Cost Function-Intuition I

In the previous video, we gave the mathematical definition of the cost function.(在以前的视频，我们给了代价函数的数学定义。)In this video, let&#39;s look at some examples, to get back to intuition about what the cost function is doing, and why we want to use it.(在这段视频中，让我们来看看一些例子，在直觉理解下代价函数是用来做什么的，以及为什么我们要用它。)To recap, here&#39;s what we had last time.(总括来说，这是我们上节课学到的。)W

2 - 4 Cost Function-Intuition II

In this video, let&#39;s delve deeper and get even better intuition about what the cost function is doing.(在这段视频中，让我们更深刻的研究更直接的体会代价函数是用来干什么的。)This video assumes that you&#39;re familiar with contour plots.(此视频以你熟悉等高线图的绘制为前提。)If you are not familiar with contour plots or contour figures some of the illustrations in this video may or may not make sense to you but is okay and if you end up skipping this video or some of it does not quite make sense because you haven&#39;t seen (如果你不熟悉等高线图这个视频中的某些描述，对你来说明确或者不明确都没关系而跳过一部分视频也不会有多大影响)，，

In this video, I want to tell you about an algorithm called gradient descent for minimizing the cost function J.(在这个视频我将向你介绍一种叫做梯度下降法的算法。)It turns out gradient descent is a more general algorithm and is used not only in linear pression.(这种算法可以将代价函数最小化，并且不仅仅用在线性回归上。)

2 - 6 Gradient Descent Intutition

this parameter, or this term, alpha, is called the learning rate.(提醒一下，这个参数alpha术语称为学习速率。)And it controls how big a step we take when updating my parameter theta J.(它控制我们以多大的幅度更新这个参数theta j。)And this second term here is the derivative term.(第二部分是导数项)And in this video, what I want to is give you better intuition about what each of these two terms is doing and why, when put together, this entire update makes sense.(给你一个更直观的认识，这两部分有什么用以及为什么把这两个放在一起时，整个更新过程是有意义的。)

derivative term, right?(这个微分项是吧?)If you&#39;re wondering why I changed the notation from these partial derivative symbols.(可能你想问为什么我改变偏导数符号。)If you don&#39;t know what the difference is between these partial derivative symbols and the dd theta, don&#39;t worry about it.(如果你不知道这两者之间的不同，不要担心。)Technically in mathematics we call this a partial derivative, we call this a derivative, depending on the number of, of parameters in the function J.(从数学角度来说，我们根据变量的个数区分偏导数和导数。)So, let&#39;s see what this, this equation will do.And so we&#39;re going to compute this derivative of .(那么我们来看这个方程，我们要计算这个导数。)Take the tangent to that point, like that straight line, the red line, and let&#39;s look at the slope of this red line.(取该点切线，就是这样的一条红色直线，让我们看看这条红色线的斜率。)That&#39;s where the derivative is.(也就是该导数。)The slope is you know just the height divided by this horizontal thing.(斜率也就是高度除以水平长度。)This line has a positive slope, so it has a positive derivative.(这个直线有正的斜率，也就是正的导数。)And so, my update to theta is going to be, theta one gives the update that theta one minus alpha times some positive(theta更新后等于theta减去一个正数乘以alpha。)Let&#39;s look at another example.So let&#39;s take my same function j.(让我们来看另一个例子，让我们用同样的函数j。)

Let&#39;s next take a look at the learning rate term alpha,and try to figure out what that&#39;s doing.(让我们来看看学习速率alpha，我们来研究一下它到底有什么用。)So, here is my gradient descent update rule(这就是我梯度下降法更新法则。)(就是这个等式)Right, there&#39;s the equation.(让我们来看看如果alpha太大或太小将会发生什么。)(如果alpha太小需要很多步才能到达全局最低点。)If alpha is too small, it needs a lot of steps to get close to the global minimum.(所以如果alpha太大，将导致无法收敛，甚至发散)If alpha is too large, it can fail to converge or even diverge.(这业解释了为什么即使学习速率保持不变时，仍收敛于局部最优点)This also explains why gradient descent can converge the local minimum.

(越接近局部最优点，我的导数越来越接近零)As I approach the minimum, my derivative gets closer and closer to zero.

(平方误差函数结合梯度下降法，这将会使我们得到第一个机器学习算法。)square cost function and gradient descent can give us the first algorithm, our linear regression algorithm.

2 - 7 Gradient Descent For Linear Regression

(把梯度下降法，线性回归模型以及损失函数放在一起)Put the gradient descent algorithm, linear regression model and the squared error cost function. But, it turns out that the cost function for linear regression going to be a bow-shaped function like this.(事实证明，线性回归的损失函数总是会是弓形这样的功能。)It&#39;s called a convex function(这是一个凸函数。)

(刚刚的算法称为批处理梯度下降)Just now, the algorithm called "Batch" Gradient Descent.(批处理：每一步梯度下降将使用全部数据集。)Batch: Each step of gradient descent uses all training set.(在下段视频中，告诉你泛化的梯度下降算法，这将使梯度下降更加强大。)ouse.） 并且工作的假设是作为输入那样的房子的大小的函数，例如根据你的朋友新房子的大小，输出该房子估计值的大小y。（So h is a function that maps from x&#39;s to y&#39;s）因此h是一个映射x到y的函数。（People often ask me, you know, why is this function called hypothesis.）人们经常问我，你知道，这是为什么该函数被称为假设。（Some of you may know the meaning of the term hypothesis, from the dictionary or from science or whatever.）你们有些人可能知道假设这词的意思，从字典或者从科学杂志或其它方面。（It turns out that in machine learning, this is a name that was used in the early days of machine learning and it kinda stuck.）它来自机器学习，它在早期机器学习中使用。（Because maybe not a greate name for this sort of function, for mapping from sizes of houses to the predictions, that you know...）这也许对于这个函数不是一个伟大的名字，将房子的大小映射到预测，你知道的....（I think the term hypothesis, maybe isn&#39;t the best possible name for this, but this is the standard terminology that people use in machine learning）我认为假设这词，对于这也许不是最好的名字，但是在机器学习领域这是人们使用的标准术语。（How do we represent h?）我们如何代表这个h。（And we will write this as htheta(x) equals theta0 plus theta 1 of x.）我们将写下h_theta(x) = theta(0)+theta(1)x。（And as a shorthand, sometimes instead of writing, you know, h subscript theta of x, sometimes there&#39;s a shorthand, I&#39;ll just write as a h of x.）做为缩写，有时为了代替写作，h(x) = h_theta(x)。（is predicting that y is some straight line function of x.）预测y是一些直线x的函数。（That&#39;s h of x equals theta0 plus theta1 x）这是x＝theta0 ＋ theta1*x。（Sometimes we will fit more complicated, perharps non-linear functions as well.）有时我们为了适应更复杂的情况，可能非线性会更好。（But linear case is the simple building block, we&#39;ll start with this example first of fitting linear functions）但是线性情况是简单的积木，我们将首先从这个例子进行拟合（, and we will build on this to eventually have more complex models, and more complex learning algorithms.）紧接着，我们将建立更加复杂模型和更加复杂的学习算法。（And for example , is actually linear regression with one variable, with the variable being x）例如，实际上线性回归有着一个变量，该变量就是x。（And another name for this model is univariate linear regression.）对于这个模型有另一个名字是单变量线性回归。（And univariate is just a fancy way of saying one variable.）单变量意思就是一个变量。

2-2 Cost Function（损失函数）

（In this lecture we&#39;ll define something called the cost function.）在这节课，我们将定义某些东西叫做代价函数。（This will let us figure out how to fit the best possible straight line to our data）这有助于我们弄清楚如何把最可能的直线和我们的数据拟合。（In linear regression we have a training set like that shown here.）在线性回归中我们有个如下所示的训练集。（To introduce a little bit more terminology, these theta zero and theta one, these theta i&#39;s are what I call the parameters of the model.）接下来的theta0和theta1为模型的参数。（What we&#39;re going to do in this video is talk about how to go about choosing these two parameter values, theta zero and theta one.）我们要做的就是谈谈如何选择这两个参数theta1和theta0。（To choise different parameters, we&#39;ll get different hypotheses functions.）选择不同的参数我们将得到不同的假设函数。

（What we want to do is come up with values for the parameters theta zero and theta one.）我们要做的就是得出theta0和theta1这两个参数的值。（So that the straight line we get out of this corresponds to a straight line that somehow fits the data well）使得我们得到的直线更好的拟合数据。（The idea is we&#39;re going to choose our parameters theta zero, theta one so that h(x), meaning the value we predict on input x, that this at least close to the values y for the examples in our training set.）我们的想法是我们要选择我们的参数theta0和theta1使得我们预测的值最接近样本对应的值。（So what I want really is to sum over my training set.）所以我想要做的是对所有训练样本进行一个求和。（So we&#39;re going to try to minimize my average error, which we&#39;re going to minimize one by 2M.）所以我们尽可能使得我们的平均误差尽可能的小。（So this is going to be my overall objective function for linear regression.）因此这将是我们的线性回归的整体目标函数。（So this cost function is also called the squared error function）所以这个损失函数也通常被称为平方误差函数。（The squared error cost function is probably the most commonly used one for regression problems.）平方误差损失函数可能是解决线性回归最常用的的手段了。（Later in this class we&#39;ll also talk about alternative cost functions as well, but this choice that we just had, should be a pretty reasonable thing to try for most linear regression problems.）在后续课程中我们将讨论其他损失函数，但是我们刚刚讲的选择是对于大多数线性回归问题非常合理的。

2 - 3 Cost Function-Intuition I

In the previous video, we gave the mathematical definition of the cost function.（在以前的视频，我们给了代价函数的数学定义。）In this video, let&#39;s look at some examples, to get back to intuition about what the cost function is doing, and why we want to use it.（在这段视频中，让我们来看看一些例子，在直觉理解下代价函数是用来做什么的，以及为什么我们要用它。）To recap, here&#39;s what we had last time.（总括来说，这是我们上节课学到的。）W

2 - 4 Cost Function-Intuition II

In this video, let&#39;s delve deeper and get even better intuition about what the cost function is doing.（在这段视频中，让我们更深刻的研究更直接的体会代价函数是用来干什么的。）This video assumes that you&#39;re familiar with contour plots.（此视频以你熟悉等高线图的绘制为前提。）If you are not familiar with contour plots or contour figures some of the illustrations in this video may or may not make sense to you but is okay and if you end up skipping this video or some of it does not quite make sense because you haven&#39;t seen （如果你不熟悉等高线图这个视频中的某些描述，对你来说明确或者不明确都没关系而跳过一部分视频也不会有多大影响），，

In this video, I want to tell you about an algorithm called gradient descent for minimizing the cost function J.（在这个视频我将向你介绍一种叫做梯度下降法的算法。）It turns out gradient descent is a more general algorithm and is used not only in linear pression.（这种算法可以将代价函数最小化，并且不仅仅用在线性回归上。）

2 - 6 Gradient Descent Intutition

this parameter, or this term, alpha, is called the learning rate.（提醒一下，这个参数alpha术语称为学习速率。）And it controls how big a step we take when updating my parameter theta J.（它控制我们以多大的幅度更新这个参数theta j。）And this second term here is the derivative term.（第二部分是导数项）And in this video, what I want to is give you better intuition about what each of these two terms is doing and why, when put together, this entire update makes sense.（给你一个更直观的认识，这两部分有什么用以及为什么把这两个放在一起时，整个更新过程是有意义的。）

derivative term, right?（这个微分项是吧？）If you&#39;re wondering why I changed the notation from these partial derivative symbols.（可能你想问为什么我改变偏导数符号。）If you don&#39;t know what the difference is between these partial derivative symbols and the dd theta, don&#39;t worry about it.（如果你不知道这两者之间的不同，不要担心。）Technically in mathematics we call this a partial derivative, we call this a derivative, depending on the number of, of parameters in the function J.（从数学角度来说，我们根据变量的个数区分偏导数和导数。）So, let&#39;s see what this, this equation will do.And so we&#39;re going to compute this derivative of .（那么我们来看这个方程，我们要计算这个导数。）Take the tangent to that point, like that straight line, the red line, and let&#39;s look at the slope of this red line.（取该点切线，就是这样的一条红色直线，让我们看看这条红色线的斜率。）That&#39;s where the derivative is.（也就是该导数。）The slope is you know just the height divided by this horizontal thing.（斜率也就是高度除以水平长度。）This line has a positive slope, so it has a positive derivative.（这个直线有正的斜率，也就是正的导数。）And so, my update to theta is going to be, theta one gives the update that theta one minus alpha times some positive（theta更新后等于theta减去一个正数乘以alpha。）Let&#39;s look at another example.So let&#39;s take my same function j.（让我们来看另一个例子，让我们用同样的函数j。）

Let&#39;s next take a look at the learning rate term alpha,and try to figure out what that&#39;s doing.（让我们来看看学习速率alpha，我们来研究一下它到底有什么用。）So, here is my gradient descent update rule（这就是我梯度下降法更新法则。）（就是这个等式）Right, there&#39;s the equation.（让我们来看看如果alpha太大或太小将会发生什么。）（如果alpha太小需要很多步才能到达全局最低点。）If alpha is too small, it needs a lot of steps to get close to the global minimum.（所以如果alpha太大，将导致无法收敛，甚至发散）If alpha is too large, it can fail to converge or even diverge.（这业解释了为什么即使学习速率保持不变时，仍收敛于局部最优点）This also explains why gradient descent can converge the local minimum.

（越接近局部最优点，我的导数越来越接近零）As I approach the minimum, my derivative gets closer and closer to zero.

（平方误差函数结合梯度下降法，这将会使我们得到第一个机器学习算法。）square cost function and gradient descent can give us the first algorithm, our linear regression algorithm.

2 - 7 Gradient Descent For Linear Regression

（把梯度下降法，线性回归模型以及损失函数放在一起）Put the gradient descent algorithm, linear regression model and the squared error cost function. But, it turns out that the cost function for linear regression going to be a bow-shaped function like this.（事实证明，线性回归的损失函数总是会是弓形这样的功能。）It&#39;s called a convex function（这是一个凸函数。）

（刚刚的算法称为批处理梯度下降）Just now, the algorithm called "Batch" Gradient Descent.（批处理：每一步梯度下降将使用全部数据集。）Batch: Each step of gradient descent uses all training set.（在下段视频中，告诉你泛化的梯度下降算法，这将使梯度下降更加强大。）