Gradient descent is one of those “greatest hits” algorithms that can offer a new perspective for solving problems. Unfortunately, it’s rarely taught in undergraduate computer science programs. In this post I’ll give an introduction to the gradient descent algorithm, and walk through an example that demonstrates how gradient descent can be used to solve machine learning problems such as linear regression.

At a theoretical level, gradient descent is an algorithm that minimizes functions. Given a function defined by a set of parameters, gradient descent starts with an initial set of parameter values and iteratively moves toward a set of parameter values that minimize the function. This iterative minimization is achieved using calculus, taking steps in the negative direction of the function gradient.

It’s sometimes difficult to see how this mathematical explanation translates into a practical setting, so it’s helpful to look at an example. The canonical example when explaining gradient descent is linear regression.

*Code for this example can be found here*

## Linear Regression Example

Simply stated, the goal of linear regression is to fit a line to a set of points. Consider the following data.

Let’s suppose we want to model the above set of points with a line. To do this we’ll use the standard `y = mx + b`

line equation where `m`

is the line’s slope and `b`

is the line’s y-intercept. To find the best line for our data, we need to find the best set of slope `m`

and y-intercept `b`

values.

A standard approach to solving this type of problem is to define an error function (also called a cost function) that measures how “good” a given line is. This function will take in a `(m,b)`

pair and return an error value based on how well the line fits our data. To compute this error for a given line, we’ll iterate through each `(x,y)`

point in our data set and sum the square distances between each point’s `y`

value and the candidate line’s `y`

value (computed at `mx + b`

). It’s conventional to square this distance to ensure that it is positive and to make our error function differentiable. In python, computing the error for a given line will look like:

# y = mx + b

# m is slope, b is y-intercept

def computeErrorForLineGivenPoints(b, m, points):

totalError = 0

for i in range(0, len(points)):

totalError += (points[i].y - (m * points[i].x + b)) ** 2

return totalError / float(len(points))

Formally, this error function looks like:

Lines that fit our data better (where better is defined by our error function) will result in lower error values. If we minimize this function, we will get the best line for our data. Since our error function consists of two parameters (`m`

and `b`

) we can visualize it as a two-dimensional surface. This is what it looks like for our data set:

Each point in this two-dimensional space represents a line. The height of the function at each point is the error value for that line. You can see that some lines yield smaller error values than others (i.e., fit our data better). When we run gradient descent search, we will start from some location on this surface and move downhill to find the line with the lowest error.

To run gradient descent on this error function, we first need to compute its gradient. The gradient will act like a compass and always point us downhill. To compute it, we will need to differentiate our error function. Since our function is defined by two parameters (`m`

and `b`

), we will need to compute a partial derivative for each. These derivatives work out to be:

We now have all the tools needed to run gradient descent. We can initialize our search to start at any pair of `m`

and `b`

values (i.e., any line) and let the gradient descent algorithm march downhill on our error function towards the best line. Each iteration will update `m`

and `b`

to a line that yields slightly lower error than the previous iteration. The direction to move in for each iteration is calculated using the two partial derivatives from above and looks like this:

def stepGradient(b_current, m_current, points, learningRate):

b_gradient = 0

m_gradient = 0

N = float(len(points))

for i in range(0, len(points)):

b_gradient += -(2/N) * (points[i].y - ((m_current*points[i].x) + b_current))

m_gradient += -(2/N) * points[i].x * (points[i].y - ((m_current * points[i].x) + b_current))

new_b = b_current - (learningRate * b_gradient)

new_m = m_current - (learningRate * m_gradient)

return [new_b, new_m]

The `learningRate`

variable controls how large of a step we take downhill during each iteration. If we take too large of a step, we may step over the minimum. However, if we take small steps, it will require many iterations to arrive at the minimum.

Below are some snapshots of gradient descent running for 2000 iterations for our example problem. We start out at point `m = -1`

`b = 0`

. Each iteration `m`

and `b`

are updated to values that yield slightly lower error than the previous iteration. The left plot displays the current location of the gradient descent search (blue dot) and the path taken to get there (black line). The right plot displays the corresponding line for the current search location. Eventually we ended up with a pretty accurate fit.

We can also observe how the error changes as we move toward the minimum. A good way to ensure that gradient descent is working correctly is to make sure that the error decreases for each iteration. Below is a plot of error values for the first 100 iterations of the above gradient search.

We’ve now seen how gradient descent can be applied to solve a linear regression problem. While the model in our example was a line, the concept of minimizing a cost function to tune parameters also applies to regression problems that use higher order polynomials and other problems found around the machine learning world.

While we were able to scratch the surface for learning gradient descent, there are several additional concepts that are good to be aware of that we weren’t able to discuss. A few of these include:

**Convexity**– In our linear regression problem, there was only one minimum. Our error surface was convex. Regardless of where we started, we would eventually arrive at the absolute minimum. In general, this need not be the case. It’s possible to have a problem with local minima that a gradient search can get stuck in. There are several approaches to mitigate this (e.g., stochastic gradient search).**Performance**– We used vanilla gradient descent with a learning rate of 0.0005 in the above example, and ran it for 2000 iterations. There are approaches such a line search, that can reduce the number of iterations required. For the above example, line search reduces the number of iterations to arrive at a reasonable solution from several thousand to around 50.**Convergence**– We didn’t talk about how to determine when the search finds a solution. This is typically done by looking for small changes in error iteration-to-iteration (e.g., where the gradient is near zero).

For more information about gradient descent, linear regression, and other machine learning topics, I would strongly recommend Andrew Ng’s machine learning course on Coursera.

## Example Code

Example code for the problem described above can be found here

**Edit**: *I chose to use linear regression example above for simplicity. We used gradient descent to iteratively estimate m and b, however we could have also solved for them directly. My intention was to illustrate how gradient descent can be used to iteratively estimate/tune parameters, as this is required for many different problems in machine learning.*

Maybe I’m missing something, but the y-intercept,slope points plotted in the “Gradient Search” graphs don’t seem to correspond to the blue lines being generated in the “Data and Current Line” graphs. The values for slope seem accurate but the y-intercepts seem off.

Hi Chris, thanks for the comment. The origin (0,0) doesn’t correspond to the bottom left of the plot (rather it’s one tick in on each axis) so it might be a little confusing to read. What specifically looks off?

This is very interesting. As I don’t have a comp sci background, can you explain when you would use gradient descent to solve a linear regression problem vs. using OLS? Thanks.

Hi Ji-A. I used a simple linear regression example in this post for simplicity. As you alluded to, the example in the post has a closed form solution that can be solved easily, so I wouldn’t use gradient descent to solve such a simplistic linear regression problem. However, gradient descent and the concept of parameter optimization/tuning is found all over the machine learning world, so I wanted to present it in a way that was easy to understand. In practice, my understanding is that gradient descent becomes more useful in the following scenarios:

1) As the number of parameters you need to solve for grows. In our example we had two parameters (m and b). In Andrew Ng’s Machine Learning class on Coursera, he suggests that when you have more than 10,000 parameters gradient descent may be a better solution than the normal equation closed form solution. See the video here: https://www.youtube.com/watch?v=B3vseKmgi8E&feature=youtu.be&t=11m27s

2) When your system of equations is non-linear. Logistic regression (a common machine learning classification method) is an example of this.

3) When an approximate answer is “good enough”.

Thanks for the information! I knew there were nuances I was missing.

Thanks for neatly explaining the concept. One question however, where are you getting the x and y values to compute the totalError and the two new gradients in your code snippets? I can’t figure that out, please help understand.

The x and y values come from the points (e.g., the data set). The points are iterated over and each point (e.g., (x, y) pair) contributes toward the totalError and gradient values.

1) Crystal clear. I have one doubt , if the error surface is having only one local minimum(absolute minimum) , then we can set derivation equal to zero (which is nothing but solving simultaneous equations right ? The solution we get from this method will be unique , in this case we no need to worry about GD algo and number of iterations),

2) But in real time we dont know the error surface will have how many locals ( let say if we have m local minimas , all these places will have gradient value will be zeros) . In the iterative process (GD algo), when we near to any of local minima we will stop (again , to reach such any one of local minima will take many number of iterations , is that right ?)

3) As you mentioned is that always right the ‘total error in previous iteration should have lesser than current iteration (It may fluctuate , It depends on learning param ?)’

- your post is too good

Hi Naresh,

Question 1 – Yes, that is correct. We could solve directly for it (as we have two equations, two unknowns, etc.). I chose a simple example to explain the gradient descent idea/concept. However, you could have a problem where you can’t solve for it directly or the cost of doing so is high (see my reply above to Ji-A).

Question 2 – Yes, that is also correct. It’s possible to get suck in local minima. Typically you can use a stochastic approach to mitigate this where you run many searches from many initial states and choose the best result amongst all of them.

Question 3 – In general, the error should always monotonically decrease (if you are truly moving downhill in the direction of the negative gradient). However, depending on your parameter selection (e.g., learning rate, etc.) it is possible to diverge.

Hope that helps.

Also want to understand how the differentiation is always arriving at a descent.

Vinsent, gradient descent is able to always move downhill because it uses calculus to compute the slope of the error surface at each iteration.

It’s nice article but I have question to choose m value what should be ideal m value I am working similar Algorithm but not able to solve it.

In my example above m was a parameter (the line’s slope) that we are trying to solve for. So we choose a random initial m value and gradient descent updates it each iteration with a slightly better value until it arrives at the best value (or get’s stuck in a local minimum).

Just pasting my question here : http://stackoverflow.com/questions/26314066/intersection-of-curve-and-find-x-y-point-using-x-and-y-data-point

Thanks, I appreciate your help.

hi, i tried to use the code in your post but however find it not converging somehow. do you know why?

http://nbviewer.ipython.org/github/tikazyq/stuff/blob/master/grad_descent.ipynb

Try using a smaller learning rate. I ran your code with a learning rate of 0.0001 and it seemed to be converging.

Fantastic article!

Very well crafted.

Thank you.

Hi, thanks for the article.

Do you, by any chance, have the original points to test the methods.

Thanks

Very clear example! Could you tell what kind of data structure the ‘points’ variable is? I can’t figure it out. Looks like an array of Point classes, since you use the [] notation to access a point and the dot notation to access x and y of a point.

Points is a list of Point objects (e.g., a class with an x and y property).

Is there somewhere that we can see the whole code example? The snippets are helpful but not entirely sufficient.

Thanks!

Thanks for the comment. I will work to put together a more complete code example and share it.

I have put together an example here: https://github.com/mattnedrich/GradientDescentExample

Nicely explained!! Enjoyed the post.Thanks