Why Least Squares, Not Least Absolute Deviations?
In explaining regression lines to novices, I always use the approach of saying that in trying to relate two variables, we can plot them on a graph, and then a line through the middle of the dots shows the relation between them. We want to make “through the middle” precise, though, so what we do is have a computer find the line that minimizes the square of the distances from each point vertically up or down to the line.
There are two tough things to explain: Why “vertically”? Why square the deviation?
The reason we use a vertical distance is that we are imagining that we’re finding a relationship between X and Y such that Y= a+ b*X + e, so e is the error, the amount we have to add on to our estimate to get the true result. Adding something on means to add a vertical distance.
What I just realized was a way to justify squaring the deviation. To be sure, maximum likelihood estimation with normal disturbances gets you to that result, but that motivation is too complicated for a novice. And it’s no good saying that we care about outliers more than about points near the line, because we don’t always–not all loss functions are quadratic, and in fact in regressions it often makes sense to forget about the outliers altogether.
The new way to explain it is this. The least-absolute deviation estimator is a median estimator (well, I might not say that to the novice). Suppose we found the least-absolute-deviation best regression line. Then, we take any point that lies above the line, and see what happens if we move it up to take a bigger value. What is sensible is that the regression line should move up–which it would, if we were using least squares. But with least-absolute deviations, it won’t move at all. That’s because if we move the regression line up, we do move it closer to the changed point, but we move it by an equal amount up away from all the points below the line and up towards all the points above the line. Overall, the distance stays the same. So there is no reason to move the line. An estimator that doesn’t change when the data changes–or even if the data changes a lot– is usually not satisfactory.
November 4th, 2006 at 4:49 pm
I think this is the projection theorem in Hilbert space. A Hilbert space is a linear space with an inner product. The norm then comes from the inner product. You have your dependent random variable y and your explanatory random variables. Let S be the set of all linear combinations of your explanatory variables. One can then show that there exists an element in S which is closest to y in the inner product norm. This is the linear combination of explanatory variables which is closest to y. Furthermore, the “error” vector y - yhat is orthogonal to yhat.
October 25th, 2007 at 7:49 pm
You make a good point, but I have a different perspective, which is that the most sensible error measure for many applications is the mean absolute error, which is exactly what LAD regression is minimizing. As I like to say, “Who’s ever seen a square dollar (pound, euro, etc.)?”
I make my case here:
http://matlabdatamining.blogspot.com/2007/10/l-1-linear-regression.html
May 15th, 2008 at 12:54 pm
170167f1dfd1…
170167f1dfd12a3caa56…