For math format like this:
An Ultra-Small Sample Example
As I was suffering from food poisoning in Florida last week, I had many hours in between vomiting bouts to while away woozily. So I thought about heteroskedasticity. Suppose you are estimating the temperature of lake water using various thermometers, and the observations are 50, 52, and 64. We will assume the thermometers are all unbiased, so if x_i is observation i and T is the true temperature, E(x_i) = T. The obvious estimator is the sample mean, (50+52+64)/3 = 55.3 (about). This is also the best estimator if the thermometers are all equally good. But what if they are not? That is, what if the disturbances are heteroskedastic?
If we know the variances of the disturbances of each thermometer, it is well known how to proceed, though I forget exactly how at the moment. What you do is to weight each observation by its accuracy. Thus, the right way might be that if the variances of the thermometers are 60, 30, and 10, and the disturbances are normally distributed, you should use the estimator (50/60 + 52/30 + 64/10) /(1/60+ 1/30+ 1/10) = 59.8 (about). But usually you won't know the variances.
So you need to estimate the variances, which you *can* do, if you have at least 3 observations. Consider observation 1, which is 50. What is a good estimate of its variance? First, we need to estimate the population mean. We can do that from observations 2 and 3, which have a sample mean of (52+64)/2 = 58. Then we use our one data point left for estimating the variance, 50. to estimate the variance to be (58-50)^2/1 = 64. We can do the same for observation 2, which uses the sample mean of (50+64)/2 = 57 so the variance is (57-52)^2/1 = 25. Finally we do it for observation 3, which uses the sample mean of (50+52)/2 = 51 so the variance is (64 - 51)^2/1 = 169. Now we estimates of the variance, so we can do a weighted average, as before.
If we want to be fancy, we could iterate, using the weights in the 2-observation mean estimates too.
There is a problem here, though. The reason that with homoskedasticity the sample mean is the best estimator is that it makes the fullest use of the available information. It weights all three observations equally, making use of all of them. Another unbiased estimator is to use just the first two observations, and always throw out the third one, whatever it may be. That wastes information. Similarly, weighting is wasting, if the data is homoskedastic. So there is a cost to weighting.
Weighting also has a benefit, because it lets us use our best information most heavily. If we know exactly how good the information is, there is no downside to weighting. But we don't. We're estimating it. So when we weight, we are hoping our estimation error is small enough that it's worth messing with equal weighting. Is it small enough? That needs proving.
The proof shouldn't be too difficult. But it is too hard to do in my head. I need to look at the proof of the standard weighting scheme when you know the variances. That, I think, comes from doing a calculus minimization of mean squared error for an unbiased estimator.
There is another problem. How good my estimator is depends on the true state of the world. Suppose the data really *is* homoskedastic. Then my estimator will just make things worse. The same is true if the data is just slightly heteroskedastic but has high variances, because in a given sample, it may well happen that one observation is an outlier and I will weight it less heavily than the other two. So the quality of the estimator depends on the amount of heteroskedasticity, which we don't know in advance.
The only way to resolve that problem, I think, is to go Bayesian. If we have a distribution of possible variances (and, more important, of variances of variances), we can find out whether the estimator does well in expectation against that distribution. For example, we might assume that the variance of each observation is drawn randomly and independently from 0 to 100. Better, probably, would be to see how well the estimator does for a variety of true states of the world and then to eyeball it and decide whether it's worth using. I expect what I will do is use python to find two numerical examples for Monte Carlo studies, one where the heteroskedasticty correction helps, and one where it hurts.
Ultra-Small Samples as Low-Fat Modelling
The ultra-small sample I just used is an example of Low-Fat Modelling (or "No-Fat", or "MIT-Style", or, confusingly, "NewJersey Style modelling in computer science, since they use "MIT-Style" to mean the exact opposite). Low-Fat Modelling is the style of economic theory made popular by Joseph Stiglitz and perfected by Jean Tirole. It models an economic idea using the simplest, most specific, model possible, as opposed to the older style of trying to use the fanciest, most general model possible. Thus, instead of assuming there are $N$ goods with a continuum of possible prices, you assume there are 2 goods with 2 prices. This will remind you of international trade's 2x2x2 models (2 countries, 2 goods, 2 inputs), since they have long used Low-Fat Modelling. Sometimes you need at least 3 countries to demonstrate an effect (e.g. the Transfer Paradox), so you do that, but you don't go to N countries.
Thus, for a theorist like me it is natural when thinking about econometric theory to try to estimate a mean rather than a regression coefficient, and to try it with 3 observations rather than N. If we can understand it for a mean with three observations, we can understand it for a regression with M coefficients for a sample with N observations. I would have liked to reduce it to 2 observations here, but then you can't estimate the variance of an observation using the other observations' data.
A Large-Sample Example
Suppose we are trying to estimate mu and we observe x_i = mu + epsilon_i, where epsilon_i is distributed normally, but with a twist. Some percentage theta of the time, epsilon ~ N(0, sigma^2_1), and some percentage of the time (1-theta), epsilon ~ N(0, sigma^2_2), with sigma^2_1 < sigma^2_2. We do not observe mu, theta, sigma^2 _1, or sigma^2_2, and we would like to estimate them all. We do, however, have an infinite amount of data.
Here's what we can do. Start with the easiest parameter, mu. For mu, our estimate will simply be the sample mean, the average of the x_i. The sample mean is consistent even under heteroskedasticity, so in the limit we have an exact estimate of mu.
Note, too, that we can compute the sample variance and get an exact estimate of theta*sigma_1^2 + (1-theta)*sigma_2^2 from it with a bit of calculation that I won't try to work through right now. Alas, that gives us just one statistic for three parameters, but it will be useful later.
The hard task is to figure out sigma_1 and sigma_2. To do that, start by calculating the mean absolute deviation Z of an observation from mu, that is: Z = the limit as N goes to infinity of Sum_i^N |x_i- mu|/N. Now divide the sample in two parts, depending on whether x_i is less than distance W away from mu-hat (which is both the population mean and, with infinite data, the sample mean). Choose W so P percent of the observations in the Far Sample are more than W greater than mu-hat.
The Far Sample will mostly be observations with variance sigma_1^2, though it will be a mix, since sometimes observations with a small variance will turn out to be much bigger than mu anyway.
Now I think maybe I have to assume that someone has taken pity on us and told us that theta = .7. We aren't free and clear, because Z is still one statistic for two unknown parameters, sigma_1^2 and sigma_2^2, but that kind person still deserves our thanks. Now that we have theta and sigma_2^2, we can use the sample variance equation to compute sigma_1^2. And we're done. A consistent estimator of the variances, and we can figure out weights and standard errors.
The Asymptotics Puzzle
This is a good setting for thinking about how we choose estimators. We economists love asymptotic theory, but of course it doesn't actually apply unless you have an infinitely large sample, which we never do. So why do we like good asymptotic properties? I think it's because if the sample is large, any estimator with good asymptotic properties is pretty good. Is that true? Or are there practical consistent estimators that are horrible until you pass some very large threshold? I can think of a silly one like that. Suppose I estimate the mean by setting it equal to 42 if N < 900 and equal to the sample mean if N > 899. That is a consistent estimator, but it is a very bad for N < 900, unless you are lucky enough that mu = 42.
Do asymptotics tell us anything about our Very Small Sample Example? Yes, I think. The idea of my estimators in both examples is the same. If an observation takes a value far from the sample mean, you can guess that it probably has a high variance and should be downweighted, and you can estimate the variance using the information at hand.
Monte Carlo Studies
I should do some Monte Carlos on these two examples. For each, I should try to pick two cases, a Good Case and a Bad Case, to show when the estimator does well and when it does poorly. I should not try to do a range of cases because that would be too confusing, and it is hopeless to try to span the parameter space. Rather, what is useful is to show the sort of situation where an estimator is useful and the sort of situation where it is not. My Python programming ability is plenty good enough to do this.
It would be nice to figure out asymptotic efficiency, but I'd have to relearn my econometric theory for that. Perhaps I could find a co-author fresh from grad school for that.
Median Estimators and Loss Function
Think hard about this. Do the math. One objeciton yu might make is that even without heteroskedasticity, the outlier of the three observations is most likely to be a big error. So why not drop it? The reason is that it is also th emost importnat IF it is correct. And that is more important with squared error loss. In fact, I bet it is for any convex loss function. For Absolute deviation loss, I bet it evens out, and it doens' matter if you drop it. That's why that give su a MEDIAN estimator. For Concave loss functions, you DO want to drop it.
But heteroskedasticity puts a thumb o th eblance, a thumb on the side of dropping the observation, or at least weighting it less
- I bet this is all related to shirnkage estimators too. Need to do some deep thinking here. A shrinkage estimator for James Stein dominance needs 3 different key parameters to be estimated (three "means", the direct effect of three variables, NOT three "nuisance parameters" like the variance and other higher moments). But there HAS to be a link.
- I need to rewrite my Shrinkage paper. VAR is bad notation for the X-variables. Maybe DATA is better.
- In James-Stein, I bet the real secret is this (and the Grand Mean estimator too). We have three variables with disturbances attached to their three different means. Suppose we have three numbers that add up to 180: 10, 20, and 150. The average is 60. We expand the little ones and shrink the big ones 50% towards the mean, to get 15, 30, and 105. In so doing we have added absolute amounts of 5 and 10 to the little numbers and subtracted 55 from the big number. There is some reason why that 55 helps us with mean squared error more than the 5 and 10 would hurt us.
- "Understanding Shrinkage Estimators: From Zero to Oracle to James-Stein." The standard estimator of the population mean is the sample mean, which is unbiased. Constructing an estimator by shrinking the sample mean results in a biased estimator, with an expected value less than the population mean. On the other hand, shrinkage always reduces the estimator's variance and can reduce its mean squared error. This paper tries to explain how that works. I start with estimating a single mean using the zero estimator and the oracle estimator, and continue with the grand-mean estimator. Thus prepared, it is easier to understand the James-Stein estimator, in its simple form with known homogeneous variance and in extensions. The James-Stein estimator combines the oracle estimate's coefficient shrinking with the grand mean estimator's cancelling out of overestimates and underestimates.
This is an example of empirical Bayes. A typical example of empirical Bayes estimation would be for estimating the batting ability of each of a number of baseball players. We would start by computing the grand sample mean of the batting averages. That would become our prior--- thus "empirical Bayes". Then we would use that prior and Bayes's rule to estimate the true mean for each player, using only his own success at batting. If he had only a few at-bats, the prior would make a big difference. If he had lots, his own data would be far more important and our estimate would be very close to his own sample mean.
Here, I am using the data to estimate the observation's variance. I am not using Bayes's Rule at all, nor a conjugate prior distribution, though. But I am using the idea of estimating an ancillary, nuisance, parameter with the data first, to use in estimating the parameter of interest.
- The unweighted mean is unbiased, and the sample variance is too. Are they consistent? Maybe not; it depends on how we define consistency. If it means to replicate the first $M$ observations an infinite numbers of times with independent draws but the same variances as in the first $M$, then yes, it is consistent. But if we were to increase the number of observations infinitely but with observations getting worse and worse in quality, with higher variances, then no, it might not be consistent. The weighted mean estimator would be consistent even then, but probably not the unweighted mean.
- The sample variance is unbiased because there are no X variables. If there were, and they were correlated with the population variances, then there would be bias. This is the classical bias: that if X_i being bigger means $\sigma^2_i$ is bigger then the sample variance underestimates the population variance.
- The White statistic doesn't work with 0 degrees of freedom-- that is, with just a constant and no X variable.
- The White heteroskedasticity consistent variance estimator might well be what I have in mind here. Kennedy's book is good on this.
- This is related to empriical Bayes. I take as my nuisance parameter estimate the sample mean from the other two obsrvations. This is why it is a shrinkage estimator. But that is still not actual empirical bayes. It is empricial bayes if I take the mean sample varaince as a prior adn then estimate each observation's variance using bayes rule and that prior. But actually, I should do "cheap bayes", a phrase I invent here. Instead of bothering with a conjugate prior adn using bayes's rule, etc., I should just average teh prior adn the data. That is honest. It is just modelling. We don't know the actual distributions, so why not just pick somthing simple? Using actual distributions is good if you are thinking about adding more data and seeing
how the estimate chagnes as data is more important relative to prior, but here we never get more than 1 data point and a prior, and we have no principled way to combine them really, because we don't know how STRONG the prior should be relative to the datum.
- My Python program should be for 3 data points, picked randomly from mu=100 and epsilon distributed N(0, sigma^2), with sigma^2 distributed U(0, 15). I should do the unweighted mean, the known-weighted mean, the diffuse-prior-weighted mean, the empirical bayes weighted mean, and the iterated empirical bayes weighted mean. Note that I am NOT doing the estimate where each observation has a different mean-- that is a different sort of empirical bayes situation, where you'd shrink towards the grand mean; here, we know a priori that all observations ahve the same mean.
- The shrinkage estimator works by guessing which observation has a big disturbance, and weighting it less. The heteroskedastic correction estimator works by figuring out which observation has a big \sigma^2_i and weighting it less. There is a subtle difference. The first-- the big disturbance-- is really our goal. We don't mind big \sigma_2_i if in the prticular realization the disturbance is small. But we don't know the population mean in advance. We make a preliminary guess. If all observations have the same sigma^2, then we are guessing at which observation has the big epsilon by using the Y and our preliminary estimate. If we know observations ahve different sigma^2, and we know what it is for each, we will weight ignoring the sample mean; we can do better than that. We could actaully COMBINE the twokinds of estimators then. We could do a weighted estimate based on data quality, and then shrink it to get individual epsilons accounted for.
- I am not quite using shrinkage, because I am using the mean of two observations to weight the third. In shrinkage, we use the grand mean to weight each of them. Probably my way is more efficient. Or is it equivalent somehow?
- I need a good name, a short one, for my estimator. Quality-weighted estimator. It really is perceived qualtyweighted estimator,but that is too long. Estimated Weighted Mean is best.
*A big question: how do I estimate the variance of an observation? Do I use its residual? Or do I take that and shrink it, or move it to the grand sample residual?