\documentclass[12pt,epsf]{article}
% pdflatex
\usepackage{graphicx}
\usepackage{amsmath}
\usepackage{amssymb}
%\reversemarginpar
%\topmargin -.5in
%\oddsidemargin 0in
% \textheight 9in
% \textwidth 6in
\pagestyle{myheadings}
\markboth{Rasmusen }{Rasmusen}
\begin{document}
\newcommand{\bbeta}{\mbox{\boldmath$\beta$}}
\newcommand{\bGamma}{\mbox{\boldmath$\Gamma$}}
\newcommand{\bvarepsilon}{\mbox{\boldmath$\varepsilon$}}
\newcommand{\bPhi}{\mbox{\boldmath$\Phi$}}
\newcommand{\bnu}{\mbox{\boldmath$\nu$}}
%\baselineskip 12pt
\parindent 24pt
\parskip 10pt
\titlepage
\vspace*{12pt}
\begin{center}
\begin{Large}
{\bf Testing Whether Data Follows a Power Law Distribution
} \\
\end{Large}
\vspace*{12pt}
24 January 2006 \\
\bigskip
Eric Rasmusen \\
\vspace*{.3 in}
THESE ARE NOTES FOR A FUTURE PAPER, NOT A REAL PAPER YET
\bigskip
{\it Abstract}
\end{center}
\begin{small}
A likelihood ratio test comparing spline to single parameter is a good way
to test whether a distribution fits a power law.
\bigskip
\noindent
\hspace*{20pt}Rasmusen: Indiana University, Kelley School of
Business, BU 456, 1309 E. 10th Street, Bloomington, Indiana,
47405-1701.
Office: (812) 855-9219. Fax: 812-855-3354.
Erasmuse@indiana.edu. http://www.rasmusen.org.
For the most recent draft, see: \\
http://www.rasmusen.org/papers/netmetrics-rasmusen.pdf.
I thank Manu Raghav for helpful comments and Barick Chung for research
assistance.
\end{small}
%%-----------------------------%----------------
\newpage
\begin{raggedright}
%\baselineskip 12pt
\parindent 24pt
\noindent
{\bf 1. Introduction}
What I will do in this note is to show how to test for a power law.
A naked-eye plot can do that.
Lehmann, B. Lautrup, and A. D. Jackson (2003) use a spline in their well-known
paper on citation networks in physics. They compare with a negative exponential,
I think, and cannot decide between the two.
You could use the Newman or Goldstein ML estimator in doing this.
Then find the likelihood of each of the two models and see if the log ratio
flunks a chi-squared test or not.
This will show whether you have concavity or convexity instead of a power law.
Also, a one-sided test is easy.
Also, most important, you can decide whether it is close to being a power law.
You have to pick your own criterion for this. I can come up with one, though.
Something based on the two models estiamted, and how much they differ-- the area
between them in the graph. A sort of $R^2$, comparing within variance to between
variance.
\includegraphics[width=150mm]{net1-newman.jpg}
\begin{center} {\bf Figure 1: Distributions Similar to the Power Law}
\footnote{From Newman, Figure 5.}
\end{center}
\bigskip
\noindent
{\bf 2. Estimation: OLS Is Biased and Inconsistent}
The power law density (for a continuous distribution) or probability (for a
discrete one) is
\begin{equation} \label{e100}
p(x) = K x^{-\gamma},
\end{equation}
where K is a constant chosen to make the cumulative density or probability
equal to one over the support of the distribution. The support is from some
strictly positive value $x_{min}$ to infinity.
If the distribution is continuous (in which case it is called the Pareto
distribution) with support $[x_{min}, \infty]$, then
\begin{equation} \label{e100}
K = (\gamma -1) x_{min}^{\gamma-1}
\end{equation}
If the distribution is discrete (in which case it is called the Zeta or Zipf
distribution) with support $\{1,2, \ldots, \infty\}$, then
\begin{equation} \label{e100}
{\displaystyle K = \zeta(\gamma) = \sum_{i=1}^{\infty} i^{-\gamma}. }
\end{equation}
We will work with the discrete distribution here.
Suppose we have data consisting of observations $x_1, x_2, \ldots, x_n$. This
will generate an empirical probability function showing the frequency of a
count, $f(z)$, with support on the possible count values $\{1, 2,
\ldots, z_{max} \}$, the function that is usually graphed. One estimation
approach is to use regression analysis on this generated dataset $(z, f(z))$.
Most simply, one could use ordinary least squares, estimating $log(f) = \alpha -
\beta log(z)$, which is equivalent to $f = \alpha z^{-\beta}$. This would be
wrong. It is biased, for reasons I will explain.
Figure 2 shows the true and empirical probability functions. The horizontal
axis shows $x \equiv log(count)$, where the count equals $1,..., \infty$, so
$x$ goes from 0 to $\infty$. The vertical axis shows $y \equiv log
(frequency(count))$. The observed frequency of a count is 0 or greater, but
since $log(0) = -\infty$, any count with observed frequency of 0 is dropped from
the dataset, so the remaining observed counts of $y$ are 0 or greater. These
observed counts are shown as stars in Figure 2. Ordinary least squares
calculates the ``OLS line'' which fits those observations as well as possible.
The expected frequency of a count need not be an integer, so it can be less than
1, which means its logarithm can be negative. The large dots shows these
expected freqencies. They lie on the ``true line'', which plots $y = \alpha -
\beta x$. None of these expected frequencies are zero, so none of the counts
would need to be dropped.
The observed frequency is the expected frequency plus a stochastic disturbance
term, $\varepsilon$, so $y = \alpha - \beta x + \varepsilon$. This disturbance
term has some distribution $g(\varepsilon)$. Figure 2 shows the realized values
of the disturbances as vertical distances, some positive, some negative.
\includegraphics[width=200mm]{net2-ols.jpg}
\begin{center} {\bf Figure 2: The Bias of Ordinary Least Squares }
\end{center}
Let us denote the vector of $T$ observed values of $y$ by ${\bf y}$, the matrix
with one column of $T$ 1's and one column of the $T$ observed values of $X$ by
${\bf X}$, the vector of $T$ unobserved values of $\varepsilon$ by
$\bvarepsilon$, and the vector $(\alpha, -\beta)$ by $\bGamma$. Then
\begin{equation} \label{e100}
{\bf y = X } \bGamma + \bvarepsilon,
\end{equation}
and the OLS estimator is
\begin{equation} \label{e100}
\hat{\bGamma} = {\bf (X'X)^{-1}X'y }
\end{equation}
The expected value of the OLS estimator is
\begin{equation} \label{e100}
\begin{array}{ll}
E\hat{\bGamma} &= E {\bf (X'X)^{-1}X'y } \\
& \\
& = E {\bf (X'X)^{-1}X' } ({\bf X }\bGamma + \bvarepsilon ) \\
& \\
& = \bGamma + E {\bf (X'X)^{-1}X' } \bvarepsilon ) \\
\end{array}
\end{equation}
If $E X' \bvarepsilon=0$, OLS is unbiased. But that is not the case here, for
two reasons. First, we have omitted all the observations for which $y = -
\infty$. These are values for which $\bvarepsilon <0$, so the expectation of
$\varepsilon$ taken over the remaining, included, values is positive, not zero.
Second, $\varepsilon$ cannot take take negative values for large values of $x$,
because that would result in negative values of $y$, and 0 is the smallest
observed value of $y$. Thus, even the included observations have a positive
expected value of $\varepsilon$. Both effects bias $\hat{\bGamma}$ in a
positive direction, so $\hat{\alpha}$ will be too large, and $\hat{-\beta}$ will
be biased up towards zero, too small a negative number, so $\beta$ will be too
small. OLS is biased, and we know the sign of the bias.
\bigskip
As an example, consider US Supreme Court citation data. Figure 2a below shows a
plot of the frequencies together with the OLS estimate, in log space and linear
space.
\includegraphics[width=150mm]{net2a-ols-sup.jpg}
\begin{center} {\bf Figure 2a: U.S. Supreme Court Data with the OLS Estimate
}
\end{center}
\bigskip
\noindent
{\bf 3. Estimation: Maximum Likelihood, Simple and Splined}
The likelihood of $\gamma$ being the true parameter if you observe value $x_i$
is
\begin{equation} \label{e100}
l( \gamma|x_i) = \frac{x_i^{-\gamma}}{\zeta(\gamma)}.
\end{equation}
Thus, the likelihood of observing the $n$-vector $x$ of independent values
$x_i$ is
\begin{equation} \label{e100}
{\displaystyle l( \gamma|x) = \Pi_{i=1}^{n} \left( \frac{x_i^{-\gamma}}
{\zeta(\gamma)} \right) }
\end{equation}
It is convenient to maximize not this, but the log-likelihood (whose maximand
over $\gamma$ will be the same as the likelihood's), which is:
\begin{equation} \label{e100}
\begin{array}{ll}
L( \gamma|x) & = log [l(\gamma|x)] \\
& \\
& {\displaystyle = \sum_{i=1}^{n} \left[ -\gamma log(x_i) - log (
\zeta(\gamma)) \right] }\\
& \\
&{\displaystyle = -\gamma \sum_{i=1}^{N} log(x_i) - N log (
\zeta(\gamma)) }\\
\end{array}
\end{equation}
Maximizing this with respect to $\gamma$, the first-order condition is
\begin{equation} \label{e100}
{\displaystyle \frac{dL( \gamma|x) }{d\gamma} = -\sum_{i=1}^{n} log(x_i)
- \left( \frac { \zeta'(\gamma) }{ \zeta(\gamma) } \right) } =0,
\end{equation}
so the best estimate of $\gamma$ solves
\begin{equation} \label{e100}
{\displaystyle \sum_{i=1}^{n} \frac{log(x_i)}{N}
= - \left( \frac{ \zeta'(\gamma) }{ \zeta(\gamma)} \right) }
\end{equation}
\bigskip
\includegraphics[width=150mm]{net3-spline.jpg}
\begin{center} {\bf Figure 3: The Spline}
\end{center}
Now go to having two $\gamma$'s. For $x \leq \theta$, we use $\gamma_1$, and
for $x \geq \theta$ we use $\gamma_2$. The two distributions must meet at
$\theta$.
Thus, the probability function is
\begin{equation} \label{e100}
\begin{array}{ll}
P(x) & = K_1 x^{-\gamma_1} \; {\rm if} \; x \leq \theta \\
& \\
& = K_2 x^{-\gamma_2} \; {\rm if} \; x > \theta \\
\end{array}
\end{equation}
Note that $K_1$ and $K_2$ do not have any necessary relation to the Zeta
function. Each half of the spline does not have to be a Zeta distribution.
Without loss of generality, let us number the $x_i$ data from smallest to
biggest, so $x_1$ is smallest and $x_N$ is biggest. Thus the data consists of
an $n$-vector $x$ such as $x=(1,1,1,1,2,2,3)$, which would have an empirical
distribution of $f(z) =$(1:4, 2:2, 3:1).
Let us define $x_{m({\theta})}$ as the last $x_i$ that equals $\theta$, so
$x_{m({\theta})} \leq \theta< x_{m({\theta})+1}$. If $\theta=2$, then in our
example, $x_{m({\theta})}=2$ and $ m({\theta}) =6$.
There are 5 parameters in $P(x)$, but they are not all independent, just as $K$
and $\gamma$ were not earlier. Two of the five parameters can be pinned down
by the
following two restrictions. First, as before, the probability must sum to one.
\begin{equation} \label{first}
\begin{array}{l}
{\displaystyle \sum_{i=1}^{ \theta } K_1 i^{-\gamma_1} +
\sum_{i= \theta+1}^{\infty} K_2 i^{-\gamma_2}=1} \\
\end{array}
\end{equation}
Second, the function must be continuous, so at the cutpoint $\theta$ the two
power laws intersect:
\begin{equation} \label{second}
\begin{array}{l}
K_1 \theta^{-\gamma_1}
= K_2 \theta^{-\gamma_2}.
\end{array}
\end{equation}
These two conditions can be solved for $K_1$ and $K_2$. From (\ref{second}),
\begin{equation} \label{e100}
\begin{array}{l}
K_2 = K_1\theta^{\gamma_2-\gamma_1}
\end{array}
\end{equation}
From (\ref{first}),
\begin{equation} \label{e100}
{\displaystyle \sum_{i=1}^{ \theta } K_1 i^{-\gamma_1} +
\sum_{i= \theta +1}^{\infty} K_1\theta^{\gamma_2-\gamma_1} i^{-\gamma_2}=
1},
\end{equation}
so
\begin{equation} \label{e100}
{\displaystyle K_1 = \left( \sum_{i=1}^{ \theta } i^{-\gamma_1} +
\sum_{i= \theta +1}^{\infty} \theta^{\gamma_2-\gamma_1} i^{-\gamma_2}
\right)^{-1}},
\end{equation}
Let us denote by $x_j$ the last observation such that $x_i \leq \theta$, so
$x_j \leq \theta < x_{j+1}.$ The likelihood of observing the $n$-vector $x$ of
independent values $x_i$ is
\begin{equation} \label{e100}
{\displaystyle l( \gamma_1, \gamma_2, \theta|x) = \left( \Pi_{i=1}
^{m({\theta})} K_1 x_i ^{-\gamma_1} \right) \left( \Pi_{i=m({\theta})+1}^{N}
K_2 x_i^{-\gamma_2} \right) . }
\end{equation}
The log likelihood is
\begin{equation} \label{e100}
\begin{array}{ll}
L( \gamma_1, \gamma_2, \theta|x)& = log [l( \gamma_1, \gamma_2, \theta|x) ] \\
& \\
& {\displaystyle = \sum_{i=1}^{m({\theta})} \left[log(K_1) - \gamma_1
log(x_i) \right]
+\sum_{i=m({\theta})+1}^{N} \left[log(K_2) - \gamma_2 log(x_i) \right] } \\
\end{array}
\end{equation}
To maximize this, use Matlab. The expressions for $K_1$ and $K_2$ contain
infinite sequences, but these can be well approximated by finite sequences.
As a starting point, try $\theta = N/2 $ and
$\gamma_1=\gamma_2=\gamma$ as calculated using the MLE earlier.
As an example, consider US Supreme Court citation data. Figure 3a below shows
a
plot of the frequencies together with the maximum likelihood estimates in log
space and linear
space, splined and single.
\includegraphics[width=150mm]{net-3a-mle-sup.jpg}
\begin{center} {\bf Figure 3a: U.S. Supreme Court Data with Maximum
Likelihood Estimates
}
\end{center}
\noindent
{\bf 4. Testing the Estimated Curve}
A first test we might do is the Kolmogorov-Smirnov test for whether it can be
rejected that the single-$\gamma$ power law generated the observed empirical
distribution function. Figure 4 shows the critical values for this, taken from
Goldstein et al.
\includegraphics[width=150mm]{net4-ks.jpg}
\begin{center} {\bf Figure 4: The Kolmogorov-Smirnov Test}\footnote{From
Goldstein et al. }
\end{center}
If we apply the Kolmogorov-Smirnov test to the U.S. Supreme Court data, the
test statistic is xxx, with a critical value of xxx at the 5 percent level.
A second test might be a
Kolmogorov-Smirnov test for whether it can be rejected that the two-$\gamma$
power law generated the observed empirical distribution function. I think that
might require a new table of critical values, though.
An alternative to the Kolmogorov-Smirnov test is a chi-squared test, sometimes
called Pearson's Chi-Squared. To do this test, the observed data is divided into
bins chosen by the researcher. The number of observations in each bin is
compared to the number predicted by the single-$\gamma$ or two-$\gamma$
distribution. This test is sensitive to how the bins are defined. Also, an
objection to it is that it would fail to detect a difference between the two
distributions that occurred only within a bin--- if, for example, the values $z
\in [10, 15]$ were in one bin, any shape of the theoretical distribution
between 10 and 15 that generates the same probability would look the same to
the test.
Let us see what happens with three different bin sizes: 1 count per bin, 2
counts per bin, and 10 counts per bin. Figure 4aa, 4ab, and 4ac show the data
with the expected frequency, in log space. The Chi-Squared statistics are xxx,
yyy, and xxx, with critical values xxx, yyy, and xxx for the 5 percent level.
A third test is the for whether the one-$\gamma$ model can be rejected in
favor of the two-$\gamma$ model. We will use a likelihood ratio test.
See if
\begin{equation}
\frac{-2 L (\gamma)}{L(\gamma_1, \gamma_2, \theta) }
\end{equation}
has a value greater than Chi-Squared (2). The Neyman-Pearson Lemma says this is
the most powerful test possible. Here, the test statistic is xxx, and the
critical value at the 5 percent level is xxx.
A separate thing we would like to test for is ``economic significance'', as
opposed to ``statistical significance''. With large amounts of data, it is
easy to reject a null hypothesis, even if the difference between what is
observed and the null is very small, because the measurements are so accurate.
Thus, we would like some measure of how much two distributions differ, as
opposed to how likely it is that they differ.
I suggest using this
measure, $M$:
\begin{equation} \label{e100}
{\displaystyle M \equiv 1- \left[ \left( \frac{1}{2} \right) \int_{-\infty}
^{\infty}
\bigg|f(x) - g(x) \bigg| dx \right], }
\end{equation}
where$f$ and $g$ can be either densities or probabilities.
The measure is between zero and one. If the two distributions are the same, it
equals one, since $f(x)-g(x)=0$. The smallest value is zero, since if $f(x)$ and
$g(x)$ have disjoint supports,
\begin{equation} \label{e100}
{\displaystyle 1- \left[ \left( \frac{1}{2} \right) \int_{-\infty}^{\infty}
\bigg|f(x) - g(x) \bigg| dx \right] = 1- \left[ \left( \frac{1}{2} \right)
\left( \int_{-\infty}^{\infty}
f(x)dx +\int_{-\infty}^{\infty}
g(x) dx \right) \right] = 1- \left[ \left( \frac{1}{2} \right)\left( 1 + 1
\right)
\right] = 0.}
\end{equation}
The measure can be calculated in several different ways. Note that
\begin{equation} \label{e100}
{\displaystyle \int_{-\infty}^{\infty}
\bigg|f(x) - g(x) \bigg| dx = \int_{x:f \geq g}
\left[f(x) - g(x) \right] dx +\int_{x:f