Tuesday, June 8, 2010

Optimal Histogram Bin Width

Have you ever wonder the choice of the number of bins when making a histogram? I do. It seems that I often play around with the number until the resulting histogram looks good enough.

There are to parameters when consider making a histogram a bona fide probability density (the sum of all histogram area equals to one), the number of bins $k$ and the bin width $h$.

If we know the bin width $h$, the number of bins $k$ is
\[k=\left\lceil\frac{max(x)-min(x)}{h}\right\rceil\]

Now, A number of methods to calculate the optimal number of bins or the bin width has been purposed. Some are selected as follows.

Sturge's formula
\[k=\left\lceil log_2n+1\right\rceil\]
Scott's choice
\[h=\frac{3.5\sigma}{n^{1/3}}\]
Freedman-Diaconis' choice
\[h=2\frac{IQR(x)}{n^{1/3}}\]

For example, using geyser data
load geyser
n=length(geyser);

The Scott's choice is
>> h=3.5*std(geyser)*n^(-1/3)
h =
    7.2704

>> k=ceil(diff(minmax(geyser))/h)
k =
     9


The Freedman-Diaconis' choice is
>> h=2*iqr(geyser)*n^(-1/3)
h =
    7.1782

>> k=ceil(diff(minmax(geyser))/h)
k =
     10


Each of these choice yields the number of bins $k$ as 9 and 10 respectively. Compare these numbers to the Sturges' formula
>> k=ceil(log2(n)+1)
k =
    10

So, it seems the number of bins should be around 10. The Matlab code for ploting the bona fide density histogram is
[n,x]=hist(geyser,k);
fhat=n/(sum(n)*h);
bar(x,fhat,1);


Notice that the sum of the histogram area is
>> sum(fhat)*h
ans =
    1.0000

No comments:

Post a Comment