Have you ever wonder the choice of the number of bins when making a histogram? I do. It seems that I often play around with the number until the resulting histogram looks good enough.
There are to parameters when consider making a histogram a bona fide probability density (the sum of all histogram area equals to one), the number of bins $k$ and the bin width $h$.
If we know the bin width $h$, the number of bins $k$ is
\[k=\left\lceil\frac{max(x)-min(x)}{h}\right\rceil\]
Now, A number of methods to calculate the optimal number of bins or the bin width has been purposed. Some are selected as follows.
Sturge's formula
\[k=\left\lceil log_2n+1\right\rceil\]
Scott's choice
\[h=\frac{3.5\sigma}{n^{1/3}}\]
Freedman-Diaconis' choice
\[h=2\frac{IQR(x)}{n^{1/3}}\]
For example, using geyser data
load geyser
n=length(geyser);
The Scott's choice is
>> h=3.5*std(geyser)*n^(-1/3)
h =
7.2704
>> k=ceil(diff(minmax(geyser))/h)
k =
9
The Freedman-Diaconis' choice is
>> h=2*iqr(geyser)*n^(-1/3)
h =
7.1782
>> k=ceil(diff(minmax(geyser))/h)
k =
10
Each of these choice yields the number of bins $k$ as 9 and 10 respectively. Compare these numbers to the Sturges' formula
>> k=ceil(log2(n)+1)
k =
10
So, it seems the number of bins should be around 10. The Matlab code for ploting the bona fide density histogram is
[n,x]=hist(geyser,k);
fhat=n/(sum(n)*h);
bar(x,fhat,1);
Notice that the sum of the histogram area is
>> sum(fhat)*h
ans =
1.0000
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment