Sunday, August 1, 2010

Estimating Pi with Monte Carlo Integration

There are several tutorials on how to perform this calculation. I like to add the technique I use with MATLAB recently. Basically, Imagine you throw darts at random (really really random) into a board with a circle in it, ie. like this.

The concept can be implement in one MATLAB command:

>> 4*mean(rand(1,5000).^2+rand(1,5000).^2 <= 1)
ans =
   3.141592653589793

or, a slightly different approach, using Monte Carlo Integration with random numbers drawn from uniform probability distribution $x_i \sim Unif(0,1)$.

>> 4*mean(sqrt(1-rand(1,5000).^2))
ans =
   3.141432627301033

Notice that the second approach requires only one series of random numbers, instead of two as the first approach but it has to perform a more complex square root operation.

Saturday, July 10, 2010

Likelihood Function in Machine Learning

The (log) probability function $p({\bf x}|\theta)$ assigns a probability (density) to any joint configuration of variables $\bf x$ given fixed parameter $\theta$.

Instead, thinking of $p({\bf x}|\theta)$ as a function of $\theta$ for fixed $\bf x$:
\[L(\theta;{\bf x}) = p({\bf x}|\theta)\]
\[l(\theta;{\bf x})=log\;p({\bf x}|\theta)\]
This function is called the (log) "likelihood"

if we choose $\theta$ that maximizes some cost function $c(\theta)$.

$c(\theta) = l(\theta;D)$  ==> maximum likelihood (ML)
$c(\theta) = l(\theta;D)+r(\theta)$ ==> maximum a posteriori/penalty BIC, AIC, ...

Statistical Machine Learning Problems

There are 4 basics problems in machine learning community: density estimation, clustering, classification and regression.

These problem can also be formulated in statistical framework.

Regression: $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})=p({\bf y,x})/\int p({\bf y,x})d{\bf y}$
Classification: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})=p(c,{\bf x})/\sum_c p(c,{\bf x})$
Clustering: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})$,  $c$ is unobserved
Density Estimation:  $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})$, $\bf x$ is unobserved

Saturday, June 19, 2010

Akaike and Bayesian Information Criteria

Usually, when we have data, we want to know what (parametric) model did the data come from. In other words, we want to do model fitting to the data. Different models consists of different number of parameters or different parameters themselves.

When there are several models to choose from, we estimate model parameters using maximum likelihood estimation. We may want to increase the likelihood by adding more parameters but that may result in overfitting.

Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) resolve the overfitting problem by penalizing for the number of parameters in the model.

The formula for the AIC is

\[AIC = 2k - 2 ln(L)\]
and, the formula for the BIC is

\[-2\cdot ln\;p(x|k)\approx BIC=-2\cdot ln L + k ln(n)\]
where $x$ is the observed data, $n$ the number of data points in $x$, $k$ the number of parameters, $p(x|k)$ the likelihood of the observed data given the number of parameters and $L$ the value of the likelihood function for the estimated model.

For example, let generate some random values
x = binornd(10,0.5,1,1000);

and pretend that we do not know what distribution function they are from. There are two models we want to fit the data, $X\sim Bino(10,0.5)$ and $X\sim Poiss(5)$.

First, try plotting the data, and the two theoretical points.
[n,xi]=hist(x);
bar(xi,n/(sum(n)*(xi(2)-xi(1))),1)
hold on;
l=min(x):max(x);
px=binopdf(l,10,0.5);
py=poisspdf(l,5);
plot(l,px,'r.');  % theoretical binomial dist
plot(l,py,'rx');  % theoretical poisson dist
hold off;
axis tight;

 Now, try using the AIC and BIC for model selection
px=binopdf(x,10,0.5);
px(px == 0)=[];
llx=sum(log(px));
py=poisspdf(x,5);
py(py == 0)=[];
lly=sum(log(py));
[aic,bic]=aicbic([llx,lly],[2,1],[length(px),length(py)])


aic =
  1.0e+003 *
    3.7569    3.9542

bic =
  1.0e+003 *
    3.7667    3.9591


The first model $X\sim Bino(10,0.5)$ has lower AIC and BIC, so it is a better model.

Next, try using different random values.

x = poissrnd(5,1,10000);

aic =
  1.0e+004 *
    4.4925    4.4069

bic =
  1.0e+004 *
    4.4939    4.4076


This time the second model $X\sim Poiss(5)$ has lower AIC and BIC so it is a better model for the data.

Log Likelihood Function

In maximum likelihood estimation (MLE), if we let $Y_1,...,Y_n$ be n independent random variables (r.v.'s) with probability functions (pdf) $f_i(y_i;\bf\theta)$ depending on a vector-valued parameter $\bf\theta$, the likelihood of the distribution is just the value of the n-variate probability density function

\[f({\bf y};{\bf\theta})=\prod\limit_{i=1}^n f_i(y_i;{\bf\theta})=L({\bf\theta};{\bf y})\]
The $L({\bf\theta};{\bf y})$ is a function of the unknown parameter $ {\bf\theta} $ given the data $ {\bf y} $, is called the likelihood function .

Analytically, maximizing the product is more difficult than maximizing the sum, so we take a natural logarithm of the likelihood function, thus it is called the log likelihood function
\[log L({\bf\theta};{\bf y})=\sum\limit_{i=1}^n log f_i(y_i;{\bf\theta})\]
For example, the outcome of an event is found as a series of random value, what is the (log) likelyhood of this outcome coming from $X \sim Bino(10,0.5)$?

First, generate some random values but let's assume that we do not know the parameters generating the random values, i.e. we don't know that it was generated from Bino(10,0.5).

X=binornd(10,0.5,1,100);

The log likelihood value of X with parameter (10,0.6) is
 >> sum(log(binopdf(X,10,0.6)))
ans =
 -200.0399

The log likelihood value of X with parameter (10,0.5) is
>> sum(log(binopdf(X,10,0.5)))
ans =
 -184.4945

And, The log likelihood value of X with parameter (11,0.5) is
>> sum(log(binopdf(X,11,0.5)))
ans =
 -187.4201

From three values above, the log likelihood of the parameter (10,0.5) is highest, from MLE standpoint, we should select this model.

Tuesday, June 15, 2010

Model-based Clustering

Model-Based Clustering (MBC) assumes that the data comes from several subpopulations, each is model separately by $g_k({\bf x}; {\bf \theta_k})$. The overall population is therefore a finite mixture model.

\[f({\bf x};p,{\bf \theta})=\sum\limit_{k=1}^c p_k g_k({\bf x}; {\bf \theta_k})\]
The weights are given by $p_k$, which is also called the mixing proportions or mixing coefficients.

The most common used density is multivariate normal (Gaussian), therefore, the MBC can assume that the data come from a multivariate Gaussian finite mixture.

There are three components of the MBC.
1. Agglomerative clustering is used to partition the data. Each partition then becomes the initial starting values for the EM algorithm.
2. The EM algorithm is used to estimate the parameters of the finite mixture.
3. The Bayesian Information Criterion (BIC) is used to choose the best grouping and number of clusters, given the data. It is an approximation to Bayes factors.

Saturday, June 12, 2010

False Alarm Probability to Likelihood Ratio Threshold

Let's start from something easier to understand. We know from the "likelihood ratio approach to classification" that

\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
The $\tau_c$ provides us an automatic decision boundary that we can use to calculate the false alarm probability, P(FA), later.

In the previous post, we know that we can set up an arbitrary decision boundary by setting the P(FA). That is, we can either directly find the quantile P(FA) on the observations of non-target class $\omega_2$,

 xq=quantile(x(ind2),pfa);

or use the inverse cummulative distribution function at P(FA) if we assume the observations come from a probability distribution

 xi=norminv(pfa,mu2,sd2);

The question now is,  Can we derive the threshold from the P(FA)?  If so, the classification task will be just to compare the likelihood ratio of the two classes, and assign the class number when the ratio is greater than the threshold, which we know what the false alarm probability is.

It turns out that, if we do cross validation on $\omega_2$ and calculate likelihood ratio
\[L_r=\frac{P(x_i^2|\omega_1)}{P(x_i^2|\omega_2)}\]
of each testing $x_i$, we will get a distribution of $L_r$. From here, if we take the quantile $\hat q_{1-P(FA)}$ , it is the threshold at P(FA). (Higher quantile means higher threshold and thus perfers the target class, this implies lower P(FA) ).

The following code should demonstrate the concept of the two approaches of getting the threshold from P(FA). The first approach uses the quantile 1-P(FA) from the likelihood ratio and the second one directly gets the quantile P(FA) from data (blue) and then calculate the threshold .

% p(x|w1) ~ N(-1,1), p(w1)=0.6
% p(x|w2) ~ N(1,1), p(w2)=0.4
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1; 


mu1=mean(x(ind1));
sd1=std(x(ind1));
xtrain = x(ind2);
lr2=zeros(1,n2);
 

% cross validation on w2
for i=1:n2
    xi=xtrain(i);
    xtrain2=xtrain;
    xtrain2(i)=[];
    mu2=mean(xtrain2);
    sd2=std(xtrain2);
    lr2(i)=normpdf(xi,mu1,sd1)/normpdf(xi,mu2,sd2);
end

pfa = 0.01:0.01:0.99;

% approach #1
threshold=quantile(lr2,1-pfa);


% approach #2
xq=quantile(x(ind2),pfa);
threshold2=normpdf(xq,mu1,sd1)./normpdf(xq,mu2,sd2);

plot(pfa,threshold,'r'); hold on;
plot(pfa,threshold2,'b'); hold off;


Both of them are almost identical, with some slightly differences.

>> sum(abs(threshold-threshold2))
ans =
    1.7067

Receiver Operating Characteristic (ROC) Curve

Receiver Operating Characteristic (ROC) Curve is a characteristic of a classifier. It is a plot between P(FA) or "false positive rate" of the class $\omega_2$ (non-target class) vs. P(CC) or "true positive rate" of the class $\omega_1$ (target class).

It also has a meaning that "If we set the alpha to this value, what would the P(correct) be?"

For example, this following code yields a theoretical ROC curve (black, dotted) and classification ROC curve (blue) of

\[p(x|\omega_1)\sim N(-1,1),\;\;\; p(\omega_1)=0.6\]
\[p(x|\omega_2)\sim N(1,1),\;\;\; p(\omega_2)=0.4\]
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1;
 

% P(false alarm), alpha, False Positive Rate
pfa = 0.01:.01:0.99; 
% theoretical line
pcc=normcdf(norminv(pfa,1,1),-1,1);
plot(pfa,pcc,'k:'); hold on;

% building a classifier
% (setting false alarm value)
pfa2= zeros(size(pfa)); % false alarm : result
pcc = zeros(size(pfa)); % P(correct classification)

% find model parameters
mu2=mean(x(ind2));
sd2=std(x(ind2));
% get decision boundary for each P(fa)
xi=norminv(pfa,mu2,sd2); 
for i=1:length(xi)
    c1=x < x(i)  % classified as class 1
    pcc(i)=sum(c1 & ind1)/n1;
    pfa2(i)=sum(c1 & ind2)/n2;
end

plot(pfa2,pcc,'b'); hold off;
xlabel('P(FA)');
ylabel('P(CC_{\omega_1})');


Estimating Likelihood function (parametric)

Likelihood function or class-conditional probability $P({\bf x}|\omega_i)$ can be calculated directly from the data. The parametric approach is to assume that the data from each class comes from a normal distribution with its own mean and covariance. This means that we need to find the two parameters from the data of each class and use these parameters to evaluate the probability at $\bf x$.

\[P({\bf x}|\omega_i)=P({\bf x}|\mu_i,\sigma_i^2)\]

For example, using iris data, we first get the parameters of the setosa, versicolor and virginica class

load iris;

muset = mean(setosa);
muver=mean(versicolor);
muvir=mean(virginica);
covset=cov(setosa);
covver=cov(versicolor);
covvir=cov(virginica);

There are 50 samples from each class. Then, we test the iris classification into three classes.

X=[setosa;versicolor;virginica];
pset=mvnpdf(X,muset,covset);
pver=mvnpdf(X,muver,covver);
pvir=mvnpdf(X,muvir,covvir);

plot(pset,'b'), hold on;
plot(pver,'r');
plot(pvir,'g'); hold off;

The result is shown below (setosa, versicolor, virginica).

Thursday, June 10, 2010

False Alarm in Classification

There are two types of error in Hypothesis Testing, the Type I error $\alpha$ and Type II error $\beta$. In classification term, the H0 is considered as the non-target class and the H1 is the target class. Therefore, the $\alpha$ is where we misclassify an observation as a target when it is not.

Therefore, the Type I error $\alpha$, here means false alarm (FA) or false positive.

\[P(FA)=\alpha\]

For example, The target class (H1) of the previous example is $\omega_1$ and the nontarget class (H0) is $\omega_2$ and the P(FA) at the decision boundary x=0.2 is

>> 0.4*normcdf(0.2,1,1)
ans =
    0.0847

Now for a reverse question, what if we want to set the P(FA) to 0.05, what would the decision boundary be?

Notice that
\[P(FA)=\int\limits_{-\infty}^C P(x|\omega_2)P(\omega_2)dx\]
We want to find the value x=C, such that
\[\int\limits_{-\infty}^C P(x|\omega_2)dx=\frac{P(FA)}{P(\omega_2)}\]
Thus,
>> norminv(0.05/0.4,1,1)
ans =
   -0.1503
The decision boundary is where x= -0.1503

Likelihood Ratio Approach to Classification

Let's continue from the previous post.

For the two-class case, the decision rule for $\bf x$ in class $\omega_1$ is

\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
That is, the likelihood ratio of $\bf x$ belonging to class $\omega_1$ to $\omega_2$ is equal to the ratio of probability of class $\omega_1$ to $\omega_2$, called a threshold value $\tau_c$.

For the same, two-class case, let x a value from -8 to 8 in 0.1 stepping. The values of x that can be classified as class $\omega_1$  (blue) are

xl=-8:0.1:8;
w1=find(normpdf(xl,-1,1)./normpdf(xl,1,1) > 0.4/0.6);
h=bar(xl(w1),0.6*normpdf(xl(w1),-1,1),1); hold on;
set(h,'EdgeColor',get(h,'FaceColor'));
w2=setxor(1:length(xl),w1);
h=bar(xl(w2),0.4*normpdf(xl(w2),1,1),1,'r'); hold off;
set(h,'EdgeColor',get(h,'FaceColor'));
xlabel('x');

The last value of x that belongs to class $\omega_1$ is
>> xl(w1(end))
ans =
    0.2000

which is also at the decision boundary.

Classification Error

An error is made when an observation $\bf{x}$ which actually belongs to class $\Omega_i$ is classified to a wrong class $\Omega_i^c$, which can be any region except $\Omega_i$.

The probability of error is calculated by integrating over all $\bf{x}$.

\[P(error)=\sum\limits_{i=1}^n\int\limits_{\Omega_i^c}P({\bf x}|\omega_i)P(\omega_i)d\bf x\]
Using this example, we can estimate the classification error as follows

xl=-8:.1:8;
pc1=0.6*normpdf(xl,-1,1);
pc2=0.4*normpdf(xl,1,1);
% The decision boundary is where the two curves meet.
ind1=find(pc1>=pc2);
ind2=find(pc1
pmis1=sum(pc1(ind2))*.1;
pmis2=sum(pc2(ind1))*.1;
error=pmis1+pmis2;

>> error
error =
    0.1539
 
The error 0.1539 is the shade area.
For the two-class case, the decision boundary is

\[P(x|\omega_1)P(\omega_1)=P(x|\omega_2)P(\omega_2)\]
We can use this script to find the decision boundary of the above problem.

>> l=find(xl>-2 & xl <2);
>> xpos=find(abs(pc1(l)-pc2(l))<0.001);
>> xl(l(xpos))

ans =
    0.2000

Thus, the decision boundary is where x=0.2.

Wednesday, June 9, 2010

Bayes Decision Rule

Given a feature vector $\bf x$, we assign it to class $\omega_j$ if

\[P(\omega_j|{\bf{x}})>P(\omega_i|{\bf{x}});\;\;\;i=1,...,n;i\ne j\]
or
\[P({\bf{x}}|\omega_j)P(\omega_j)>P({\bf{x}}|\omega_i)P(\omega_i);\;\;\;i=1,...,n;i\ne j\]
The $P(\omega)$ is the prior and $P({\bf{x}}|\omega)$ is the likelihood or class-conditional probability. Using these two quantities, we should whatever class $\bf{x}$ that has the highest posterior probability $P(\omega|{\bf{x}})$.

The following MATLAB code demonstrates an application for Bayes decision rule. Suppose that the class-conditionals are given by $\phi(x;\mu,\sigma^2)$:

\[P(x|\omega_1)= \phi(x;-1,1)\]
\[P(x|\omega_2)= \phi(x;1,1)\]
and the priors are
\[P(\omega_1)=0.6\]
\[P(\omega_2)=0.4\]
What class does the $x = -0.75$ belong to?
>> x=-0.75;
>> normpdf(x,-1,1)*0.6
ans =
    0.2320

>> normpdf(x,1,1)*0.4
ans =
    0.0345
Since 0.232 > 0.0345, then x belongs to class 1.

Tuesday, June 8, 2010

Optimal Histogram Bin Width

Have you ever wonder the choice of the number of bins when making a histogram? I do. It seems that I often play around with the number until the resulting histogram looks good enough.

There are to parameters when consider making a histogram a bona fide probability density (the sum of all histogram area equals to one), the number of bins $k$ and the bin width $h$.

If we know the bin width $h$, the number of bins $k$ is
\[k=\left\lceil\frac{max(x)-min(x)}{h}\right\rceil\]

Now, A number of methods to calculate the optimal number of bins or the bin width has been purposed. Some are selected as follows.

Sturge's formula
\[k=\left\lceil log_2n+1\right\rceil\]
Scott's choice
\[h=\frac{3.5\sigma}{n^{1/3}}\]
Freedman-Diaconis' choice
\[h=2\frac{IQR(x)}{n^{1/3}}\]

For example, using geyser data
load geyser
n=length(geyser);

The Scott's choice is
>> h=3.5*std(geyser)*n^(-1/3)
h =
    7.2704

>> k=ceil(diff(minmax(geyser))/h)
k =
     9


The Freedman-Diaconis' choice is
>> h=2*iqr(geyser)*n^(-1/3)
h =
    7.1782

>> k=ceil(diff(minmax(geyser))/h)
k =
     10


Each of these choice yields the number of bins $k$ as 9 and 10 respectively. Compare these numbers to the Sturges' formula
>> k=ceil(log2(n)+1)
k =
    10

So, it seems the number of bins should be around 10. The Matlab code for ploting the bona fide density histogram is
[n,x]=hist(geyser,k);
fhat=n/(sum(n)*h);
bar(x,fhat,1);


Notice that the sum of the histogram area is
>> sum(fhat)*h
ans =
    1.0000

Jackknife-After-Bootstrap

We are familiar with a bootstrap estimate of a statistical parameter, such as standard error. Since bootstrap is the estimate, it also has its own error. How can we calculate the bootstrap error? The answer is, ..using jackknife.

load law
lsat=law(:,1);
gpa=law(:,2);
B=1000;

%bootstrap
[bootstat,bootsam]=bootstrp(B,'corrcoef',lsat,gpa);
n=length(gpa);
jreps=zeros(1,n);

%jackknife
for i=1:n
    [I,J]=find(bootsam==i);
    %remove column containing i-th sample,
    jacksam=setxor(J,1:B);
    bootrep=bootstat(jacksam,2);
    jreps(i)=std(bootrep);
end
varjack = (n-1)/n * sum((jreps-mean(jreps)).^2);
sejack = sqrt(varjack)
gamma=std(bootstat(:,2))

gamma =
    0.1332

sejack =
    0.0839


The result can be interpreted as follows. The standard error of the correlation coefficient for this simulation is $\hat\gamma_B=\hat{SE}_{Boot}(\hat\rho)=0.133$ and the estimated error of this value is $\hat{SE}_{Jack}(\hat\gamma_B)=0.084$.

Monday, June 7, 2010

Bias-Corrected and Accelerated Interval

This $BC_a$ interval is introduce by Efron (1987) and is said to be superior in coverage performance, range-preserving, etc and is an improvement over the percentile confidence interval.

The $BC_a$ interval adjusts the endpoint of $(1-\alpha) 100\%$ by two parameter, $\hat a$ and $\hat z_0$.

\[BC_a = (\hat \theta_B^{\alpha_1},\hat \theta_B^{\alpha_1})\]
where
\[\alpha_1=\Phi\left(\hat z_0 + \frac{\hat z_0+z^{\alpha/2}}{1-\hat a(\hat z_0+z^{\alpha/2}}\right)\]
\[\alpha_2=\Phi\left(\hat z_0 + \frac{\hat z_0+z^{(1-\alpha/2)}}{1-\hat a(\hat z_0+z^{(1-\alpha/2)}}\right)\]
The $\Phi$ denotes the standard normal cumulative distribution function. Therefore, for $\alpha/2=0.05$, we have $z^{\alpha/2}=z^{0.05}=-1.645$, because $\Phi(-1.645)=0.05$.

We can also see that when $\hat a=0$ and $\hat z_0=0$, this $BC_a$ is the same as the percentile interval.

\[\alpha_1=\Phi(z^{\alpha/2})=\alpha/2\]
The bias-correction term is $\hat z_0$, and is the proportion of bootstrap replicates $\hat\theta^b$ that are less than the statistics $\hat\theta$.
\[\hat z_0 = \Phi^{-1}\left(\frac{\sharp(\hat\theta^b<\hat\theta)}{B}\right)\]
The acceleration term $\hat a$ is
\[\hat a=\frac{-SKEW(\hat\theta^{(-i)})}{6\sqrt n}\]
where $\hat\theta^{(-i)}$ is the value of the statistics using the i-th jackknife sample. Please see the detailed definition here.

Luckily for us, Matlab already has the bootci() function and its default method is $BC_a$ !!!

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=2000;
skew = @(x)(skewness(x));
ci=bootci(B,skew,forearm) 

ci =
   -0.3904
    0.1731

Sunday, June 6, 2010

Jackknife

Jackknife is a partitioning method similar to k-fold cross validation but with the purpose similar to some of the bootstrap, which is to estimate the bias and standard error of a measure (mean, variance, correlation coefficient, etc).

\[\hat{Bias}_{jack}(T)=(n-1)(\bar T_{jack}-T)\]
\[\hat{SE}_{jack}(T)={\left[\frac{n-1}{n}\sum\limits_{i=1}^n{(T^{(-i)}-\bar T_{jack})}^2\right]}^{1/2}\]
The following jackknife method set aside 1 data item and calculate the correlation coefficient $\rho$ between the test scores (lsat) and the GPA from a sample of 15 law students.

load law;
lsat=law(:,1);
gpa=law(:,2);
tmp=corrcoef(gpa,lsat);
T=tmp(1,2);

n=length(gpa);
reps=zeros(1,n);
for i=1:n
    lsatt=lsat([1:i-1,i+1:end]);
    gpat=gpa([1:i-1,i+1:end]);
    tmp=corrcoef(gpat,lsatt);
    reps(i)=tmp(1,2);
end
mureps=mean(reps);
sehat=sqrt((n-1)/n * sum((reps-mureps).^2))
biashat = (n-1)*(mureps-T)

sehat =
    0.1425

biashat =
   -0.0065

Or, we can use the MATLAB jackknife() function.

m=jackknife(@corr,gpa,lsat);
mum=mean(m);
sejhatM=sqrt((n-1)/n * sum((m-mum).^2))
biasjhatM = (n-1)*(mum-T)


sejhatM =
    0.1425

biasjhatM =
   -0.0065


On the other hand, the following code use bootstrap to obtain the bias and standard error of the same data.
 
bootstat=bootstrp(200,@corr,gpa,lsat);
sebhat=std(bootstat)
biasbhat=mean(bootstat)-T

sebhat =
    0.1372

biasbhat =
   -0.0041

Saturday, June 5, 2010

Cross Validation

One of a widely used method of data partitioning is the cross validation. The purpose of the cross validation is to assess the accuracy of a model, especially in prediction jobs. Naive researchers often test the accuracy of their model with the data that is used to build the model, and the result yield a low error. This is because the model has already seen that data, so it may not generalize (apply in other dataset) well.

Let's try a linear regression model. We are trying to fit a steam data with polyfit() function, and then try to evaluate the model over the domain of the data.

load steam;
[p,s]=polyfit(x,y,1);
xf=linspace(min(x),max(x));
yf=polyval(p,xf);
plot(x,y,'.'); hold on;
plot(xf,yf,'r'); hold off;
xlabel('Average Temperature (\circ F)');
ylabel('Steam per Month (pounds)');
axis tight;
% prediction error
yhat=polyval(p,x);
pe = mean((y-yhat).^2);

The output of the above code is
and the prediction error is
>> pe
pe =
    0.7289


However, as stated before. This should not be a way to evaluate the performance of the model. We should leave some data for testing, and take the rest for training the model.

For example, if we leave 20% of the data for testing, the cross validation method is shown as

%cross validation method 20% testing
indtest=2:5:25;
xtest=x(indtest);
ytest=y(indtest);
xtrain=x;
ytrain=y;
xtrain(indtest)=[];
ytrain(indtest)=[];
[p,s]=polyfit(xtrain,ytrain,1);
yf=polyval(p,xtest);
%prediction error
pe_cross=mean((ytest-yf).^2);

the prediction error for this cross validation method is
>> pe_cross
pe_cross =
    0.9930

The ultimate form of the cross validation method is called K-fold cross validation. Here, the data is partition into K part. One part is set aside for testing, and the other (K-1) part is for training (building the model). The process iteratively tests and trains the model K time to cover all the partitions.

k=5;
n=length(x);

indcross=1+mod(randperm(n),k);
pe_ncross=zeros(1,k);
for i=1:k
    indtest=indcross==i;
    xtest=x(indtest);
    ytest=y(indtest);
    xtrain=x(~indtest);
    ytrain=y(~indtest);
    [p,s]=polyfit(xtrain,ytrain,1);
    yf=polyval(p,xtest);
    %prediction error
    pe_ncross(i)=mean((ytest-yf).^2);
end
penf=mean(pe_ncross)


The average prediction error for this 5-fold cross validation is
>> penf
penf =
    0.8732


Compare to the Matlab's crossval() function..
fitf=@(xtrain,ytrain,xtest)(polyval(polyfit(xtrain,ytrain,1),xtest));
cvMse=crossval('mse',x',y','kfold',5,'predfun',fitf)

cvMse =
    0.8910

Friday, June 4, 2010

Bootstrap Percentile Confidence Interval

(Compared to the Bootstrap-t Confidence Interval, I like this one better because it is easier to understand and compute and it is faster too)

We can directly find the quantile $\alpha/2$-th and $(1-\alpha/2)$-th from the bootstrap replicates.

\[(\hat{\theta_B}^{\alpha/2}, \hat{\theta_B}^{(1-\alpha/2)})\]

For example, the bootstrap percentile confidence interval of the skewness of the forearm data is

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=1000;
bootstat=bootstrp(B,'skewness',forearm);
alpha=0.05;
k=B*alpha/2;
szvals=sort(bootstat);
blo = szvals(k);
bhi = szvals(B-k);
[blo,bhi]

ans =

   -0.3940    0.1794


Efron and Tibshirani (1993) stated that this method is more stable and better theoretical coverage properties than the bootstrap-t.

Bootstrap-t Confidence Interval

We can calculate the student-t confidence interval using bootstrap. First, B bootstrap samples are generated and then for each sample we compute.
\[z^b = \frac{{{\hat \theta }^b} - \hat \theta}{\hat{SE^b}}\]
The $\hat\theta$ is the statistics we need to know and the $\hat\theta^b$ is the bootstrap replicate of it. The denominator $\hat{SE^b}$ is often calulated using bootstrap of the replicate (bootstrap of bootstrap) if the formular for standard error of $\hat\theta^b$ is not known.

Once we have the $z^b$, find the quantile $\alpha/2$-th and   $(1-\alpha/2)$-th needed for the endpoints of the interval.
 \[\alpha /2 = \frac{{\sharp ({z^b} \le {{\hat t}^{\alpha /2}})}}{B}\]
and get the interval

\[(\hat \theta - {z^{(1 - \alpha /2)}}S{E_{\hat \theta }},\hat \theta - {z^{(\alpha /2)}}S{E_{\hat \theta }})\]
The following code demonstrates the Bootstrap-t confidence interval of the skewness of the forearm data.

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=1000;
[bootstat,bootsam]=bootstrp(B,'skewness',forearm);
sehats=zeros(size(bootstat));
for i=1:B
    x = forearm(bootsam(:,i));
    sehats(i)=std(bootstrp(25,'skewness',x));
end
zvals = (bootstat - theta)./sehats;

SE=std(bootstat);
alpha=0.05;
k=B*alpha/2;
szvals=sort(zvals);
tlo = szvals(k);
thi = szvals(B-k);
blo=theta - thi*SE;
bhi=theta - tlo*SE;
[blo,bhi]

ans =


   -0.3829    0.1730 


It is recommended that the number of bootstrap replicates, B, should be 1,000 or more.

Thursday, June 3, 2010

Bootstrap Estimate of Standard Error and Bias

Bootstrapping is a method similar to Monte Carlo with one difference. Instead of sampling from a theoretical population distribution, we are sampling from the data (random sample) with replacement.

For example, the following code calculate the skewness of forearm.

load forearm
theta = skewness(forearm);


Now we want to find out what is the standard error (same as the standard deviation) of this measurement? Notice that this is not the standard deviation of the data. The bootstrap resampling process is as follow

load forearm
theta = skewness(forearm);
n=length(forearm);
B=100; % how many bootstrap replicates
inds=unidrnd(n,n,B); % nxB
xboot=forearm(inds); % nxB
thetaB=skewness(xboot);
seB=std(thetaB);

>> seB
seB =
    0.1498

Or we can directly use MATLAB bootstrap function

Bmat = bootstrp(B,'skewness',forearm);
seBmat = std(Bmat);

>> seBmat
seBmat =
    0.1493


Efron and ibshirani [1993] suggested that, taking more than 200 bootstrap replicates to estimate the standard error is unnecessary.

Now, for the bias, the bootstrap estimate can easily calculate as.

meanB = mean(thetaB);
biasB = meanB - theta;

>> biasB
biasB =
   -0.0075


E&T also recommended $B\ge400$.

Wednesday, June 2, 2010

Monte Carlo Assessment of Type I and II Error

Can we assess the result of performing an inference test when the assumptions of standard methods are violated?. For example, we assumed that the population distribution is normal when it is not.  We can use Monte Carlo simulation to estimate that result.

Type I error occurs when we reject the null hypothesis H0 when it is true. By sampling from (pseudo) population that represents the H0 and determining which sample make Type I error or not, the recorded result can then be used as the probability of making a Type I error.

The procedure is similar for estimating Type II error but we must sampling from the alternative hypothesis.

The following code is for the MC assessment of type I error of a test.

H0:  $\mu=45$
H1:  $\mu > 45$

Notice that this is an upper-tail test.

n=100; %sampling size
m=1000; % mc iteration
sigma=15;
alpha=0.05;
mu=45;
sigxbar=sigma/sqrt(n);
cv = norminv(1-alpha,0,1);
Im = 0;
for i=1:m
    xs = sigma*randn(1,n)+mu;
    Tm=(mean(xs)-mu)/sigxbar;
    if Tm >= cv % reject H0
        Im = Im+1;
    end
end
alphahat = Im/m;


In the end, the value of alphahat is
>> alphahat
alphahat =
    0.0490


The 0.0490 is close to 0.05 theoretical result.

The following code is for estimating Type II error. The code continues from the above code.
Im = 0;
mu1 = 47.2;
for i=1:m
    xs = sigma*randn(1,n)+mu1;
    Tm=(mean(xs)-mu)/sigxbar;
    if Tm < cv % no reject H0
        Im = Im+1;
    end
end
betahat = Im/m;


>> betahat
betahat =
    0.5740


The theoretical value for beta is:

>> normcdf(cv,(mu1-mu)/sigxbar,1)
ans =
    0.5707


or alternatively

>> normcdf(cv*sigxbar+mu,mu1,sigxbar)
ans =
    0.5707

Tuesday, June 1, 2010

Monte Carlo Hypothesis Testing (P-Value)

Instead of finding the critical value from the simulated distribution of the test statistics, we use samples to calculate the p-value.

load mcdata;
n=length(mcdata);
sigma=7.8;
M=1000; % number of Monte Carlo trials
 
%start simulation to get sample
Tm = zeros(1,M);
for i=1:M
    xs = sigma*randn(1,n)+454;
    Tm(i)=mean(xs);
end
tobs = mean(mcdata);
ind=find(Tm <= tobs);
pvalhat=length(ind)/M


>> mcp_hypotest

pvalhat =

    0.0060


If the significant level $\alpha=0.05$, this estimated p-value is less than the $\alpha$. We would reject the H0.

Monte Carlo Hypothesis Testing (Critical Value)

In stead of getting the critical value from a given distribution and given $\alpha$, we might draw samples directly out of the population and calculate the test statistics of them in order to find the critical value.

This is the same as saying that we find the critical value from the simulated distribution of the test statistics.

For example, suppose we are looking at the mcdata and testing the hypotheses
H0: $\mu = 454$
H1: $\mu < 454$

Therefore, this is a lower tail test.
Given that the population sigma is 7.8.
We first check whether the population come from a normal distribution using normplot().

>> load mcdata;
>> normplot(mcdata);
 Now let's do some Monte Carlo

n=length(mcdata);
alpha=0.05;
M=1000; % number of Monte Carlo trials
sigma=7.8;
sigxbar = sigma/sqrt(n);
 

% start simulation to get a critical value
Tm = zeros(1,M);
for i=1:M
    %Generate a random sample under H0
    xs = sigma*randn(1,n)+454;
    Tm(i)=(mean(xs)-454)/sigxbar;
end
cv = quantile(Tm,alpha)
 

%get the observed test statistics
tobs = (mean(mcdata)-454)/sigxbar

>> mc_hypotest
cv =
   -1.6439

tobs =
   -2.5641

Since the observed value of test statistic is less than the critical value, we reject the H0.

Notice that the only difference between the Monte Carlo method and the regular Hypothesis testing is how to get the critical value. Usually, for Monte Carlo, we do not know what type of population distribution. However, in this example, we choose the normal distribution and try drawing samples out of it.

Monday, May 31, 2010

Confidence Intervals

What does it mean by 95% confidence interval?  This is something related to the Type I error and, therefore, the $\alpha$. If we set the $\alpha$ to 0.05, the confidence interval is (1-$\alpha$)*100 %. The 95% is the area under a distribution curve (0.95), between $z_{\alpha/2}$ and $1- z_{\alpha/2}$.

>> z2=norminv(0.025,0,1)
z2 =
   -1.9600
>> normspec([z2,-z2],0,1)
ans =
    0.9500

The shade area is computed from
 \[P({z^{\alpha /2}} < Z < {z^{1-\alpha /2}}) = 1-\alpha \]
where
 \[Z = \frac{{\bar X - \mu }}{{\sigma /\sqrt n }}\]
therefore
 \[P({z^{\alpha /2}} < \frac{{\bar X-\mu }}{{\sigma /\sqrt n }} < {z^{1-\alpha /2}}) = 1-\alpha \]
and some rearrangements
\[P(\bar X - {z^{1 - \alpha /2}}\frac{\sigma }{{\sqrt n }} < \mu  < \bar X - {z^{\alpha /2}}\frac{\sigma }{{\sqrt n }}) = 1-\alpha \]
Thus the lower limit is \[\bar X - {z^{1 - \alpha /2}}\frac{\sigma }{{\sqrt n }}\] and the upper limit is \[\bar X - {z^{1 - \alpha /2}}\frac{\sigma }{{\sqrt n }}\]
For the travel time problem ($\bar X=47.2$,$\sigma /\sqrt n=1.5$), if we set the $\alpha$ to be 0.05, the confidence interval is:

 >> interval = 47.2 - [-z2 z2]*1.5
interval =
   44.2601   50.1399
 

That means we are 95% confidence that the real average travel time is between 44.26 to 50.14 minutes, which is actually correct since the compared average travel time is 45 minutes.


P-value Approach to Hypothesis Testing

Recall that in the critical value approach to hypothesis testing, one must set the alpha in order to get the critical value. Then, if a test statistic is higher than this critical value, the null hypothesis will be rejected. Alternatively, the p-value is the probability of observing a value of the test statistics (ie. the mean) as extreme as that is observed. For example, the p-value for lower-tail test is

\[p-value = {P_{{H_0}}}(T \le {t_0})\]
And for upper tail test is
\[p-value = {P_{{H_0}}}(T \ge {t_0})\]
Where $t_0$ is the observed value of the test statistic $T$, and $P_{H_0}$ denotes the probability under the null hypothesis. The p-value is sometime referred to as the observed significance level.

For example, The p-value of average traval time 47.2 from the previous average 45 with standard deviation 1.5 is

>> mu=45;
>> sig=1.5;
>> xbar=47.2;
>> zobs = (xbar-mu)/sig;
>> pval=1-normcdf(zobs,0,1)
pval =
0.0712

The p-value is 0.0712. If we do the hypothesis test at 0.05 significant level, we would not reject the null hypothesis.

Power of a Hypothesis Test

Remember the $\beta$  (Type II Error) ? There is one convenient way to test the performance of a hypothesis test by determining the probability of not making a Type II error. It is called power of the test: $1-\beta$. In fact, it is the probability of rejecting H0 when it is really false.

Here is the illustration of the beta:
>> normspec([-inf,1.645],3,1)
ans =
    0.0877

>> l=linspace(-4,4,100);
>> n=normpdf(l,0,1);
>> hold on;
>> plot(l,n,'r')
>> title("Power = 0.0877")


And here is for the power
>> normspec([1.645,inf],3,1)
ans =
    0.9123

>> plot(l,n,'r')
>> title('Power = 1-Beta = 0.9123')

Sunday, May 30, 2010

Critical Value Approach to Hypothesis Testing

Usually, we define a test statistic by
\[z = \frac{{\bar x - {\mu_0}}}{{{\sigma_{\bar X}}}} = \frac{{\bar x - {\mu_0}}}{{{\sigma_X}/\sqrt n }}\]
Let's do some practical example. If in the year 2000, the average travel time in town A is 45 minutes, a recent survey in the year 2010 of the same town with n=100 commute times found that the average travel time is 47.2 minutes with the standard deviation $\sigma=15$ minutes. Has the average commute time really increased?

Here we have the statistical hypotheses:
H0: $\mu=45$ minutes
H1: $\mu> 45$ minutes

Thus, based on the above information
\[z = \frac{{47.2 - 45}}{{15/\sqrt {100} }} = \frac{{2.2}}{{1.5}} = 1.47\]

This is an upper tail test: 
\[P_{H_0}(T \le t) = 1- \alpha \]
We want conduct this test at the significant level $\alpha=0.05$. Conceptually, the rejection region is the top 5% of the area under a standard normal curve, as shown by the following MATLAB code

>> cv = norminv(0.95,0,1)
cv =
    1.6449

>> p=normspec([cv,inf],0,1)
p =
    0.0500




The 1.6449 is the critical value for rejecting H0 at $\alpha=0.05$. Since the statistic test result 1.47 is less than the critical value, we do not reject H0. This means the average travel time in 2010 is not really different from 2000.

Error in Hypothesis Testing

There are two types of errors that can occur when we make a decision in statistical hypothesis testing. The first is a Type I error, which means we reject H0 when it is really true. Another error is called Type II error, when means we accept H0 but it is really false.

I am always confused about these two, despite many years of statistical background :(

In short,
Type I Error:  Incorrectly reject the H0
Type II Error: Incorrectly NOT reject the H0

We denote the symbols for the probability of Type I error is $\alpha$ and of Type II error is $\beta$.

The $\alpha$ is sometime called the sinificance level of the test and it represents the maximum probability of Type I error that will be tolerated. Typical values of $\alpha$ are 0.01, 0.05, 0.10.

I am getting more understanding about Type I and Type II error and it can be linked to the True Positive, False Positive, False Negative and True Negative as follow.

Actually, the main confusion came because of the name "positive". Here, "Positive" means "rejecting H0"  and "Negative" means "fail to reject H0" OK?

Therefore, The 2x2 matrix of test/actual conditions is


Moreover, while Type I error is the False Positive (FP) but the $\alpha$  is not the same as "false positive" as most references try to say. It should be "false positive rate" FP/(FP+TN). This way, it makes more sense for the $1-\alpha$ to be called the specificity = TN/(FP+TN).

On the other hand, Type II error is the False Negative (FN) but the $\beta$ is the "false negative rate"  FN/(TP+FN). Now, the power or $1-\beta$, which is also called the sensitivity, is TP/(TP+FN).

Statistical Hyposthesis Testing

In hypothesis Testing, there are two hypotheses. The first one is called null hypothesis H0, which usually the hypothesis we would like to test. The second one is called alternative hypothesis H1. When we reject H0, the leads to accepting H1.

Technically, H0 is something we hope if we make wrong decision, it will have less severe effects. For example, in a murder case, a suspect is put in a trial. The judge could have two ways of thinking.

H0: The suspect is the murderer.
H1: The suspect is innocent.

or
H0: The suspect is innocent.
H1: The suspect is the murderer.

Can you guess which one every judge think?  Of course, the later one. By assume that the suspect is innocent (but he is not), the judge may let go of the suspect (wrong decision). If the judge makes wrong decision, no one die more in the process. However, in the first thinking, The judge may put an innocent people to be killed.

In product inspection for manufacturing process, H0 is the assumption that the product is good, H1 is the otherwise.

Why?, well try to think in terms of the cost of inspection from the manufacturer's point of view.

Tuesday, May 25, 2010

Gelman and Rubin Method for MCMC Convergence

The common problem when constructing a MCMC is when to stop it. It should be stopped when it converges to the target distribution. There are several purposed ideas but let's investigate the Gelman and Rubin Method.

The Gelman and Rubin Method (G&R) is based on multiple chains running at the same time. The starting points are chosen to be distant apart. When the characteristic of these chain i.e. the mean converge to about the same value, let's say it's about time to stop.

The idea of G&R method includes the between-sequence and within-sequence variances. The two variances is then used to calculated a factor, called "estimated potential scale reduction" $\hat R$.

The chains should continue to run if $\hat R$ is still high, until it is going below 1.1 or 1.2.

Example 14.9 of Computational Statistics Handbook with MATLAB shows the use of csgelrub() function for 4 MCMC (Metropolis-Hasting) chains. The characteristic is the mean of each chain. After the chain stopped, we can check the results as

>> nu(:,end)
ans =
    0.0248
   -0.0574
    0.0318
   -0.0196

>> rhat(end)
ans =
    1.0073



Monday, May 24, 2010

Gibbs Sampler of Bivariate Normal

Let's say that we want to generate samples from a bivariate normal with the following parameter.
\[\mu=\left[ {\begin{array}{*{20}{c}}{{\mu _1}}\\{{\mu _2}}\\\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}1\\2\\\end{array}} \right]\]
\[\Sigma=\left[ {\begin{array}{*{20}{c}}1 & \rho\\\rho & 1 \\\end{array}} \right] = \left[ {\begin{array}{*{20}{c}}1 & {0.9} \\{0.9} & 1 \\\end{array}} \right]\]
The conditional distribution $f(x_1|x_2)$ is univariate normal $N(\mu_1+\rho(x_2-\mu_2), 1-\rho^2)$ and $f(x_2|x_1)$ is $N(\mu_2+\rho(x_1-\mu_1), 1-\rho^2)$.

The Matlab code for the Gibbs sampling is

n=5000;
x=zeros(n,2);
rho=0.9;
mu=[1;2];
sig=sqrt(1-rho^2);
%initial point
x(1,:)=[10,10];
for i=2:n   
    x(i,1)=normrnd(mu(1)+rho*(x(i-1,2)-mu(2)),sig);
    x(i,2)=normrnd(mu(2)+rho*(x(i,1)-mu(1)),sig);
end

and the result after 1,000 samples burn-in is

plot(x(1001:end,1),x(1001:end,2),'.');
axis square;
To check the result, calculate the sample mean and sample covariance matrix of x

>> mean(x)
ans =
    0.9942    1.9960
>> corrcoef(x)
ans =
    1.0000    0.8943
    0.8943    1.0000

This problem is, in fact, easily solved with Matlab's mvnrnd().

covm=[1 rho;rho 1];
X = mvnrnd(mu,covm,4000);
figure;
plot(X(:,1),X(:,2),'.');
axis square;