Saturday, June 19, 2010

Akaike and Bayesian Information Criteria

Usually, when we have data, we want to know what (parametric) model did the data come from. In other words, we want to do model fitting to the data. Different models consists of different number of parameters or different parameters themselves.

When there are several models to choose from, we estimate model parameters using maximum likelihood estimation. We may want to increase the likelihood by adding more parameters but that may result in overfitting.

Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) resolve the overfitting problem by penalizing for the number of parameters in the model.

The formula for the AIC is

\[AIC = 2k - 2 ln(L)\]
and, the formula for the BIC is

\[-2\cdot ln\;p(x|k)\approx BIC=-2\cdot ln L + k ln(n)\]
where $x$ is the observed data, $n$ the number of data points in $x$, $k$ the number of parameters, $p(x|k)$ the likelihood of the observed data given the number of parameters and $L$ the value of the likelihood function for the estimated model.

For example, let generate some random values
x = binornd(10,0.5,1,1000);

and pretend that we do not know what distribution function they are from. There are two models we want to fit the data, $X\sim Bino(10,0.5)$ and $X\sim Poiss(5)$.

First, try plotting the data, and the two theoretical points.
[n,xi]=hist(x);
bar(xi,n/(sum(n)*(xi(2)-xi(1))),1)
hold on;
l=min(x):max(x);
px=binopdf(l,10,0.5);
py=poisspdf(l,5);
plot(l,px,'r.');  % theoretical binomial dist
plot(l,py,'rx');  % theoretical poisson dist
hold off;
axis tight;

 Now, try using the AIC and BIC for model selection
px=binopdf(x,10,0.5);
px(px == 0)=[];
llx=sum(log(px));
py=poisspdf(x,5);
py(py == 0)=[];
lly=sum(log(py));
[aic,bic]=aicbic([llx,lly],[2,1],[length(px),length(py)])


aic =
  1.0e+003 *
    3.7569    3.9542

bic =
  1.0e+003 *
    3.7667    3.9591


The first model $X\sim Bino(10,0.5)$ has lower AIC and BIC, so it is a better model.

Next, try using different random values.

x = poissrnd(5,1,10000);

aic =
  1.0e+004 *
    4.4925    4.4069

bic =
  1.0e+004 *
    4.4939    4.4076


This time the second model $X\sim Poiss(5)$ has lower AIC and BIC so it is a better model for the data.

Log Likelihood Function

In maximum likelihood estimation (MLE), if we let $Y_1,...,Y_n$ be n independent random variables (r.v.'s) with probability functions (pdf) $f_i(y_i;\bf\theta)$ depending on a vector-valued parameter $\bf\theta$, the likelihood of the distribution is just the value of the n-variate probability density function

\[f({\bf y};{\bf\theta})=\prod\limit_{i=1}^n f_i(y_i;{\bf\theta})=L({\bf\theta};{\bf y})\]
The $L({\bf\theta};{\bf y})$ is a function of the unknown parameter $ {\bf\theta} $ given the data $ {\bf y} $, is called the likelihood function .

Analytically, maximizing the product is more difficult than maximizing the sum, so we take a natural logarithm of the likelihood function, thus it is called the log likelihood function
\[log L({\bf\theta};{\bf y})=\sum\limit_{i=1}^n log f_i(y_i;{\bf\theta})\]
For example, the outcome of an event is found as a series of random value, what is the (log) likelyhood of this outcome coming from $X \sim Bino(10,0.5)$?

First, generate some random values but let's assume that we do not know the parameters generating the random values, i.e. we don't know that it was generated from Bino(10,0.5).

X=binornd(10,0.5,1,100);

The log likelihood value of X with parameter (10,0.6) is
 >> sum(log(binopdf(X,10,0.6)))
ans =
 -200.0399

The log likelihood value of X with parameter (10,0.5) is
>> sum(log(binopdf(X,10,0.5)))
ans =
 -184.4945

And, The log likelihood value of X with parameter (11,0.5) is
>> sum(log(binopdf(X,11,0.5)))
ans =
 -187.4201

From three values above, the log likelihood of the parameter (10,0.5) is highest, from MLE standpoint, we should select this model.

Tuesday, June 15, 2010

Model-based Clustering

Model-Based Clustering (MBC) assumes that the data comes from several subpopulations, each is model separately by $g_k({\bf x}; {\bf \theta_k})$. The overall population is therefore a finite mixture model.

\[f({\bf x};p,{\bf \theta})=\sum\limit_{k=1}^c p_k g_k({\bf x}; {\bf \theta_k})\]
The weights are given by $p_k$, which is also called the mixing proportions or mixing coefficients.

The most common used density is multivariate normal (Gaussian), therefore, the MBC can assume that the data come from a multivariate Gaussian finite mixture.

There are three components of the MBC.
1. Agglomerative clustering is used to partition the data. Each partition then becomes the initial starting values for the EM algorithm.
2. The EM algorithm is used to estimate the parameters of the finite mixture.
3. The Bayesian Information Criterion (BIC) is used to choose the best grouping and number of clusters, given the data. It is an approximation to Bayes factors.

Saturday, June 12, 2010

False Alarm Probability to Likelihood Ratio Threshold

Let's start from something easier to understand. We know from the "likelihood ratio approach to classification" that

\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
The $\tau_c$ provides us an automatic decision boundary that we can use to calculate the false alarm probability, P(FA), later.

In the previous post, we know that we can set up an arbitrary decision boundary by setting the P(FA). That is, we can either directly find the quantile P(FA) on the observations of non-target class $\omega_2$,

 xq=quantile(x(ind2),pfa);

or use the inverse cummulative distribution function at P(FA) if we assume the observations come from a probability distribution

 xi=norminv(pfa,mu2,sd2);

The question now is,  Can we derive the threshold from the P(FA)?  If so, the classification task will be just to compare the likelihood ratio of the two classes, and assign the class number when the ratio is greater than the threshold, which we know what the false alarm probability is.

It turns out that, if we do cross validation on $\omega_2$ and calculate likelihood ratio
\[L_r=\frac{P(x_i^2|\omega_1)}{P(x_i^2|\omega_2)}\]
of each testing $x_i$, we will get a distribution of $L_r$. From here, if we take the quantile $\hat q_{1-P(FA)}$ , it is the threshold at P(FA). (Higher quantile means higher threshold and thus perfers the target class, this implies lower P(FA) ).

The following code should demonstrate the concept of the two approaches of getting the threshold from P(FA). The first approach uses the quantile 1-P(FA) from the likelihood ratio and the second one directly gets the quantile P(FA) from data (blue) and then calculate the threshold .

% p(x|w1) ~ N(-1,1), p(w1)=0.6
% p(x|w2) ~ N(1,1), p(w2)=0.4
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1; 


mu1=mean(x(ind1));
sd1=std(x(ind1));
xtrain = x(ind2);
lr2=zeros(1,n2);
 

% cross validation on w2
for i=1:n2
    xi=xtrain(i);
    xtrain2=xtrain;
    xtrain2(i)=[];
    mu2=mean(xtrain2);
    sd2=std(xtrain2);
    lr2(i)=normpdf(xi,mu1,sd1)/normpdf(xi,mu2,sd2);
end

pfa = 0.01:0.01:0.99;

% approach #1
threshold=quantile(lr2,1-pfa);


% approach #2
xq=quantile(x(ind2),pfa);
threshold2=normpdf(xq,mu1,sd1)./normpdf(xq,mu2,sd2);

plot(pfa,threshold,'r'); hold on;
plot(pfa,threshold2,'b'); hold off;


Both of them are almost identical, with some slightly differences.

>> sum(abs(threshold-threshold2))
ans =
    1.7067

Receiver Operating Characteristic (ROC) Curve

Receiver Operating Characteristic (ROC) Curve is a characteristic of a classifier. It is a plot between P(FA) or "false positive rate" of the class $\omega_2$ (non-target class) vs. P(CC) or "true positive rate" of the class $\omega_1$ (target class).

It also has a meaning that "If we set the alpha to this value, what would the P(correct) be?"

For example, this following code yields a theoretical ROC curve (black, dotted) and classification ROC curve (blue) of

\[p(x|\omega_1)\sim N(-1,1),\;\;\; p(\omega_1)=0.6\]
\[p(x|\omega_2)\sim N(1,1),\;\;\; p(\omega_2)=0.4\]
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1;
 

% P(false alarm), alpha, False Positive Rate
pfa = 0.01:.01:0.99; 
% theoretical line
pcc=normcdf(norminv(pfa,1,1),-1,1);
plot(pfa,pcc,'k:'); hold on;

% building a classifier
% (setting false alarm value)
pfa2= zeros(size(pfa)); % false alarm : result
pcc = zeros(size(pfa)); % P(correct classification)

% find model parameters
mu2=mean(x(ind2));
sd2=std(x(ind2));
% get decision boundary for each P(fa)
xi=norminv(pfa,mu2,sd2); 
for i=1:length(xi)
    c1=x < x(i)  % classified as class 1
    pcc(i)=sum(c1 & ind1)/n1;
    pfa2(i)=sum(c1 & ind2)/n2;
end

plot(pfa2,pcc,'b'); hold off;
xlabel('P(FA)');
ylabel('P(CC_{\omega_1})');


Estimating Likelihood function (parametric)

Likelihood function or class-conditional probability $P({\bf x}|\omega_i)$ can be calculated directly from the data. The parametric approach is to assume that the data from each class comes from a normal distribution with its own mean and covariance. This means that we need to find the two parameters from the data of each class and use these parameters to evaluate the probability at $\bf x$.

\[P({\bf x}|\omega_i)=P({\bf x}|\mu_i,\sigma_i^2)\]

For example, using iris data, we first get the parameters of the setosa, versicolor and virginica class

load iris;

muset = mean(setosa);
muver=mean(versicolor);
muvir=mean(virginica);
covset=cov(setosa);
covver=cov(versicolor);
covvir=cov(virginica);

There are 50 samples from each class. Then, we test the iris classification into three classes.

X=[setosa;versicolor;virginica];
pset=mvnpdf(X,muset,covset);
pver=mvnpdf(X,muver,covver);
pvir=mvnpdf(X,muvir,covvir);

plot(pset,'b'), hold on;
plot(pver,'r');
plot(pvir,'g'); hold off;

The result is shown below (setosa, versicolor, virginica).

Thursday, June 10, 2010

False Alarm in Classification

There are two types of error in Hypothesis Testing, the Type I error $\alpha$ and Type II error $\beta$. In classification term, the H0 is considered as the non-target class and the H1 is the target class. Therefore, the $\alpha$ is where we misclassify an observation as a target when it is not.

Therefore, the Type I error $\alpha$, here means false alarm (FA) or false positive.

\[P(FA)=\alpha\]

For example, The target class (H1) of the previous example is $\omega_1$ and the nontarget class (H0) is $\omega_2$ and the P(FA) at the decision boundary x=0.2 is

>> 0.4*normcdf(0.2,1,1)
ans =
    0.0847

Now for a reverse question, what if we want to set the P(FA) to 0.05, what would the decision boundary be?

Notice that
\[P(FA)=\int\limits_{-\infty}^C P(x|\omega_2)P(\omega_2)dx\]
We want to find the value x=C, such that
\[\int\limits_{-\infty}^C P(x|\omega_2)dx=\frac{P(FA)}{P(\omega_2)}\]
Thus,
>> norminv(0.05/0.4,1,1)
ans =
   -0.1503
The decision boundary is where x= -0.1503

Likelihood Ratio Approach to Classification

Let's continue from the previous post.

For the two-class case, the decision rule for $\bf x$ in class $\omega_1$ is

\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
That is, the likelihood ratio of $\bf x$ belonging to class $\omega_1$ to $\omega_2$ is equal to the ratio of probability of class $\omega_1$ to $\omega_2$, called a threshold value $\tau_c$.

For the same, two-class case, let x a value from -8 to 8 in 0.1 stepping. The values of x that can be classified as class $\omega_1$  (blue) are

xl=-8:0.1:8;
w1=find(normpdf(xl,-1,1)./normpdf(xl,1,1) > 0.4/0.6);
h=bar(xl(w1),0.6*normpdf(xl(w1),-1,1),1); hold on;
set(h,'EdgeColor',get(h,'FaceColor'));
w2=setxor(1:length(xl),w1);
h=bar(xl(w2),0.4*normpdf(xl(w2),1,1),1,'r'); hold off;
set(h,'EdgeColor',get(h,'FaceColor'));
xlabel('x');

The last value of x that belongs to class $\omega_1$ is
>> xl(w1(end))
ans =
    0.2000

which is also at the decision boundary.

Classification Error

An error is made when an observation $\bf{x}$ which actually belongs to class $\Omega_i$ is classified to a wrong class $\Omega_i^c$, which can be any region except $\Omega_i$.

The probability of error is calculated by integrating over all $\bf{x}$.

\[P(error)=\sum\limits_{i=1}^n\int\limits_{\Omega_i^c}P({\bf x}|\omega_i)P(\omega_i)d\bf x\]
Using this example, we can estimate the classification error as follows

xl=-8:.1:8;
pc1=0.6*normpdf(xl,-1,1);
pc2=0.4*normpdf(xl,1,1);
% The decision boundary is where the two curves meet.
ind1=find(pc1>=pc2);
ind2=find(pc1
pmis1=sum(pc1(ind2))*.1;
pmis2=sum(pc2(ind1))*.1;
error=pmis1+pmis2;

>> error
error =
    0.1539
 
The error 0.1539 is the shade area.
For the two-class case, the decision boundary is

\[P(x|\omega_1)P(\omega_1)=P(x|\omega_2)P(\omega_2)\]
We can use this script to find the decision boundary of the above problem.

>> l=find(xl>-2 & xl <2);
>> xpos=find(abs(pc1(l)-pc2(l))<0.001);
>> xl(l(xpos))

ans =
    0.2000

Thus, the decision boundary is where x=0.2.

Wednesday, June 9, 2010

Bayes Decision Rule

Given a feature vector $\bf x$, we assign it to class $\omega_j$ if

\[P(\omega_j|{\bf{x}})>P(\omega_i|{\bf{x}});\;\;\;i=1,...,n;i\ne j\]
or
\[P({\bf{x}}|\omega_j)P(\omega_j)>P({\bf{x}}|\omega_i)P(\omega_i);\;\;\;i=1,...,n;i\ne j\]
The $P(\omega)$ is the prior and $P({\bf{x}}|\omega)$ is the likelihood or class-conditional probability. Using these two quantities, we should whatever class $\bf{x}$ that has the highest posterior probability $P(\omega|{\bf{x}})$.

The following MATLAB code demonstrates an application for Bayes decision rule. Suppose that the class-conditionals are given by $\phi(x;\mu,\sigma^2)$:

\[P(x|\omega_1)= \phi(x;-1,1)\]
\[P(x|\omega_2)= \phi(x;1,1)\]
and the priors are
\[P(\omega_1)=0.6\]
\[P(\omega_2)=0.4\]
What class does the $x = -0.75$ belong to?
>> x=-0.75;
>> normpdf(x,-1,1)*0.6
ans =
    0.2320

>> normpdf(x,1,1)*0.4
ans =
    0.0345
Since 0.232 > 0.0345, then x belongs to class 1.

Tuesday, June 8, 2010

Optimal Histogram Bin Width

Have you ever wonder the choice of the number of bins when making a histogram? I do. It seems that I often play around with the number until the resulting histogram looks good enough.

There are to parameters when consider making a histogram a bona fide probability density (the sum of all histogram area equals to one), the number of bins $k$ and the bin width $h$.

If we know the bin width $h$, the number of bins $k$ is
\[k=\left\lceil\frac{max(x)-min(x)}{h}\right\rceil\]

Now, A number of methods to calculate the optimal number of bins or the bin width has been purposed. Some are selected as follows.

Sturge's formula
\[k=\left\lceil log_2n+1\right\rceil\]
Scott's choice
\[h=\frac{3.5\sigma}{n^{1/3}}\]
Freedman-Diaconis' choice
\[h=2\frac{IQR(x)}{n^{1/3}}\]

For example, using geyser data
load geyser
n=length(geyser);

The Scott's choice is
>> h=3.5*std(geyser)*n^(-1/3)
h =
    7.2704

>> k=ceil(diff(minmax(geyser))/h)
k =
     9


The Freedman-Diaconis' choice is
>> h=2*iqr(geyser)*n^(-1/3)
h =
    7.1782

>> k=ceil(diff(minmax(geyser))/h)
k =
     10


Each of these choice yields the number of bins $k$ as 9 and 10 respectively. Compare these numbers to the Sturges' formula
>> k=ceil(log2(n)+1)
k =
    10

So, it seems the number of bins should be around 10. The Matlab code for ploting the bona fide density histogram is
[n,x]=hist(geyser,k);
fhat=n/(sum(n)*h);
bar(x,fhat,1);


Notice that the sum of the histogram area is
>> sum(fhat)*h
ans =
    1.0000

Jackknife-After-Bootstrap

We are familiar with a bootstrap estimate of a statistical parameter, such as standard error. Since bootstrap is the estimate, it also has its own error. How can we calculate the bootstrap error? The answer is, ..using jackknife.

load law
lsat=law(:,1);
gpa=law(:,2);
B=1000;

%bootstrap
[bootstat,bootsam]=bootstrp(B,'corrcoef',lsat,gpa);
n=length(gpa);
jreps=zeros(1,n);

%jackknife
for i=1:n
    [I,J]=find(bootsam==i);
    %remove column containing i-th sample,
    jacksam=setxor(J,1:B);
    bootrep=bootstat(jacksam,2);
    jreps(i)=std(bootrep);
end
varjack = (n-1)/n * sum((jreps-mean(jreps)).^2);
sejack = sqrt(varjack)
gamma=std(bootstat(:,2))

gamma =
    0.1332

sejack =
    0.0839


The result can be interpreted as follows. The standard error of the correlation coefficient for this simulation is $\hat\gamma_B=\hat{SE}_{Boot}(\hat\rho)=0.133$ and the estimated error of this value is $\hat{SE}_{Jack}(\hat\gamma_B)=0.084$.

Monday, June 7, 2010

Bias-Corrected and Accelerated Interval

This $BC_a$ interval is introduce by Efron (1987) and is said to be superior in coverage performance, range-preserving, etc and is an improvement over the percentile confidence interval.

The $BC_a$ interval adjusts the endpoint of $(1-\alpha) 100\%$ by two parameter, $\hat a$ and $\hat z_0$.

\[BC_a = (\hat \theta_B^{\alpha_1},\hat \theta_B^{\alpha_1})\]
where
\[\alpha_1=\Phi\left(\hat z_0 + \frac{\hat z_0+z^{\alpha/2}}{1-\hat a(\hat z_0+z^{\alpha/2}}\right)\]
\[\alpha_2=\Phi\left(\hat z_0 + \frac{\hat z_0+z^{(1-\alpha/2)}}{1-\hat a(\hat z_0+z^{(1-\alpha/2)}}\right)\]
The $\Phi$ denotes the standard normal cumulative distribution function. Therefore, for $\alpha/2=0.05$, we have $z^{\alpha/2}=z^{0.05}=-1.645$, because $\Phi(-1.645)=0.05$.

We can also see that when $\hat a=0$ and $\hat z_0=0$, this $BC_a$ is the same as the percentile interval.

\[\alpha_1=\Phi(z^{\alpha/2})=\alpha/2\]
The bias-correction term is $\hat z_0$, and is the proportion of bootstrap replicates $\hat\theta^b$ that are less than the statistics $\hat\theta$.
\[\hat z_0 = \Phi^{-1}\left(\frac{\sharp(\hat\theta^b<\hat\theta)}{B}\right)\]
The acceleration term $\hat a$ is
\[\hat a=\frac{-SKEW(\hat\theta^{(-i)})}{6\sqrt n}\]
where $\hat\theta^{(-i)}$ is the value of the statistics using the i-th jackknife sample. Please see the detailed definition here.

Luckily for us, Matlab already has the bootci() function and its default method is $BC_a$ !!!

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=2000;
skew = @(x)(skewness(x));
ci=bootci(B,skew,forearm) 

ci =
   -0.3904
    0.1731

Sunday, June 6, 2010

Jackknife

Jackknife is a partitioning method similar to k-fold cross validation but with the purpose similar to some of the bootstrap, which is to estimate the bias and standard error of a measure (mean, variance, correlation coefficient, etc).

\[\hat{Bias}_{jack}(T)=(n-1)(\bar T_{jack}-T)\]
\[\hat{SE}_{jack}(T)={\left[\frac{n-1}{n}\sum\limits_{i=1}^n{(T^{(-i)}-\bar T_{jack})}^2\right]}^{1/2}\]
The following jackknife method set aside 1 data item and calculate the correlation coefficient $\rho$ between the test scores (lsat) and the GPA from a sample of 15 law students.

load law;
lsat=law(:,1);
gpa=law(:,2);
tmp=corrcoef(gpa,lsat);
T=tmp(1,2);

n=length(gpa);
reps=zeros(1,n);
for i=1:n
    lsatt=lsat([1:i-1,i+1:end]);
    gpat=gpa([1:i-1,i+1:end]);
    tmp=corrcoef(gpat,lsatt);
    reps(i)=tmp(1,2);
end
mureps=mean(reps);
sehat=sqrt((n-1)/n * sum((reps-mureps).^2))
biashat = (n-1)*(mureps-T)

sehat =
    0.1425

biashat =
   -0.0065

Or, we can use the MATLAB jackknife() function.

m=jackknife(@corr,gpa,lsat);
mum=mean(m);
sejhatM=sqrt((n-1)/n * sum((m-mum).^2))
biasjhatM = (n-1)*(mum-T)


sejhatM =
    0.1425

biasjhatM =
   -0.0065


On the other hand, the following code use bootstrap to obtain the bias and standard error of the same data.
 
bootstat=bootstrp(200,@corr,gpa,lsat);
sebhat=std(bootstat)
biasbhat=mean(bootstat)-T

sebhat =
    0.1372

biasbhat =
   -0.0041

Saturday, June 5, 2010

Cross Validation

One of a widely used method of data partitioning is the cross validation. The purpose of the cross validation is to assess the accuracy of a model, especially in prediction jobs. Naive researchers often test the accuracy of their model with the data that is used to build the model, and the result yield a low error. This is because the model has already seen that data, so it may not generalize (apply in other dataset) well.

Let's try a linear regression model. We are trying to fit a steam data with polyfit() function, and then try to evaluate the model over the domain of the data.

load steam;
[p,s]=polyfit(x,y,1);
xf=linspace(min(x),max(x));
yf=polyval(p,xf);
plot(x,y,'.'); hold on;
plot(xf,yf,'r'); hold off;
xlabel('Average Temperature (\circ F)');
ylabel('Steam per Month (pounds)');
axis tight;
% prediction error
yhat=polyval(p,x);
pe = mean((y-yhat).^2);

The output of the above code is
and the prediction error is
>> pe
pe =
    0.7289


However, as stated before. This should not be a way to evaluate the performance of the model. We should leave some data for testing, and take the rest for training the model.

For example, if we leave 20% of the data for testing, the cross validation method is shown as

%cross validation method 20% testing
indtest=2:5:25;
xtest=x(indtest);
ytest=y(indtest);
xtrain=x;
ytrain=y;
xtrain(indtest)=[];
ytrain(indtest)=[];
[p,s]=polyfit(xtrain,ytrain,1);
yf=polyval(p,xtest);
%prediction error
pe_cross=mean((ytest-yf).^2);

the prediction error for this cross validation method is
>> pe_cross
pe_cross =
    0.9930

The ultimate form of the cross validation method is called K-fold cross validation. Here, the data is partition into K part. One part is set aside for testing, and the other (K-1) part is for training (building the model). The process iteratively tests and trains the model K time to cover all the partitions.

k=5;
n=length(x);

indcross=1+mod(randperm(n),k);
pe_ncross=zeros(1,k);
for i=1:k
    indtest=indcross==i;
    xtest=x(indtest);
    ytest=y(indtest);
    xtrain=x(~indtest);
    ytrain=y(~indtest);
    [p,s]=polyfit(xtrain,ytrain,1);
    yf=polyval(p,xtest);
    %prediction error
    pe_ncross(i)=mean((ytest-yf).^2);
end
penf=mean(pe_ncross)


The average prediction error for this 5-fold cross validation is
>> penf
penf =
    0.8732


Compare to the Matlab's crossval() function..
fitf=@(xtrain,ytrain,xtest)(polyval(polyfit(xtrain,ytrain,1),xtest));
cvMse=crossval('mse',x',y','kfold',5,'predfun',fitf)

cvMse =
    0.8910

Friday, June 4, 2010

Bootstrap Percentile Confidence Interval

(Compared to the Bootstrap-t Confidence Interval, I like this one better because it is easier to understand and compute and it is faster too)

We can directly find the quantile $\alpha/2$-th and $(1-\alpha/2)$-th from the bootstrap replicates.

\[(\hat{\theta_B}^{\alpha/2}, \hat{\theta_B}^{(1-\alpha/2)})\]

For example, the bootstrap percentile confidence interval of the skewness of the forearm data is

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=1000;
bootstat=bootstrp(B,'skewness',forearm);
alpha=0.05;
k=B*alpha/2;
szvals=sort(bootstat);
blo = szvals(k);
bhi = szvals(B-k);
[blo,bhi]

ans =

   -0.3940    0.1794


Efron and Tibshirani (1993) stated that this method is more stable and better theoretical coverage properties than the bootstrap-t.

Bootstrap-t Confidence Interval

We can calculate the student-t confidence interval using bootstrap. First, B bootstrap samples are generated and then for each sample we compute.
\[z^b = \frac{{{\hat \theta }^b} - \hat \theta}{\hat{SE^b}}\]
The $\hat\theta$ is the statistics we need to know and the $\hat\theta^b$ is the bootstrap replicate of it. The denominator $\hat{SE^b}$ is often calulated using bootstrap of the replicate (bootstrap of bootstrap) if the formular for standard error of $\hat\theta^b$ is not known.

Once we have the $z^b$, find the quantile $\alpha/2$-th and   $(1-\alpha/2)$-th needed for the endpoints of the interval.
 \[\alpha /2 = \frac{{\sharp ({z^b} \le {{\hat t}^{\alpha /2}})}}{B}\]
and get the interval

\[(\hat \theta - {z^{(1 - \alpha /2)}}S{E_{\hat \theta }},\hat \theta - {z^{(\alpha /2)}}S{E_{\hat \theta }})\]
The following code demonstrates the Bootstrap-t confidence interval of the skewness of the forearm data.

load forearm;
theta = skewness(forearm);
n=length(forearm);
B=1000;
[bootstat,bootsam]=bootstrp(B,'skewness',forearm);
sehats=zeros(size(bootstat));
for i=1:B
    x = forearm(bootsam(:,i));
    sehats(i)=std(bootstrp(25,'skewness',x));
end
zvals = (bootstat - theta)./sehats;

SE=std(bootstat);
alpha=0.05;
k=B*alpha/2;
szvals=sort(zvals);
tlo = szvals(k);
thi = szvals(B-k);
blo=theta - thi*SE;
bhi=theta - tlo*SE;
[blo,bhi]

ans =


   -0.3829    0.1730 


It is recommended that the number of bootstrap replicates, B, should be 1,000 or more.

Thursday, June 3, 2010

Bootstrap Estimate of Standard Error and Bias

Bootstrapping is a method similar to Monte Carlo with one difference. Instead of sampling from a theoretical population distribution, we are sampling from the data (random sample) with replacement.

For example, the following code calculate the skewness of forearm.

load forearm
theta = skewness(forearm);


Now we want to find out what is the standard error (same as the standard deviation) of this measurement? Notice that this is not the standard deviation of the data. The bootstrap resampling process is as follow

load forearm
theta = skewness(forearm);
n=length(forearm);
B=100; % how many bootstrap replicates
inds=unidrnd(n,n,B); % nxB
xboot=forearm(inds); % nxB
thetaB=skewness(xboot);
seB=std(thetaB);

>> seB
seB =
    0.1498

Or we can directly use MATLAB bootstrap function

Bmat = bootstrp(B,'skewness',forearm);
seBmat = std(Bmat);

>> seBmat
seBmat =
    0.1493


Efron and ibshirani [1993] suggested that, taking more than 200 bootstrap replicates to estimate the standard error is unnecessary.

Now, for the bias, the bootstrap estimate can easily calculate as.

meanB = mean(thetaB);
biasB = meanB - theta;

>> biasB
biasB =
   -0.0075


E&T also recommended $B\ge400$.

Wednesday, June 2, 2010

Monte Carlo Assessment of Type I and II Error

Can we assess the result of performing an inference test when the assumptions of standard methods are violated?. For example, we assumed that the population distribution is normal when it is not.  We can use Monte Carlo simulation to estimate that result.

Type I error occurs when we reject the null hypothesis H0 when it is true. By sampling from (pseudo) population that represents the H0 and determining which sample make Type I error or not, the recorded result can then be used as the probability of making a Type I error.

The procedure is similar for estimating Type II error but we must sampling from the alternative hypothesis.

The following code is for the MC assessment of type I error of a test.

H0:  $\mu=45$
H1:  $\mu > 45$

Notice that this is an upper-tail test.

n=100; %sampling size
m=1000; % mc iteration
sigma=15;
alpha=0.05;
mu=45;
sigxbar=sigma/sqrt(n);
cv = norminv(1-alpha,0,1);
Im = 0;
for i=1:m
    xs = sigma*randn(1,n)+mu;
    Tm=(mean(xs)-mu)/sigxbar;
    if Tm >= cv % reject H0
        Im = Im+1;
    end
end
alphahat = Im/m;


In the end, the value of alphahat is
>> alphahat
alphahat =
    0.0490


The 0.0490 is close to 0.05 theoretical result.

The following code is for estimating Type II error. The code continues from the above code.
Im = 0;
mu1 = 47.2;
for i=1:m
    xs = sigma*randn(1,n)+mu1;
    Tm=(mean(xs)-mu)/sigxbar;
    if Tm < cv % no reject H0
        Im = Im+1;
    end
end
betahat = Im/m;


>> betahat
betahat =
    0.5740


The theoretical value for beta is:

>> normcdf(cv,(mu1-mu)/sigxbar,1)
ans =
    0.5707


or alternatively

>> normcdf(cv*sigxbar+mu,mu1,sigxbar)
ans =
    0.5707

Tuesday, June 1, 2010

Monte Carlo Hypothesis Testing (P-Value)

Instead of finding the critical value from the simulated distribution of the test statistics, we use samples to calculate the p-value.

load mcdata;
n=length(mcdata);
sigma=7.8;
M=1000; % number of Monte Carlo trials
 
%start simulation to get sample
Tm = zeros(1,M);
for i=1:M
    xs = sigma*randn(1,n)+454;
    Tm(i)=mean(xs);
end
tobs = mean(mcdata);
ind=find(Tm <= tobs);
pvalhat=length(ind)/M


>> mcp_hypotest

pvalhat =

    0.0060


If the significant level $\alpha=0.05$, this estimated p-value is less than the $\alpha$. We would reject the H0.

Monte Carlo Hypothesis Testing (Critical Value)

In stead of getting the critical value from a given distribution and given $\alpha$, we might draw samples directly out of the population and calculate the test statistics of them in order to find the critical value.

This is the same as saying that we find the critical value from the simulated distribution of the test statistics.

For example, suppose we are looking at the mcdata and testing the hypotheses
H0: $\mu = 454$
H1: $\mu < 454$

Therefore, this is a lower tail test.
Given that the population sigma is 7.8.
We first check whether the population come from a normal distribution using normplot().

>> load mcdata;
>> normplot(mcdata);
 Now let's do some Monte Carlo

n=length(mcdata);
alpha=0.05;
M=1000; % number of Monte Carlo trials
sigma=7.8;
sigxbar = sigma/sqrt(n);
 

% start simulation to get a critical value
Tm = zeros(1,M);
for i=1:M
    %Generate a random sample under H0
    xs = sigma*randn(1,n)+454;
    Tm(i)=(mean(xs)-454)/sigxbar;
end
cv = quantile(Tm,alpha)
 

%get the observed test statistics
tobs = (mean(mcdata)-454)/sigxbar

>> mc_hypotest
cv =
   -1.6439

tobs =
   -2.5641

Since the observed value of test statistic is less than the critical value, we reject the H0.

Notice that the only difference between the Monte Carlo method and the regular Hypothesis testing is how to get the critical value. Usually, for Monte Carlo, we do not know what type of population distribution. However, in this example, we choose the normal distribution and try drawing samples out of it.