PanYaa

Saturday, June 12, 2010

Receiver Operating Characteristic (ROC) Curve

Receiver Operating Characteristic (ROC) Curve is a characteristic of a classifier. It is a plot between P(FA) or "false positive rate" of the class $\omega_2$ (non-target class) vs. P(CC) or "true positive rate" of the class $\omega_1$ (target class).

It also has a meaning that "If we set the alpha to this value, what would the P(correct) be?"

For example, this following code yields a theoretical ROC curve (black, dotted) and classification ROC curve (blue) of

\[p(x|\omega_1)\sim N(-1,1),\;\;\; p(\omega_1)=0.6\]
\[p(x|\omega_2)\sim N(1,1),\;\;\; p(\omega_2)=0.4\]
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1;

% P(false alarm), alpha, False Positive Rate 

pfa = 0.01:.01:0.99; 

% theoretical line

pcc=normcdf(norminv(pfa,1,1),-1,1);

plot(pfa,pcc,'k:'); hold on;

% building a classifier

% (setting false alarm value)

pfa2= zeros(size(pfa)); % false alarm : result

pcc = zeros(size(pfa)); % P(correct classification)

% find model parameters 

mu2=mean(x(ind2));

sd2=std(x(ind2));

% get decision boundary for each P(fa)

xi=norminv(pfa,mu2,sd2); 

for i=1:length(xi)
    c1=x < x(i) % classified as class 1
    pcc(i)=sum(c1 & ind1)/n1;
    pfa2(i)=sum(c1 & ind2)/n2;
end

plot(pfa2,pcc,'b'); hold off;
xlabel('P(FA)');
ylabel('P(CC_{\omega_1})');

Estimating Likelihood function (parametric)

Likelihood function or class-conditional probability $P({\bf x}|\omega_i)$ can be calculated directly from the data. The parametric approach is to assume that the data from each class comes from a normal distribution with its own mean and covariance. This means that we need to find the two parameters from the data of each class and use these parameters to evaluate the probability at $\bf x$.

\[P({\bf x}|\omega_i)=P({\bf x}|\mu_i,\sigma_i^2)\]

For example, using iris data, we first get the parameters of the setosa, versicolor and virginica class

load iris;

muset = mean(setosa);
muver=mean(versicolor);
muvir=mean(virginica);
covset=cov(setosa);
covver=cov(versicolor);
covvir=cov(virginica);

There are 50 samples from each class. Then, we test the iris classification into three classes.

X=[setosa;versicolor;virginica];

pset=mvnpdf(X,muset,covset);

pver=mvnpdf(X,muver,covver);

pvir=mvnpdf(X,muvir,covvir);

plot(pset,'b'), hold on;

plot(pver,'r');

plot(pvir,'g'); hold off;

The result is shown below (setosa, versicolor, virginica).

Thursday, June 10, 2010

False Alarm in Classification

There are two types of error in Hypothesis Testing, the Type I error $\alpha$ and Type II error $\beta$. In classification term, the H0 is considered as the non-target class and the H1 is the target class. Therefore, the $\alpha$ is where we misclassify an observation as a target when it is not.

Therefore, the Type I error $\alpha$, here means false alarm (FA) or false positive.

\[P(FA)=\alpha\]

For example, The target class (H1) of the previous example is $\omega_1$ and the nontarget class (H0) is $\omega_2$ and the P(FA) at the decision boundary x=0.2 is

>> 0.4*normcdf(0.2,1,1)
ans =
0.0847

Now for a reverse question, what if we want to set the P(FA) to 0.05, what would the decision boundary be?

Notice that
\[P(FA)=\int\limits_{-\infty}^C P(x|\omega_2)P(\omega_2)dx\]
We want to find the value x=C, such that
\[\int\limits_{-\infty}^C P(x|\omega_2)dx=\frac{P(FA)}{P(\omega_2)}\]
Thus,

>> norminv(0.05/0.4,1,1)

ans =

   -0.1503

The decision boundary is where x= -0.1503

Likelihood Ratio Approach to Classification

Let's continue from the previous post.

For the two-class case, the decision rule for $\bf x$ in class $\omega_1$ is

\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
That is, the likelihood ratio of $\bf x$ belonging to class $\omega_1$ to $\omega_2$ is equal to the ratio of probability of class $\omega_1$ to $\omega_2$, called a threshold value $\tau_c$.

For the same, two-class case, let x a value from -8 to 8 in 0.1 stepping. The values of x that can be classified as class $\omega_1$ (blue) are

xl=-8:0.1:8;
w1=find(normpdf(xl,-1,1)./normpdf(xl,1,1) > 0.4/0.6);
h=bar(xl(w1),0.6*normpdf(xl(w1),-1,1),1); hold on;
set(h,'EdgeColor',get(h,'FaceColor'));
w2=setxor(1:length(xl),w1);
h=bar(xl(w2),0.4*normpdf(xl(w2),1,1),1,'r'); hold off;
set(h,'EdgeColor',get(h,'FaceColor'));
xlabel('x');

The last value of x that belongs to class $\omega_1$ is

>> xl(w1(end))

ans =

    0.2000

which is also at the decision boundary.

Classification Error

An error is made when an observation $\bf{x}$ which actually belongs to class $\Omega_i$ is classified to a wrong class $\Omega_i^c$, which can be any region except $\Omega_i$.

The probability of error is calculated by integrating over all $\bf{x}$.

\[P(error)=\sum\limits_{i=1}^n\int\limits_{\Omega_i^c}P({\bf x}|\omega_i)P(\omega_i)d\bf x\]
Using this example, we can estimate the classification error as follows

xl=-8:.1:8;

pc1=0.6*normpdf(xl,-1,1);

pc2=0.4*normpdf(xl,1,1);

% The decision boundary is where the two curves meet.

ind1=find(pc1>=pc2);

ind2=find(pc1

pmis1=sum(pc1(ind2))*.1;

pmis2=sum(pc2(ind1))*.1;

error=pmis1+pmis2;

>> error

error =

    0.1539

The error 0.1539 is the shade area.

For the two-class case, the decision boundary is

\[P(x|\omega_1)P(\omega_1)=P(x|\omega_2)P(\omega_2)\]
We can use this script to find the decision boundary of the above problem.

>> l=find(xl>-2 & xl <2);

>> xpos=find(abs(pc1(l)-pc2(l))<0.001);

>> xl(l(xpos))

ans =

    0.2000

Thus, the decision boundary is where x=0.2.

Wednesday, June 9, 2010

Bayes Decision Rule

Given a feature vector $\bf x$, we assign it to class $\omega_j$ if

\[P(\omega_j|{\bf{x}})>P(\omega_i|{\bf{x}});\;\;\;i=1,...,n;i\ne j\]
or
\[P({\bf{x}}|\omega_j)P(\omega_j)>P({\bf{x}}|\omega_i)P(\omega_i);\;\;\;i=1,...,n;i\ne j\]
The $P(\omega)$ is the prior and $P({\bf{x}}|\omega)$ is the likelihood or class-conditional probability. Using these two quantities, we should whatever class $\bf{x}$ that has the highest posterior probability $P(\omega|{\bf{x}})$.

The following MATLAB code demonstrates an application for Bayes decision rule. Suppose that the class-conditionals are given by $\phi(x;\mu,\sigma^2)$:

\[P(x|\omega_1)= \phi(x;-1,1)\]
\[P(x|\omega_2)= \phi(x;1,1)\]
and the priors are
\[P(\omega_1)=0.6\]
\[P(\omega_2)=0.4\]
What class does the $x = -0.75$ belong to?

>> x=-0.75;

>> normpdf(x,-1,1)*0.6

ans =

    0.2320

>> normpdf(x,1,1)*0.4

ans =

    0.0345

Since 0.232 > 0.0345, then x belongs to class 1.

Tuesday, June 8, 2010

Optimal Histogram Bin Width

Have you ever wonder the choice of the number of bins when making a histogram? I do. It seems that I often play around with the number until the resulting histogram looks good enough.

There are to parameters when consider making a histogram a bona fide probability density (the sum of all histogram area equals to one), the number of bins $k$ and the bin width $h$.

If we know the bin width $h$, the number of bins $k$ is
\[k=\left\lceil\frac{max(x)-min(x)}{h}\right\rceil\]

Now, A number of methods to calculate the optimal number of bins or the bin width has been purposed. Some are selected as follows.

Sturge's formula
\[k=\left\lceil log_2n+1\right\rceil\]
Scott's choice
\[h=\frac{3.5\sigma}{n^{1/3}}\]
Freedman-Diaconis' choice
\[h=2\frac{IQR(x)}{n^{1/3}}\]

For example, using geyser data
load geyser
n=length(geyser);

The Scott's choice is
>> h=3.5*std(geyser)*n^(-1/3)
h =
    7.2704
>> k=ceil(diff(minmax(geyser))/h)
k =
     9

The Freedman-Diaconis' choice is
>> h=2*iqr(geyser)*n^(-1/3)
h =
    7.1782
>> k=ceil(diff(minmax(geyser))/h)
k =
     10

Each of these choice yields the number of bins $k$ as 9 and 10 respectively. Compare these numbers to the Sturges' formula
>> k=ceil(log2(n)+1)
k =
    10
So, it seems the number of bins should be around 10. The Matlab code for ploting the bona fide density histogram is
[n,x]=hist(geyser,k);
fhat=n/(sum(n)*h);
bar(x,fhat,1);

Notice that the sum of the histogram area is
>> sum(fhat)*h
ans =
1.0000