There are several tutorials on how to perform this calculation. I like to add the technique I use with MATLAB recently. Basically, Imagine you throw darts at random (really really random) into a board with a circle in it, ie. like this.
The concept can be implement in one MATLAB command:
>> 4*mean(rand(1,5000).^2+rand(1,5000).^2 <= 1)
ans =
3.141592653589793
or, a slightly different approach, using Monte Carlo Integration with random numbers drawn from uniform probability distribution $x_i \sim Unif(0,1)$.
>> 4*mean(sqrt(1-rand(1,5000).^2))
ans =
3.141432627301033
Notice that the second approach requires only one series of random numbers, instead of two as the first approach but it has to perform a more complex square root operation.
Sunday, August 1, 2010
Saturday, July 10, 2010
Likelihood Function in Machine Learning
The (log) probability function $p({\bf x}|\theta)$ assigns a probability (density) to any joint configuration of variables $\bf x$ given fixed parameter $\theta$.
Instead, thinking of $p({\bf x}|\theta)$ as a function of $\theta$ for fixed $\bf x$:
\[L(\theta;{\bf x}) = p({\bf x}|\theta)\]
\[l(\theta;{\bf x})=log\;p({\bf x}|\theta)\]
This function is called the (log) "likelihood"
if we choose $\theta$ that maximizes some cost function $c(\theta)$.
$c(\theta) = l(\theta;D)$ ==> maximum likelihood (ML)
$c(\theta) = l(\theta;D)+r(\theta)$ ==> maximum a posteriori/penalty BIC, AIC, ...
Instead, thinking of $p({\bf x}|\theta)$ as a function of $\theta$ for fixed $\bf x$:
\[L(\theta;{\bf x}) = p({\bf x}|\theta)\]
\[l(\theta;{\bf x})=log\;p({\bf x}|\theta)\]
This function is called the (log) "likelihood"
if we choose $\theta$ that maximizes some cost function $c(\theta)$.
$c(\theta) = l(\theta;D)$ ==> maximum likelihood (ML)
$c(\theta) = l(\theta;D)+r(\theta)$ ==> maximum a posteriori/penalty BIC, AIC, ...
Statistical Machine Learning Problems
There are 4 basics problems in machine learning community: density estimation, clustering, classification and regression.
These problem can also be formulated in statistical framework.
Regression: $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})=p({\bf y,x})/\int p({\bf y,x})d{\bf y}$
Classification: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})=p(c,{\bf x})/\sum_c p(c,{\bf x})$
Clustering: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})$, $c$ is unobserved
Density Estimation: $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})$, $\bf x$ is unobserved
These problem can also be formulated in statistical framework.
Regression: $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})=p({\bf y,x})/\int p({\bf y,x})d{\bf y}$
Classification: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})=p(c,{\bf x})/\sum_c p(c,{\bf x})$
Clustering: $p(c|{\bf x})=p(c,{\bf x})/p({\bf x})$, $c$ is unobserved
Density Estimation: $p({\bf y}|{\bf x})=p({\bf y,x})/p({\bf x})$, $\bf x$ is unobserved
Saturday, June 19, 2010
Akaike and Bayesian Information Criteria
Usually, when we have data, we want to know what (parametric) model did the data come from. In other words, we want to do model fitting to the data. Different models consists of different number of parameters or different parameters themselves.
When there are several models to choose from, we estimate model parameters using maximum likelihood estimation. We may want to increase the likelihood by adding more parameters but that may result in overfitting.
Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) resolve the overfitting problem by penalizing for the number of parameters in the model.
The formula for the AIC is
\[AIC = 2k - 2 ln(L)\]
and, the formula for the BIC is
\[-2\cdot ln\;p(x|k)\approx BIC=-2\cdot ln L + k ln(n)\]
where $x$ is the observed data, $n$ the number of data points in $x$, $k$ the number of parameters, $p(x|k)$ the likelihood of the observed data given the number of parameters and $L$ the value of the likelihood function for the estimated model.
For example, let generate some random values
and pretend that we do not know what distribution function they are from. There are two models we want to fit the data, $X\sim Bino(10,0.5)$ and $X\sim Poiss(5)$.
First, try plotting the data, and the two theoretical points.
Now, try using the AIC and BIC for model selection
px=binopdf(x,10,0.5);
px(px == 0)=[];
llx=sum(log(px));
py=poisspdf(x,5);
py(py == 0)=[];
lly=sum(log(py));
[aic,bic]=aicbic([llx,lly],[2,1],[length(px),length(py)])
aic =
1.0e+003 *
3.7569 3.9542
bic =
1.0e+003 *
3.7667 3.9591
The first model $X\sim Bino(10,0.5)$ has lower AIC and BIC, so it is a better model.
Next, try using different random values.
aic =
1.0e+004 *
4.4925 4.4069
bic =
1.0e+004 *
4.4939 4.4076
This time the second model $X\sim Poiss(5)$ has lower AIC and BIC so it is a better model for the data.
When there are several models to choose from, we estimate model parameters using maximum likelihood estimation. We may want to increase the likelihood by adding more parameters but that may result in overfitting.
Akaike Information Criteria (AIC) and Bayesian Information Criteria (BIC) resolve the overfitting problem by penalizing for the number of parameters in the model.
The formula for the AIC is
\[AIC = 2k - 2 ln(L)\]
and, the formula for the BIC is
\[-2\cdot ln\;p(x|k)\approx BIC=-2\cdot ln L + k ln(n)\]
where $x$ is the observed data, $n$ the number of data points in $x$, $k$ the number of parameters, $p(x|k)$ the likelihood of the observed data given the number of parameters and $L$ the value of the likelihood function for the estimated model.
For example, let generate some random values
x = binornd(10,0.5,1,1000);
and pretend that we do not know what distribution function they are from. There are two models we want to fit the data, $X\sim Bino(10,0.5)$ and $X\sim Poiss(5)$.
First, try plotting the data, and the two theoretical points.
[n,xi]=hist(x);
bar(xi,n/(sum(n)*(xi(2)-xi(1))),1)
hold on;
l=min(x):max(x);
px=binopdf(l,10,0.5);
py=poisspdf(l,5);
plot(l,px,'r.'); % theoretical binomial dist
plot(l,py,'rx'); % theoretical poisson dist
hold off;
axis tight;
bar(xi,n/(sum(n)*(xi(2)-xi(1))),1)
hold on;
l=min(x):max(x);
px=binopdf(l,10,0.5);
py=poisspdf(l,5);
plot(l,px,'r.'); % theoretical binomial dist
plot(l,py,'rx'); % theoretical poisson dist
hold off;
axis tight;
Now, try using the AIC and BIC for model selection
px=binopdf(x,10,0.5);
px(px == 0)=[];
llx=sum(log(px));
py=poisspdf(x,5);
py(py == 0)=[];
lly=sum(log(py));
[aic,bic]=aicbic([llx,lly],[2,1],[length(px),length(py)])
aic =
1.0e+003 *
3.7569 3.9542
bic =
1.0e+003 *
3.7667 3.9591
The first model $X\sim Bino(10,0.5)$ has lower AIC and BIC, so it is a better model.
Next, try using different random values.
x = poissrnd(5,1,10000);
aic =
1.0e+004 *
4.4925 4.4069
bic =
1.0e+004 *
4.4939 4.4076
This time the second model $X\sim Poiss(5)$ has lower AIC and BIC so it is a better model for the data.
Log Likelihood Function
In maximum likelihood estimation (MLE), if we let $Y_1,...,Y_n$ be n independent random variables (r.v.'s) with probability functions (pdf) $f_i(y_i;\bf\theta)$ depending on a vector-valued parameter $\bf\theta$, the likelihood of the distribution is just the value of the n-variate probability density function
\[f({\bf y};{\bf\theta})=\prod\limit_{i=1}^n f_i(y_i;{\bf\theta})=L({\bf\theta};{\bf y})\]
The $L({\bf\theta};{\bf y})$ is a function of the unknown parameter $ {\bf\theta} $ given the data $ {\bf y} $, is called the likelihood function .
Analytically, maximizing the product is more difficult than maximizing the sum, so we take a natural logarithm of the likelihood function, thus it is called the log likelihood function
\[log L({\bf\theta};{\bf y})=\sum\limit_{i=1}^n log f_i(y_i;{\bf\theta})\]
For example, the outcome of an event is found as a series of random value, what is the (log) likelyhood of this outcome coming from $X \sim Bino(10,0.5)$?
First, generate some random values but let's assume that we do not know the parameters generating the random values, i.e. we don't know that it was generated from Bino(10,0.5).
The log likelihood value of X with parameter (10,0.6) is
The log likelihood value of X with parameter (10,0.5) is
And, The log likelihood value of X with parameter (11,0.5) is
From three values above, the log likelihood of the parameter (10,0.5) is highest, from MLE standpoint, we should select this model.
\[f({\bf y};{\bf\theta})=\prod\limit_{i=1}^n f_i(y_i;{\bf\theta})=L({\bf\theta};{\bf y})\]
The $L({\bf\theta};{\bf y})$ is a function of the unknown parameter $ {\bf\theta} $ given the data $ {\bf y} $, is called the likelihood function .
Analytically, maximizing the product is more difficult than maximizing the sum, so we take a natural logarithm of the likelihood function, thus it is called the log likelihood function
\[log L({\bf\theta};{\bf y})=\sum\limit_{i=1}^n log f_i(y_i;{\bf\theta})\]
For example, the outcome of an event is found as a series of random value, what is the (log) likelyhood of this outcome coming from $X \sim Bino(10,0.5)$?
First, generate some random values but let's assume that we do not know the parameters generating the random values, i.e. we don't know that it was generated from Bino(10,0.5).
X=binornd(10,0.5,1,100);
The log likelihood value of X with parameter (10,0.6) is
>> sum(log(binopdf(X,10,0.6)))
ans =
-200.0399
ans =
-200.0399
The log likelihood value of X with parameter (10,0.5) is
>> sum(log(binopdf(X,10,0.5)))
ans =
-184.4945
ans =
-184.4945
And, The log likelihood value of X with parameter (11,0.5) is
>> sum(log(binopdf(X,11,0.5)))
ans =
-187.4201
ans =
-187.4201
From three values above, the log likelihood of the parameter (10,0.5) is highest, from MLE standpoint, we should select this model.
Tuesday, June 15, 2010
Model-based Clustering
Model-Based Clustering (MBC) assumes that the data comes from several subpopulations, each is model separately by $g_k({\bf x}; {\bf \theta_k})$. The overall population is therefore a finite mixture model.
\[f({\bf x};p,{\bf \theta})=\sum\limit_{k=1}^c p_k g_k({\bf x}; {\bf \theta_k})\]
The weights are given by $p_k$, which is also called the mixing proportions or mixing coefficients.
The most common used density is multivariate normal (Gaussian), therefore, the MBC can assume that the data come from a multivariate Gaussian finite mixture.
There are three components of the MBC.
1. Agglomerative clustering is used to partition the data. Each partition then becomes the initial starting values for the EM algorithm.
2. The EM algorithm is used to estimate the parameters of the finite mixture.
3. The Bayesian Information Criterion (BIC) is used to choose the best grouping and number of clusters, given the data. It is an approximation to Bayes factors.
\[f({\bf x};p,{\bf \theta})=\sum\limit_{k=1}^c p_k g_k({\bf x}; {\bf \theta_k})\]
The weights are given by $p_k$, which is also called the mixing proportions or mixing coefficients.
The most common used density is multivariate normal (Gaussian), therefore, the MBC can assume that the data come from a multivariate Gaussian finite mixture.
There are three components of the MBC.
1. Agglomerative clustering is used to partition the data. Each partition then becomes the initial starting values for the EM algorithm.
2. The EM algorithm is used to estimate the parameters of the finite mixture.
3. The Bayesian Information Criterion (BIC) is used to choose the best grouping and number of clusters, given the data. It is an approximation to Bayes factors.
Saturday, June 12, 2010
False Alarm Probability to Likelihood Ratio Threshold
Let's start from something easier to understand. We know from the "likelihood ratio approach to classification" that
\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
The $\tau_c$ provides us an automatic decision boundary that we can use to calculate the false alarm probability, P(FA), later.
In the previous post, we know that we can set up an arbitrary decision boundary by setting the P(FA). That is, we can either directly find the quantile P(FA) on the observations of non-target class $\omega_2$,
or use the inverse cummulative distribution function at P(FA) if we assume the observations come from a probability distribution
xi=norminv(pfa,mu2,sd2);
The question now is, Can we derive the threshold from the P(FA)? If so, the classification task will be just to compare the likelihood ratio of the two classes, and assign the class number when the ratio is greater than the threshold, which we know what the false alarm probability is.
It turns out that, if we do cross validation on $\omega_2$ and calculate likelihood ratio
\[L_r=\frac{P(x_i^2|\omega_1)}{P(x_i^2|\omega_2)}\]
of each testing $x_i$, we will get a distribution of $L_r$. From here, if we take the quantile $\hat q_{1-P(FA)}$ , it is the threshold at P(FA). (Higher quantile means higher threshold and thus perfers the target class, this implies lower P(FA) ).
The following code should demonstrate the concept of the two approaches of getting the threshold from P(FA). The first approach uses the quantile 1-P(FA) from the likelihood ratio and the second one directly gets the quantile P(FA) from data (blue) and then calculate the threshold .
% p(x|w1) ~ N(-1,1), p(w1)=0.6
% p(x|w2) ~ N(1,1), p(w2)=0.4
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1;
mu1=mean(x(ind1));
sd1=std(x(ind1));
xtrain = x(ind2);
lr2=zeros(1,n2);
% cross validation on w2
for i=1:n2
xi=xtrain(i);
xtrain2=xtrain;
xtrain2(i)=[];
mu2=mean(xtrain2);
sd2=std(xtrain2);
lr2(i)=normpdf(xi,mu1,sd1)/normpdf(xi,mu2,sd2);
end
pfa = 0.01:0.01:0.99;
% approach #1
threshold=quantile(lr2,1-pfa);
% approach #2
xq=quantile(x(ind2),pfa);
threshold2=normpdf(xq,mu1,sd1)./normpdf(xq,mu2,sd2);
plot(pfa,threshold,'r'); hold on;
plot(pfa,threshold2,'b'); hold off;
Both of them are almost identical, with some slightly differences.
>> sum(abs(threshold-threshold2))
ans =
1.7067
\[P({\bf x}|\omega_1)P(\omega_1) > P({\bf x}|\omega_2)P(\omega_2)\]
\[L_R({\bf x})=\frac{P({\bf x}|\omega_1)}{P({\bf x}|\omega_2)}>\frac{P(\omega_2)}{P(\omega_1)}=\tau_c\]
The $\tau_c$ provides us an automatic decision boundary that we can use to calculate the false alarm probability, P(FA), later.
In the previous post, we know that we can set up an arbitrary decision boundary by setting the P(FA). That is, we can either directly find the quantile P(FA) on the observations of non-target class $\omega_2$,
xq=quantile(x(ind2),pfa);
or use the inverse cummulative distribution function at P(FA) if we assume the observations come from a probability distribution
xi=norminv(pfa,mu2,sd2);
The question now is, Can we derive the threshold from the P(FA)? If so, the classification task will be just to compare the likelihood ratio of the two classes, and assign the class number when the ratio is greater than the threshold, which we know what the false alarm probability is.
It turns out that, if we do cross validation on $\omega_2$ and calculate likelihood ratio
\[L_r=\frac{P(x_i^2|\omega_1)}{P(x_i^2|\omega_2)}\]
of each testing $x_i$, we will get a distribution of $L_r$. From here, if we take the quantile $\hat q_{1-P(FA)}$ , it is the threshold at P(FA). (Higher quantile means higher threshold and thus perfers the target class, this implies lower P(FA) ).
The following code should demonstrate the concept of the two approaches of getting the threshold from P(FA). The first approach uses the quantile 1-P(FA) from the likelihood ratio and the second one directly gets the quantile P(FA) from data (blue) and then calculate the threshold .
% p(x|w1) ~ N(-1,1), p(w1)=0.6
% p(x|w2) ~ N(1,1), p(w2)=0.4
n=1000;
u=rand(1,n);
x=zeros(size(u));
ind1 = u<= 0.60;
ind2 = ~ind1;
n1=sum(ind1);
n2=sum(ind2);
x(ind1)=randn(1,n1)-1;
x(ind2)=randn(1,n2)+1;
mu1=mean(x(ind1));
sd1=std(x(ind1));
xtrain = x(ind2);
lr2=zeros(1,n2);
% cross validation on w2
for i=1:n2
xi=xtrain(i);
xtrain2=xtrain;
xtrain2(i)=[];
mu2=mean(xtrain2);
sd2=std(xtrain2);
lr2(i)=normpdf(xi,mu1,sd1)/normpdf(xi,mu2,sd2);
end
pfa = 0.01:0.01:0.99;
% approach #1
threshold=quantile(lr2,1-pfa);
% approach #2
xq=quantile(x(ind2),pfa);
threshold2=normpdf(xq,mu1,sd1)./normpdf(xq,mu2,sd2);
plot(pfa,threshold,'r'); hold on;
plot(pfa,threshold2,'b'); hold off;
Both of them are almost identical, with some slightly differences.
ans =
1.7067
Subscribe to:
Posts (Atom)