Saturday, May 8, 2010

Basic Tools for Exploratory Data Analysis

In Exploratory Data Analysis (EDA), there are several basic tools that might be useful. For example, histogram, Q-Q Plot, Boxplot and Scatterplot Matrix.

Histogram can be plotted using MATLAB hist() function, however if we want it to be a probability mass function, we can simply make a relative frequency histogram, i.e.:

data = poissrnd(5,1,100);
[n,x]=hist(data);
subplot(1,2,1);
bar(x,n,1); axis tight;
title('Frequency Histogram')
subplot(1,2,2);
bar(x,n/sum(n),1); axis tight;
title('Relative Frequency Histogram');
Q-Q Plot is used to compare quartiles of two distributions. If the two distributions are the same we would expect the Q-Q plot to produce a straight line.

Another type of plot related to Q-Q Plot is “Probability Plot”, which can be used to compare the plot of the data to a standard probability distribution.

data2 = randn(1,75);
subplot(1,2,1);
qqplot(data,data2);
subplot(1,2,2);
probplot('normal',data);


Boxplot is another visualized technique to represent the distribution of the data. The lower and upper edge of the box is the 1st and 3rd percentile, therefore the height of the box is called "InterQuartile Range (IQR)". Moreover, it can show the most extreme values as the lower and upper limit (wisker), plus the outliers.

load carsmall
boxplot(MPG,Origin);
Scaterplot Matrix is a way to display multi-dimensional data by looking at 2D scatterplots of all possible pairs of variables. In MATLAB, we can use the plotmatrix() function.

load iris
plotmatrix(setosa);
title('Iris Setosa');

No comments:

Post a Comment