# Advice for large dataset classification

1. ## Advice for large dataset classification

Hi everyone,

I am an undergrad student taking a grad level course, Pattern Recognition and I need some advice on my project. I am given a database of leaves, total of 3032 samples, belonging to 47 different classes with 2003 features for each sample. The features consist of the following: rectangularity, aspect ratio, mean hue, eccentricity, convexity and 999 features for the FFT magnitudes along x-axis and 999 features for the FFT magnitudes along y-axis. I am using Matlab.

First of all, I know that I need dimensionality reduction. I applied Fisher's Linear Discriminant Analysis to maximize the between cluster variance and minimize within cluster variance. I have received a warning saying that the matrix can be ill-conditioned. So I have decided to apply PCA first and then try LDA again.

Some experienced PhD students said that the dataset can be represented with at most 10 dimensions but I have tried different numbers of principal components and ran Matlab's built-in linear classifier. Here are the results:

It appears that 400 PC gives the best result, 100 is also acceptable but 10 is certainly bad. Do I really need 100 features or should I do something extra?

Later I ran the LDA, downloaded from the page http://www.mathworks.com/matlabcentr...inant-analysis (explanation is in http://matlabdatamining.blogspot.com...lysis-lda.html). I am not sure whether this code performs dimensionality reduction though. It helps me to acquire linear scores and when I do classification based on them, I get the exact same result as Matlab's built-in linear classifier. Does it mean Matlab's linear classifier perform some optimal transformation as well? Would it be better if I can somehow reduce the dimensionality while transforming the data, and how can I do it?

Finally I wanted to try Support Vector Machines since I have heard that "most of the time" it performs better than any other classifier. I have used a function named multisvm which simply loops through all the classes, takes the current class as 1 and others 0 and uses Matlab's built in svmtrain function (http://www.mathworks.com/matlabcentr...ent/multisvm.m).

The results were much worse than I expected. I also tried RBF for kernel and different sigmas but the error rates were still much more than the 10% that I obtained with linear classifier.

Again, do you think is SVM way worse in this particular case or am I doing something wrong?

I am open to new ideas as well.