+ Post New Thread
Results 1 to 2 of 2
- 30th May 2014, 15:51 #1
- Join Date
- Jul 2013
- 0 / 0
Advice for large dataset classification
I am an undergrad student taking a grad level course, Pattern Recognition and I need some advice on my project. I am given a database of leaves, total of 3032 samples, belonging to 47 different classes with 2003 features for each sample. The features consist of the following: rectangularity, aspect ratio, mean hue, eccentricity, convexity and 999 features for the FFT magnitudes along x-axis and 999 features for the FFT magnitudes along y-axis. I am using Matlab.
First of all, I know that I need dimensionality reduction. I applied Fisher's Linear Discriminant Analysis to maximize the between cluster variance and minimize within cluster variance. I have received a warning saying that the matrix can be ill-conditioned. So I have decided to apply PCA first and then try LDA again.
Some experienced PhD students said that the dataset can be represented with at most 10 dimensions but I have tried different numbers of principal components and ran Matlab's built-in linear classifier. Here are the results:
It appears that 400 PC gives the best result, 100 is also acceptable but 10 is certainly bad. Do I really need 100 features or should I do something extra?
Later I ran the LDA, downloaded from the page http://www.mathworks.com/matlabcentr...inant-analysis (explanation is in http://matlabdatamining.blogspot.com...lysis-lda.html). I am not sure whether this code performs dimensionality reduction though. It helps me to acquire linear scores and when I do classification based on them, I get the exact same result as Matlab's built-in linear classifier. Does it mean Matlab's linear classifier perform some optimal transformation as well? Would it be better if I can somehow reduce the dimensionality while transforming the data, and how can I do it?
Finally I wanted to try Support Vector Machines since I have heard that "most of the time" it performs better than any other classifier. I have used a function named multisvm which simply loops through all the classes, takes the current class as 1 and others 0 and uses Matlab's built in svmtrain function (http://www.mathworks.com/matlabcentr...ent/multisvm.m).
The results were much worse than I expected. I also tried RBF for kernel and different sigmas but the error rates were still much more than the 10% that I obtained with linear classifier.
Again, do you think is SVM way worse in this particular case or am I doing something wrong?
I am open to new ideas as well.
Thanks in advance.
- 30th May 2014, 15:51
1st August 2014, 12:54 #2
- Join Date
- Feb 2012
- 70 / 70
Re: Advice for large dataset classification
Can you use self organizing maps + something?