+ Post New Thread
Results 1 to 2 of 2
  1. #1
    Newbie level 5
    Points: 212, Level: 2

    Join Date
    Jul 2013
    0 / 0

    Advice for large dataset classification

    Hi everyone,

    I am an undergrad student taking a grad level course, Pattern Recognition and I need some advice on my project. I am given a database of leaves, total of 3032 samples, belonging to 47 different classes with 2003 features for each sample. The features consist of the following: rectangularity, aspect ratio, mean hue, eccentricity, convexity and 999 features for the FFT magnitudes along x-axis and 999 features for the FFT magnitudes along y-axis. I am using Matlab.

    First of all, I know that I need dimensionality reduction. I applied Fisher's Linear Discriminant Analysis to maximize the between cluster variance and minimize within cluster variance. I have received a warning saying that the matrix can be ill-conditioned. So I have decided to apply PCA first and then try LDA again.

    Some experienced PhD students said that the dataset can be represented with at most 10 dimensions but I have tried different numbers of principal components and ran Matlab's built-in linear classifier. Here are the results:

    Click image for larger version. 

Name:	pca all.jpg 
Views:	7 
Size:	157.6 KB 
ID:	105877

    It appears that 400 PC gives the best result, 100 is also acceptable but 10 is certainly bad. Do I really need 100 features or should I do something extra?

    Later I ran the LDA, downloaded from the page http://www.mathworks.com/matlabcentr...inant-analysis (explanation is in http://matlabdatamining.blogspot.com...lysis-lda.html). I am not sure whether this code performs dimensionality reduction though. It helps me to acquire linear scores and when I do classification based on them, I get the exact same result as Matlab's built-in linear classifier. Does it mean Matlab's linear classifier perform some optimal transformation as well? Would it be better if I can somehow reduce the dimensionality while transforming the data, and how can I do it?

    Finally I wanted to try Support Vector Machines since I have heard that "most of the time" it performs better than any other classifier. I have used a function named multisvm which simply loops through all the classes, takes the current class as 1 and others 0 and uses Matlab's built in svmtrain function (http://www.mathworks.com/matlabcentr...ent/multisvm.m).

    The results were much worse than I expected. I also tried RBF for kernel and different sigmas but the error rates were still much more than the 10% that I obtained with linear classifier.

    Click image for larger version. 

Name:	svm sigma 1-20.jpg 
Views:	4 
Size:	94.4 KB 
ID:	105878

    Again, do you think is SVM way worse in this particular case or am I doing something wrong?

    I am open to new ideas as well.

    Thanks in advance.

    •   Alt30th May 2014, 15:51



  2. #2
    Advanced Member level 3
    Points: 4,693, Level: 16

    Join Date
    Feb 2012
    71 / 71

    Re: Advice for large dataset classification

    Can you use self organizing maps + something?

--[[ ]]--