|Team name||Filename||Error (5 guesses)||Description|
|SuperVision||test-preds-141-146.2009-131-137-145-146.2011-145f.||0.15315||Using extra training data from ImageNet Fall 2011 release|
|SuperVision||test-preds-131-137-145-135-145f.txt||0.16422||Using only supplied training data|
|ISI||pred_FVs_wLACs_weighted.txt||0.26172||Weighted sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively.|
|ISI||pred_FVs_weighted.txt||0.26602||Weighted sum of scores from classifiers using each FV.|
|ISI||pred_FVs_summed.txt||0.26646||Naive sum of scores from classifiers using each FV.|
|ISI||pred_FVs_wLACs_summed.txt||0.26952||Naive sum of scores from each classifier with SIFT+FV, LBP+FV, GIST+FV, and CSIFT+FV, respectively.|
|OXFORD_VGG||test_adhocmix_classification.txt||0.26979||Mixed selection from High-Level SVM scores and Baseline Scores, decision is performed by looking at the validation performance|
|OXFORD_VGG||test_finecls_classification.txt||0.27079||High-Level SVM over Fine Level Classification score, DPM score and Baseline Classification scores (Fisher Vectors over Dense SIFT and Color Statistics)|
|OXFORD_VGG||test_baseline_classification.txt||0.27302||Baseline: SVM trained on Fisher Vectors over Dense SIFT and Color Statistics|
|University of Amsterdam||final-UvA-lsvoc2012test.results.val||0.29576||See text above|
|LEAR-XRCE||submit_i12_d0512_mix.txt||0.34464||Trained on ILSVRC'12 - using a mixture of NCM classifiers|
|LEAR-XRCE||submit_i12_d0512_k1.txt||0.36184||Trained on ILSVRC'12 - using NCM|
|LEAR-XRCE||submit_i10_d0512_mix.txt||0.38006||Trained on ILSVRC'10 - using a mixture of NCM classifiers|
|LEAR-XRCE||submit_i10_d0512_k1.txt||0.41048||Trained on ILSVRC'10 - using NCM|
|Team name||Filename||Error (5 guesses)||Description|
|SuperVision||test-rect-preds-144-cloc-141-146.2009-131-137-145-||0.335463||Using extra training data for classification from ImageNet Fall 2011 release|
|SuperVision||test-rect-preds-144-cloc-131-137-145-135-145f.txt||0.341905||Using only supplied training data|
|OXFORD_VGG||test_adhocmix_detection.txt||0.500342||Re-ranked DPM detection over Mixed selection from High-Level SVM scores and Baseline Scores, decision is performed by looking at the validation performance|
|OXFORD_VGG||test_finecls_detection_bestbbox.txt||0.50139||Re-ranked DPM detection over High-Level SVM Scores|
|OXFORD_VGG||test_finecls_detection_firstbbox.txt||0.522189||Re-ranked DPM detection over High-Level SVM Scores - First bbox selection heuristic|
|OXFORD_VGG||test_baseline_detection.txt||0.529482||DPM detection over baseline classification scores|
|ISI||result2.txt||0.536474||We use the cascade object detection with deformable part models, restricting the sizes of bounding boxes. |
|ISI||result.txt||0.536546||We use the cascade object detection with deformable part models, restricting the sizes of bounding boxes. |
|Team name||Team members||Abstract|
|ISI||Naoyuki Gunji (the Univ. of Tokyo), Takayuki Higuchi (the Univ. of Tokyo), Koki Yasumoto (the Univ. of Tokyo), Hiroshi Muraoka (the Univ. of Tokyo), Yoshitaka Ushiku (the Univ. of Tokyo), Tatsuya Harada (the Univ. of Tokyo & JST PRESTO), Yasuo Kuniyoshi (the Univ. of Tokyo)||Task 1: Classification
We use multi-class online learning and late fusion techniques with multiple image features.
We extract conventional Fisher Vectors (FV) [Sanchez et al., CVPR 2011] and streamlined version of Graphical Gaussian Vectors (GGV) [Harada, NIPS 2012]. For extraction, we use not only common SIFT and CSIFT, but also LBP and GIST in a dense-sampling manner.
We train linear classifiers using Passive-Aggressive (PA) algorithm [Crammer et al., JMLR 2006].
Then we investigate two strategies to combine scores from each feature's classifier. One is to sum all scores simply, and the other is to train another version of PA using the scores. We train the weight for each feature and sum the scores using the weight.
Task 2: Classification with localization
We extract HOG descriptors from each sliding window.
We use the cascade object detection with deformable part models [Felzenszwalb et al., CVPR 2010], restricting the sizes of bounding boxes.
We also restrict the candidates of objects for each input image using the predictions of the Task 1.
Task 3: Fine-grained classification
We represent images using FVs computed from a variety of descriptors. Each descriptors are extracted more densely than those in the Task 1.
We train linear classifiers on each FV using PA. Then scores are summed up for prediction.|
|LEAR-XRCE||Thomas Mensink, LEAR - INRIA Grenoble and TVPA - Xerox Research Centre Europe
Jakob Verbeek, LEAR - INRIA Grenoble
Florent Perronnin, TVPA - Xerox Research Centre Europe
Gabriela Csurka, TVPA - Xerox Research Centre Europe||In our submission we evaluate the performance of the Nearest Mean Classifier (NCM) in the ILSVRC 2012 Challenge. The idea of the NCM classifier is to classify an image to the class with the nearest class-mean. To obtain competitive performance we learn a low rank Mahalanobis distance function, M = W' W, by maximizing the log-likelihood of correct prediction [Mensink et al., ECCV'12].
We submit two runs (a and b) where the metric has been learned and parameters has been validated on the ILSVRC'10 training and evaluation set. There is no training on the ILSVRC'12 dataset, except that we had to compute the class means on the ILSVRC'12 training set.
The other two runs (c and d) use a metric which has been learned on the ILSVRC'12 dataset.
Run b differs from run a (and similar d differs from c), in that we use a non-linear extension of the NCM classifier, where each class is represented by k centroids in stead of only a single mean [Mensink et al., TechReport'12]. The metric we use is learned and validated for k=1. For the final classification results we use a mixture of k = [1 5 10 15 20] centroids, where each mixture component has equal weight.
Images are represented by Fisher Vectors on SIFT and Local Color Features [Lowe, IJCV'04 and Perronnin et al., ECCV'10], which are early fused into a 64K dimensional feature vector, these vectors are compressed using Product Quantization [Jégou et al., PAMI’11].|
|OXFORD_VGG||Karen Simonyan, University of Oxford
Yusuf Aytar, University of Oxford
Andrea Vedaldi, University of Oxford
Andrew Zisserman, University of Oxford||In this submission, image classification was performed using a conventional pipeline based on Fisher vector image representation and one-vs-rest linear SVM classifiers. In more detail, two types of local patch features were densely extracted over multiple scales: SIFT and colour statistics. The features were then augmented with patch spatial coordinates and aggregated into two Fisher vectors corresponding to the two feature types. Fisher vectors were computed using GMMs with 1024 Gaussians, resulting in 135K-dimensional representations. To obtain a single feature vector per image, the two Fisher vectors were then stacked. We did not use spatial pyramid representation. To be able to deal with large amounts of training data, product quantisation was employed to compress the image features. Finally, an ensemble of one-vs-rest linear SVMs was trained over stacked features using stochastic sub-gradient method (Pegasos).
Localization is performed using DPM (discriminatively trained part based models) detectors (without parts) trained for each class individually. The DPM detectors are boosted via harvesting more bounding boxes from the training set using a semi-supervised approach. Using the validation set the top 5000 images for each class are shortlisted via image classification score and then detection is performed using DPMs. After that, for each class individually another boundingbox-aware classification model is trained from the cropped images using the max-scored bounding box for each image. In this fine-level bounding box classification, we used features similar to those used for image classification (stacked dense SIFT and colour Fisher vectors). Finally, for each class, a high-level SVM is trained over the image classification score, DPM max-detection score and the score from fine-level bounding box classification. Using the scores from the high-level SVM, top 5000 shortlist from the test set is re-ranked.|
|SuperVision||Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
University of Toronto||Our model is a large, deep convolutional neural network trained on raw RGB pixel values. The neural network, which has 60 million parameters and 650,000 neurons, consists of five convolutional layers, some of which are followed by max-pooling layers, and three globally-connected layers with a final 1000-way softmax. It was trained on two NVIDIA GPUs for about a week. To make training faster, we used non-saturating neurons and a very efficient GPU implementation of convolutional nets. To reduce overfitting in the globally-connected layers we employed hidden-unit "dropout", a recently-developed regularization method that proved to be very effective. |
|Uni Jena||Christoph Göring, Computer Vision Group, Friedrich Schiller University Jena, Germany
Erik Rodner, ICSI Vision Group, UC Berkeley, California
Alexander Freytag, Uni Jena, Computer Vision Group, Friedrich Schiller University Jena, Germany
Joachim Denzler, Uni Jena, Computer Vision Group, Friedrich Schiller University Jena, Germany||Our team tackeled task 3 of ILSVRC 2012 challenge - namely fine-grained object classification. We built a final classification system relying on three key ingredients: (1) the combination of different feature types to capture different aspects of objects, namely shape, color and texture, (2) a simple yet efficient part detector together with background elimination using a segmentation approach without any user interaction, and (3) a linear classifier with an efficient kernel approximation to ensure computation times within a few hours even for this large-scale dataset. Details for every step follow in the subsequent paragraphs.
For differentiating between hundreds of dog categories, not few but many details matter. Therefore, we represent images by a combination of different sources of information. Shape of objects is captured using a bag-of-words histogram of
opponent sift descriptors which are densely sampled from the image. In addition, we extract color information using colorname histograms. Finally, we compute local binary patterns to capture texture information, which is helpful for differentiating between different fur structures. We add spatial information to every type of feature by extracting not a single feature per image but a pyramid histograms representation.
Following state-of-the-art approaches, we additionally extract part based information. Since bodies of dogs are highly deformable, the parts being most reliably detectable are their heads. Unfortunately, there is no annotation for these parts available in the data and we can not train a standard detector. Therefore, we use a simple head detector by applying a hough circle transform to find eyes and noses and then search for 3 circles that compose a triangle. With this approach we are able to find a large fraction of dog heads in the images. Our detection approach does not work with dark fur, bad illumination conditions, and when the head is not in the picture. Detection results are used to extract an additional sift bag-of-words descriptor from the head region.
Background clutter present in the images might interfere classification. We therefore apply grabcut to all images to considere relevant foreground regions only. For grabcut, a background color model was trained on the pixels outside of the provided bounding box, whereas a foreground color model was trained on pixels inside the bounding box. This initial bounding box segmentation is then refined using iterated graph cuts.
Images are finally represented by a combination of all previously described features.
For classification, we use the liblinear svm with a one-vs-all multiclass approach. Due to the linearity of the classifier, computing classification scores is extremely fast which makes it feasible for this large-scale dataset. However, the gain of speed has the drawback of a diminished discriminative power. We overcome this drawback by utilizing homogenous kernel maps to approximate a chi2-kernel. With this combination we are able to combine the speed of a linear svm with the discriminative power of kernel-based methods. With all details mentioned, liblinear is able to train a model using the 20,500 training examples in less than 4 hours using 70Gb RAM.
|University of Amsterdam||Koen E. A. van de Sande
Cees G. M. Snoek||We extend the fine grained coding approach of last years LSVRC classification winners  by fine grained color descriptors and a calibrated SVM with a cutting plane solver. We provide the cutting plane solver 10% of the negative examples per class to train an exact model (not stochastic). For each fine grained color descriptors we train a separate SVM which is fused at the classifier level. This is more precise and more efficient than training on a concatenated version of the features. Last years UvA system used Platt's sigmoid to calibrate classifier scores, which involves expensive 5-fold cross-validation. Therefore, this year we used an unsupervised calibrator based on extreme value theory . It fits a Weibull to the classifier scores which is subsequently used for score normalization. Overall, the error rate is approximately 8% lower than last years UvA system on the validation set.
 F. Perronnin and J. Sanchez, Compressed Fisher vectors for Large Scale Visual Recognition, Large Scale Visual Recognition workshop, ICCV 2011.
 W. Scheirer, N. Kumar, P. Belhumeur and T. Boult, Multi-Attribute Spaces: Calibration for Attribute Fusion and Similarity Search, CVPR 2012
|XRCE/INRIA||Florent Perronnin, XRCE
Zeynep Akata, XRCE/INRIA
Zaid Harchaoui, INRIA
Cordelia Schmid, INRIA||Our low-level descriptors are the SIFT of [Lowe, IJCV 2004] and the color features of [Perronnin et al., ECCV 2010]. They are aggregated into image-level features using the Fisher Vector (FV) of [Perronnin et al, ECCV 2010]. Because of the high-dim of the FVs, they are compressed with Product Quantization (PQ) as proposed in [Sanchez and Perronnin, CVPR 2011].
We train linear SVMs in a one-vs-rest manner using the good practices of [Perronnin et al., CVPR 2012]. This leads on the validation set to approx. 2% improvement with respect to last year's training strategy.
For task 1 we submitted two systems:
- res_64k_svm.txt: we use 64K FVs obtained by concatenating 32K-dim SIFT and color FVs.
- res_1M_svm.txt: 0.5M-dim SIFT and color FVs are computed and classified separately. The SIFT and color results are merged with late fusion (weighted averaging).
For task 3 we report results only with the 0.5M-dim SIFT and color FVs and late fusion. We submitted two systems:
- res_1M_svm_nocrop.dat: without using bounding boxes
- res_1M_svm_crop.dat: with bounding boxes|