Results of ILSVRC2014

Object detection     Classification+Localization    Team information   Per-class results

Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition

Object detection

Task 1a: Object detection with provided training data

Object detection with provided training data: Ordered by number of categories won

Team nameEntry descriptionNumber of object categories wonmean AP
NUSMultiple Model Fusion with Context Rescoring1060.37212
MSRA Visual ComputingA combination of multiple SPP-net-based models (no outside data)450.351103
UvA-EuvisionDeep learning with provided data210.320253
1-HKUSTrun 2180.288669
Southeast-CASIACNN-based proposal classification with proposal filtration and model combination40.304022
1-HKUSTrun 440.285616
Southeast-CASIACNN-based proposal classification with proposal filtration and sample balance20.304783
1-HKUSTrun 200.288669
CASIA_CRIPAC_2CNN-based proposal classification with part classification and object regression00.286158
1-HKUSTrun 300.284595
1-HKUSTrun 100.261543
MSRA Visual ComputingA single SPP-net model for detection (no outside data)---0.318403

Object detection with provided training data: Ordered by mean average precision

Team nameEntry descriptionmean APNumber of object categories won
NUSMultiple Model Fusion with Context Rescoring0.37212106
MSRA Visual ComputingA combination of multiple SPP-net-based models (no outside data)0.35110345
UvA-EuvisionDeep learning with provided data0.32025321
MSRA Visual ComputingA single SPP-net model for detection (no outside data)0.318403---
Southeast-CASIACNN-based proposal classification with proposal filtration and sample balance0.3047832
Southeast-CASIACNN-based proposal classification with proposal filtration and model combination0.3040224
1-HKUSTrun 20.2886690
1-HKUSTrun 20.28866918
CASIA_CRIPAC_2CNN-based proposal classification with part classification and object regression0.2861580
1-HKUSTrun 40.2856164
1-HKUSTrun 30.2845950
1-HKUSTrun 10.2615430

Task 1b: Object detection with additional training data

Object detection with additional training data: Ordered by number of categories won

Team nameEntry descriptionDescription of outside data usedNumber of object categories wonmean AP
GoogLeNetEnsemble of detection models. Validation is 44.5% mAPPretraining on ILSVRC12 classification data.1420.439329
CUHK DeepID-NetCombine multiple models described in the abstract without contextual modelingImageNet classification and localization data290.406659
Deep InsightCombination of three detection modelsThree CNNs from classification task are used for initialization.270.404517
UvA-EuvisionDeep learning with outside dataImageNet 100010.354213
Berkeley VisionR-CNN baselineThe CNN was pre-trained on the ILSVRC 2013 CLS dataset.10.345213
Trimps-SoushenTwo models combined with nms ILSVRC2012 classification data00.337485
Trimps-SoushenFour models combinationILSVRC2012 classification data00.332469
Trimps-SoushenCombine SS regions and RP regions to train a new regressor.ILSVRC2012 classification data00.317869
Trimps-SoushenSingle model trained with RP regions.ILSVRC2012 classification data00.315643
MILRCNN + FV RescoringWe used pretrained codebooks (trained on Imageclef) for PQ coding of fisher vectors00.303669
ORANGE-BUPTselective search, models trained in 2014 dataset,bounding box regresssionClassification Training Set00.27703
ORANGE-BUPTselective search, models trained in 2014 dataset,bounding box regresssionClassification Training Set00.271499
ORANGE-BUPTselective search, models trained in 2014 datasetClassification Training Set00.269317
ORANGE-BUPTselective search, models trained in 2013 dataset,bounding box regresssionClassification Training Set00.265701
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models00.264344
ORANGE-BUPTselective search, models trained in 2014 datasetClassification Training Set00.264307
Trimps-SoushenA simple method which use our localization pipline plus nms.ILSVRC2012 classification data00.201702
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models00.159337
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models00.156382
CUHK DeepID-NetCombine multiple models described in the abstract without contextual modeling. The training data includes the validation dataset 2. ImageNet classification and localization data---0.406998
CUHK DeepID-Net2Combine multiple models described in the abstract without contextual modeling. The training data includes the validation dataset 2.ImageNet classification and localization data---0.40352
CUHK DeepID-Net2Combine multiple models described in the abstract without contextual modelingImageNet classification and localization data---0.403417
Deep InsightA single detection model.A CNN from classification task is used for initialization.---0.401568
Deep InsightAnother single detection model.A CNN from classification task is used for initialization.---0.396982
GoogLeNetSingle detection model. Validation is 38.75% mAPPretraining on ILSVRC12 classification data.---0.380277
CUHK DeepID-Net2Multi-stage deep CNN without contextual modelingImageNet classification and localization data---0.377471
CUHK DeepID-NetA single deep CNN with deformation layers and without contextual modelingImageNet classification and localization data---0.349798
Virginia TechRCNN with finetuningILSVRC 2012 Classification data (Training)---0.303374
lffallRCNN trained on val+train1k, tested on testILSVRC 2012 classification data---0.303068

Ordered by mean average precision

Team nameEntry descriptionDescription of outside data usedmean APNumber of object categories won
GoogLeNetEnsemble of detection models. Validation is 44.5% mAPPretraining on ILSVRC12 classification data.0.439329142
CUHK DeepID-NetCombine multiple models described in the abstract without contextual modeling. The training data includes the validation dataset 2. ImageNet classification and localization data0.406998---
CUHK DeepID-NetCombine multiple models described in the abstract without contextual modelingImageNet classification and localization data0.40665929
Deep InsightCombination of three detection modelsThree CNNs from classification task are used for initialization.0.40451727
CUHK DeepID-Net2Combine multiple models described in the abstract without contextual modeling. The training data includes the validation dataset 2.ImageNet classification and localization data0.40352---
CUHK DeepID-Net2Combine multiple models described in the abstract without contextual modelingImageNet classification and localization data0.403417---
Deep InsightA single detection model.A CNN from classification task is used for initialization.0.401568---
Deep InsightAnother single detection model.A CNN from classification task is used for initialization.0.396982---
GoogLeNetSingle detection model. Validation is 38.75% mAPPretraining on ILSVRC12 classification data.0.380277---
CUHK DeepID-Net2Multi-stage deep CNN without contextual modelingImageNet classification and localization data0.377471---
UvA-EuvisionDeep learning with outside dataImageNet 10000.3542131
CUHK DeepID-NetA single deep CNN with deformation layers and without contextual modelingImageNet classification and localization data0.349798---
Berkeley VisionR-CNN baselineThe CNN was pre-trained on the ILSVRC 2013 CLS dataset.0.3452131
Trimps-SoushenTwo models combined with nms ILSVRC2012 classification data0.3374850
Trimps-SoushenFour models combinationILSVRC2012 classification data0.3324690
Trimps-SoushenCombine SS regions and RP regions to train a new regressor.ILSVRC2012 classification data0.3178690
Trimps-SoushenSingle model trained with RP regions.ILSVRC2012 classification data0.3156430
MILRCNN + FV RescoringWe used pretrained codebooks (trained on Imageclef) for PQ coding of fisher vectors0.3036690
Virginia TechRCNN with finetuningILSVRC 2012 Classification data (Training)0.303374---
lffallRCNN trained on val+train1k, tested on testILSVRC 2012 classification data0.303068---
ORANGE-BUPTselective search, models trained in 2014 dataset,bounding box regresssionClassification Training Set0.277030
ORANGE-BUPTselective search, models trained in 2014 dataset,bounding box regresssionClassification Training Set0.2714990
ORANGE-BUPTselective search, models trained in 2014 datasetClassification Training Set0.2693170
ORANGE-BUPTselective search, models trained in 2013 dataset,bounding box regresssionClassification Training Set0.2657010
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models0.2643440
ORANGE-BUPTselective search, models trained in 2014 datasetClassification Training Set0.2643070
Trimps-SoushenA simple method which use our localization pipline plus nms.ILSVRC2012 classification data0.2017020
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models0.1593370
MPG_UTSS, OB, TR proposals + RCNNRCNN and Caffe pre-trained models0.1563820

Classification+localization

Task 2a: Classification+localization with provided training data

Classification+localization with provided training data: Ordered by localization error

Team nameEntry descriptionLocalization errorClassification error
VGGa combination of multiple ConvNets (by averaging)0.2532310.07405
VGGa combination of multiple ConvNets (fusion weights learnt on the validation set)0.2535010.07407
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated0.2554310.07337
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated0.2561670.07325
GoogLeNetModel with localization ~26% top5 val error.0.2644140.14828
GoogLeNetModel with localization ~26% top5 val error, limiting number of classes.0.2644250.12724
VGGa single ConvNet (13 convolutional and 3 fully-connected layers)0.2671840.08434
SYSU_VisionWe compared the class-specific localization accuracy of solution 1 and solution 2 by the validation set. Then we chosen better solution on each class based on the accuracy. General speaking, solution 2 outformed solution 1 when there were multiple objects in the image or the objects are relatively small.0.318990.14446
MIL5 top instances predicted using FV-CNN0.3374140.20734
MIL5 top instances predicted using FV-CNN + class specific window size rejection. Flipped training images are added.0.338430.21023
SYSU_VisionWe just simply averaged the result between solution 1 and solution 2 to form our solution 4. 0.3387410.14446
MIL5 top instances predicted using FV-CNN + class specific window size rejection0.3400380.20823
MSRA Visual ComputingMultiple SPP-nets further tuned on validation set (A)0.3547690.08062
MSRA Visual ComputingMultiple SPP-nets further tuned on validation set (B)0.3549240.0806
MSRA Visual ComputingMultiple SPP-nets (B)0.3555680.082
MSRA Visual ComputingMultiple SPP-nets (A)0.35620.08307
MSRA Visual ComputingA single SPP-net0.361180.09079
SYSU_VisionOur solution 2 was got idea by R-CNN's framework. For testing each image, we: Firstly, used the classification model in solution 1 to get the top 5 class-predictions. Secondly, applyed Selective Search get the candidate regions. Thirdly, fine-tuned another classfication model specific for classifying regions based the classification model above, then used it to find out the scores of each regions. Fourthly, got the highest-score-region in each top 5 class-predictions to form the final result.0.3634410.14446
SYSU_VisionOur algorithm employed the classification-localization framework. For classification, we train a one-thousand-class classification model based on Alex network published on NIP 2012. For localization, we first train a one-thousand-class localization model based on Alex network. However, such a localization model is inclined to localize the saliency region, which can not work well for ImageNet localization. So we fine tune one thousand class-specific models based on the pre-train one-thousand-class localization model, one for each class. But because of the shortage of training images for each class, the over-fitting problem is very serious. To reduce this problem, we design a similarity-sorted fine tuning method. First, we choose one class to fine tune the pre-trian one-thousand-class localization model, and get a localization model for this chosen class. Then we choose the class most similar to the pre-chosen class and fine tune this class based on pre-chosen class localization model. In this way, the training image of similar classed are shared.0.3634830.14446
MIL5 top class labels predicted using FV-CNN0.4029650.18278
MIL5 top class labels predicted using FV-CNN + class specific window size rejection0.4055370.18396
ORANGE-BUPTseven models, augmentation(flip, scale and crop) ,one classification has one region0.4282770.18898
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , one classification has one region0.4434220.15158
ORANGE-BUPTseven models, augmentation(flip and crop),one classification has one region0.4493970.16137
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 1, 20.4687130.13949
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 10.4694080.14115
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 20.470020.14214
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble)0.4707260.14372
Cldi-KAISTDeep CNN framework (4 networks ensemble)0.4717840.14847
TTIC_ECP - EpitomicVisionEpitomicVision4: EpitomicVision2 with fixed mapping of the best matching mosaic position to bounding box0.4829150.10563
Brno University of Technologyweighted average over 17 CNNs with 20 transformations0.5199490.17647
GoogLeNetNo localization. Top5 val score is 6.66% error.0.6062570.06656
Andrew HowardCombination of Convolutional Nets with Validation set adaptation + KNN0.6103650.08111
Andrew HowardCombination of Convolutional Nets with Validation set adaptation0.6110190.08226
Andrew HowardCombination of Convolutional Nets + KNN0.6123050.0853
NUS-BSTThree (NIN + fc) + traditional feature + kernel regression fusion0.6128760.09794
Andrew HowardBaseline Combination of Convolutional Nets0.6143070.08919
NUS-BSTThree (NIN + fc) combined0.6149090.10267
NUS-BSTSingle (NIN + fc)0.6170150.10913
TTIC_ECP - EpitomicVisionEpitomicVision3: Weighted average of probabilities assigned by EpitomicVision1 and EpitomicVision20.6171290.10222
XYZfusion of 4 models0.6171910.11229
XYZfusion of 5 models0.6172120.11239
XYZfusion of 3 models0.6176060.11359
TTIC_ECP - EpitomicVisionEpitomicVision2 (finetuned model w. scale and position search): Image classification with a single deep epitomic neural network, including search over scale and position. No localization attempted.0.6186850.10563
XYZsingle ZF net0.6211230.12375
TTIC_ECP - EpitomicVisionEpitomicVision1 (fast standard model): Image classification with a single deep epitomic neural network. No localization attempted.0.6215590.11941
XYZsingle "Network in Network" net0.626050.1348
Fengjun Lvaverage of 3 CNNs, for classification task only0.6366420.17352
Fengjun Lvsingle CNN, for classification task only0.6368080.17433
SCUT_GLHFusion of CNN network0.6372850.18784
SCUT_GLHCNN network and rerank by the relation of labels0.6410510.19936
BREIL_KAIST1 Convnet trained on original data0.7611880.16044
DeeperVisionSimple average ensemble and box0.8429530.09508
DeeperVisionWeighted ensemble and box0.8431610.09556
DeeperVisionBest single model0.951410.10515
UI---0.999730.99525
DeeperVisionSimple average ensemble1.00.09508
DeeperVisionWeighted ensemble 1.00.09556
BDC-I2R,UPMCAdaptive fusion of multiple CNN models with output rectification (original training data)1.00.11326
BDC-I2R,UPMCAdaptive fusion of multiple CNN models (original training data)1.00.11403
UvA-EuvisionMulti with classification only1.00.12117
BDC-I2R,UPMCA single CNN model (original training data)1.00.12128
UvA-EuvisionSingle with classification only1.00.12376
libccv1 convnet, MattNet, 16-bit half precision parameters1.00.16032
PassByCombine two different model, using the scheme in our previous submit.1.00.16705
PassByUsing just one convolutional neural network. Proposed weighted averaged scheme over several salient images obtained from original images and combine them with the standard 10 crops (4 corners plus one center). No outside training data are used. 1.00.16894
PassByUsing just one convolutional neural network. Averaged over several salient images obtained from original images and combine them with the standard 10 crops. No outside training data are used. (no location information included)1.00.17092
DeepCNetBrief description. Deep ConvNet with 8 layers of 2x2 max-pooling; trained on supplied data.1.00.17481
UI---1.00.99504

Classification+localization with provided training data: Ordered by classification error

Team nameEntry descriptionClassification errorLocalization error
GoogLeNetNo localization. Top5 val score is 6.66% error.0.066560.606257
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion weights learnt on the validation set); detected boxes were not updated0.073250.256167
VGGa combination of multiple ConvNets, including a net trained on images of different size (fusion done by averaging); detected boxes were not updated0.073370.255431
VGGa combination of multiple ConvNets (by averaging)0.074050.253231
VGGa combination of multiple ConvNets (fusion weights learnt on the validation set)0.074070.253501
MSRA Visual ComputingMultiple SPP-nets further tuned on validation set (B)0.08060.354924
MSRA Visual ComputingMultiple SPP-nets further tuned on validation set (A)0.080620.354769
Andrew HowardCombination of Convolutional Nets with Validation set adaptation + KNN0.081110.610365
MSRA Visual ComputingMultiple SPP-nets (B)0.0820.355568
Andrew HowardCombination of Convolutional Nets with Validation set adaptation0.082260.611019
MSRA Visual ComputingMultiple SPP-nets (A)0.083070.3562
VGGa single ConvNet (13 convolutional and 3 fully-connected layers)0.084340.267184
Andrew HowardCombination of Convolutional Nets + KNN0.08530.612305
Andrew HowardBaseline Combination of Convolutional Nets0.089190.614307
MSRA Visual ComputingA single SPP-net0.090790.36118
DeeperVisionSimple average ensemble0.095081.0
DeeperVisionSimple average ensemble and box0.095080.842953
DeeperVisionWeighted ensemble 0.095561.0
DeeperVisionWeighted ensemble and box0.095560.843161
NUS-BSTThree (NIN + fc) + traditional feature + kernel regression fusion0.097940.612876
TTIC_ECP - EpitomicVisionEpitomicVision3: Weighted average of probabilities assigned by EpitomicVision1 and EpitomicVision20.102220.617129
NUS-BSTThree (NIN + fc) combined0.102670.614909
DeeperVisionBest single model0.105150.95141
TTIC_ECP - EpitomicVisionEpitomicVision2 (finetuned model w. scale and position search): Image classification with a single deep epitomic neural network, including search over scale and position. No localization attempted.0.105630.618685
TTIC_ECP - EpitomicVisionEpitomicVision4: EpitomicVision2 with fixed mapping of the best matching mosaic position to bounding box0.105630.482915
NUS-BSTSingle (NIN + fc)0.109130.617015
XYZfusion of 4 models0.112290.617191
XYZfusion of 5 models0.112390.617212
BDC-I2R,UPMCAdaptive fusion of multiple CNN models with output rectification (original training data)0.113261.0
XYZfusion of 3 models0.113590.617606
BDC-I2R,UPMCAdaptive fusion of multiple CNN models (original training data)0.114031.0
TTIC_ECP - EpitomicVisionEpitomicVision1 (fast standard model): Image classification with a single deep epitomic neural network. No localization attempted.0.119410.621559
UvA-EuvisionMulti with classification only0.121171.0
BDC-I2R,UPMCA single CNN model (original training data)0.121281.0
XYZsingle ZF net0.123750.621123
UvA-EuvisionSingle with classification only0.123761.0
GoogLeNetModel with localization ~26% top5 val error, limiting number of classes.0.127240.264425
XYZsingle "Network in Network" net0.13480.62605
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 1, 20.139490.468713
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 10.141150.469408
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble) + re-weighting 20.142140.47002
Cldi-KAISTDeep CNN framework (4 networks ensemble) + Deep CNN-based Fisher framework (4 networks ensemble)0.143720.470726
SYSU_VisionOur algorithm employed the classification-localization framework. For classification, we train a one-thousand-class classification model based on Alex network published on NIP 2012. For localization, we first train a one-thousand-class localization model based on Alex network. However, such a localization model is inclined to localize the saliency region, which can not work well for ImageNet localization. So we fine tune one thousand class-specific models based on the pre-train one-thousand-class localization model, one for each class. But because of the shortage of training images for each class, the over-fitting problem is very serious. To reduce this problem, we design a similarity-sorted fine tuning method. First, we choose one class to fine tune the pre-trian one-thousand-class localization model, and get a localization model for this chosen class. Then we choose the class most similar to the pre-chosen class and fine tune this class based on pre-chosen class localization model. In this way, the training image of similar classed are shared.0.144460.363483
SYSU_VisionWe just simply averaged the result between solution 1 and solution 2 to form our solution 4. 0.144460.338741
SYSU_VisionOur solution 2 was got idea by R-CNN's framework. For testing each image, we: Firstly, used the classification model in solution 1 to get the top 5 class-predictions. Secondly, applyed Selective Search get the candidate regions. Thirdly, fine-tuned another classfication model specific for classifying regions based the classification model above, then used it to find out the scores of each regions. Fourthly, got the highest-score-region in each top 5 class-predictions to form the final result.0.144460.363441
SYSU_VisionWe compared the class-specific localization accuracy of solution 1 and solution 2 by the validation set. Then we chosen better solution on each class based on the accuracy. General speaking, solution 2 outformed solution 1 when there were multiple objects in the image or the objects are relatively small.0.144460.31899
GoogLeNetModel with localization ~26% top5 val error.0.148280.264414
Cldi-KAISTDeep CNN framework (4 networks ensemble)0.148470.471784
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , one classification has one region0.151580.443422
libccv1 convnet, MattNet, 16-bit half precision parameters0.160321.0
BREIL_KAIST1 Convnet trained on original data0.160440.761188
ORANGE-BUPTseven models, augmentation(flip and crop),one classification has one region0.161370.449397
PassByCombine two different model, using the scheme in our previous submit.0.167051.0
PassByUsing just one convolutional neural network. Proposed weighted averaged scheme over several salient images obtained from original images and combine them with the standard 10 crops (4 corners plus one center). No outside training data are used. 0.168941.0
PassByUsing just one convolutional neural network. Averaged over several salient images obtained from original images and combine them with the standard 10 crops. No outside training data are used. (no location information included)0.170921.0
Fengjun Lvaverage of 3 CNNs, for classification task only0.173520.636642
Fengjun Lvsingle CNN, for classification task only0.174330.636808
DeepCNetBrief description. Deep ConvNet with 8 layers of 2x2 max-pooling; trained on supplied data.0.174811.0
Brno University of Technologyweighted average over 17 CNNs with 20 transformations0.176470.519949
MIL5 top class labels predicted using FV-CNN0.182780.402965
MIL5 top class labels predicted using FV-CNN + class specific window size rejection0.183960.405537
SCUT_GLHFusion of CNN network0.187840.637285
ORANGE-BUPTseven models, augmentation(flip, scale and crop) ,one classification has one region0.188980.428277
SCUT_GLHCNN network and rerank by the relation of labels0.199360.641051
MIL5 top instances predicted using FV-CNN0.207340.337414
MIL5 top instances predicted using FV-CNN + class specific window size rejection0.208230.340038
MIL5 top instances predicted using FV-CNN + class specific window size rejection. Flipped training images are added.0.210230.33843
UI---0.995041.0
UI---0.995250.99973

Task 2b: Classification+localization with additional training data

Classification+localization with additional training data: Ordered by localization error

Team nameEntry descriptionDescription of outside data usedLocalization errorClassification error
Adobe-UIUCCLS+LOC try #32000 additional ImageNet classes to train the classifiers0.3009610.13456
Adobe-UIUCCLS+LOC try #22000 additional ImageNet classes to train the classifiers0.3074860.13042
Adobe-UIUCCLS+LOC try #42000 additional ImageNet classes to train the classifiers0.3332540.11883
Adobe-UIUCCLS+LOC try #12000 additional ImageNet classes to train the classifiers0.3343430.11578
Trimps-SoushenCombine three big models plus one complementary model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.4222080.1146
Trimps-SoushenCombine five models plus one complementary model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.4225920.11469
Trimps-SoushenCombine four models300000 external images from ILSVRC2010 and ILSVRC2011 training data0.4226230.11583
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , five confident regions50,000 images in validation set0.4270420.18593
Trimps-SoushenCombine nine models396000 external images from ILSVRC2010 and ILSVRC2011 training data0.427830.11616
Trimps-SoushenSingle model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.4302890.12088
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , five confident regions50,000 images in validation set0.4421980.14797
CASIA_CRIPAC_Weak_SupervisionWeakly supervised localization+convolutional networksMCG proposals pretrained on PASCAL VOC 20120.6196190.11358
Adobe-UIUCCLS w/o LOC2000 additional ImageNet classes to train the classifiers1.00.11733

Classification+localization with additional training data: Ordered by classification error

Team nameEntry descriptionDescription of outside data usedClassification errorLocalization error
CASIA_CRIPAC_Weak_SupervisionWeakly supervised localization+convolutional networksMCG proposals pretrained on PASCAL VOC 20120.113580.619619
Trimps-SoushenCombine three big models plus one complementary model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.11460.422208
Trimps-SoushenCombine five models plus one complementary model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.114690.422592
Adobe-UIUCCLS+LOC try #12000 additional ImageNet classes to train the classifiers0.115780.334343
Trimps-SoushenCombine four models300000 external images from ILSVRC2010 and ILSVRC2011 training data0.115830.422623
Trimps-SoushenCombine nine models396000 external images from ILSVRC2010 and ILSVRC2011 training data0.116160.42783
Adobe-UIUCCLS w/o LOC2000 additional ImageNet classes to train the classifiers0.117331.0
Adobe-UIUCCLS+LOC try #42000 additional ImageNet classes to train the classifiers0.118830.333254
Trimps-SoushenSingle model396000 external images from ILSVRC2010 and ILSVRC2011 training data0.120880.430289
Adobe-UIUCCLS+LOC try #22000 additional ImageNet classes to train the classifiers0.130420.307486
Adobe-UIUCCLS+LOC try #32000 additional ImageNet classes to train the classifiers0.134560.300961
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , five confident regions50,000 images in validation set0.147970.442198
ORANGE-BUPTseven models, augmentation(flip, scale and crop) , five confident regions50,000 images in validation set0.185930.427042

Team information

Team name (with project link where available)Team membersAbstract
1-HKUSTCewu Lu (Hong Kong University of Science and Technology)
Hei Law* (Hong Kong University of Science and Technology)
Hao Chen* (The Chinese University of Hong Kong)
Qifeng Chen* (Stanford University)
Yao Xiao* (Hong Kong University of Science and Technology)
Chi Keung Tang (Hong Kong University of Science and Technology)
(* indexes equal contribution, by Alphabets)


For the detection task, we first generate some candidate bounding boxes, and then our system recognizes objects on these candidate proposals. We try to improve both localization and recognition. On the localization side, initial candidate proposals are generated from selective search [1], and a novel bounding boxes regression method is used for better object localization. On the recognition side, to represent a candidate proposal, we adopt many features such as RCNN features [2], IFV features [3], DPM features [4], to name a few. Given these features, category-specific combination functions are learnt to improve object recognition. Background priors and object interaction priors are also learnt and applied into our system. In addition, our framework involves some other novel techniques. The pertinent technical details for the submission are in preparation. In the ILSVRC2014 competition, we do not use any outside training data.


[1]Uijlings J R R, van de Sande K E A, Gevers T, et al. Selective search for object recognition[J]. International journal of computer vision, 2013, 104(2): 154-171.

[2]Girshick R, Donahue J, Darrell T, et al. Rich feature hierarchies for accurate object detection and semantic segmentation[J]. arXiv preprint arXiv:1311.2524, 2013.

[3]Perronnin F, Sánchez J, Mensink T. Improving the fisher kernel for large-scale image classification[M]//Computer Vision–ECCV 2010. Springer Berlin Heidelberg, 2010: 143-156.

[4]Felzenszwalb P, McAllester D, Ramanan D. A discriminatively trained, multiscale, deformable part model[C]//Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on. IEEE, 2008: 1-8.
Conference on. IEEE, 2008: 1-8.
Adobe-UIUCHailin Jin (Adobe)
Zhaowen Wang (UIUC)
Jianchao Yang (Adobe)
Zhe Lin (Adobe)
Our algorithm is based on an integrated convolutional neural network framework for both classification and localization. We train several 6-layer convnets using 3000 ImageNet classes for classification and then adapt one model for bounding box regression. At test time, we use k-means to find bounding box clusters and rank the clusters according to the classification scores.
Andrew HowardAndrew Howard - Howard Vision TechnologiesDeep convolutional neural networks are very costly to train so my submission focuses on reusing networks through retraining and by using the same network to make multiple predictions.

I started with a deeper and wider Zeiler/Fergus net (ZF) [1]. The differences from the base ZF model are that I use 7 convolutional layers with convolutional layer 3-7 having 512 filters. It took over 6 weeks to train on a GTX Titan using cuda-convnet [2]. This base model is trained using 224x224 crops from the full 256xN image [3] with random horizontal flips [4]. Each training crop is further perturbed with color channel noise [4] and random variation in photometric properties (lighting,contrast,color) [3]. This base model is then adapted to build a high resolution [3] and a low resolution model. The high resolution model is retrained on 224x224 crops from a 448xN sized image with random variation in size (448 +- 10%) and no drop out due to the large number of training crops available. The low resolution model embeds the entire image resized to 150xN into a random location in the 224x224 crop for retraining. I also retrain the base model to increase the size of the fully connected layers to a size larger than would fit in GPU memory if the model was trained together (the fully connected layer is grown from 4096x4096 to 12288x12288 and trained from scratch while the convolutional layers are held fixed). When the new fully connected layers are retrained, I use a slow form of Polyak averaging which averages the model parameters after each epoch rather than after each iterate. Each retrained model takes roughly 1/3 the time that training a model from scratch would.

At test time predictions are made at 6 resolutions each one roughly 30% larger than the next smaller size. Each of the 3 models are responsible for 2 resolutions. The base resolution model acts on images scaled at 256xN and 340xN. The high resolution model acts on 448xN and 576xN and the low resolution acts on 150xN and 200xN. Each resolution uses locations selected on a dense spatial grid on the entire image similar to [5]. Predictions at each spatial location are averaged into a prediction for a given resolution and then predictions are each resolution are combined evenly.

I further build a KNN model on the validation set as suggested by the NUS team last year [6]. For features, I use the final 1000 dimension aggregate predictions. I use leave one out cross validation on the validation set to choose K (the number of neighbors) and the weighting between the final neural network prediction and the KNN prediction.

Finally I adapt the neural networks to the validation set distribution as suggested by the NUS team last year [6]. To do this, I hold fixed the convolutional layers and adapt the fully connected layers to the validation set. Each neural network model is adapted on a different random 80% subset of the validation set with early stopping based on the remaining 20% of the validation.

The final submission is made up of 2 sets of 3 networks plus 1 KNN prediction. The second set of networks are a smaller earlier version and only add a little value.

[1] M.D. Zeiler, R. Fergus, "Visualizing and Understanding Convolutional Networks." ECCV 2014.

[2] https://code.google.com/p/cuda-convnet

[3] A.G. Howard, "Some Improvements on Deep Convolutional Neural Network Based Image Classification." ICLR 2014.

[4] A. Krizhevsky, I. Sutskever, and G. E. Hinton, "Imagenet classification with deep convolutional neural networks." NIPS 2012.

[5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks." ICLR 2014.

[6] M. Lin, Q. Chen, J. Dong, J. Huang, W. Xia, "Adaptive Non-parametric Rectification of Shallow and Deep Experts." ILSVRC 2013.and evenly weights the predictions at each spatial location. Each resolution gets even weighting in the final prediction.
BDC-I2R,UPMCBig Deep Computing Team

Olivier Morère (1,2),
Hanlin Goh (1),
Antoine Veillard (2),
Vijay Chandrasekhar (1)

1: Institute for Infocomm Research, Singapore
2: Université Pierre et Marie Curie, Paris, France
Multiple deep convolutional neural networks (CNN) [Krizhevsky et al. 2012], each trained with a different set of parameters. The deep representations are extracted across multiple scales and positions within an image. Model fusion is adaptively performed within each CNN model, and subsequently across the different models. Class distribution priors are used to rectify the outputs of the model. The CNN features are extracted across a GPU cluster, while a CPU cluster is used to optimize parameters in a MapReduce framework.

We submit three runs for the classification-only task. No external data was used in our models.
Run 1: A single CNN model.
Run 2: Adaptive fusion of multiple CNN models.
Run 3: Adaptive fusion of multiple CNN models with output rectification.
Berkeley VisionRoss Girshick, UC Berkeley
Jeff Donahue, UC Berkeley
Sergio Guadarrama, UC Berkeley
Trevor Darrell, UC Berkeley
Jitendra Malik, UC Berkeley
Our detection entry is a baseline for R-CNN [1] on the expanded ILSVRC 2014 detection dataset. We followed the approach for training on ILSVRC 2013 detection described in the R-CNN tech report [2], but with two small changes.

1) We used the additional training annotations for the 2014 detection dataset.

2) We used a slightly larger convolutional neural network than in [1, 2]. In this network, convolutional layers one through five have 96, 384, 512, 512, and 384 filters, respectively. The two fully connected layers (before the linear classifiers) both have 4096 output units. This network was pre-trained on the ILSVRC 2013 CLS dataset before fine-tuning for detection.

We performed control experiments to compare these changes to the results in [2]. On the val2 validation set (see [2]), the new training data added for 2014 improved results from 29.7% to 31.2% mAP, using the same CNN as in [2] in both cases. Using the slightly larger CNN improved results on val2 to 32.1%. Bounding-box regression further increased this to 33.4% (compared to 31.0% in [2]).

[1] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. CVPR 2014.

[2] Ross Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. Technical report. http://arxiv.org/abs/1311.2524v4.
BREIL_KAISTKAIST department of EE

Jun-Cheol Park, Yunhun Jang, Hyungwon Choi, JaeYoung Jun
Our team trained a deep convolutional neural network with similar architecture introduced in[1]. The overall training details are based on [2]. We used caffe[3] as our development environment. For localization, we computed image specific class saliency as in [4].

[1] Chatfield, Ken, et al. "Return of the Devil in the Details: Delving Deep into Convolutional Nets." arXiv preprint arXiv:1405.3531 (2014).
[2] Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
[3] Jia, Yangqing. "Caffe: An open source convolutional architecture for fast feature embedding." h ttp://caffe. berkeleyvision. org (2013).
[4] Simonyan, Karen, Andrea Vedaldi, and Andrew Zisserman. "Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps." arXiv preprint arXiv:1312.6034 (2013).
Brno University of TechnologyMartin Kolář, Michal Hradiš, Pavel SvobodaOur method is based on calculating the weighted average of multiple architectures of standard Convolutional Neural Networks (Krizhevsky et al. 2012) on randomly transformed images (color and geometry). Results were optimised using textual associations between synsets (Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.). We used code based on Caffe by Yangqing Jia on the IT4I computing cluster, and trained 17 CNNs on Kepler K20 GPUs.
CASIA_CRIPAC_2Peihao Huang, Institute of Automation, Chinese Academy of Sciences
Yongzhen Huang, Institute of Automation, Chinese Academy of Sciences
Feng Liu, School of Automation, Southeast University
Zifeng Wu, Institute of Automation, Chinese Academy of Sciences
Fang Zhao, Institute of Automation, Chinese Academy of Sciences
Liang Wang, Institute of Automation, Chinese Academy of Sciences
Tieniu Tan, Institute of Automation, Chinese Academy of Sciences
Our method is mainly based on the framework of R-CNN for object detection. However, the object proposals are different from those used in R-CNN, explained as follows.
(1) We train a part classification model using CNN, to judge that a proposal (obtained by the selective search algorithm) belongs to an object or not.
(2) We train an object regression model using CNN, to estimate the location and the size of the object from a part.
(3) For each image, we use the K-means algorithm for clustering over the locations and the sizes estimated in (2).
(4) We choose the proposals close to the clustering centers.

Another difference is that, to obtain the pre-training CNN model, we use the 200 categories images on dataset 1 for training rather than the 1000 categories images on dataset 2.
CASIA_CRIPAC_Weak_SupervisionWeiqiang Ren, CRIPAC, CASIA
Chong Wang, CRIPAC, CASIA
Yanhua Cheng, CRIPAC, CASIA
Kaiqi Huang, CRIPAC, CASIA
Tieniu Tan, CRIPAC, CASIA
We use the weakly supervised object localization from only classification labels to enhance classification task. First, MCG proposal pre-trained on PASCAL VOC 2012 is used to extract the region proposals and each region proposal is represented using pre-trained convolutional networks.
Then, a multiple instance learning strategy is adopted to learn the object detectors with weak supervision. Using the learned object detectors, we are able to learn object classifiers instead of global image classifiers using multi-class softmax model. Finally, the detection models and classification models are fused to produce the final classification results.
Cldi-KAISTKyunghyun Paeng (KAIST), Donggeun Yoo (KAIST), Sunggyun Park (KAIST), Jungin Lee (Cldi Inc.), Anthony S. Paek (Cldi Inc.), In So Kweon (KAIST), Seong Dae Kim (KAIST)Our submission is based on a combination of two methodologies – the Deep Convolutional Neural Network (DCNN) framework [1] as a global expert and the DCNN-based Fisher framework as a local expert. Simple reweighting techniques are used as well. Our localization method is a bounding box regression.

In order to train a global expert, we have used 10 networks under different settings: using various preprocessing methods, and/or different network architectures. We selected the best ensemble of the networks that demonstrate the best accuracy in the validation dataset.

Our local expert is trained using local features composed of DCNN responses from mid-layers. We encoded the local features into Fisher vectors [2] and trained SVM classifiers. In order to prevent overfitting, we trained our network using 0.9 million from the entire set of training images, and the remaining 0.3 million were used for Fisher encoding and SVM training.

[1] Krizhevsky, Alex and Sutskever, Ilya and Hinton, Geoff, "Imagenet classification with deep convolutional neural networks." NIPS 2012.

[2] Perronnin, Florent, Jorge Sánchez, and Thomas Mensink. "Improving the fisher kernel for large-scale image classification." ECCV 2010.


CUHK DeepID-NetWanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang

Multimedia Laboratory, The Chinese University of Hong Kong
The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). This detection work is based on deep CNN with proposed new deformation layers, feature pre-training strategy, sub-region pooling and model combination. The effectiveness of learning deformation models of object parts has been proved in object detection by many existing non-deep-learning detectors, e.g. [a]. However, it is missed in current deep learning models. In deep CNN models, max pooling and average pooling are useful in handling deformation but cannot learn the deformation penalty and geometric model of object parts. We design the deformation layer for deep models so that the deformation penalty of objects can be learned by deep models. The deformation layer was first proposed in our recently published work [b], which showed significant improvement in pedestrian detection. In this submission, we extend it to general object detection on ImageNet. In [b], the deformation layer was only applied to a single level corresponding to body parts, while in this work the deformation layer was applied to every convolutional layer to capture geometric deformation at all the levels. In [b], it was assumed that a pedestrian only has one instance of a body part, so each part filter only has one optimal response in a detection window. In this work, it is assumed that an object has multiple instances of body part (e.g. a building has many windows), so each part filter is allowed to have multiple response peaks in a detection window. This new model is more suitable for general object detection.

The whole detection pipeline is much more complex than [b]. In addition to the above improvement, we also added several new components in the pipeline, including feature pre-training on the ImageNet classification dataset (objective function is the image classification task), feature fine-tuning on the ImageNet detection dataset (objective function is the object detection task), a proposed new sub-region pooling step, contextual modeling (which uses the whole image prediction scores over 1000 classes as contextual features to combine with features extracted from a detection window with deep CNN), SVM classification by using the extracted features. We also adopted bounding box regression [c].

A new sub-region pooling strategy is proposed. It divides the detection window into sub-regions, and applies max-pooling or average pooling across feature vectors extracted from different sub-regions. It improves the performance and also increases the model diversity.

Different from the state-of-the-art deep learning detection framework [c], which pretrain the net on ImageNet classification data (1000 classes), We proposed a new strategy of doing pre-training on the ImageNet classification data (1000 classes), such that the pre-trained features are much more effective on the detection task and with better discriminative power on object localization.

By changing the configuration of each component of the detection pipeline, multiple models with large diversity are generated. Multiple models are automatically selected and combined to generate the final detection result.
We have submitted the results of five different approaches. The first two results report the best performance to be achieved with a single model. Their difference is whether using contextual features from image classification or not. The remaining three results report the best performance to be achieved with model combination. Their differences are using contextual modeling or not, and whether using validation 2 dataset from ImageNet as part of training or not.


[a] P. Felzenszwalb, R. B. Grishick, D.McAllister, and D. Ramanan. Object detection with discriminatively trained part based models. IEEE Trans. PAMI, 32:1627–1645, 2010.

[b] Wanli Ouyang, Xiaogang Wang, "Joint Deep Learning for Pedestrian Detection ", In Proc. IEEE ICCV 2013.

[c] R. Girshick, J. Donahue, T. Darrell, J. Malik, "Rich feature hierarchies for accurate object detection and semantic segmentation", CVPR 2014.
CUHK DeepID-Net2Wanli Ouyang, Xingyu Zeng, Shi Qiu, Ping Luo, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang, Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou Tang

Multimedia Laboratory, The Chinese University of Hong Kong
The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). This detection work is based on multi-stage deep CNN and model combination. Multi-Stage classifiers have been widely used in object detection and achieved great success. With a cascaded structure, each classifier processes a different subset of data. However, these classifiers are usually trained sequentially without joint optimization. In this submission, we proposed a new deep architecture that can jointly train multiple classifiers through several stages of back-propagation. Each stage handles samples at a different difficulty levels. Specifically the first stage of deep CNN handles easy samples, the second state of deep CNN process more difficult samples which can’t be handled in the first stage, and so on. Through a specific design of the training strategy, this deep architecture is able to simulate the cascaded classifiers by mining hard samples to train the network stage-by-stage. The group of classifiers in the deep model choose training samples stage by stage. The training is split into several back-propagation (BP) stages. Due to the design of our training procedure, the gradients of classifier parameters at the current stage are mainly influenced by the samples misclassified by the classifiers at the previous stages. At each BP stage, the whole deep model has been initialized with a good starting point learned at the previous stage and the additional classifiers focus on the misclassified hard samples. Direct back-propagation on the multi-stage deep CNN easily lead to the overfitting problem. We design stage-wise supervised training to regularize the optimization problem. At each BP stage, classifiers at the previous stages jointly work with the classifier at the current stage in dealing with misclassified samples. Existing cascaded classifiers only pass a single score to the next stage, while our deep model keeps the score map within a local region and it serves as contextual information to support the decision at the next stage. Our recent work [1] has explored the idea of multi-stage deep learning, but it was only applied to pedestrian detection. In this submission, we apply it to general object detection on ImageNet.

The detection pipeline is much more complex than [1]. It includes feature pre-training, multi-stage deep CNN fine-tuning, sub-region pooling, contextual modeling , SVM classification, and bounding box regression. The state-of-the-art deep learning object detection framework in [2] pretrain the net on ImageNet classification data (1000 classes) and then fine-tune on ImageNet detection data (200 classes). We proposed a new strategy of doing pre-training on the ImageNet classification data (1000 classes), such that the pre-trained features are much more effective on the detection task and with better discriminative power on object localization. A new sub-region pooling strategy is proposed. It divides the detection window into sub-regions, and applies max-pooling or average pooling across feature vectors extracted from different sub-regions. Context modeling uses the whole image prediction scores over 1000 classes as contextual features to combine with features extracted from a detection window with deep CNN.

By changing the configuration of each step, we can generate multiple deep models. For example, the features can be pre-trained with Alex’s net or Clarifai. With extracted features, bounding boxes can be classified with fully connected networks with hinge loss or SVM, including sub-region pooling or not. Therefore, different models can be generated. Top N models with the highest accuracies are combined by averaging. The work uses ImageNet classification training set (1000 classes) to pre-train features, and fine tunes features on ImageNet detection training set (200 classes). No other training data is used.

[1] Xingyu Zeng, Wanli Ouyang, Xiaogang Wang, "Multi-Stage Contextual Deep Learning for Pedestrian Detection ", In Proc. IEEE ICCV 2013.

[2] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation", In Proc. CVPR, 2014.
Deep InsightJunjie Yan (NLPR)
Naiyan Wang (HKUST)
Stan Z. Li (NLPR)
Dit-Yan Yeung (HKUST)

We use the region proposal, CNN Feature and SVM classifier for object detection (similar to the framework RCNN ). In our entry, we use the selective search and structure edge to generate around 4000 object proposals for each image. The features of each object proposal are extracted from three CNNs, which are trained on the classification task and tuned on the detection task. The three CNNs are different in the depth of convolution layer. Deeper model always achieves better result according to the validation set. The bounding box regression uses the output of the final layer as the input to refine the result. For the context, we train 200 binary classifiers on the detection data and use them to re-score the detection.
DeepCNetBen Graham-University of WarwickWe trained a deep convolutional network with the architecture

(input=768x768x3)-200C3-MP2-400C2-MP2-600C2-MP2-800C2-MP2-1000C2-MP2-1750C2-MP2-2500C2-MP2-3250C2-MP2-4000C2-(output=1000N softmax layer)

The architecture is inspired by the paper (Ciresan, et al. Multi-column deep neural networks for image classification, 2012).
Input images are scaled to have approximately 2^16 pixel, maintaining aspect ratio, and placed in the centre of the input field.
Sparsity is used to accelerate the training process (Graham, Sparse arrays of signatures for online character recognition http://arxiv.org/abs/1308.0371, 2013).
For training, affine transformations are used. For testing, each image is fed forward through the network only once.

Regarding Q3 in the FAQ "Do teams have to submit both classification and localization results in order to participate in Task 2?"
Do to lack of time, I have not attempted the localization part of the challenge; but I hope to work on that in future.

Thank you to all the organisers.
DeeperVisionDeeperVisionWe use very deep convolutional neural network which consists of 10+ layers in the competition. To fully optimize such a deep model, we adopt a Nesterov based optimization method which is shown to be superior to the conventional SGD. We also exploit more advanced data augmentation technique such as using various resolution, lightness and contrast variation, etc. For model ensemble, we directly use discrete optimization to optimize top 5 error rate.
Fengjun LvFengjun Lv - Fengjun Lv ConsultingWe followed the approach by Krizhevsky et al. in their NIPS 2012 paper but with a different pre-processing step. For non-square images, instead of using central crop (which in many cases, does not contain the object of interest at all or the object is incomplete), we apply Graph-Based Visual Saliency (by Harel et al. NIPS 2006) to the original image (both in training and testing) and use integral image to get a square crop that maximizes the visual saliency. One of the two submissions is from a single CNN. The other combines multiple CNNs.
GoogLeNetChristian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Drago Anguelov, Dumitru Erhan, Andrew RabinovichWe explore an improved convolutional neural network architecture which combines the multi-scale idea with intuitions gained from the Hebbian principle. Additional dimension reduction layers based on embedding learning intuition allow us to increase both the depth and the width of the network significantly without incurring significant computational overhead. Combining these ideas allow for increasing the number of parameters in convolutional layers significantly while cutting the total number of parameters and resulting in improved generalization. Various incarnations of this architecture are trained for and applied at various scales and the resulting scores are averaged for each image.
lffallFeng Liu, Southeast University, ChinaThis track is just for testing some off-the-shelf algorithms to provide a baseline for our subsequent researches and studies. In particular, we want to compare the results of different algorithms that can produce region proposals, and to find out which is the most important factor that influence the following classification.
DET entry 1 is our reproduction of the RCNN[1] algorithm trained on val + train1k set, whose region proposals are provided by selective search[2].
[1] Girshick, Ross, et al. "Rich feature hierarchies for accurate object detection and semantic segmentation." arXiv preprint arXiv:1311.2524 (2013).
[2] Van de Sande, Koen EA, et al. "Segmentation as selective search for object recognition." Computer Vision (ICCV), 2011 IEEE International Conference on. IEEE, 2011.
libccvLiu Liu, libccv.orgOpen-source implementation of MattNet (Visualizing and Understanding Convolutional Networks, Matthew D. Zeiler, and Rob Fergus) trained with 1 convnet, detailed in: http://libccv.org/doc/doc-convnet/
MILSenthil Purushwalkam (The Univ. of Tokyo[intern] and IIT Guwahati)
Yuichiro Tsuchiya (The Univ. of Tokyo)
Atsushi Kanehira (The Univ. of Tokyo)
Asako Kanezaki (The Univ. of Tokyo)
Tatsuya Harada (The Univ. of Tokyo)
Classification-Localisation Task

We combine two models - one based on fisher vectors extracted from two feature descriptors and the other using a special classifier trained on CNN features extracted using selective search boxes.
For the fisher based model [1], fisher vectors were extracted using local feature descriptors. Linear classifiers were trained for these fisher vectors using the averaged passive-aggressive algorithm.
For the CNN based model, CNN features were extracted on selective search windows. The classifier was trained using [2] which trains a multiclass classifier by creating 'negative classes' for each class. This optimises the separation between positive and negative features while simultaneously optimising the separation between classes.


Detection Task:
We use RCNN[3] as the base detector. We train separate fisher based classifiers for each class using the Passive Aggressive algorithm. The scores from these classifiers for each image is collected and is used for rescoring the detections.


1) N. Gunji, T. Higuchi, K. Yasumoto, H. Muraoka, Y. Ushiku, T. Harada, and Y. Kuniyoshi. Scalable Multiclass Object Categorization with Fisher Based Features. ILSVRC2012, 2012.

2) Asako Kanezaki, Sho Inaba, Yoshitaka Ushiku, Yuya Yamashita, Hiroshi Muraoka, Yasuo Kuniyoshi, and Tatsuya Harada. Hard Negative Classes for Multiple Object Detection. 2014 IEEE International Conference on Robotics and Automation (ICRA 2014), pp.3066-3073, 2014.

3) Girshick, Ross, Jeff Donahue, Trevor Darrell, and Jitendra Malik. "Rich feature hierarchies for accurate object detection and semantic segmentation." arXiv preprint arXiv:1311.2524 (2013).
MPG_UTRiku Togashi (The University of Tokyo)
Keita Iwamoto (The University of Tokyo)
Tomoaki Iwase (The University of Tokyo)
Hideki Nakayama (The University of Tokyo)
In this challenge, we focused on integrating object region proposals obtained from different methods to use as the inputs for the RCNN system [1]. Namely, we used objectness (OB) [2], selective search (SS) [3], and bounding box transfer (TR) [4]. We used public codes of RCNN, OB, SS (bundled in RCNN). For implementing TR, we extracted 4096-dimensional global CNN features by Caffe [5] and retrieved nearest training samples in terms of L2 distance.
We computed 500 to 1000 windows for each object region proposal method and then put them together for RCNN. Using pre-trained CNN and SVM models provided by RCNN software, we computed scores for each proposal and ran non-maxima suppression (without distinguishing proposal methods) to determine the final predictions. We did not perform bounding box regression (refinement) as the original RCNN paper does.

We observed that combining different object proposal methods worked better than just computing more proposals by one method. Particularly, TR method could greatly improve the performance from the original RCNN (based on SS), probably because TR can implicitly utilize global dataset statistics and conceptually very different from OB and SS.


[1] Ross Girshick, Jeff Donahue, Trevor Darrell, and Jitendra Malik, Rich feature hierarchies for accurate object detection and semantic segmentation, In Proc. IEEE CVPR, 2014.

[2] Bogdan Alexe, Thomas Deselaers, and Vittorio Ferrari , Measuring the objectness of image windows, IEEE Trans. PAMI, vol. 34, no. 11, pp. 2189-2202, 2012.

[3] Jasper R. R. Uijlings, Koen E. A. van de Sande, Theo Gevers, and Arnold W. M. Smeulders, Selective Search for Object Recognition, International Journal of Computer Vision, Volume 104 (2), page 154-171, 2013.

[4] Jose A. Rodriguez-Serrano and Diane Larlus, Predicting an Object Location using a Global Image Representation, In Proc. IEEE ICCV, 2013.

[5] Yangqing Jia, Caffe:An Open Source Convolutional Architecture for Fast Feature Embedding, 2013.
MSRA Visual ComputingKaiming He (Microsoft Research)
Xiangyu Zhang (Xi'an Jiaotong University)
Shaoqing Ren (University of Science and Technology of China)
Jian Sun (Microsoft Research)
Our CLS and DET methods are both based on the SPP-net in our ECCV 2014 paper “Spatial Pyramid Pooling in Deep Convolutional Networks for Visual Recognition”. SPP (SPM) is a flexible solution for handling image scales/sizes, and is also robust to deformations. The usage of the SPP layer is independent of the CNN designs, and we show that SPP improves the classification accuracy of various CNNs, regardless of the network depth, width, strides, and other designs.

The SPP-net is also a fast and accurate solution to object detection. We compute the convolutional feature maps from the images only once, and use SPP to pool features from arbitrary proposal windows for training SVM detectors. Our method is tens of times faster than R-CNN. Our network is pre-trained only using the DET-200 data (without outside data such as CLS-1000). A few strategies are proposed to improve the pre-training, driven by the different statistical properties of the DET-200 set.

The algorithm details have been described in our ECCV paper. An extended technical report will be updated. The code will be released.
NUSJian DONG(1), Yunchao WEI(1), min LIN(1), Qiang CHEN(2), Wei XIA(1), Shuicheng YAN(1)

(1) National University of Singapore
(2) IBM Research, Australia
There are four major components for improving detection performance:

Network In Network (NIN) [Key Contribution]:
We trained an NIN which is a special modification of CNN [1] with 14 parameterized layers. NIN uses a shared multilayer perceptron as the convolution kernel to convolve the underlying input, the resulting structure is equivalent to adding cascaded cross channel parametric (CCCP) pooling on top of convolutional layer. Adding CCCP layer significantly improves the performance as compared to vanilla convolution.

Augmented training and testing sample:
This improvement was first described by Andrew Howard [Andrew 2014]. Instead of resizing and cropping the image to 256x256, the image is proportionally resized to 256xN(Nx256) with the short edge to 256. Subcrops of 224x224 are then randomly extracted for training.

Traditional framework with SVM:
Traditional classification framework can provide complementary information, such as scene-level information, to CNN network. Hence, we integrate the outputs from the traditional framework (based on our PASCAL VOC2012 winning solutions, with the new extension of high-order parametric coding in which the first and second order parameters of the adapted GMM for each instance are both considered) to further improve the performance.

Kernel regression for rescoring:
Finally, we employ non-parametric rectification method to correct/rectify the outputs from multiple models for obtaining more accurate prediction. Basically for each sample in the training and validation sets, we have a pair of outputs-from-multi-models and ground-truth label. For a testing sample, we use regularized kernel regression method to determine the affinities between the test sample and its auto-selected training/validation samples, and then the affinities are utilized to fuse the ground-truth labels of these selected samples to produce a rectified prediction.

Detection (Task 1) ------
The basic method is based on Ross Girshick's RCNN framework. We employ Network in Network as the feature extractor to improve the model discriminative capability. Features from multiple NINs are concatenated for both model training and bounding box regression. Raw detection scores are calculated based on the features from the refined bounding boxes.
To integrate the global context information beyond the information within the target bounding box, we concatenate all the raw detection scores and then combine them with the outputs from the traditional classification framework by context refinement [2]. Finally, the refined detection results are further updated through the adaptive kernel regression.

[1] Min Lin, Qiang Chen, Shuicheng Yan. Network In Network. In ICLR 2014.
[2] Qiang Chen, Zheng Song, Jian Dong, Zhongyang Huang, Yang Hua, Shuicheng Yan. Contextualizing Object Detection and Classification. In TPAMI 2014.
NUS-BSTMin Lin(1), Jian Dong(1), Hanjiang Lai(1), Junjun Xiong(2), Shuicheng Yan(1)

(1) National University of Singapore
(2) Beijing Samsung Telecom R&D Center
This submission is based on our recent ICLR’14 work called “Network in Network”, and there are four major components for the whole solution:

Network In Network (NIN) [key contribution]:
We trained an NIN which is a special modification of CNN [Min et al. 2014] with 14 parameterized layers. NIN uses a shared multilayer perceptron as the convolution kernel to convolve the underlying input, the resulting structure is equivalent to adding cascaded cross channel parametric (CCCP) pooling on top of convolutional layer. Adding CCCP layer significantly improves the performance as compared to vanilla convolution.

Augmented training and testing sample:
This improvement is first described by Andrew Howard [Andrew 2014]. Instead of resizing and cropping the image to 256x256, the image is proportionally resized to 256xN (or Nx256) with the short edge to 256. Subcrops of 224x224 are then randomly extracted for training. During testing, 3 views of 256x256 are extracted and each view goes through the 10 view testing described by [Alex et al. 2013].

Traditional features with SVM:
Traditional classification framework can provide complementary information, such as scene level information, to NIN network. Hence, we integrate the outputs from the traditional framework (based on our PASCAL VOC2012 winning solutions, with the new extension of high-order parametric coding in which the first and second order parameters of the adapted GMM for each instance are both considered) to further improve the performance.

Kernel regression for fusion of results:
Finally, we employ non-parametric rectification method to correct/rectify the outputs from multiple models for obtaining more accurate prediction. Basically for each sample in the training and validation sets, we have a pair of outputs-from-multi-models and ground-truth label. For a testing sample, we use regularized kernel regression method to determine the affinities between the test sample and its auto-selected training/validation samples, and then the affinities are utilized to fuse the ground-truth labels of these selected samples to produce a rectified prediction.

Min Lin, Qiang Chen, and Shuicheng Yan. "Network In Network." International Conference on Learning Representations. 2014.

Howard, Andrew G. "Some Improvements on Deep Convolutional Neural Network Based Image Classification." International Conference on Learning Representations. 2014.

Krizhevsky, Alex, Ilya Sutskever, and Geoffrey E. Hinton. "Imagenet classification with deep convolutional neural networks." Advances in neural information processing systems. 2012.
ORANGE-BUPTHongliang BAI, Orange Labs Beijing
Yinan LIU, Orange Labs Beijing
Bo LIU, BUPT, CHINA
Yanchao FENG, BUPT, CHINA
Kun TAO, Orange Labs Beijing
Yuan DONG, Orange Labs Beijing
It is the second time that we participate in ILSVRC. In this year, we submit maximal ten runs in the DET and LOC tasks. In DET, inspired by Ross’s rcnn method, we detect 200 classes in test images with selective search, pretrained CNN models in training set of LOC task, fine-tuning in the detection training set, neural network-based classification (201 classes including background) , and bounding box regression. In the validation dataset, we get 0.272 mAP. Three steps are conducted in LOC, (1) train seven classification models by deep learning in different network structure and parameters, and test with data augmentations (crop, flip and scale) (2)test images are segmented into ~2000 regions by selective search algorithm, then the regions are classified by the above classifiers into one of 1000 classes. (3) regions with highest possibility classes generated by the classification model are selected as the final output. In validation set of classification, the top-5/1 error rate is 0.3680 and 0.1526 compared with the last year’s 0.25194. For location task, the best performance is about 0.45 in validation data set.
PassByLin SUN(LENOVO/HKUST)
Zhanghui Kuang(LENOVO)
Cong Zhao(LENOVO)
Kui Jia (University of Macao)
Oscar C.Au (HKUST)
Since the time limited, we do not obtain a good CNN baseline, about 80% on validation dataset. However, we want to indicate that we could apply some traditional computer vision methods to boost the performance even the tools at hand are poor. In this submission, we propose a saliency based method in order to better present the images when single CNN fails. Average and novel weighted average methods are applied to obtain the final prediction. We believe our method will be better if we get enough time to train and tune.

Reference:
1.DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition, ICML, 2014
2. ImageNet Classification with Deep Convolutional Neural Networks, NIPS, 2012
SCUT_GLHGuo Lihua (south china university of technology)
Liao Qijun (south china university of technology)
Ma Qianli (south china university of technology)
Lin Junbin (south china university of technology)
Deep Neural networks have very stronger power to automatically learn the complex relation between the input and output than some traditional shallow model, such as SVM, PCA, and so on. Currently, the most widely used network which achieves better performance is CNN. CNN has been successfully applied to image classification, scene recognition, and natural speech analysis and other areas. This method uses the CNN network to train imagenet training image. We calculate the average accuracy of top20 in validation sets, and find that the average accuracy of validation sets has above 90%. Based on this, firstly, we establish the semantic relation of all the labels. Then, use CNN network to extract the top 20 candidated labels. Finally, rerank the result based on the semantic relation of the candidated labels.
Southeast-CASIAFeng Liu, School of Automation, Southeast University
Zifeng Wu, Institute of Automation, Chinese Academy of Sciences
Yongzhen Huang, Institute of Automation, Chinese Academy of Sciences
Our algorithm is composed of five components:
(1) Using the selective search algorithm to generate about 2400 proposals for every image.
(2) Training a two-category proposal classification model using CNN on dataset 1 to remove proposals more likely from backgrounds. 700 proposals are preserved after this step.
(3) Training an initial 200-category image classification model using CNN on dataset 1.
(4) Fine-tuning the initial model using 700 proposals. We consider two strategies: with sample balance and without sample balance over categories in fine-tuning, and accordingly obtain two proposal representation models. The final proposal representation is the combination of these two models.
(5) Training 200 two-category proposal classification models using SVM, and using bounding box regression to obtain the final detection results.
SYSU_VisionLiliang Zhang, Tianshui Chen, Shuye Zhang, Wanglan He, Liang Lin, Dengguang Pang, Lingbo Liu. Sun Yat-Sen University, China.Solution 1:
Our solution 1 employed the classification-localization framework. For classification, we train a one-thousand-class classification model based on Alex network published on NIP 2012. For localization, we first train a one-thousand-class localization model based on Alex network. However, such a localization model is inclined to localize the saliency region, which can not work well for ImageNet localization. So we fine tune one thousand class-specific models based on the pre-train one-thousand-class localization model, one for each class. But because of the shortage of training images for each class, the over-fitting problem is very serious. To reduce this problem, we design a similarity-sorted fine tuning method. First, we choose one class to fine tune the pre-trian one-thousand-class localization model, and get a localization model for this chosen class. Then we choose the class most similar to the pre-chosen class and fine tune this class based on pre-chosen class localization model. In this way, the training image of similar classed are shared.
Solution 2:
Our solution 2 was got idea by R-CNN's framework. For testing each image, we: Firstly, used the classification model in solution 1 to get the top 5 class-predictions. Secondly, applyed Selective Search get the candidate regions. Thirdly, fine-tuned another classfication model specific for classifying regions based the classification model above, then used it to find out the scores of each regions. Fourthly, got the highest-score-region in each top 5 class-predictions to form the final result.
Solution 3:
We compared the class-specific localization accuracy of solution 1 and solution 2 by the validation set. Then we chosen better solution on each class based on the accuracy. General speaking, solution 2 outformed solution 1 when there were multiple objects in the image or the objects are relatively small.
Solution 4:
We just simply averaged the result between solution 1 and solution 2 to form our solution 4.
Trimps-SoushenJie Shao, Xiaoteng Zhang, JianYing Zhou, Jian Wang, Jian Chen, Yanfeng Shang, Wenfei Wang, Lin Mei, Chuanping Hu.
The Third Research Institute of the Ministry of Public Security, P.R. China.
Task 1: Detection
Our work is based on the R-CNN paper in CVPR2014. We use another region selection method called RP from ICCV 2013 paper, this method generate less regions without significant precision reduction. We use these new regions to train a new model with less space and time. Besides this, we try several combine methods. First, we combine the regions generated by selective search and RP on a single model. We individually train R-CNN on selective search regions and RP regions, then we just combine the results of different models using nms. In the training stage, we fine-tune the CNN model trained on ILSVRC2012 classification data with ILSVRC2014 detection data. We do not use any other outside data. We also try a simple method which use our localization pipline plus nms for object detection.

Task 2: Classification and localization
Our model is based on large deep convolutional neural network. We use several methods to improve the performance. 1. Data Augmentation. Some of our models are trained on original data plus about 396000 external images from ILSVRC2010 and ILSVRC2011 training data. All training data belong to original 1000 object categories. Other data augmentation methods include random crops from Nx256 resized images, contrast and color jittering, and Gaussian noise. We use opencv to resize images with cubic interpolation, which we found very useful. 2. Model Details. The biggest model we trained has about 120M parameters. To encourage model diversity, we use different normalization and pooling method, with partly random selected external data. We also train two kind of complementary models, supervised CNN pre-train model and vary resolution model (normal resolution --> high resolution (fine-tuning) --> normal resolution (fine-tuning)). Both of these models have lower accuracy, but play very important role in model voting. 3. Testing. We make predictions at multiscale, each scale with 7 cropped images and their horizontal flips.

For Localization task, a simple pipeline is taken. First, we use RP to extract region proposals, regions with IOU greater than 0.8 are used as positive samples, and regions with IOU between 0.2 and 0.3 (Localization data are not fully annotated) are used as background. Second, we fine-tune a classification model with these regions. Finally, for a test image, extracted region proposals are feed to the fine-tuned model to get region confidence and corresponding coordinates. Base on the result from Classification task, we select the top-k regions and averaging their coordinates as output.


[1] Rich feature hierarchies for accurate object detection and semantic segmentation. Girshick, Ross and Donahue, Jeff and Darrell, Trevor and Malik, Jitendra. Computer Vision and Pattern Recognition 2014.
[2]Prime Object Proposals with Randomized Prim's Algorithm, Santiago Manen, Matthieu Guillaumin, Luc Van Gool, International Conference on Computer Vision (ICCV) 2013.
[3] Some Improvements on Deep Convolutional Neural Network Based Image Classification. Andrew G. Howard. http://arxiv.org/abs/1312.5402
TTIC_ECP - EpitomicVisionGeorge Papandreou, Toyota Technological Institute at Chicago (TTIC)
Iasonas Kokkinos, Ecole Centrale Paris (ECP)
These entries showcase deep epitomic neural nets [1]. An epitomic convolution layer replaces a pair of consecutive convolution and max-pooling layers found in standard deep convolutional neural networks (CNNs). The model uses mini-epitomes [2] in place of filters and computes responses invariant to small translations by epitomic search instead of max-pooling over image positions. Epitomic search returns the maximum response of each image patch with all patches extracted from a larger epitome [3]. The model parameters (mini-epitome filters) are learned by error backpropagation in a supervised fashion, similar to standard CNNs [4, 5]. We have submitted the following entries:

EpitomicVision1 (vanilla epitomic NN):

This entry has been obtained with the EPITOMIC-NORM variant of the epitomic model described in detail in [1]. The only difference with [1] is that the current network has more hidden units in layers 1 to 6. A single large net has been used (no averaging over different nets). No attempt has been done for localization (we report the whole image as bounding box prediction).

EpitomicVision2 (+ scale and position search):

This model also searches over scale and position for the best match. This is implemented by building a mosaic with multiple versions of the image at different scales [6, 7], running the epitomic classifier in a convolutional fashion similar to [5], and selecting the position on the mosaic that gives the maximum response. The parameters of the model were initialized from a model similar to EpitomicVision1 and were fine-tuned. A single large net has been used (no averaging over different nets). No attempt has been done for localization (we report the whole image as bounding box prediction).

EpitomicVision3 (fusion of EpitomicVision1 + EpitomicVision2):

The class probabilities for this model are weighted averages of the EpitomicVision1 (w=0.4) and EpitomicVision (w=0.6) models. No attempt has been done for localization (we report the whole image as bounding box prediction).

EpitomicVision4 (EpitomicVision2 with fixed mapping of the best matching mosaic position to bounding box):

This is a simple attempt to equip the EpitomicVision2 predictions with localization estimates.


All models have been trained using the supplied CLOC training set alone.

Acknowledgments:

We implemented the methods by extending the excellent Caffe software framework [8]. We gratefully acknowledge the support of NVIDIA Corporation with the donation of GPUs used for this research.

References:

[1] G. Papandreou, "Deep Epitomic Convolutional Neural Networks,"
arXiv:1406.2732, June 2014.

[2] G. Papandreou, L.-C. Chen, and A. Yuille, "Modeling image patches with a generic dictionary of mini-epitomes," in Proc. CVPR 2014.

[3] N. Jojic, B. Frey, and A. Kannan, "Epitomic analysis of appearance and shape", in Proc. ICCV 2003.

[4] A. Krizhevsky, I. Sutskever, and G. Hinton, "ImageNet classification with deep convolutional neural networks," in Proc. NIPS 2013.

[5] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun, "Overfeat: Integrated recognition, localization and detection using convolutional networks," in Proc. ICLR 2014.

[6] C. Dubout and F. Fleuret, "Exact acceleration of linear object detectors," in Proc. ECCV 2012.

[7] F. Iandola, M. Moskewicz, S. Karayev, R. Girshick, T. Darrell, K. Keutzer, "DenseNet: Implementing efficient ConvNet descriptor pyramids," arXiv:1404.1869, April 2014.

[8] Y. Jia, "Caffe: An open source convolutional architecture for fast feature embedding," 2013.
UIFatemeh Shafizadegan, Msc student of Artificial Intelligence, University of Isfahan.
Elham Shabaninia, PhD candidate of Artificial Intelligence,University of Isfahan.
Our model is based on Spatial Pyramid Matching (SPM), similar to [1]. This is an extension of SPM using sparse codes of SIFT features that propose a linear kernel. SIFT features are robust in rotation, scale, affine and different intensities. This approach reduce the complexity of SVM in training phase to O(n) and the complexity in testing phase doesn’t change. This approach uses max spatial pooling that is robust to local spatial translations. The image representation turns out to work well with linear SVM classifiers.



[1] Linear Spatial Pyramid Matching Using Sparse Coding for Image Classification, J.Yang, K.Yu, Y.Gong, T.Huang, CVPR 2009.


UvA-EuvisionKoen van de Sande
Daniel Fontijne
Cees Snoek
Harro Stokman
Arnold Smeulders

University of Amsterdam and Euvision Technologies
Task 1 Detection
================
Our first run is based on deep learning in combination with selective search. It is trained using some additional data from ImageNet.
Our second run is based on deep learning in combination with selective search. It is trained on just the provided data.
Our third run is Fisher with FLAIR. It is the equivalent of our top entry in 2013 with improved training procedure. See Van de Sande et al., "Fisher and VLAD with FLAIR", CVPR 2014 for algorithm details. It is trained on just the provided data. This run has a speed advantage over the previous two runs.

Task 2 CLS+LOC
==============
We participate in just the classification task using deep learning. No outside data is used.
VGGKaren Simonyan, University of Oxford
Andrew Zisserman, University of Oxford
In this submission we explore the effect of the convolutional network (ConvNet) depth on its accuracy. We have used three ConvNet architectures with the following weight layer configurations:
1) ten 3x3 convolutional layers, three 1x1 convolutional layers, and three fully-connected layers - 16 weight layers in total;
2) thirteen 3x3 convolutional layers and three fully-connected layers - 16 weight layers in total;
3) sixteen 3x3 convolutional layers and three fully-connected layers - 19 weight layers in total.
All convolutional layers have stride 1 and are followed by ReLU non-linearity. The fully-connected layers are regularised with dropout. The networks were trained on fixed-size image crops, but at test time they were applied densely over the whole uncropped images.

For localisation, we used per-class bounding box regression similar to OverFeat, but over a smaller number of scales and without multiple max-pooling offsets.

Our implementation is derived from the Caffe toolbox, but contains a number of significant modifications, including parallel training on multiple GPUs installed in a single system. Training a single ConvNet on 4 NVIDIA Titan GPUs took from 2 to 3 weeks (depending on the ConvNet configuration).
Virginia TechAkrit Mohapatra, Neelima Chavali

Virginia Tech
An undergraduate summer research project by Akrit Mohapatra in collaboration with Neelima Chavali based on the RCNN paper (arXiv:1311.2524v4) (Ross B. Girshick, Jeff Donahue, Trevor Darrell, Jitendra Malik: Rich feature hierarchies for accurate object detection and semantic segmentation.) The algorithm and code from the paper were used and models were created by changing various hyper-parameters.
XYZZhongwen Xu and Yi Yang, The University of QueenslandThese submissions are trained by modified version of cuda-convnet[1] and caffe[2]. The basic structures follow ZFnet[3][6] with smaller size of kernels in the first convolutional layer. One exception is the Network in Network[4] net proposed by Min Lin from National University of Singapore. The network only takes 50 Megabytes, and can achieve good performance. Results from multiple models are fused in simple way. And for enriching the transformation, we apply multiple scales, multiple views and multiple transformations used by Andrew Howard last year[5].

[1] https://code.google.com/p/cuda-convnet
[2] Yangqing Jia, http://caffe.berkeleyvision.org/
[3] Matthew D Zeiler, Rob Fergus, Visualizing and Understanding Convolutional Networks
[4] Min Lin, Qiang Chen, Shuicheng Yan, Network In Network
[5] Andrew G. Howard, Some Improvements on Deep Convolutional Neural Network Based Image Classification
[6] K. Chatfield, K. Simonyan, A. Vedaldi, A. Zisserman, Return of the Devil in the Details: Delving Deep into Convolutional Nets