DET    LOC    VID   Team information

Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition

Object detection (DET)[top]

Task 1a: Object detection with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP
BDAT submission4 85 0.731392
BDAT submission3 65 0.732227
BDAT submission2 30 0.723712
DeepView(ETRI) Ensemble_A 10 0.593084
NUS-Qihoo_DPNs (DET) Ensemble of DPN models 9 0.656932
KAISTNIA_ETRI Ensemble Model5 1 0.61022
KAISTNIA_ETRI Ensemble Model4 0 0.609402
KAISTNIA_ETRI Ensemble Model2 0 0.608299
KAISTNIA_ETRI Ensemble Model1 0 0.608278
KAISTNIA_ETRI Ensemble Model3 0 0.60631
DeepView(ETRI) Single model A using ResNet for detection 0 0.587519
QINIU-ATLAB+CAS-SARI Ensemble 3 0 0.526601
QINIU-ATLAB+CAS-SARI Ensemble 4 0 0.523595
QINIU-ATLAB+CAS-SARI Ensemble 2 0 0.518426
QINIU-ATLAB+CAS-SARI Ensemble 5 0 0.51703
QINIU-ATLAB+CAS-SARI Ensemble 1 0 0.515529
FACEALL_BUPT faster-rcnn with soft-nms 0 0.499586
PaulForDream YOLOv2 608*608 0 0.494288
Sheep based on ssd_512 0 0.459223
Sheep according to YOLO and SSD 0 0.43453
FACEALL_BUPT single rfcn with class-aware model baseline 0 0.433957
FACEALL_BUPT Ensembles of three models 0 0.416489
Sheep based on ssd_300 0 0.401165
XBENTURY-THU Ensemble three models A 0 0.392799
Handsight 3 Yes-Net model with different input size, parameter and anchor boxes. Yes-Net get 39 FPS running on titan x pascal with 416 x 416 image. 0 0.32725
FACEALL_BUPT Ensembles of two models 0 0.288323
ReadSense use val dataset to train 0 0.282997
BUPT-PRIV resnext101-fpn single model 0 0.260561
BUPT-PRIV ensemble 4 models 0 0.254045
BUPT-PRIV ensemble 4 models 0 0.25175
XBENTURY-THU Ensemble three models B 0 0.243063
glee Submission1: baseline with frcnn 0 0.221268
YYF --- 0 0.20668
NUS-Qihoo_DPNs (DET) Training with val2 data --- 0.657634
NUS-Qihoo_DPNs (DET) Baseline ensemble of DPN models --- 0.645912
NUS-Qihoo_DPNs (DET) DPN models with class-wise weights --- 0.645023
Sheep based on YOLO --- 0.405691
Sheep based on vgg_16 --- 0.369627
[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won
BDAT submission3 0.732227 65
BDAT submission4 0.731392 85
BDAT submission2 0.723712 30
NUS-Qihoo_DPNs (DET) Training with val2 data 0.657634 ---
NUS-Qihoo_DPNs (DET) Ensemble of DPN models 0.656932 9
NUS-Qihoo_DPNs (DET) Baseline ensemble of DPN models 0.645912 ---
NUS-Qihoo_DPNs (DET) DPN models with class-wise weights 0.645023 ---
KAISTNIA_ETRI Ensemble Model5 0.61022 1
KAISTNIA_ETRI Ensemble Model4 0.609402 0
KAISTNIA_ETRI Ensemble Model2 0.608299 0
KAISTNIA_ETRI Ensemble Model1 0.608278 0
KAISTNIA_ETRI Ensemble Model3 0.60631 0
DeepView(ETRI) Ensemble_A 0.593084 10
DeepView(ETRI) Single model A using ResNet for detection 0.587519 0
QINIU-ATLAB+CAS-SARI Ensemble 3 0.526601 0
QINIU-ATLAB+CAS-SARI Ensemble 4 0.523595 0
QINIU-ATLAB+CAS-SARI Ensemble 2 0.518426 0
QINIU-ATLAB+CAS-SARI Ensemble 5 0.51703 0
QINIU-ATLAB+CAS-SARI Ensemble 1 0.515529 0
FACEALL_BUPT faster-rcnn with soft-nms 0.499586 0
PaulForDream YOLOv2 608*608 0.494288 0
Sheep based on ssd_512 0.459223 0
Sheep according to YOLO and SSD 0.43453 0
FACEALL_BUPT single rfcn with class-aware model baseline 0.433957 0
FACEALL_BUPT Ensembles of three models 0.416489 0
Sheep based on YOLO 0.405691 ---
Sheep based on ssd_300 0.401165 0
XBENTURY-THU Ensemble three models A 0.392799 0
Sheep based on vgg_16 0.369627 ---
Handsight 3 Yes-Net model with different input size, parameter and anchor boxes. Yes-Net get 39 FPS running on titan x pascal with 416 x 416 image. 0.32725 0
FACEALL_BUPT Ensembles of two models 0.288323 0
ReadSense use val dataset to train 0.282997 0
BUPT-PRIV resnext101-fpn single model 0.260561 0
BUPT-PRIV ensemble 4 models 0.254045 0
BUPT-PRIV ensemble 4 models 0.25175 0
XBENTURY-THU Ensemble three models B 0.243063 0
glee Submission1: baseline with frcnn 0.221268 0
YYF --- 0.20668 0
[top]

Task 1b: Object detection with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP
BDAT submission5 refine part of training data annotation 128 0.731613
BDAT submission1 refine part of the training data annotation 58 0.725563
NUS-Qihoo_DPNs (DET) Ensemble of DPN models with extra data Self collected images 14 0.657609
[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won
BDAT submission5 refine part of training data annotation 0.731613 128
BDAT submission1 refine part of the training data annotation 0.725563 58
NUS-Qihoo_DPNs (DET) Ensemble of DPN models with extra data Self collected images 0.657609 14
[top]

Object localization (LOC)[top]

Task 2a: Classification+localization with provided training data

Ordered by localization error

Team name Entry description Localization error Classification error
NUS-Qihoo_DPNs (CLS-LOC) [E3] LOC:: Dual Path Networks + Basic Ensemble 0.062263 0.03413
Trimps-Soushen Result-3 0.064991 0.02481
Trimps-Soushen Result-2 0.06525 0.02481
Trimps-Soushen Result-4 0.065261 0.02481
Trimps-Soushen Result-5 0.065302 0.02481
Trimps-Soushen Result-1 0.067698 0.02481
BDAT provide_box 0.081392 0.03158
BDAT provide_class 0.086942 0.02962
NUS-Qihoo_DPNs (CLS-LOC) [E2] CLS:: Dual Path Networks + Basic Ensemble 0.088093 0.0274
NUS-Qihoo_DPNs (CLS-LOC) [E1] CLS:: Dual Path Networks + Basic Ensemble 0.088269 0.02744
BDAT provide_box_former 0.090593 0.03461
SIIT_KAIST-SKT ensemble 2 0.128924 0.03226
SIIT_KAIST-SKT a weighted geometric ensemble 0.128996 0.03442
SIIT_KAIST-SKT ensemble 1 0.129028 0.0345
FACEALL_BUPT Ensembles of 6 models for classification , single model for localization 0.223947 0.04574
FACEALL_BUPT Ensembles of 5 models for classification , single model for localization 0.227494 0.04918
FACEALL_BUPT Ensembles of 3 models for classification , single model for localization 0.227619 0.04574
FACEALL_BUPT single model for classification , single model for localization 0.239185 0.06486
FACEALL_BUPT Ensembles of 4 models for classification , single model for localization 0.239351 0.06486
MCPRL_BUPT For classification, we merge two models, the top-5 cls-error on validation is 0.0475. For localization, we use a single Faster RCNN model with VGG-16, the top-5 cls-loc error on validation is 0.2905. 0.289404 0.04752
MCPRL_BUPT Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0481. The top-5 cls-loc error on validation is 0.2907. 0.29012 0.04857
MCPRL_BUPT Two models for classification, one single model for localization, in which 1.5x context information and OHEM are adopted. 0.291562 0.04752
connected Inception + ResNET + SSD with our selective box algorithm3 0.310318 0.17374
connected Inception + ResNET + SSD with our selective box algorithm2 0.314498 0.17374
connected Inception + ResNET + SSD with our selective box algorithm1 0.323928 0.17374
WMW Ensemble C [No bounding box results] 0.590987 0.02251
WMW Ensemble E [No bounding box results] 0.591018 0.02258
WMW Ensemble D [No bounding box results] 0.591039 0.0227
WMW Ensemble B [No bounding box results] 0.59106 0.0227
WMW Ensemble A [No bounding box results] 0.591153 0.0227
MIL_UT Ensemble of 9 models (classification-only) 0.596164 0.03205
MIL_UT Ensemble of 10 models (classification-only) 0.596174 0.03228
zlm ten crops, no location 0.999803 0.99631
MPG_UT Ensemble of 7 models. Ten-crop inference. Without localization. 1.0 0.04324
MPG_UT Single Model of Partial Bilinear Pooling. Ten-crop inference. Without localization. 1.0 0.04498
MPG_UT Single Model of Branched Bilinear Pooling. Multi resolutions and crops test. Without localization. 1.0 0.99675
MPG_UT Ensemble of entry 3. Without localization. 1.0 0.99702
MPG_UT Ensemble of entry 2 and entry 4. Without localization. 1.0 0.99767
[top]

Ordered by classification error

Team name Entry description Classification error Localization error
WMW Ensemble C [No bounding box results] 0.02251 0.590987
WMW Ensemble E [No bounding box results] 0.02258 0.591018
WMW Ensemble A [No bounding box results] 0.0227 0.591153
WMW Ensemble D [No bounding box results] 0.0227 0.591039
WMW Ensemble B [No bounding box results] 0.0227 0.59106
Trimps-Soushen Result-1 0.02481 0.067698
Trimps-Soushen Result-2 0.02481 0.06525
Trimps-Soushen Result-3 0.02481 0.064991
Trimps-Soushen Result-4 0.02481 0.065261
Trimps-Soushen Result-5 0.02481 0.065302
NUS-Qihoo_DPNs (CLS-LOC) [E2] CLS:: Dual Path Networks + Basic Ensemble 0.0274 0.088093
NUS-Qihoo_DPNs (CLS-LOC) [E1] CLS:: Dual Path Networks + Basic Ensemble 0.02744 0.088269
BDAT provide_class 0.02962 0.086942
BDAT provide_box 0.03158 0.081392
MIL_UT Ensemble of 9 models (classification-only) 0.03205 0.596164
SIIT_KAIST-SKT ensemble 2 0.03226 0.128924
MIL_UT Ensemble of 10 models (classification-only) 0.03228 0.596174
NUS-Qihoo_DPNs (CLS-LOC) [E3] LOC:: Dual Path Networks + Basic Ensemble 0.03413 0.062263
SIIT_KAIST-SKT a weighted geometric ensemble 0.03442 0.128996
SIIT_KAIST-SKT ensemble 1 0.0345 0.129028
BDAT provide_box_former 0.03461 0.090593
MPG_UT Ensemble of 7 models. Ten-crop inference. Without localization. 0.04324 1.0
MPG_UT Single Model of Partial Bilinear Pooling. Ten-crop inference. Without localization. 0.04498 1.0
FACEALL_BUPT Ensembles of 3 models for classification , single model for localization 0.04574 0.227619
FACEALL_BUPT Ensembles of 6 models for classification , single model for localization 0.04574 0.223947
MCPRL_BUPT For classification, we merge two models, the top-5 cls-error on validation is 0.0475. For localization, we use a single Faster RCNN model with VGG-16, the top-5 cls-loc error on validation is 0.2905. 0.04752 0.289404
MCPRL_BUPT Two models for classification, one single model for localization, in which 1.5x context information and OHEM are adopted. 0.04752 0.291562
MCPRL_BUPT Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0481. The top-5 cls-loc error on validation is 0.2907. 0.04857 0.29012
FACEALL_BUPT Ensembles of 5 models for classification , single model for localization 0.04918 0.227494
FACEALL_BUPT Ensembles of 4 models for classification , single model for localization 0.06486 0.239351
FACEALL_BUPT single model for classification , single model for localization 0.06486 0.239185
connected Inception + ResNET + SSD with our selective box algorithm1 0.17374 0.323928
connected Inception + ResNET + SSD with our selective box algorithm2 0.17374 0.314498
connected Inception + ResNET + SSD with our selective box algorithm3 0.17374 0.310318
zlm ten crops, no location 0.99631 0.999803
MPG_UT Single Model of Branched Bilinear Pooling. Multi resolutions and crops test. Without localization. 0.99675 1.0
MPG_UT Ensemble of entry 3. Without localization. 0.99702 1.0
MPG_UT Ensemble of entry 2 and entry 4. Without localization. 0.99767 1.0
[top]

Task 2b: Classification+localization with additional training data

Ordered by localization error

Team name Entry description Description of outside data used Localization error Classification error
NUS-Qihoo_DPNs (CLS-LOC) [E5] LOC:: Dual Path Networks + Basic Ensemble ImageNet-5k & Self collected images (about 33k) 0.061941 0.03378
BDAT extra_box box annotation from DET 0.087533 0.03085
NUS-Qihoo_DPNs (CLS-LOC) [E4] CLS:: Dual Path Networks + Basic Ensemble ImageNet-5k & Self collected images (about 33k) 0.087948 0.02713
BDAT extra_class box annotation from DET 0.091081 0.02997
[top]

Ordered by classification error

Team name Entry description Description of outside data used Classification error Localization error
NUS-Qihoo_DPNs (CLS-LOC) [E4] CLS:: Dual Path Networks + Basic Ensemble ImageNet-5k & Self collected images (about 33k) 0.02713 0.087948
BDAT extra_class box annotation from DET 0.02997 0.091081
BDAT extra_box box annotation from DET 0.03085 0.087533
NUS-Qihoo_DPNs (CLS-LOC) [E5] LOC:: Dual Path Networks + Basic Ensemble ImageNet-5k & Self collected images (about 33k) 0.03378 0.061941
[top]

Object detection from video (VID)[top]

Task 3a: Object detection from video with provided training data

Ordered by number of categories won

Team name Entry description Number of object categories won mean AP
IC&USYD provide_submission3 15 0.817265
IC&USYD provide_submission1 6 0.808847
IC&USYD provide_submission2 4 0.818309
NUS-Qihoo-UIUC_DPNs (VID) no_extra + seq + mca + mcs 3 0.757772
NUS-Qihoo-UIUC_DPNs (VID) no_extra + seq + vcm + mcs 1 0.757853
NUS-Qihoo-UIUC_DPNs (VID) Faster RCNN + Video Context 1 0.748493
THU-CAS merge-new 0 0.730498
THU-CAS old-new 0 0.728707
THU-CAS new-new 0 0.691423
GoerVision Deformable R-FCN single model+ResNet101 0 0.669631
GoerVision Ensemble 2 model, use ResNet101 as foundamental classification network and deformable R-FCN to detect video frames, multi-scale testing 0 0.665693
GoerVision o train the video objectWe use the ResNet101 and Deformable R-FCN for the detection. 0 0.655686
GoerVision Using R-FCN to detect video object, multi scale testing applied. 0 0.646965
FACEALL_BUPT SSD based on Resnet101 networks 0 0.195754
[top]

Ordered by mean average precision

Team name Entry description mean AP Number of object categories won
IC&USYD provide_submission2 0.818309 4
IC&USYD provide_submission3 0.817265 15
IC&USYD provide_submission1 0.808847 6
NUS-Qihoo-UIUC_DPNs (VID) no_extra + seq + vcm + mcs 0.757853 1
NUS-Qihoo-UIUC_DPNs (VID) no_extra + seq + mca + mcs 0.757772 3
NUS-Qihoo-UIUC_DPNs (VID) Faster RCNN + Video Context 0.748493 1
THU-CAS merge-new 0.730498 0
THU-CAS old-new 0.728707 0
THU-CAS new-new 0.691423 0
GoerVision Deformable R-FCN single model+ResNet101 0.669631 0
GoerVision Ensemble 2 model, use ResNet101 as foundamental classification network and deformable R-FCN to detect video frames, multi-scale testing 0.665693 0
GoerVision o train the video objectWe use the ResNet101 and Deformable R-FCN for the detection. 0.655686 0
GoerVision Using R-FCN to detect video object, multi scale testing applied. 0.646965 0
FACEALL_BUPT SSD based on Resnet101 networks 0.195754 0
[top]

Task 3b: Object detection from video with additional training data

Ordered by number of categories won

Team name Entry description Description of outside data used Number of object categories won mean AP
IC&USYD extra_submission2 region proposal trained from DET and COCO 24 0.819339
NUS-Qihoo-UIUC_DPNs (VID) extra + seq + vcm + mcs self-collected images 3 0.760252
GoerVision pre-trained model from COCO detection dataset. coco detection dataset 2 0.68817
NUS-Qihoo-UIUC_DPNs (VID) Faster RCNN + Video Context Self collected images 1 0.751525
IC&USYD extra_submission1 region proposal trained from DET and COCO 0 0.797295
[top]

Ordered by mean average precision

Team name Entry description Description of outside data used mean AP Number of object categories won
IC&USYD extra_submission2 region proposal trained from DET and COCO 0.819339 24
IC&USYD extra_submission1 region proposal trained from DET and COCO 0.797295 0
NUS-Qihoo-UIUC_DPNs (VID) extra + seq + vcm + mcs self-collected images 0.760252 3
NUS-Qihoo-UIUC_DPNs (VID) Faster RCNN + Video Context Self collected images 0.751525 1
GoerVision pre-trained model from COCO detection dataset. coco detection dataset 0.68817 2
[top]

Task 3c: Object detection/tracking from video with provided training data

Team name Entry description mean AP
IC&USYD provide_submission2 0.641474
IC&USYD provide_submission1 0.544835
NUS-Qihoo-UIUC_DPNs (VID) Model3 0.544536
THU-CAS track-merge+new 0.511627
NUS-Qihoo-UIUC_DPNs (VID) track_no_extra + mcs 0.510381
THU-CAS track-old+new 0.476636
THU-CAS track-new+new 0.469237
FACEALL_BUPT SSD based on Resnet101 networks,ECO tracking and cluster different confidence bounding box 0.063858
[top]

Task 3d: Object detection/tracking from video with additional training data

Team name Entry description Description of outside data used mean AP
IC&USYD extra_submission2 region proposal trained from DET and COCO 0.642935
IC&USYD extra_submission1 region proposal trained from DET and COCO 0.57749
NUS-Qihoo-UIUC_DPNs (VID) Model2 Self collected images 0.550078
NUS-Qihoo-UIUC_DPNs (VID) track_extra + mcs self collected images 0.530889
NUS-Qihoo-UIUC_DPNs (VID) Model1 Self collected images 0.530137
[top]

Team information[top]

Team name Team members Abstract
BDAT Hui Shuai(1), Zhenbo Yu(1), Qingshan Liu(1), Xiaotong Yuan(1), Kaihua Zhang(1), Yisheng Zhu(1), Guangcan Liu(1), Jing Yang(1), Yuxiang Zhou(2), Jiankang Deng(2), (1)Nanjing University of Information Science & Technology, (2)Imperial College London Adaptive attention[1] and deep combined convolutional models[2,3] are used for LOC task.

Scale[4,5,6], context[7], sampling and deep combined convolutional networks[2,3] are considered for DET task. Object density estimation is used for score re-rank.

[1] Wang F, Jiang M, Qian C, et al. Residual Attention Network for Image Classification[J]. arXiv preprint arXiv:1704.06904, 2017.
[2] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778.
[3] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning[C]//AAAI. 2017: 4278-4284.
[4] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[J]. arXiv preprint arXiv:1505.04597, 2015.
[5] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[J]. arXiv preprint arXiv:1612.03144, 2016.
[6] Shrivastava A, Sukthankar R, Malik J, et al. Beyond skip connections: Top-down modulation for object detection[J]. arXiv preprint arXiv:1612.06851, 2016.
[7] Zeng X, Ouyang W, Yan J, et al. Crafting GBD-Net for Object Detection[J]. arXiv preprint arXiv:1610.02579, 2016.
BUPT-PRIV Lu Yang (BUPT)
Zhiwei Liu (UCAS)
Qing Song (BUPT)
(1) We present a novel method that combing the feature pyramid network(FPN) with a basic Faster R-CNN system. FPN is A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales.
(2) ResNext-101 is used for feature extraction in our object detection system, which is a simple, modularized multi-way extension of ResNet for ImageNet classification. This pretrained model can bring non-trival improvement on the validation set.
(3) During training and inference, we use multi-scale strategy to tackle the objects with large scale variation, which can further increases mAP by ~3 point on the validation set.

References
[1] Tsung-Yi Lin,Piotr Dollar, Ross Girshick, Kaiming He, et al. "Feature Pyramid Networks for Object Detection" in CVPR, 2017
[2] Saining Xie, Ross Girshick, Piotr Dollar, et al. "Aggregated Residual Transformations for Deep Neural Networks" in CVPR 2017
[3] Ren S, He K, Girshick R, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks" Advances in neural information processing systems. 2015
connected Hitoshi Hongo (Connected)
Tetsuya Hada (Connected)
Kazuki Fukui (Connected)
Yoshihiro Mori (Connected)
In our method, we first predict labels by an ensemble model of ResNet-50 and Inception v3.
We used single shot multibox detector, whose weights were pre-trained on PASCAL VOC dataset and fine-tuned for ILSVRC,to generate bounding boxes.
We applied a box-selection strategy as follows:
The joint distributions of the bbox width and height were learned for each class using kernel density estimation, and our strategy considers both bbox confidences and likelihoods of shapes of the candidate bboxes.
In addition, boxes of similar size and location were suppressed so that we can find various sized objects in the images.
FACEALL_BUPT Jinchang XU, BUPT, CHINA
Zhiqun HE, BUPT, CHINA
Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA
Qiushan GUO, BUPT, CHINA
Zesang HUANG, BUPT, CHINA
Yu LEI, BUPT, CHINA
Lei LI, BUPT, CHINA
Yingruo FAN, BUPT, CHINA
Junfei ZHUANG, BUPT, CHINA
Fengye XIONG, BUPT, CHINA
Hongliang BAI, Beijing Faceall co., LTD
Wenjian FENG, Beijing Faceall co., LTD
Yuan DONG, BUPT, CHINA
Object Detection =============== We employ the well-known “Faster-RCNN + ResNet-101” framework [1] and “RFCN+ ResNet-101” framework [2]. We used the ResNet-101 model which is pre-trained on the CLS-LOC dataset with image–level annotation. For the first method, we utilize Faster-RCNN with the publicly available resnet-101. We adopt multi-scale ROIs to obtain features containing richer context information. For testing, the Soft-NMS[3] is integrated with Faster-RCNN, and meanwhile we use 3 scales and merge these results using the simple strategy introduced last year. No validation data is used for training, and flipped images are used in only a third of the training epochs. For the second method, we utilized two categories: class-aware and class-agonistic. As the RFCN paper used class-agonistic to train the RFCN, we find using class-aware achieve better performance than with class-agonistic. The first stage, we trained on the object detection dataset with OHEM for 200k iterations with the learning rate 0.001. For the next stage, we reduced the PRN loss weight from 1 to 0.5 and trained for another 80k iterations. For the final stage, we still reduced the RPN loss weight from 0.5 to 0.2 for 120k iterations. Object Classification/Localization ===============As to the part of classification and localization, we have tried multiple methods to promote the accuracy of classification and the mean average precision of localization. For the classification part, we use four pre-trained models fine-tuned on train-val data. For testing, for comparison studies, multiple crops are randomly sampled from an image or its horizontal flip , and average the scores at multiple scales (images are resized such that the shorter side in {224,256,288,384}),finally we ensemble models to benefit inference stage. For the localization part, the models are initialized by the ImageNet classification models, and then fine-tuned on the object-level annotations of 1000 classes. We utilize the class-agnostic strategy to learn a bounding boxes regression, the generated regions are classified by fine-tuned model into one of 1001 classes. At last, we also adopt the multiple scales method in testing and ensemble the results of several models. Object detection from video =============== In this work, we use a variant of SSD[5] with ResNet[1] for detection task. The overall training of the detection network follows a similar procedure with [5]. For visual tracking part, we use ECO[6] to track the objects from detection every 5 frames, we also cluster the detections with different confidence.
[1] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149.
[2] Dai J, Li Y, He K, et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks[J]. 2016.
[3] Bodla N, Singh B, Chellappa R, et al. Improving Object Detection With One Line of Code[J]. 2017.
[5] SSD: Single Shot MultiBox Detector", Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. arXiv preprint arXiv:1512.02325(2015).
[6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. arXiv preprint arXiv:1611.09224, 2016
glee gwang-gook lee Faster R-CNN is used as the detection method with ResNet-152 as a base network.
Several techniques, such as atrous conv, box refinement are used to improve performance.
Model ensemble is not used.
GoerVision Yejin Chen(1,2), Chunshan Bai(1,5), Zhuo Chen(1), Le Ge(4), Chengwei Li(4), Shuo Xu(5), Yuxuan Bao(6), Lu Bai(1), Xinyi Sun(1), Shun Yuan(1), Xiangdong Zhang(1,3)

(1)Goertek Inc.
(2)Nanyang Technological University
(3)Tsing Hua University
(4)University of California, Berkeley
(5)Beihang University
(6)Michigan University, Ann Arbor
Our team utilize two image object detection architectures, namely Fast R-CNN[1] and Deformable R-FCN[2] for the task of object detection.To train the video object detection model, we use the ResNet101[5] as the fundamental classification network. And we adopt Deformable R-FCN[3] and Faster R-CNN to detect the still image objects. Mufti-Contect Suppression and motion guided propagation is used to correct the frames object detection. Coco detection data set is used for pretrained the detection model[7].

For testing, we utilize multi-scale testing, and models ensemble approaches to benefit inference stage.

We use MXnet[4] for the algorithm framework because of its flexible and efficient.

Since of time constrain, we are not able to use Inception_Resnet_v2[6] and tracking in the VID task, which can significant improve the result mAP.

References:
[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, "Faster R-CNN:Towards real-time object detection with region proposal networks" Advances in neural information processing system. 2015.
[2] Dai, Jifeng, Yi Li, Kaiming He, Jian Sun. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016).
[3] Yuwen Xiong, Haozhi Qi, Guodong Zhang, Yi Li, Jifeng Dai, Bin Xiao, Han Hu and Yichen Wei. "Deformalbe Convolutional Networks",arXiv: 1703.06211. Jun 2017.
[4] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015.
[5] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015).
[6] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016).
[7] Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick, “Mask R-CNN” Tech Report, arXiv:1703,06870. Mar,2017.
Handsight Liangzhuang Ma
Xin Kan
Qianjiang Xiao
Wenlong Liu
Peiqin Sun
Our submission is based a new model named Yes-Net [1] which get 39 FPS on 416 x 416 image (running on titan x pascal). The starting point of the work is YOLO9000 [2].
The novel contributions of Yes-Net:
(1) It combines local information with global information by adding the RNN architecture in CNN model as the base feature extractor.
(2) Instead of selecting fixed number, fixed shapes and fixed center point anchor boxes in each grid cell, we use k-means to cluster N anchor boxes based on the whole training set. Every anchor box has its own shape and center.
(3) We propose a novel method that use a RNN instead of NMS to select the output boxes, which finally improved the generalization ability of our detector. We argue that this method also can be adopted by other detectors.

We have accomplished the paper illustrating details of Yes-Net. It is available on arXiv 1706.09180.

References:
[1] Liangzhuang Ma, Xin Kan Yes-Net: An effective Detector Based on Global Information. arXiv preprint arxiv: 1706.09180, 2017.6
[2] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv preprint arxiv:1612.08242v1, 2016.9
IC&USYD Jiankang Deng(1), Yuxiang Zhou(1), Baosheng Yu(2), Zhe Chen(2), Stefanos Zafeiriou(1), Dacheng Tao(2), (1)Imperial College London, (2)University of Sydney Flow acceleration[1,2] is used. Final scores are adaptively chosen between the detector and tracker.

[1] Deep Feature Flow for Video Recognition
Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.

[2] Flow-Guided Feature Aggregation for Video Object Detection, Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Arxiv tech report, 2017.
KAISTNIA_ETRI Keun Dong Lee(ETRI) Seungjae Lee(ETRI) JongGook Ko (ETRI) Jaehyung Kim* (KAIST) Jun Hyun Nam* (KAIST) Jinwoo Shin (KAIST) (* indicates equal contribution) In this work, we consider ensembles of variants of GBD-Net [1] with ResNet [2] for the detection task. For maximizing the ensemble effect, we design variants using (a) various depth/width for multi-region networks (without GBD), (b) fusion of different layer feature maps with weighted additions (with GBD), (c) different pooling methods considering various region shapes (with GBD), and (d) new loss function incorporating network confidence (without GBD). For RPN, we trained cascade RPN [1] and also performed iterative inference procedures to improve its performance.

[1] Xingyu Zeng et al. “Crafting GBD-Net for Object Detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017.

[2] Kaiming He, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
MCPRL_BUPT Yuanyuan Li
Yang Bai
We participate in the classification and localization task. Our classification models are mainly based on Inception-ResNet-v2 and Inception-v4 [1] and the location framework is Faster RCNN [2].
We make the following improvements.
(1) For the classification part, we train Inception-ResNet-v2 and Inception-v4 networks. Model ensemble is adopted to obtain a better performance.
(2) For the localization part, we use the Faster RCNN framework with VGG-16 [3] as a backbone. To make samples balanced, we sample the training images averagely. We cluster the annotations to pre-set more appropriate hyper-parameters. Finally, 1.5x contextual information of RoIs and Online Hard Example Mining (OHEM) [4] are used to locate the objects more precisely.

[1] Szegedy, Christian, et al. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning." AAAI. 2017.
[2] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[3] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014).
[4] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.
MIL_UT Yuji Tokozume (the Univ. of Tokyo), Kosuke Arase (the Univ. of Tokyo), Yoshitaka Ushiku (the Univ. of Tokyo), Tatsuya Harada (the Univ. of Tokyo/RIKEN) Ensemble of ResNets [1, 2] and ResNeXts [3] trained with a novel learning method.

[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016.
[2] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In ECCV, 2016.
[3] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. In CVPR, 2017.
MPG_UT Hideki Nakayama, Jiren Jin and Ikki Kishida.
The University of Tokyo
We utilized pretrained ResNeXt[1] and ResNet[2]. Our models replaced global average pooling with bilinear-pooling[3] with log-Euclid normalization.
In order to solve the problem with applying bilinear pooling to large dimensional inputs, we built two types of models and ensembled the results for submission.

1. Last fully connected layer and global average pooling on pretrained ResNeXt[1] were replaced by branched bottlenecks, bilinear-pooling[3] with log-Euclid normalization and new fully connected layer. Due to the difficulty of end-to-end learning, we conducted stage-wise training.

2. Partial Bilinera Pooling: The idea is to move the bottleneck of the representation capacity to a later phase of the model. The outputs of last convolutional layer are seperated into several sub groups (along the filter dimension), and bilinear pooling are applied to each sub group. The combination of sub groups is done by summation. We also add a batch normalization layer after the combination of sub groups, which improves the lerannig efficiency and fianl performance.

[1] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. CVPR, 2017.
[2] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Deep Residual Learning for Image Recognition. CVPR, 2016.
[3] Tsung-Yu Lin, Aruni RoyChowdhury and Subhransu Maji. Bilinear CNN Models for Fine-grained Visual Recognition. ICCV, 2015.
NUS-Qihoo-UIUC_DPNs (VID) NUS: Yunchao Wei, Mengdan Zhang, Jianan Li, Yunpeng Chen, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan
UIUC: Honghui Shi
(1)Technique Details for object detection from video:
The video object detector is trained based on the Faster R-CNN [1] using Dual Path Network (DPN) as backbone. Specifically, we adopt three DPN models, i.e., the DPN-96, DPN-107 and DPN-131, as a truck feature learner as well as the head classifier in the Faster R-CNN framework. The best single model achieves 79.3% (mAP) on the validation set. The ensemble of 4 models gives an mAP of 80.5%. In addition, we propose a selected-average-pooling strategy to infer video context information, which is used to refine the detection results. With sequential tracking and rescoring [4], video context, the mAP can be further improved to 84.5% on the validation set.

(2)Technique Details for object detection/tracking from video:
The object trajectory generation is based on the following two complementary methods: the optical-flow [2] based tubelet generation method and the visual tracking [3] based tubelet generation method. The former method ensures the accuracy of trajectories, and the latter method provides high tracking recall. The tracking tubelets are finally selectively added to the optical-flow tubelets based on our adaptive merge strategy. The final mAP in the validation set is 70.1%.

NOTE: Here, the validation set indicates the 555 videos from ILSVRC2016.

[1] Ren, S, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS. 2015.
[2] Ilg, E, et al. “Flownet 2.0: Evolution of optical flow estimation with deep networks” arXiv preprint arXiv:1612.01925, 2016.
[3] Nam H, et al. “Learning multi-domain convolutional neural networks for visual tracking”, CVPR 2016.
[4] Han, Wei, et al. "Seq-nms for video object detection." arXiv preprint arXiv:1602.08465 2016.
NUS-Qihoo_DPNs (CLS-LOC) NUS: Yunpeng Chen, Huaxin Xiao, Jianan Li, Xuecheng Nie, Xiaojie Jin, Jianshu Li, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan
We present a simple, highly efficient and modularized Dual Path Network (DPN) which introduces a novel dual path topology. The DPN model contains a residual path and a densely connected path which are able to effectively share common features while maintaining the flexibility to learn to explore new features. DPNs serve as our main network for all the tasks.

In the CLS-LOC task, we adopt the DPNs to predict the Top-5 objects and then assign the corresponding bounding boxes using DPN based Faster RCNNs [1].

On the provided training data track, a shallow DPN-98 (236MB/11.7GFLOPs) surpasses the best ResNeXt-101(64×4d) [2] on the image classification task with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN-131 (304MB/16.0GFLOPs) achieves top-1(top-5) classification error at 18.55%(4.16%) on the validation set using single 320x320 center corp. We combine in total two strong DPNs and two weaker DPNs combining with several existing CNNs to get the final prediction.

On the extra training data track, we pretrained a DPN-107 on ImageNet-5k dataset and then fine-tuned on the provided training set with 33k self-collected extra training images. With different fine-tuning strategies, we got two DPN-107 networks. The final prediction is a weighted combination of these two additional models and previous models.

For the bounding box prediction, we follow the Faster RCNN pipeline and train three DPNs (two DPN-92 and one DPN-131). Proposals are extracted from all models and the final scores are adjusted by the classification scores.

Technique details of the Dual Path Networks will be available in arXiv soon and the pertained models will also be public available at the same time.


*Note: All DPNs are trained using MXNet on 10 nodes with totally 40 K80 graphic cards from scratch. Without specific code optimization, the training speed of DPN-131 reaches > 60 samples/sec per node in synchronous way.
-----
[1] S Ren, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS. 2015.
[2] S Xie, et al. "Aggregated residual transformations for deep neural networks." CVPR. 2017.
NUS-Qihoo_DPNs (DET) NUS: Jianan Li, Jianshu Li, Yunpeng Chen, Huaxin Xiao, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan
We adopt the Dual Path Network (DPN), which contains a novel dual path topology, on the object detection task based on the Faster R-CNN[1]. The features sharing scheme and flexibility to explore new features in DPN are shown to be effective in object detection. Specifically, we adapt several DPN models, i.e., DPN-92, DPN-107, DPN-131, etc, as the trunk feature learner as well as the head classifier in the Faster R-CNN framework. We only use networks up to 131 layers, which are light to train and well-fit within most common GPUs, yet yielding good performance. For region proposal generation, low-level fine-grained features are exploited, which are shown to be effective in improving the proposal recall. Furthermore, we incorporate beneficial context information by adopting the dilated convolution [2] in segmentation into the detection framework. During testing, we design a categories-wise weighting strategy to explore expert models for different categories and apply weights to different experts accordingly for multi-model inference. In addition, we adapt pre-trained model on image classification task to extract global context information, which can provide beneficial cues for reasoning the detection results within the whole input image.


[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015.
[2] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." International Conference on Learning Representations. 2016.
PaulForDream PengChong I use a end-to-end detection system YOLO which was first presented in [1] and then improved in [2] to train a detection network.
I modify the YOLOv2 detection network structure which was defined in [2] to improve the detection performance for small object.
Except network structure I also modify the loss function which is more sensitive to small object.
In [2],the author used 5 anchors to predict bounding box while I use 10 anchors which is computed with ILSVRC2017 DET train-dataset annotations.
High resolution detection network helps improve detection performance, so the input image size is 608*608.

References
[1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640
[2] Joseph Redmon, Ali Farhadi. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242
QINIU-ATLAB+CAS-SARI Yixin Bao (1)
Zhijian Zhao (1)
Yining Lin (1)
Bingfei Fu (2)
Hao Ye (2)
Li Wang (2)

(1) Shanghai Qiniu Information & technology Corporation
(2) Shanghai Advanced Research Institute, Chinese Academy of Sciences
Abstract
In ILSVRC2017, we focus on object detection with provided training data. our object detection architecture is Faster RCNN (in mxnet [1]) with different network structures: resnet101 [2], resnet152, Inception-v3 [3] and dcn-rfcn [4]. To maximumly utilize those deep neural networks, we use eval methods, for example, box voting to improve the accuracy of object detection. During the training, we also analyze the missing pictures and finetune the neural networks. Moreover, we use multi-scale to catch the tiny objects and long objects. Our final submissions consist of ensembles of multiple models, which can improve the robustness of the whole model.

References
[1] "GitHub," 2017. [Online]. Available: https://github.com/fubingfeiya/mxnet. [Accessed 30 6 2017].
[2] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," CVPR, pp. 1-12, 10 12 2015.
[3] C. Szegedy, V. Vanhoucke, S. Ioffe and J. Shlens, "Rethinking the Inception Architecture for Computer Vision," CVPR, pp. 2818-2826, 2016.
[4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei, "Deformable Convolutional Networks," CVPR, 17 3 2017.
ReadSense Thomas Tong, ReadSense
Leon Ding, ReadSense
Jiajun Du, Shanghai Jiao Tong University
Faster R-CNN
[8] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]. Advances in neural information processing systems. 2015: 91-99.
Sheep Sheng Yang,student at NWPU This is a method for detecting objects in images that integrates the bounding boxes and class probabilities into a deep neural network. First, we generate a small set of default boxes of different aspect ratios. Next, we run a single convolutional network on the image and generate class confidence for all object categories for default boxes. Last, we utilize non-max suppression to choose the resulting detections by model’s confidence and regress the bounding boxes.
This model is based on the caffe architecture and motivated by SSD and YOLO and this method is fast and effective. Also, it can tackle objects with various sizes and handle images with different resolutions.
The main advantages are as follows:
1. Incorporating the bounding boxes and class probabilities into a deep network rather than fast r-cnn using two network.
2. Generating default boxes and aspect ratios rather than region proposals
3. Different resolution images can all be tackled in the method.

Reference:
1. Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV (2013)
6. Girshick, R.: Fast R-CNN. In: ICCV. (2015
2. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS. (2015)
5. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. (2016)
6. Simonyan,K.,Zisserman,A.: Verydeepconvolutionalnetworksforlarge-scaleimagerecognition. In: NIPS. (2015)
7. S. Gould, T. Gao, and D. Koller. Region-based segmentationandobjectdetection. InAdvances inneural information processing systems, pages 655–663, 2009
8. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015
9. T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013.
10. R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/
11. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc.) challenge. International journal of computer vision, 88(2):303– 338, 2010
SIIT_KAIST-SKT Dongyoon Han* (KAIST)
Jiwhan Kim* (KAIST)
Gwang-Gook Lee (SKT)
Junmo Kim (KAIST)
(*equally contributed)
We used trained models including three 200-depth PyramidNets [1] for classification network. PyramidNets are designed to maximize network capability by gradually increasing the number of channel maps, and as the result, PyramidNets show better performance than ResNets [2] and other state-of-the-art network with equal # of parameters.

We adopted the PyramidNets trained previously on ILSVRC-2012 trainset which were uploaded in our github: https://github.com/jhkim89/PyramidNet.

Our main goal is two-fold: 1) to get better performance with a smaller ensemble of networks; 2) several additional tasks have been applied to improve the classification ability of trained networks during the test phase.

For localization network, R-FCN [3] is used with our backbone network of a single PyramidNet-101, which is trained for boosting both RPN and detection heads by an auxiliary loss.


[1] D. Han*, J. Kim* and Junmo Kim. "Deep pyramidal residual networks", equally contributed by the authors*, CVPR 2017.
[2] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition", CVPR 2016.
[3] J. Dai, Y. Li, K. He, and J. Sun, 'R-fcn: Object detection via region-based fully convolutional networks', NIPS 2016.
THU-CAS GuiGuang Ding (1)
ChaoQun Chu (1)
Sheng Tang (2)
Bin Wang (2)
JunBin Xiao (2)
Chen Li (1)
DaHan Gong (1)

(1) School Of Software, Tsinghua University, Beijing, China
(2) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
For the still-image detection, in this year, we use Faster R-CNN [1] framework with different versions of ResNet [2] as the base convolutional network.
1. For Object detection from video (VID) sub-task, our contributions are three-fold:
(1) Co-occurrence relationship based multi-context suppression: by mining the co-occurrence relationship between bounding boxes with different classes from the training dataset and analyzing multi-context in each video, we can get the representative classes and the co-occurrence classes, and then implement more effective and targeted suppression, which can bring a large MAP improvement.
(2) True-negative object filtering: Based on the analysis of the training dataset, we can get the non-occurrence relationship between different objects, which can help us to filter out the true negative objects which have lower detection scores and whose categories are not concurrently appeared with those objects of highest detection scores.
(3) Tubelets based bounding boxes reclassification: Based on the tubelets which are constructed by on detection results, optical flow and multi-target tracking algorithms, we can implement an effective reclassification to get coherent categories of those bounding boxes of tubelets.

2. For Object detection from video (VID) with tracking, we propose a novel tracking framework to integrate optical flow [3] tracking, GOTURN tracking [4] and POI tracking [5] to a more accurate tubelets-construction algorithm due to their complementary attributes. Under this framework, our contributions are four-fold:
(1) Tubelet construction based on detection results and optical flow: By using optical flow as tracker, we propose to sequentialize the detection bounding boxes of same object to form tubelet.
(2) Multi-target tracking with GOTURN: we first choose anchor frame, and exploit the adjacent information to determine the reliable anchor targets for efficient tracking. Then, we track each anchor target with a GOTURN tracker in parallel. Finally, we use still-image detection results to recall missing tubelets.
(3) Multi-target tracking with POI: POI tracking is an effective data association based multiple object tracking (MOT) algorithm, based on the detection result of the still-image detection, we can get some reliable tubelets, which are beneficial for the tubelets fusion afterwards.
(4) Tubelet NMS: we propose a novel effective union and concatenation method for tubelet fusion, which improves the final AP by a large margin.



References:
[1] Ren S, He K, Girshick R, Sun J. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015: 91-99.
[2] He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition”, CVPR 2016.
[3] Kang K, Ouyang W, Li H, Wang X. “Object Detection from Video Tubelets with Convolutional Neural Networks”, CVPR 2016.
[4] Held, David, Sebastian Thrun, and Silvio Savarese. "Learning to track at 100 fps with deep regression networks." arXiv preprint arXiv:1604.01802 (2016).
[5] Yu, Fengwei, et al. "Poi: Multiple object tracking with high performance detection and appearance feature." arXiv preprint arXiv:1610.06136 (2016).
Trimps-Soushen Xiaoteng Zhang, Zhengyan Ding, Jianying Zhou, Jie Shao, Lin Mei

The Third Research Institute of the Ministry of Public Security, P.R. China.
In the classification part, we took several methods, such as stochastic depth, data augmentation and model optimization (remove some redundant layers), to avoid overfitting. In addition, different training strategies were adopted for different models, which boost the final performance largely.
Our localization system was based on the Faster R-CNN pipeline with some improvements like global context, cascade RPN/RCNN, multi scale train/test. And some great works like FPN/Mask RCNN/Deformable CNN were also integrated in our pipeline. Furthermore, a more efficient fusion algorithm was used in test phase which has excellent improvements for localization accuracy.
DeepView(ETRI) Seung-Hwan Bae*(ETRI)
Youngjoo Jo*(ETRI)
Joongwon Hwang*(ETRI)
Youngwan Lee*(ETRI)
Young-Suk Yoon*(ETRI)
Yuseok Bae*(ETRI)
Jongyoul Park(ETRI,Supervisor)
* indicates equal contribution
++For training:
We train the three types of convolutional detectors for this challenge:
(1) SSD type: We use DSSD[1] with VGGNet and SSD[2] with WR-Inception Network [3].
(2) Faster RCNN type Ⅰ: We use the pre-trained resnet101/152/269 models [4] as CLS-Net. We then add region proposal networks to the CLS-Net. For this challenge, we fine-tune the networks on 200 detection object classes.
(3) Faster RCNN type Ⅱ: We apply a resizing method with bilinear interpolation [5] on resnet152 model instead of ROI_pooling. The method also is used to make new hyper-feature layer.
In order to handle the imbalanced problem, we make the ratio of positive and negative samples to be equal in each mini-batch. To detect a small object, we do not limit the size of region proposals. For improving detection accuracy, we propose new techniques for both multi-scale and multi-region.
++For improving accuracy:
(1) Results Ensemble: To ensemble the results, we combined the detections results of models according to their mean APs on val2. After that, the soft-NMS and box refinement has been performed. It can improve the mAP to 4~5%.
(2) Soft-NMS: We test the models with Soft-NMS[6] to get some improvement.
(3) Data augmentation: We augment the given training set in 4 manners: 2D rotation, 2D translation, color histogram equalization, and stochastic noise addition. To make the balanced dataset between classes, we produce augmented images when the original image contains objects of the classes with fewer instances.

[Reference]
[1] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector, arXiv preprint arXiv:1701.06659, 2017.
[2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016.
[3] Y. Lee, H. Kim, E. Park, X. Cui, H. Kim, Wide-residual-inception Networks for real-time object detection. arXiv preprint arXiv: 1702.01243, 2017.
[4] X.u Zeng, W. Ouyang, J Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, H. Zhou, X. Wang, Crafting GBD-Net for Object Detection, CoRR, 2016
[5] X. Chen & A. Gupta, An implementation of faster rcnn with study for region sampling, arXiv preprint arXiv:1702.02138., 2017
[6] N. Bodla, B. Singh, R. Chellappa, & L. S. Davis, Improving Object Detection With One Line of Code, arXiv preprint arXiv:1704.04503., 2017



WMW Jie Hu (Momenta)
Li Shen (University of Oxford)
Gang Sun (Momenta)
We design a new architecture building block, named “Squeeze-and-Excitation (SE)”. Each building block embeds the information from global receptive fields by “squeeze” operation, and selectively induces response enhancement by “excitation” operation. SE module is the foundation of our entries. We develop multiple versions of SENet, such as SE-ResNet, SE-ResNeXt and SE-Inception-ResNet, which obviously surpass their non-SE counterparts with slight increase on computation and GPU memory cost. We achieved a top-5 error rate 2.3% on validation set.

All the models are trained on our designed distributed deep learning training system “ROCS”. We conduct significant optimization on GPU memory and message passing across GPU servers. Benefitting from that, our system trains SE-ResNet152 with a minibatch size of 2048 on 64 Nvidia Pascal Titan X GPUs in 20 hours using synchronous SGD without warm-up. We train all the models from scratch with provided training data. We submit no localization result.

More technical and experimental details will be elucidated in a report.
XBENTURY-THU Nie FangXing : XiaoBaiShiJi
Ding ZhanYang : XiaoBaiShiJi
Gao TingWei : THU
Abstract:

Our team uses Faster-RCNN to train DET models. Our models are based on a ResNet 101 layers Network,adding some data argumentation and some tricks.. We train six different models to test , the results are divided into two groups, each group ensemble three models.


Reference:

Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun

CRAFT Objects from Images: Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li

Fast R-CNN: Ross Girshick

Training Region-based Object Detectors with Online Hard Example Mining: Abhinav Shrivastava, Abhinav Gupta, Ross Girshick

Gated Bi-directional CNN for Object Detection: Xingyu ZengEmail authorWanli OuyangBin YangJunjie YanXiaogang Wang
YYF YufengYuan PERSONAL The core methods is based on yolo_v2 with the resnet152 model.
The anchors is introduced into yolo theory by yolo_v2.I also benifit from the anchors.I find
The resnet152 pooling layer and fc layer is removed replaced with 3 cnn layers.
The first cnn layer is combined from res5c,res4f and res3d.
I use four models with the same structure with different weights.
The weights is trained on the same images with different conditions, such as different size.
The input images are all 480pix.
The four models create own pred_results then the results is mixed in a set.
The last result is selected by the non-maximum suppression.

This results are created by me,only one person with NvidiaCards.
THe results are not so beautiful because the ability from fresh,limit resources.
But my interest is the best driving force and I believe that must have anchor detecting method except Faster-Rcnn.
I find Yolo the candidate with exciting locating ability and good recgnising ability but this is not so good in 200c ategory and small size.
I try many methods and I will go on the yolo road.

references:
Yolo and yolo_v2:https://pjreddie.com/darknet/yolo/
Resnet:https://arxiv.org/abs/1512.03385
zlm Liming Zhao, Zhejiang University
Xi Li, Zhejiang University
A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow.
To further reduce the training difficulty, we present a simple network architecture, deep merge-and-run neural networks.
We use a modularized building block, merge-and-run block, which assembles residual branches in parallel through a merge-and-run mapping: Average the inputs of these residual branches (Merge), and add the average to the output of each residual branch as the input of the subsequent residual branch (Run), respectively.
[top]