Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition
Object detection (DET)[top]
Task 1a: Object detection with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
BDAT | submission4 | 85 | 0.731392 |
BDAT | submission3 | 65 | 0.732227 |
BDAT | submission2 | 30 | 0.723712 |
DeepView(ETRI) | Ensemble_A | 10 | 0.593084 |
NUS-Qihoo_DPNs (DET) | Ensemble of DPN models | 9 | 0.656932 |
KAISTNIA_ETRI | Ensemble Model5 | 1 | 0.61022 |
KAISTNIA_ETRI | Ensemble Model4 | 0 | 0.609402 |
KAISTNIA_ETRI | Ensemble Model2 | 0 | 0.608299 |
KAISTNIA_ETRI | Ensemble Model1 | 0 | 0.608278 |
KAISTNIA_ETRI | Ensemble Model3 | 0 | 0.60631 |
DeepView(ETRI) | Single model A using ResNet for detection | 0 | 0.587519 |
QINIU-ATLAB+CAS-SARI | Ensemble 3 | 0 | 0.526601 |
QINIU-ATLAB+CAS-SARI | Ensemble 4 | 0 | 0.523595 |
QINIU-ATLAB+CAS-SARI | Ensemble 2 | 0 | 0.518426 |
QINIU-ATLAB+CAS-SARI | Ensemble 5 | 0 | 0.51703 |
QINIU-ATLAB+CAS-SARI | Ensemble 1 | 0 | 0.515529 |
FACEALL_BUPT | faster-rcnn with soft-nms | 0 | 0.499586 |
PaulForDream | YOLOv2 608*608 | 0 | 0.494288 |
Sheep | based on ssd_512 | 0 | 0.459223 |
Sheep | according to YOLO and SSD | 0 | 0.43453 |
FACEALL_BUPT | single rfcn with class-aware model baseline | 0 | 0.433957 |
FACEALL_BUPT | Ensembles of three models | 0 | 0.416489 |
Sheep | based on ssd_300 | 0 | 0.401165 |
XBENTURY-THU | Ensemble three models A | 0 | 0.392799 |
Handsight | 3 Yes-Net model with different input size, parameter and anchor boxes. Yes-Net get 39 FPS running on titan x pascal with 416 x 416 image. | 0 | 0.32725 |
FACEALL_BUPT | Ensembles of two models | 0 | 0.288323 |
ReadSense | use val dataset to train | 0 | 0.282997 |
BUPT-PRIV | resnext101-fpn single model | 0 | 0.260561 |
BUPT-PRIV | ensemble 4 models | 0 | 0.254045 |
BUPT-PRIV | ensemble 4 models | 0 | 0.25175 |
XBENTURY-THU | Ensemble three models B | 0 | 0.243063 |
glee | Submission1: baseline with frcnn | 0 | 0.221268 |
YYF | --- | 0 | 0.20668 |
NUS-Qihoo_DPNs (DET) | Training with val2 data | --- | 0.657634 |
NUS-Qihoo_DPNs (DET) | Baseline ensemble of DPN models | --- | 0.645912 |
NUS-Qihoo_DPNs (DET) | DPN models with class-wise weights | --- | 0.645023 |
Sheep | based on YOLO | --- | 0.405691 |
Sheep | based on vgg_16 | --- | 0.369627 |
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
BDAT | submission3 | 0.732227 | 65 |
BDAT | submission4 | 0.731392 | 85 |
BDAT | submission2 | 0.723712 | 30 |
NUS-Qihoo_DPNs (DET) | Training with val2 data | 0.657634 | --- |
NUS-Qihoo_DPNs (DET) | Ensemble of DPN models | 0.656932 | 9 |
NUS-Qihoo_DPNs (DET) | Baseline ensemble of DPN models | 0.645912 | --- |
NUS-Qihoo_DPNs (DET) | DPN models with class-wise weights | 0.645023 | --- |
KAISTNIA_ETRI | Ensemble Model5 | 0.61022 | 1 |
KAISTNIA_ETRI | Ensemble Model4 | 0.609402 | 0 |
KAISTNIA_ETRI | Ensemble Model2 | 0.608299 | 0 |
KAISTNIA_ETRI | Ensemble Model1 | 0.608278 | 0 |
KAISTNIA_ETRI | Ensemble Model3 | 0.60631 | 0 |
DeepView(ETRI) | Ensemble_A | 0.593084 | 10 |
DeepView(ETRI) | Single model A using ResNet for detection | 0.587519 | 0 |
QINIU-ATLAB+CAS-SARI | Ensemble 3 | 0.526601 | 0 |
QINIU-ATLAB+CAS-SARI | Ensemble 4 | 0.523595 | 0 |
QINIU-ATLAB+CAS-SARI | Ensemble 2 | 0.518426 | 0 |
QINIU-ATLAB+CAS-SARI | Ensemble 5 | 0.51703 | 0 |
QINIU-ATLAB+CAS-SARI | Ensemble 1 | 0.515529 | 0 |
FACEALL_BUPT | faster-rcnn with soft-nms | 0.499586 | 0 |
PaulForDream | YOLOv2 608*608 | 0.494288 | 0 |
Sheep | based on ssd_512 | 0.459223 | 0 |
Sheep | according to YOLO and SSD | 0.43453 | 0 |
FACEALL_BUPT | single rfcn with class-aware model baseline | 0.433957 | 0 |
FACEALL_BUPT | Ensembles of three models | 0.416489 | 0 |
Sheep | based on YOLO | 0.405691 | --- |
Sheep | based on ssd_300 | 0.401165 | 0 |
XBENTURY-THU | Ensemble three models A | 0.392799 | 0 |
Sheep | based on vgg_16 | 0.369627 | --- |
Handsight | 3 Yes-Net model with different input size, parameter and anchor boxes. Yes-Net get 39 FPS running on titan x pascal with 416 x 416 image. | 0.32725 | 0 |
FACEALL_BUPT | Ensembles of two models | 0.288323 | 0 |
ReadSense | use val dataset to train | 0.282997 | 0 |
BUPT-PRIV | resnext101-fpn single model | 0.260561 | 0 |
BUPT-PRIV | ensemble 4 models | 0.254045 | 0 |
BUPT-PRIV | ensemble 4 models | 0.25175 | 0 |
XBENTURY-THU | Ensemble three models B | 0.243063 | 0 |
glee | Submission1: baseline with frcnn | 0.221268 | 0 |
YYF | --- | 0.20668 | 0 |
Task 1b: Object detection with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
BDAT | submission5 | refine part of training data annotation | 128 | 0.731613 |
BDAT | submission1 | refine part of the training data annotation | 58 | 0.725563 |
NUS-Qihoo_DPNs (DET) | Ensemble of DPN models with extra data | Self collected images | 14 | 0.657609 |
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
BDAT | submission5 | refine part of training data annotation | 0.731613 | 128 |
BDAT | submission1 | refine part of the training data annotation | 0.725563 | 58 |
NUS-Qihoo_DPNs (DET) | Ensemble of DPN models with extra data | Self collected images | 0.657609 | 14 |
Object localization (LOC)[top]
Task 2a: Classification+localization with provided training data
Ordered by localization error
Team name | Entry description | Localization error | Classification error |
NUS-Qihoo_DPNs (CLS-LOC) | [E3] LOC:: Dual Path Networks + Basic Ensemble | 0.062263 | 0.03413 |
Trimps-Soushen | Result-3 | 0.064991 | 0.02481 |
Trimps-Soushen | Result-2 | 0.06525 | 0.02481 |
Trimps-Soushen | Result-4 | 0.065261 | 0.02481 |
Trimps-Soushen | Result-5 | 0.065302 | 0.02481 |
Trimps-Soushen | Result-1 | 0.067698 | 0.02481 |
BDAT | provide_box | 0.081392 | 0.03158 |
BDAT | provide_class | 0.086942 | 0.02962 |
NUS-Qihoo_DPNs (CLS-LOC) | [E2] CLS:: Dual Path Networks + Basic Ensemble | 0.088093 | 0.0274 |
NUS-Qihoo_DPNs (CLS-LOC) | [E1] CLS:: Dual Path Networks + Basic Ensemble | 0.088269 | 0.02744 |
BDAT | provide_box_former | 0.090593 | 0.03461 |
SIIT_KAIST-SKT | ensemble 2 | 0.128924 | 0.03226 |
SIIT_KAIST-SKT | a weighted geometric ensemble | 0.128996 | 0.03442 |
SIIT_KAIST-SKT | ensemble 1 | 0.129028 | 0.0345 |
FACEALL_BUPT | Ensembles of 6 models for classification , single model for localization | 0.223947 | 0.04574 |
FACEALL_BUPT | Ensembles of 5 models for classification , single model for localization | 0.227494 | 0.04918 |
FACEALL_BUPT | Ensembles of 3 models for classification , single model for localization | 0.227619 | 0.04574 |
FACEALL_BUPT | single model for classification , single model for localization | 0.239185 | 0.06486 |
FACEALL_BUPT | Ensembles of 4 models for classification , single model for localization | 0.239351 | 0.06486 |
MCPRL_BUPT | For classification, we merge two models, the top-5 cls-error on validation is 0.0475. For localization, we use a single Faster RCNN model with VGG-16, the top-5 cls-loc error on validation is 0.2905. | 0.289404 | 0.04752 |
MCPRL_BUPT | Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0481. The top-5 cls-loc error on validation is 0.2907. | 0.29012 | 0.04857 |
MCPRL_BUPT | Two models for classification, one single model for localization, in which 1.5x context information and OHEM are adopted. | 0.291562 | 0.04752 |
connected | Inception + ResNET + SSD with our selective box algorithm3 | 0.310318 | 0.17374 |
connected | Inception + ResNET + SSD with our selective box algorithm2 | 0.314498 | 0.17374 |
connected | Inception + ResNET + SSD with our selective box algorithm1 | 0.323928 | 0.17374 |
WMW | Ensemble C [No bounding box results] | 0.590987 | 0.02251 |
WMW | Ensemble E [No bounding box results] | 0.591018 | 0.02258 |
WMW | Ensemble D [No bounding box results] | 0.591039 | 0.0227 |
WMW | Ensemble B [No bounding box results] | 0.59106 | 0.0227 |
WMW | Ensemble A [No bounding box results] | 0.591153 | 0.0227 |
MIL_UT | Ensemble of 9 models (classification-only) | 0.596164 | 0.03205 |
MIL_UT | Ensemble of 10 models (classification-only) | 0.596174 | 0.03228 |
zlm | ten crops, no location | 0.999803 | 0.99631 |
MPG_UT | Ensemble of 7 models. Ten-crop inference. Without localization. | 1.0 | 0.04324 |
MPG_UT | Single Model of Partial Bilinear Pooling. Ten-crop inference. Without localization. | 1.0 | 0.04498 |
MPG_UT | Single Model of Branched Bilinear Pooling. Multi resolutions and crops test. Without localization. | 1.0 | 0.99675 |
MPG_UT | Ensemble of entry 3. Without localization. | 1.0 | 0.99702 |
MPG_UT | Ensemble of entry 2 and entry 4. Without localization. | 1.0 | 0.99767 |
Ordered by classification error
Team name | Entry description | Classification error | Localization error |
WMW | Ensemble C [No bounding box results] | 0.02251 | 0.590987 |
WMW | Ensemble E [No bounding box results] | 0.02258 | 0.591018 |
WMW | Ensemble A [No bounding box results] | 0.0227 | 0.591153 |
WMW | Ensemble D [No bounding box results] | 0.0227 | 0.591039 |
WMW | Ensemble B [No bounding box results] | 0.0227 | 0.59106 |
Trimps-Soushen | Result-1 | 0.02481 | 0.067698 |
Trimps-Soushen | Result-2 | 0.02481 | 0.06525 |
Trimps-Soushen | Result-3 | 0.02481 | 0.064991 |
Trimps-Soushen | Result-4 | 0.02481 | 0.065261 |
Trimps-Soushen | Result-5 | 0.02481 | 0.065302 |
NUS-Qihoo_DPNs (CLS-LOC) | [E2] CLS:: Dual Path Networks + Basic Ensemble | 0.0274 | 0.088093 |
NUS-Qihoo_DPNs (CLS-LOC) | [E1] CLS:: Dual Path Networks + Basic Ensemble | 0.02744 | 0.088269 |
BDAT | provide_class | 0.02962 | 0.086942 |
BDAT | provide_box | 0.03158 | 0.081392 |
MIL_UT | Ensemble of 9 models (classification-only) | 0.03205 | 0.596164 |
SIIT_KAIST-SKT | ensemble 2 | 0.03226 | 0.128924 |
MIL_UT | Ensemble of 10 models (classification-only) | 0.03228 | 0.596174 |
NUS-Qihoo_DPNs (CLS-LOC) | [E3] LOC:: Dual Path Networks + Basic Ensemble | 0.03413 | 0.062263 |
SIIT_KAIST-SKT | a weighted geometric ensemble | 0.03442 | 0.128996 |
SIIT_KAIST-SKT | ensemble 1 | 0.0345 | 0.129028 |
BDAT | provide_box_former | 0.03461 | 0.090593 |
MPG_UT | Ensemble of 7 models. Ten-crop inference. Without localization. | 0.04324 | 1.0 |
MPG_UT | Single Model of Partial Bilinear Pooling. Ten-crop inference. Without localization. | 0.04498 | 1.0 |
FACEALL_BUPT | Ensembles of 3 models for classification , single model for localization | 0.04574 | 0.227619 |
FACEALL_BUPT | Ensembles of 6 models for classification , single model for localization | 0.04574 | 0.223947 |
MCPRL_BUPT | For classification, we merge two models, the top-5 cls-error on validation is 0.0475. For localization, we use a single Faster RCNN model with VGG-16, the top-5 cls-loc error on validation is 0.2905. | 0.04752 | 0.289404 |
MCPRL_BUPT | Two models for classification, one single model for localization, in which 1.5x context information and OHEM are adopted. | 0.04752 | 0.291562 |
MCPRL_BUPT | Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0481. The top-5 cls-loc error on validation is 0.2907. | 0.04857 | 0.29012 |
FACEALL_BUPT | Ensembles of 5 models for classification , single model for localization | 0.04918 | 0.227494 |
FACEALL_BUPT | Ensembles of 4 models for classification , single model for localization | 0.06486 | 0.239351 |
FACEALL_BUPT | single model for classification , single model for localization | 0.06486 | 0.239185 |
connected | Inception + ResNET + SSD with our selective box algorithm1 | 0.17374 | 0.323928 |
connected | Inception + ResNET + SSD with our selective box algorithm2 | 0.17374 | 0.314498 |
connected | Inception + ResNET + SSD with our selective box algorithm3 | 0.17374 | 0.310318 |
zlm | ten crops, no location | 0.99631 | 0.999803 |
MPG_UT | Single Model of Branched Bilinear Pooling. Multi resolutions and crops test. Without localization. | 0.99675 | 1.0 |
MPG_UT | Ensemble of entry 3. Without localization. | 0.99702 | 1.0 |
MPG_UT | Ensemble of entry 2 and entry 4. Without localization. | 0.99767 | 1.0 |
Task 2b: Classification+localization with additional training data
Ordered by localization error
Team name | Entry description | Description of outside data used | Localization error | Classification error |
NUS-Qihoo_DPNs (CLS-LOC) | [E5] LOC:: Dual Path Networks + Basic Ensemble | ImageNet-5k & Self collected images (about 33k) | 0.061941 | 0.03378 |
BDAT | extra_box | box annotation from DET | 0.087533 | 0.03085 |
NUS-Qihoo_DPNs (CLS-LOC) | [E4] CLS:: Dual Path Networks + Basic Ensemble | ImageNet-5k & Self collected images (about 33k) | 0.087948 | 0.02713 |
BDAT | extra_class | box annotation from DET | 0.091081 | 0.02997 |
Ordered by classification error
Team name | Entry description | Description of outside data used | Classification error | Localization error |
NUS-Qihoo_DPNs (CLS-LOC) | [E4] CLS:: Dual Path Networks + Basic Ensemble | ImageNet-5k & Self collected images (about 33k) | 0.02713 | 0.087948 |
BDAT | extra_class | box annotation from DET | 0.02997 | 0.091081 |
BDAT | extra_box | box annotation from DET | 0.03085 | 0.087533 |
NUS-Qihoo_DPNs (CLS-LOC) | [E5] LOC:: Dual Path Networks + Basic Ensemble | ImageNet-5k & Self collected images (about 33k) | 0.03378 | 0.061941 |
Object detection from video (VID)[top]
Task 3a: Object detection from video with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
IC&USYD | provide_submission3 | 15 | 0.817265 |
IC&USYD | provide_submission1 | 6 | 0.808847 |
IC&USYD | provide_submission2 | 4 | 0.818309 |
NUS-Qihoo-UIUC_DPNs (VID) | no_extra + seq + mca + mcs | 3 | 0.757772 |
NUS-Qihoo-UIUC_DPNs (VID) | no_extra + seq + vcm + mcs | 1 | 0.757853 |
NUS-Qihoo-UIUC_DPNs (VID) | Faster RCNN + Video Context | 1 | 0.748493 |
THU-CAS | merge-new | 0 | 0.730498 |
THU-CAS | old-new | 0 | 0.728707 |
THU-CAS | new-new | 0 | 0.691423 |
GoerVision | Deformable R-FCN single model+ResNet101 | 0 | 0.669631 |
GoerVision | Ensemble 2 model, use ResNet101 as foundamental classification network and deformable R-FCN to detect video frames, multi-scale testing | 0 | 0.665693 |
GoerVision | o train the video objectWe use the ResNet101 and Deformable R-FCN for the detection. | 0 | 0.655686 |
GoerVision | Using R-FCN to detect video object, multi scale testing applied. | 0 | 0.646965 |
FACEALL_BUPT | SSD based on Resnet101 networks | 0 | 0.195754 |
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
IC&USYD | provide_submission2 | 0.818309 | 4 |
IC&USYD | provide_submission3 | 0.817265 | 15 |
IC&USYD | provide_submission1 | 0.808847 | 6 |
NUS-Qihoo-UIUC_DPNs (VID) | no_extra + seq + vcm + mcs | 0.757853 | 1 |
NUS-Qihoo-UIUC_DPNs (VID) | no_extra + seq + mca + mcs | 0.757772 | 3 |
NUS-Qihoo-UIUC_DPNs (VID) | Faster RCNN + Video Context | 0.748493 | 1 |
THU-CAS | merge-new | 0.730498 | 0 |
THU-CAS | old-new | 0.728707 | 0 |
THU-CAS | new-new | 0.691423 | 0 |
GoerVision | Deformable R-FCN single model+ResNet101 | 0.669631 | 0 |
GoerVision | Ensemble 2 model, use ResNet101 as foundamental classification network and deformable R-FCN to detect video frames, multi-scale testing | 0.665693 | 0 |
GoerVision | o train the video objectWe use the ResNet101 and Deformable R-FCN for the detection. | 0.655686 | 0 |
GoerVision | Using R-FCN to detect video object, multi scale testing applied. | 0.646965 | 0 |
FACEALL_BUPT | SSD based on Resnet101 networks | 0.195754 | 0 |
Task 3b: Object detection from video with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
IC&USYD | extra_submission2 | region proposal trained from DET and COCO | 24 | 0.819339 |
NUS-Qihoo-UIUC_DPNs (VID) | extra + seq + vcm + mcs | self-collected images | 3 | 0.760252 |
GoerVision | pre-trained model from COCO detection dataset. | coco detection dataset | 2 | 0.68817 |
NUS-Qihoo-UIUC_DPNs (VID) | Faster RCNN + Video Context | Self collected images | 1 | 0.751525 |
IC&USYD | extra_submission1 | region proposal trained from DET and COCO | 0 | 0.797295 |
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
IC&USYD | extra_submission2 | region proposal trained from DET and COCO | 0.819339 | 24 |
IC&USYD | extra_submission1 | region proposal trained from DET and COCO | 0.797295 | 0 |
NUS-Qihoo-UIUC_DPNs (VID) | extra + seq + vcm + mcs | self-collected images | 0.760252 | 3 |
NUS-Qihoo-UIUC_DPNs (VID) | Faster RCNN + Video Context | Self collected images | 0.751525 | 1 |
GoerVision | pre-trained model from COCO detection dataset. | coco detection dataset | 0.68817 | 2 |
Task 3c: Object detection/tracking from video with provided training data
Team name | Entry description | mean AP |
IC&USYD | provide_submission2 | 0.641474 |
IC&USYD | provide_submission1 | 0.544835 |
NUS-Qihoo-UIUC_DPNs (VID) | Model3 | 0.544536 |
THU-CAS | track-merge+new | 0.511627 |
NUS-Qihoo-UIUC_DPNs (VID) | track_no_extra + mcs | 0.510381 |
THU-CAS | track-old+new | 0.476636 |
THU-CAS | track-new+new | 0.469237 |
FACEALL_BUPT | SSD based on Resnet101 networks,ECO tracking and cluster different confidence bounding box | 0.063858 |
Task 3d: Object detection/tracking from video with additional training data
Team name | Entry description | Description of outside data used | mean AP |
IC&USYD | extra_submission2 | region proposal trained from DET and COCO | 0.642935 |
IC&USYD | extra_submission1 | region proposal trained from DET and COCO | 0.57749 |
NUS-Qihoo-UIUC_DPNs (VID) | Model2 | Self collected images | 0.550078 |
NUS-Qihoo-UIUC_DPNs (VID) | track_extra + mcs | self collected images | 0.530889 |
NUS-Qihoo-UIUC_DPNs (VID) | Model1 | Self collected images | 0.530137 |
Team information[top]
Team name | Team members | Abstract |
BDAT | Hui Shuai(1), Zhenbo Yu(1), Qingshan Liu(1), Xiaotong Yuan(1), Kaihua Zhang(1), Yisheng Zhu(1), Guangcan Liu(1), Jing Yang(1), Yuxiang Zhou(2), Jiankang Deng(2), (1)Nanjing University of Information Science & Technology, (2)Imperial College London | Adaptive attention[1] and deep combined convolutional models[2,3] are used for LOC task.
Scale[4,5,6], context[7], sampling and deep combined convolutional networks[2,3] are considered for DET task. Object density estimation is used for score re-rank. [1] Wang F, Jiang M, Qian C, et al. Residual Attention Network for Image Classification[J]. arXiv preprint arXiv:1704.06904, 2017. [2] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[C]//Proceedings of the IEEE conference on computer vision and pattern recognition. 2016: 770-778. [3] Szegedy C, Ioffe S, Vanhoucke V, et al. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning[C]//AAAI. 2017: 4278-4284. [4] Ronneberger O, Fischer P, Brox T. U-net: Convolutional networks for biomedical image segmentation[J]. arXiv preprint arXiv:1505.04597, 2015. [5] Lin T Y, Dollár P, Girshick R, et al. Feature pyramid networks for object detection[J]. arXiv preprint arXiv:1612.03144, 2016. [6] Shrivastava A, Sukthankar R, Malik J, et al. Beyond skip connections: Top-down modulation for object detection[J]. arXiv preprint arXiv:1612.06851, 2016. [7] Zeng X, Ouyang W, Yan J, et al. Crafting GBD-Net for Object Detection[J]. arXiv preprint arXiv:1610.02579, 2016. |
BUPT-PRIV | Lu Yang (BUPT)
Zhiwei Liu (UCAS) Qing Song (BUPT) |
(1) We present a novel method that combing the feature pyramid network(FPN) with a basic Faster R-CNN system. FPN is A top-down architecture with lateral connections is developed for building high-level semantic feature maps at all scales.
(2) ResNext-101 is used for feature extraction in our object detection system, which is a simple, modularized multi-way extension of ResNet for ImageNet classification. This pretrained model can bring non-trival improvement on the validation set. (3) During training and inference, we use multi-scale strategy to tackle the objects with large scale variation, which can further increases mAP by ~3 point on the validation set. References [1] Tsung-Yi Lin,Piotr Dollar, Ross Girshick, Kaiming He, et al. "Feature Pyramid Networks for Object Detection" in CVPR, 2017 [2] Saining Xie, Ross Girshick, Piotr Dollar, et al. "Aggregated Residual Transformations for Deep Neural Networks" in CVPR 2017 [3] Ren S, He K, Girshick R, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks" Advances in neural information processing systems. 2015 |
connected | Hitoshi Hongo (Connected)
Tetsuya Hada (Connected) Kazuki Fukui (Connected) Yoshihiro Mori (Connected) |
In our method, we first predict labels by an ensemble model of ResNet-50 and Inception v3.
We used single shot multibox detector, whose weights were pre-trained on PASCAL VOC dataset and fine-tuned for ILSVRC,to generate bounding boxes. We applied a box-selection strategy as follows: The joint distributions of the bbox width and height were learned for each class using kernel density estimation, and our strategy considers both bbox confidences and likelihoods of shapes of the candidate bboxes. In addition, boxes of similar size and location were suppressed so that we can find various sized objects in the images. |
FACEALL_BUPT | Jinchang XU, BUPT, CHINA
Zhiqun HE, BUPT, CHINA Xuankun HUANG, BUPT, CHINA Jiangqi ZHANG, BUPT, CHINA Qiushan GUO, BUPT, CHINA Zesang HUANG, BUPT, CHINA Yu LEI, BUPT, CHINA Lei LI, BUPT, CHINA Yingruo FAN, BUPT, CHINA Junfei ZHUANG, BUPT, CHINA Fengye XIONG, BUPT, CHINA Hongliang BAI, Beijing Faceall co., LTD Wenjian FENG, Beijing Faceall co., LTD Yuan DONG, BUPT, CHINA |
Object Detection =============== We employ the well-known “Faster-RCNN + ResNet-101” framework [1] and “RFCN+ ResNet-101” framework [2]. We used the ResNet-101 model which is pre-trained on the CLS-LOC dataset with image–level annotation. For the first method, we utilize Faster-RCNN with the publicly available resnet-101. We adopt multi-scale ROIs to obtain features containing richer context information. For testing, the Soft-NMS[3] is integrated with Faster-RCNN, and meanwhile we use 3 scales and merge these results using the simple strategy introduced last year. No validation data is used for training, and flipped images are used in only a third of the training epochs. For the second method, we utilized two categories: class-aware and class-agonistic. As the RFCN paper used class-agonistic to train the RFCN, we find using class-aware achieve better performance than with class-agonistic. The first stage, we trained on the object detection dataset with OHEM for 200k iterations with the learning rate 0.001. For the next stage, we reduced the PRN loss weight from 1 to 0.5 and trained for another 80k iterations. For the final stage, we still reduced the RPN loss weight from 0.5 to 0.2 for 120k iterations. Object Classification/Localization ===============As to the part of classification and localization, we have tried multiple methods to promote the accuracy of classification and the mean average precision of localization. For the classification part, we use four pre-trained models fine-tuned on train-val data. For testing, for comparison studies, multiple crops are randomly sampled from an image or its horizontal flip , and average the scores at multiple scales (images are resized such that the shorter side in {224,256,288,384}),finally we ensemble models to benefit inference stage. For the localization part, the models are initialized by the ImageNet classification models, and then fine-tuned on the object-level annotations of 1000 classes. We utilize the class-agnostic strategy to learn a bounding boxes regression, the generated regions are classified by fine-tuned model into one of 1001 classes. At last, we also adopt the multiple scales method in testing and ensemble the results of several models. Object detection from video =============== In this work, we use a variant of SSD[5] with ResNet[1] for detection task. The overall training of the detection network follows a similar procedure with [5]. For visual tracking part, we use ECO[6] to track the objects from detection every 5 frames, we also cluster the detections with different confidence.
[1] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2017, 39(6):1137-1149. [2] Dai J, Li Y, He K, et al. R-FCN: Object Detection via Region-based Fully Convolutional Networks[J]. 2016. [3] Bodla N, Singh B, Chellappa R, et al. Improving Object Detection With One Line of Code[J]. 2017. [5] SSD: Single Shot MultiBox Detector", Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. arXiv preprint arXiv:1512.02325(2015). [6] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg. Eco: Efficient convolution operators for tracking. arXiv preprint arXiv:1611.09224, 2016 |
glee | gwang-gook lee | Faster R-CNN is used as the detection method with ResNet-152 as a base network.
Several techniques, such as atrous conv, box refinement are used to improve performance. Model ensemble is not used. |
GoerVision | Yejin Chen(1,2), Chunshan Bai(1,5), Zhuo Chen(1), Le Ge(4), Chengwei Li(4), Shuo Xu(5), Yuxuan Bao(6), Lu Bai(1), Xinyi Sun(1), Shun Yuan(1), Xiangdong Zhang(1,3)
(1)Goertek Inc. (2)Nanyang Technological University (3)Tsing Hua University (4)University of California, Berkeley (5)Beihang University (6)Michigan University, Ann Arbor |
Our team utilize two image object detection architectures, namely Fast R-CNN[1] and Deformable R-FCN[2] for the task of object detection.To train the video object detection model, we use the ResNet101[5] as the fundamental classification network. And we adopt Deformable R-FCN[3] and Faster R-CNN to detect the still image objects. Mufti-Contect Suppression and motion guided propagation is used to correct the frames object detection. Coco detection data set is used for pretrained the detection model[7].
For testing, we utilize multi-scale testing, and models ensemble approaches to benefit inference stage. We use MXnet[4] for the algorithm framework because of its flexible and efficient. Since of time constrain, we are not able to use Inception_Resnet_v2[6] and tracking in the VID task, which can significant improve the result mAP. References: [1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, "Faster R-CNN:Towards real-time object detection with region proposal networks" Advances in neural information processing system. 2015. [2] Dai, Jifeng, Yi Li, Kaiming He, Jian Sun. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016). [3] Yuwen Xiong, Haozhi Qi, Guodong Zhang, Yi Li, Jifeng Dai, Bin Xiao, Han Hu and Yichen Wei. "Deformalbe Convolutional Networks",arXiv: 1703.06211. Jun 2017. [4] Tianqi Chen, Mu Li, Yutian Li, Min Lin, Naiyan Wang, Minjie Wang, Tianjun Xiao, Bing Xu, Chiyuan Zhang, and Zheng Zhang. MXNet: A Flexible and Efficient Machine Learning Library for Heterogeneous Distributed Systems. In Neural Information Processing Systems, Workshop on Machine Learning Systems, 2015. [5] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [6] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016). [7] Kaiming He, Georgia Gkioxari, Piotr Dollar, Ross Girshick, “Mask R-CNN” Tech Report, arXiv:1703,06870. Mar,2017. |
Handsight | Liangzhuang Ma
Xin Kan Qianjiang Xiao Wenlong Liu Peiqin Sun |
Our submission is based a new model named Yes-Net [1] which get 39 FPS on 416 x 416 image (running on titan x pascal). The starting point of the work is YOLO9000 [2].
The novel contributions of Yes-Net: (1) It combines local information with global information by adding the RNN architecture in CNN model as the base feature extractor. (2) Instead of selecting fixed number, fixed shapes and fixed center point anchor boxes in each grid cell, we use k-means to cluster N anchor boxes based on the whole training set. Every anchor box has its own shape and center. (3) We propose a novel method that use a RNN instead of NMS to select the output boxes, which finally improved the generalization ability of our detector. We argue that this method also can be adopted by other detectors. We have accomplished the paper illustrating details of Yes-Net. It is available on arXiv 1706.09180. References: [1] Liangzhuang Ma, Xin Kan Yes-Net: An effective Detector Based on Global Information. arXiv preprint arxiv: 1706.09180, 2017.6 [2] Redmon J, Farhadi A. YOLO9000: Better, Faster, Stronger. arXiv preprint arxiv:1612.08242v1, 2016.9 |
IC&USYD | Jiankang Deng(1), Yuxiang Zhou(1), Baosheng Yu(2), Zhe Chen(2), Stefanos Zafeiriou(1), Dacheng Tao(2), (1)Imperial College London, (2)University of Sydney | Flow acceleration[1,2] is used. Final scores are adaptively chosen between the detector and tracker.
[1] Deep Feature Flow for Video Recognition Xizhou Zhu, Yuwen Xiong, Jifeng Dai, Lu Yuan, and Yichen Wei, IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. [2] Flow-Guided Feature Aggregation for Video Object Detection, Xizhou Zhu, Yujie Wang, Jifeng Dai, Lu Yuan, and Yichen Wei. Arxiv tech report, 2017. |
KAISTNIA_ETRI | Keun Dong Lee(ETRI) Seungjae Lee(ETRI) JongGook Ko (ETRI) Jaehyung Kim* (KAIST) Jun Hyun Nam* (KAIST) Jinwoo Shin (KAIST) (* indicates equal contribution) | In this work, we consider ensembles of variants of GBD-Net [1] with ResNet [2] for the detection task. For maximizing the ensemble effect, we design variants using (a) various depth/width for multi-region networks (without GBD), (b) fusion of different layer feature maps with weighted additions (with GBD), (c) different pooling methods considering various region shapes (with GBD), and (d) new loss function incorporating network confidence (without GBD). For RPN, we trained cascade RPN [1] and also performed iterative inference procedures to improve its performance.
[1] Xingyu Zeng et al. “Crafting GBD-Net for Object Detection.” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2017. [2] Kaiming He, et al. “Deep residual learning for image recognition.” Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. |
MCPRL_BUPT | Yuanyuan Li
Yang Bai |
We participate in the classification and localization task. Our classification models are mainly based on Inception-ResNet-v2 and Inception-v4 [1] and the location framework is Faster RCNN [2].
We make the following improvements. (1) For the classification part, we train Inception-ResNet-v2 and Inception-v4 networks. Model ensemble is adopted to obtain a better performance. (2) For the localization part, we use the Faster RCNN framework with VGG-16 [3] as a backbone. To make samples balanced, we sample the training images averagely. We cluster the annotations to pre-set more appropriate hyper-parameters. Finally, 1.5x contextual information of RoIs and Online Hard Example Mining (OHEM) [4] are used to locate the objects more precisely. [1] Szegedy, Christian, et al. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning." AAAI. 2017. [2] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. [3] Simonyan, Karen, and Andrew Zisserman. "Very deep convolutional networks for large-scale image recognition." arXiv preprint arXiv:1409.1556 (2014). [4] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016. |
MIL_UT | Yuji Tokozume (the Univ. of Tokyo), Kosuke Arase (the Univ. of Tokyo), Yoshitaka Ushiku (the Univ. of Tokyo), Tatsuya Harada (the Univ. of Tokyo/RIKEN) | Ensemble of ResNets [1, 2] and ResNeXts [3] trained with a novel learning method.
[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016. [2] K. He, X. Zhang, S. Ren, and J. Sun. Identity Mappings in Deep Residual Networks. In ECCV, 2016. [3] S. Xie, R. Girshick, P. Dollar, Z. Tu, and K. He. Aggregated Residual Transformations for Deep Neural Networks. In CVPR, 2017. |
MPG_UT | Hideki Nakayama, Jiren Jin and Ikki Kishida.
The University of Tokyo |
We utilized pretrained ResNeXt[1] and ResNet[2]. Our models replaced global average pooling with bilinear-pooling[3] with log-Euclid normalization.
In order to solve the problem with applying bilinear pooling to large dimensional inputs, we built two types of models and ensembled the results for submission. 1. Last fully connected layer and global average pooling on pretrained ResNeXt[1] were replaced by branched bottlenecks, bilinear-pooling[3] with log-Euclid normalization and new fully connected layer. Due to the difficulty of end-to-end learning, we conducted stage-wise training. 2. Partial Bilinera Pooling: The idea is to move the bottleneck of the representation capacity to a later phase of the model. The outputs of last convolutional layer are seperated into several sub groups (along the filter dimension), and bilinear pooling are applied to each sub group. The combination of sub groups is done by summation. We also add a batch normalization layer after the combination of sub groups, which improves the lerannig efficiency and fianl performance. [1] Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu and Kaiming He. Aggregated Residual Transformations for Deep Neural Networks. CVPR, 2017. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Deep Residual Learning for Image Recognition. CVPR, 2016. [3] Tsung-Yu Lin, Aruni RoyChowdhury and Subhransu Maji. Bilinear CNN Models for Fine-grained Visual Recognition. ICCV, 2015. |
NUS-Qihoo-UIUC_DPNs (VID) | NUS: Yunchao Wei, Mengdan Zhang, Jianan Li, Yunpeng Chen, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan UIUC: Honghui Shi |
(1)Technique Details for object detection from video:
The video object detector is trained based on the Faster R-CNN [1] using Dual Path Network (DPN) as backbone. Specifically, we adopt three DPN models, i.e., the DPN-96, DPN-107 and DPN-131, as a truck feature learner as well as the head classifier in the Faster R-CNN framework. The best single model achieves 79.3% (mAP) on the validation set. The ensemble of 4 models gives an mAP of 80.5%. In addition, we propose a selected-average-pooling strategy to infer video context information, which is used to refine the detection results. With sequential tracking and rescoring [4], video context, the mAP can be further improved to 84.5% on the validation set. (2)Technique Details for object detection/tracking from video: The object trajectory generation is based on the following two complementary methods: the optical-flow [2] based tubelet generation method and the visual tracking [3] based tubelet generation method. The former method ensures the accuracy of trajectories, and the latter method provides high tracking recall. The tracking tubelets are finally selectively added to the optical-flow tubelets based on our adaptive merge strategy. The final mAP in the validation set is 70.1%. NOTE: Here, the validation set indicates the 555 videos from ILSVRC2016. [1] Ren, S, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS. 2015. [2] Ilg, E, et al. “Flownet 2.0: Evolution of optical flow estimation with deep networks” arXiv preprint arXiv:1612.01925, 2016. [3] Nam H, et al. “Learning multi-domain convolutional neural networks for visual tracking”, CVPR 2016. [4] Han, Wei, et al. "Seq-nms for video object detection." arXiv preprint arXiv:1602.08465 2016. |
NUS-Qihoo_DPNs (CLS-LOC) | NUS: Yunpeng Chen, Huaxin Xiao, Jianan Li, Xuecheng Nie, Xiaojie Jin, Jianshu Li, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan |
We present a simple, highly efficient and modularized Dual Path Network (DPN) which introduces a novel dual path topology. The DPN model contains a residual path and a densely connected path which are able to effectively share common features while maintaining the flexibility to learn to explore new features. DPNs serve as our main network for all the tasks.
In the CLS-LOC task, we adopt the DPNs to predict the Top-5 objects and then assign the corresponding bounding boxes using DPN based Faster RCNNs [1]. On the provided training data track, a shallow DPN-98 (236MB/11.7GFLOPs) surpasses the best ResNeXt-101(64×4d) [2] on the image classification task with 26% smaller model size, 25% less computational cost and 8% lower memory consumption, and a deeper DPN-131 (304MB/16.0GFLOPs) achieves top-1(top-5) classification error at 18.55%(4.16%) on the validation set using single 320x320 center corp. We combine in total two strong DPNs and two weaker DPNs combining with several existing CNNs to get the final prediction. On the extra training data track, we pretrained a DPN-107 on ImageNet-5k dataset and then fine-tuned on the provided training set with 33k self-collected extra training images. With different fine-tuning strategies, we got two DPN-107 networks. The final prediction is a weighted combination of these two additional models and previous models. For the bounding box prediction, we follow the Faster RCNN pipeline and train three DPNs (two DPN-92 and one DPN-131). Proposals are extracted from all models and the final scores are adjusted by the classification scores. Technique details of the Dual Path Networks will be available in arXiv soon and the pertained models will also be public available at the same time. *Note: All DPNs are trained using MXNet on 10 nodes with totally 40 K80 graphic cards from scratch. Without specific code optimization, the training speed of DPN-131 reaches > 60 samples/sec per node in synchronous way. ----- [1] S Ren, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS. 2015. [2] S Xie, et al. "Aggregated residual transformations for deep neural networks." CVPR. 2017. |
NUS-Qihoo_DPNs (DET) | NUS: Jianan Li, Jianshu Li, Yunpeng Chen, Huaxin Xiao, Jiashi Feng
Qihoo 360: Jian Dong, Shuicheng Yan |
We adopt the Dual Path Network (DPN), which contains a novel dual path topology, on the object detection task based on the Faster R-CNN[1]. The features sharing scheme and flexibility to explore new features in DPN are shown to be effective in object detection. Specifically, we adapt several DPN models, i.e., DPN-92, DPN-107, DPN-131, etc, as the trunk feature learner as well as the head classifier in the Faster R-CNN framework. We only use networks up to 131 layers, which are light to train and well-fit within most common GPUs, yet yielding good performance. For region proposal generation, low-level fine-grained features are exploited, which are shown to be effective in improving the proposal recall. Furthermore, we incorporate beneficial context information by adopting the dilated convolution [2] in segmentation into the detection framework. During testing, we design a categories-wise weighting strategy to explore expert models for different categories and apply weights to different experts accordingly for multi-model inference. In addition, we adapt pre-trained model on image classification task to extract global context information, which can provide beneficial cues for reasoning the detection results within the whole input image.
[1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. [2] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." International Conference on Learning Representations. 2016. |
PaulForDream | PengChong | I use a end-to-end detection system YOLO which was first presented in [1] and then improved in [2] to train a detection network.
I modify the YOLOv2 detection network structure which was defined in [2] to improve the detection performance for small object. Except network structure I also modify the loss function which is more sensitive to small object. In [2],the author used 5 anchors to predict bounding box while I use 10 anchors which is computed with ILSVRC2017 DET train-dataset annotations. High resolution detection network helps improve detection performance, so the input image size is 608*608. References [1] Joseph Redmon, Santosh Divvala, Ross Girshick, Ali Farhadi. You Only Look Once: Unified, Real-Time Object Detection. arXiv preprint arXiv:1506.02640 [2] Joseph Redmon, Ali Farhadi. YOLO9000: Better, Faster, Stronger. arXiv preprint arXiv:1612.08242 |
QINIU-ATLAB+CAS-SARI | Yixin Bao (1)
Zhijian Zhao (1) Yining Lin (1) Bingfei Fu (2) Hao Ye (2) Li Wang (2) (1) Shanghai Qiniu Information & technology Corporation (2) Shanghai Advanced Research Institute, Chinese Academy of Sciences |
Abstract
In ILSVRC2017, we focus on object detection with provided training data. our object detection architecture is Faster RCNN (in mxnet [1]) with different network structures: resnet101 [2], resnet152, Inception-v3 [3] and dcn-rfcn [4]. To maximumly utilize those deep neural networks, we use eval methods, for example, box voting to improve the accuracy of object detection. During the training, we also analyze the missing pictures and finetune the neural networks. Moreover, we use multi-scale to catch the tiny objects and long objects. Our final submissions consist of ensembles of multiple models, which can improve the robustness of the whole model. References [1] "GitHub," 2017. [Online]. Available: https://github.com/fubingfeiya/mxnet. [Accessed 30 6 2017]. [2] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," CVPR, pp. 1-12, 10 12 2015. [3] C. Szegedy, V. Vanhoucke, S. Ioffe and J. Shlens, "Rethinking the Inception Architecture for Computer Vision," CVPR, pp. 2818-2826, 2016. [4] J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu and Y. Wei, "Deformable Convolutional Networks," CVPR, 17 3 2017. |
ReadSense | Thomas Tong, ReadSense
Leon Ding, ReadSense Jiajun Du, Shanghai Jiao Tong University |
Faster R-CNN
[8] Ren S, He K, Girshick R, et al. Faster r-cnn: Towards real-time object detection with region proposal networks[C]. Advances in neural information processing systems. 2015: 91-99. |
Sheep | Sheng Yang,student at NWPU | This is a method for detecting objects in images that integrates the bounding boxes and class probabilities into a deep neural network. First, we generate a small set of default boxes of different aspect ratios. Next, we run a single convolutional network on the image and generate class confidence for all object categories for default boxes. Last, we utilize non-max suppression to choose the resulting detections by model’s confidence and regress the bounding boxes.
This model is based on the caffe architecture and motivated by SSD and YOLO and this method is fast and effective. Also, it can tackle objects with various sizes and handle images with different resolutions. The main advantages are as follows: 1. Incorporating the bounding boxes and class probabilities into a deep network rather than fast r-cnn using two network. 2. Generating default boxes and aspect ratios rather than region proposals 3. Different resolution images can all be tackled in the method. Reference: 1. Uijlings, J.R., van de Sande, K.E., Gevers, T., Smeulders, A.W.: Selective search for object recognition. IJCV (2013) 6. Girshick, R.: Fast R-CNN. In: ICCV. (2015 2. Ren, S., He, K., Girshick, R., Sun, J.: Faster R-CNN: Towards real-time object detection with region proposal networks. In: NIPS. (2015) 5. Redmon, J., Divvala, S., Girshick, R., Farhadi, A.: You only look once: Unified, real-time object detection. In: CVPR. (2016) 6. Simonyan,K.,Zisserman,A.: Verydeepconvolutionalnetworksforlarge-scaleimagerecognition. In: NIPS. (2015) 7. S. Gould, T. Gao, and D. Koller. Region-based segmentationandobjectdetection. InAdvances inneural information processing systems, pages 655–663, 2009 8. O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein, A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV), 2015 9. T. Dean, M. A. Ruzon, M. Segal, J. Shlens, S. Vijayanarasimhan, and J. Yagnik. Fast, accurate detection of 100,000 object classes on a single machine. In CVPR, 2013. 10. R. B. Girshick, P. F. Felzenszwalb, and D. McAllester. Discriminatively trained deformable part models, release 5. http://people.cs.uchicago.edu/ rbg/latent-release5/ 11. M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisserman. The pascal visual object classes (voc.) challenge. International journal of computer vision, 88(2):303– 338, 2010 |
SIIT_KAIST-SKT | Dongyoon Han* (KAIST)
Jiwhan Kim* (KAIST) Gwang-Gook Lee (SKT) Junmo Kim (KAIST) (*equally contributed) |
We used trained models including three 200-depth PyramidNets [1] for classification network. PyramidNets are designed to maximize network capability by gradually increasing the number of channel maps, and as the result, PyramidNets show better performance than ResNets [2] and other state-of-the-art network with equal # of parameters.
We adopted the PyramidNets trained previously on ILSVRC-2012 trainset which were uploaded in our github: https://github.com/jhkim89/PyramidNet. Our main goal is two-fold: 1) to get better performance with a smaller ensemble of networks; 2) several additional tasks have been applied to improve the classification ability of trained networks during the test phase. For localization network, R-FCN [3] is used with our backbone network of a single PyramidNet-101, which is trained for boosting both RPN and detection heads by an auxiliary loss. [1] D. Han*, J. Kim* and Junmo Kim. "Deep pyramidal residual networks", equally contributed by the authors*, CVPR 2017. [2] K. He, X. Zhang, S. Ren, and J. Sun, "Deep residual learning for image recognition", CVPR 2016. [3] J. Dai, Y. Li, K. He, and J. Sun, 'R-fcn: Object detection via region-based fully convolutional networks', NIPS 2016. |
THU-CAS | GuiGuang Ding (1)
ChaoQun Chu (1) Sheng Tang (2) Bin Wang (2) JunBin Xiao (2) Chen Li (1) DaHan Gong (1) (1) School Of Software, Tsinghua University, Beijing, China (2) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China |
For the still-image detection, in this year, we use Faster R-CNN [1] framework with different versions of ResNet [2] as the base convolutional network.
1. For Object detection from video (VID) sub-task, our contributions are three-fold: (1) Co-occurrence relationship based multi-context suppression: by mining the co-occurrence relationship between bounding boxes with different classes from the training dataset and analyzing multi-context in each video, we can get the representative classes and the co-occurrence classes, and then implement more effective and targeted suppression, which can bring a large MAP improvement. (2) True-negative object filtering: Based on the analysis of the training dataset, we can get the non-occurrence relationship between different objects, which can help us to filter out the true negative objects which have lower detection scores and whose categories are not concurrently appeared with those objects of highest detection scores. (3) Tubelets based bounding boxes reclassification: Based on the tubelets which are constructed by on detection results, optical flow and multi-target tracking algorithms, we can implement an effective reclassification to get coherent categories of those bounding boxes of tubelets. 2. For Object detection from video (VID) with tracking, we propose a novel tracking framework to integrate optical flow [3] tracking, GOTURN tracking [4] and POI tracking [5] to a more accurate tubelets-construction algorithm due to their complementary attributes. Under this framework, our contributions are four-fold: (1) Tubelet construction based on detection results and optical flow: By using optical flow as tracker, we propose to sequentialize the detection bounding boxes of same object to form tubelet. (2) Multi-target tracking with GOTURN: we first choose anchor frame, and exploit the adjacent information to determine the reliable anchor targets for efficient tracking. Then, we track each anchor target with a GOTURN tracker in parallel. Finally, we use still-image detection results to recall missing tubelets. (3) Multi-target tracking with POI: POI tracking is an effective data association based multiple object tracking (MOT) algorithm, based on the detection result of the still-image detection, we can get some reliable tubelets, which are beneficial for the tubelets fusion afterwards. (4) Tubelet NMS: we propose a novel effective union and concatenation method for tubelet fusion, which improves the final AP by a large margin. References: [1] Ren S, He K, Girshick R, Sun J. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015: 91-99. [2] He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition”, CVPR 2016. [3] Kang K, Ouyang W, Li H, Wang X. “Object Detection from Video Tubelets with Convolutional Neural Networks”, CVPR 2016. [4] Held, David, Sebastian Thrun, and Silvio Savarese. "Learning to track at 100 fps with deep regression networks." arXiv preprint arXiv:1604.01802 (2016). [5] Yu, Fengwei, et al. "Poi: Multiple object tracking with high performance detection and appearance feature." arXiv preprint arXiv:1610.06136 (2016). |
Trimps-Soushen | Xiaoteng Zhang, Zhengyan Ding, Jianying Zhou, Jie Shao, Lin Mei
The Third Research Institute of the Ministry of Public Security, P.R. China. |
In the classification part, we took several methods, such as stochastic depth, data augmentation and model optimization (remove some redundant layers), to avoid overfitting. In addition, different training strategies were adopted for different models, which boost the final performance largely.
Our localization system was based on the Faster R-CNN pipeline with some improvements like global context, cascade RPN/RCNN, multi scale train/test. And some great works like FPN/Mask RCNN/Deformable CNN were also integrated in our pipeline. Furthermore, a more efficient fusion algorithm was used in test phase which has excellent improvements for localization accuracy. |
DeepView(ETRI) | Seung-Hwan Bae*(ETRI)
Youngjoo Jo*(ETRI) Joongwon Hwang*(ETRI) Youngwan Lee*(ETRI) Young-Suk Yoon*(ETRI) Yuseok Bae*(ETRI) Jongyoul Park(ETRI,Supervisor) * indicates equal contribution |
++For training:
We train the three types of convolutional detectors for this challenge: (1) SSD type: We use DSSD[1] with VGGNet and SSD[2] with WR-Inception Network [3]. (2) Faster RCNN type Ⅰ: We use the pre-trained resnet101/152/269 models [4] as CLS-Net. We then add region proposal networks to the CLS-Net. For this challenge, we fine-tune the networks on 200 detection object classes. (3) Faster RCNN type Ⅱ: We apply a resizing method with bilinear interpolation [5] on resnet152 model instead of ROI_pooling. The method also is used to make new hyper-feature layer. In order to handle the imbalanced problem, we make the ratio of positive and negative samples to be equal in each mini-batch. To detect a small object, we do not limit the size of region proposals. For improving detection accuracy, we propose new techniques for both multi-scale and multi-region. ++For improving accuracy: (1) Results Ensemble: To ensemble the results, we combined the detections results of models according to their mean APs on val2. After that, the soft-NMS and box refinement has been performed. It can improve the mAP to 4~5%. (2) Soft-NMS: We test the models with Soft-NMS[6] to get some improvement. (3) Data augmentation: We augment the given training set in 4 manners: 2D rotation, 2D translation, color histogram equalization, and stochastic noise addition. To make the balanced dataset between classes, we produce augmented images when the original image contains objects of the classes with fewer instances. [Reference] [1] C.-Y. Fu, W. Liu, A. Ranga, A. Tyagi, and A. C. Berg. DSSD: Deconvolutional single shot detector, arXiv preprint arXiv:1701.06659, 2017. [2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.- Y. Fu, and A. C. Berg. SSD: Single shot multibox detector. In ECCV, 2016. [3] Y. Lee, H. Kim, E. Park, X. Cui, H. Kim, Wide-residual-inception Networks for real-time object detection. arXiv preprint arXiv: 1702.01243, 2017. [4] X.u Zeng, W. Ouyang, J Yan, H. Li, T. Xiao, K. Wang, Y. Liu, Y. Zhou, B. Yang, Z. Wang, H. Zhou, X. Wang, Crafting GBD-Net for Object Detection, CoRR, 2016 [5] X. Chen & A. Gupta, An implementation of faster rcnn with study for region sampling, arXiv preprint arXiv:1702.02138., 2017 [6] N. Bodla, B. Singh, R. Chellappa, & L. S. Davis, Improving Object Detection With One Line of Code, arXiv preprint arXiv:1704.04503., 2017 |
WMW | Jie Hu (Momenta)
Li Shen (University of Oxford) Gang Sun (Momenta) |
We design a new architecture building block, named “Squeeze-and-Excitation (SE)”. Each building block embeds the information from global receptive fields by “squeeze” operation, and selectively induces response enhancement by “excitation” operation. SE module is the foundation of our entries. We develop multiple versions of SENet, such as SE-ResNet, SE-ResNeXt and SE-Inception-ResNet, which obviously surpass their non-SE counterparts with slight increase on computation and GPU memory cost. We achieved a top-5 error rate 2.3% on validation set.
All the models are trained on our designed distributed deep learning training system “ROCS”. We conduct significant optimization on GPU memory and message passing across GPU servers. Benefitting from that, our system trains SE-ResNet152 with a minibatch size of 2048 on 64 Nvidia Pascal Titan X GPUs in 20 hours using synchronous SGD without warm-up. We train all the models from scratch with provided training data. We submit no localization result. More technical and experimental details will be elucidated in a report. |
XBENTURY-THU | Nie FangXing : XiaoBaiShiJi
Ding ZhanYang : XiaoBaiShiJi Gao TingWei : THU |
Abstract:
Our team uses Faster-RCNN to train DET models. Our models are based on a ResNet 101 layers Network,adding some data argumentation and some tricks.. We train six different models to test , the results are divided into two groups, each group ensemble three models. Reference: Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks: Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun CRAFT Objects from Images: Bin Yang, Junjie Yan, Zhen Lei, Stan Z. Li Fast R-CNN: Ross Girshick Training Region-based Object Detectors with Online Hard Example Mining: Abhinav Shrivastava, Abhinav Gupta, Ross Girshick Gated Bi-directional CNN for Object Detection: Xingyu ZengEmail authorWanli OuyangBin YangJunjie YanXiaogang Wang |
YYF | YufengYuan PERSONAL | The core methods is based on yolo_v2 with the resnet152 model.
The anchors is introduced into yolo theory by yolo_v2.I also benifit from the anchors.I find The resnet152 pooling layer and fc layer is removed replaced with 3 cnn layers. The first cnn layer is combined from res5c,res4f and res3d. I use four models with the same structure with different weights. The weights is trained on the same images with different conditions, such as different size. The input images are all 480pix. The four models create own pred_results then the results is mixed in a set. The last result is selected by the non-maximum suppression. This results are created by me,only one person with NvidiaCards. THe results are not so beautiful because the ability from fresh,limit resources. But my interest is the best driving force and I believe that must have anchor detecting method except Faster-Rcnn. I find Yolo the candidate with exciting locating ability and good recgnising ability but this is not so good in 200c ategory and small size. I try many methods and I will go on the yolo road. references: Yolo and yolo_v2:https://pjreddie.com/darknet/yolo/ Resnet:https://arxiv.org/abs/1512.03385 |
zlm | Liming Zhao, Zhejiang University
Xi Li, Zhejiang University |
A deep residual network, built by stacking a sequence of residual blocks, is easy to train, because identity mappings skip residual branches and thus improve information flow.
To further reduce the training difficulty, we present a simple network architecture, deep merge-and-run neural networks. We use a modularized building block, merge-and-run block, which assembles residual branches in parallel through a merge-and-run mapping: Average the inputs of these residual branches (Merge), and add the average to the output of each residual branch as the input of the subsequent residual branch (Run), respectively. |