Legend:
Yellow background = winner in this task according to this metric; authors are willing to reveal the method
White background = authors are willing to reveal the method
Grey background = authors chose not to reveal the method
Italics = authors requested entry not participate in competition
Object detection (DET)[top]
Task 1a: Object detection with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
CUImage | Ensemble of 6 models using provided data | 109 | 0.662751 |
Hikvision | Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2 | 30 | 0.652704 |
Hikvision | Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2 | 18 | 0.652003 |
NUIST | submission_1 | 15 | 0.608752 |
NUIST | submission_2 | 9 | 0.607124 |
Trimps-Soushen | Ensemble 2 | 8 | 0.61816 |
360+MCG-ICT-CAS_DET | 9 models ensemble with validation and 2 iterations | 4 | 0.615561 |
360+MCG-ICT-CAS_DET | Baseline: Faster R-CNN with Res200 | 4 | 0.590596 |
Hikvision | Best single model, mAP is 65.1 on val2 | 2 | 0.634003 |
CIL | Ensemble of 2 Models | 1 | 0.553542 |
360+MCG-ICT-CAS_DET | 9 models ensemble | 0 | 0.613045 |
360+MCG-ICT-CAS_DET | 3 models | 0 | 0.605708 |
Trimps-Soushen | Ensemble 1 | 0 | 0.57956 |
360+MCG-ICT-CAS_DET | res200+dasc+obj+sink+impneg+seg | 0 | 0.576742 |
CIL | Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished) | 0 | 0.551189 |
KAIST-SLSP | 2 models ensemble with box rescoring | 0 | 0.535393 |
MIL_UT | ensemble of ResNet101, ResNet152 based Faster RCNN | 0 | 0.532216 |
KAIST-SLSP | 2 models ensemble | 0 | 0.515472 |
Faceall-BUPT | ensemble plan B; validation map 52.28 | 0 | 0.488839 |
Faceall-BUPT | ensemble plan A; validation map 52.24 | 0 | 0.486977 |
Faceall-BUPT | multi-scale roi; best single model; validation map 51.73 | 0 | 0.484141 |
VB | Ensemble Detection Model E3 | 0 | 0.481285 |
Hitsz_BCC | Combined 500x500 with 300x300 model | 0 | 0.479929 |
VB | Ensemble Detection Model E1 | 0 | 0.479043 |
ToConcoctPellucid | Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting | 0 | 0.477484 |
Hitsz_BCC | Self-implement SSD 500x500 model with ResNet-101 | 0 | 0.472984 |
ToConcoctPellucid | Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting | 0 | 0.470133 |
ToConcoctPellucid | ResNet-101 + Faster-RCNN single model | 0 | 0.469716 |
Faceall-BUPT | faster rcnn baseline; validation map 49.30 | 0 | 0.461085 |
Hitsz_BCC | Self-implement SSD 300x300 model with ResNet-152 | 0 | 0.451462 |
VB | Ensemble Detection Model E2 | 0 | 0.45063 |
SunNMoon | ensemble FRCN and SSD based on Resnet101 networks. | 0 | 0.434906 |
Choong | Ensemble of Deep learning model based on VGG16 & ResNet | 0 | 0.434323 |
VB | Single Detection Model S1 | 0 | 0.421331 |
hustvision | convbox-googlenet | 0 | 0.413457 |
LZDTX | A deconv-ssd network with input size 300x300. | 0 | 0.403113 |
OutOfMemory | ResNet-152+FasterRCNN | 0 | 0.393259 |
Lean-T | A single model, Faster R-CNN baseline,continuous iterations(~230K) | 0 | 0.314391 |
BUAA ERCACAT | combined model for detection | 0 | 0.269069 |
BUAA ERCACAT | A single model for detection | 0 | 0.265055 |
Lean-T | A single model, Faster R-CNN baseline,discontinuous iterations(~600K) | 0 | 0.259508 |
CUImage | Single GBD-Net model using provided data | --- | 0.633634 |
CUImage | Single Cluster-Net using provided data | --- | 0.618024 |
Trimps-Soushen | Single model | --- | 0.581434 |
DPFly | detetion algorithm 1 | --- | 0.491905 |
VIST | Single model A using ResNet for detection | --- | 0.459305 |
VIST | Single model B using ResNet for detection | --- | 0.455689 |
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
CUImage | Ensemble of 6 models using provided data | 0.662751 | 109 |
Hikvision | Ensemble A of 3 RPN and 6 FRCN models, mAP is 67 on val2 | 0.652704 | 30 |
Hikvision | Ensemble B of 3 RPN and 5 FRCN models, mean AP is 66.9, median AP is 69.3 on val2 | 0.652003 | 18 |
Hikvision | Best single model, mAP is 65.1 on val2 | 0.634003 | 2 |
CUImage | Single GBD-Net model using provided data | 0.633634 | --- |
Trimps-Soushen | Ensemble 2 | 0.61816 | 8 |
CUImage | Single Cluster-Net using provided data | 0.618024 | --- |
360+MCG-ICT-CAS_DET | 9 models ensemble with validation and 2 iterations | 0.615561 | 4 |
360+MCG-ICT-CAS_DET | 9 models ensemble | 0.613045 | 0 |
NUIST | submission_1 | 0.608752 | 15 |
NUIST | submission_2 | 0.607124 | 9 |
360+MCG-ICT-CAS_DET | 3 models | 0.605708 | 0 |
360+MCG-ICT-CAS_DET | Baseline: Faster R-CNN with Res200 | 0.590596 | 4 |
Trimps-Soushen | Single model | 0.581434 | --- |
Trimps-Soushen | Ensemble 1 | 0.57956 | 0 |
360+MCG-ICT-CAS_DET | res200+dasc+obj+sink+impneg+seg | 0.576742 | 0 |
CIL | Ensemble of 2 Models | 0.553542 | 1 |
CIL | Single Model ( Preactivation Resnet + Faster RCNN on Tensflow, On training (1/3 of total epochs was finished) | 0.551189 | 0 |
KAIST-SLSP | 2 models ensemble with box rescoring | 0.535393 | 0 |
MIL_UT | ensemble of ResNet101, ResNet152 based Faster RCNN | 0.532216 | 0 |
KAIST-SLSP | 2 models ensemble | 0.515472 | 0 |
DPFly | detetion algorithm 1 | 0.491905 | --- |
Faceall-BUPT | ensemble plan B; validation map 52.28 | 0.488839 | 0 |
Faceall-BUPT | ensemble plan A; validation map 52.24 | 0.486977 | 0 |
Faceall-BUPT | multi-scale roi; best single model; validation map 51.73 | 0.484141 | 0 |
VB | Ensemble Detection Model E3 | 0.481285 | 0 |
Hitsz_BCC | Combined 500x500 with 300x300 model | 0.479929 | 0 |
VB | Ensemble Detection Model E1 | 0.479043 | 0 |
ToConcoctPellucid | Ensemble of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting | 0.477484 | 0 |
Hitsz_BCC | Self-implement SSD 500x500 model with ResNet-101 | 0.472984 | 0 |
ToConcoctPellucid | Ensemble of different topology of ResNet-101 + ResNet-50 followed by prediction pooling using box-voting | 0.470133 | 0 |
ToConcoctPellucid | ResNet-101 + Faster-RCNN single model | 0.469716 | 0 |
Faceall-BUPT | faster rcnn baseline; validation map 49.30 | 0.461085 | 0 |
VIST | Single model A using ResNet for detection | 0.459305 | --- |
VIST | Single model B using ResNet for detection | 0.455689 | --- |
Hitsz_BCC | Self-implement SSD 300x300 model with ResNet-152 | 0.451462 | 0 |
VB | Ensemble Detection Model E2 | 0.45063 | 0 |
SunNMoon | ensemble FRCN and SSD based on Resnet101 networks. | 0.434906 | 0 |
Choong | Ensemble of Deep learning model based on VGG16 & ResNet | 0.434323 | 0 |
VB | Single Detection Model S1 | 0.421331 | 0 |
hustvision | convbox-googlenet | 0.413457 | 0 |
LZDTX | A deconv-ssd network with input size 300x300. | 0.403113 | 0 |
OutOfMemory | ResNet-152+FasterRCNN | 0.393259 | 0 |
Lean-T | A single model, Faster R-CNN baseline,continuous iterations(~230K) | 0.314391 | 0 |
BUAA ERCACAT | combined model for detection | 0.269069 | 0 |
BUAA ERCACAT | A single model for detection | 0.265055 | 0 |
Lean-T | A single model, Faster R-CNN baseline,discontinuous iterations(~600K) | 0.259508 | 0 |
Task 1b: Object detection with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
CUImage | Our model using our labeled landmarks on ImageNet Det data | We used the labeled landmarks on ImageNet Det data | 176 | 0.660081 |
Trimps-Soushen | Ensemble 3 | With extra annotations. | 22 | 0.616836 |
NUIST | submission_4 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 1 | 0.542942 |
NUIST | submission_3 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 1 | 0.540981 |
NUIST | submission_5 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 0 | 0.540619 |
DPAI Vison | multi-model ensemble, multiple classifier ensemble | add extra data for class num<1000 | 0 | 0.534943 |
DPAI Vison | multi-model ensemble, multiple context classifier ensemble | add extra data for class num<1000 | 0 | 0.534543 |
DPAI Vison | multi-model ensemble, extra classifier | add extra data for class num<1000 | 0 | 0.534203 |
DPAI Vison | multi-model ensemble, one-scale context classifier | add extra data for class num<1000 | 0 | 0.533838 |
DPAI Vison | multi-model ensemble | add extra data for class num<1000 | 0 | 0.526699 |
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
CUImage | Our model using our labeled landmarks on ImageNet Det data | We used the labeled landmarks on ImageNet Det data | 0.660081 | 176 |
Trimps-Soushen | Ensemble 3 | With extra annotations. | 0.616836 | 22 |
NUIST | submission_4 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 0.542942 | 1 |
NUIST | submission_3 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 0.540981 | 1 |
NUIST | submission_5 | refine the training data, add labels neglected, remove noisy labels for multi-instance images | 0.540619 | 0 |
DPAI Vison | multi-model ensemble, multiple classifier ensemble | add extra data for class num<1000 | 0.534943 | 0 |
DPAI Vison | multi-model ensemble, multiple context classifier ensemble | add extra data for class num<1000 | 0.534543 | 0 |
DPAI Vison | multi-model ensemble, extra classifier | add extra data for class num<1000 | 0.534203 | 0 |
DPAI Vison | multi-model ensemble, one-scale context classifier | add extra data for class num<1000 | 0.533838 | 0 |
DPAI Vison | multi-model ensemble | add extra data for class num<1000 | 0.526699 | 0 |
Object localization (LOC)[top]
Task 2a: Classification+localization with provided training data
Ordered by localization error
Team name | Entry description | Localization error | Classification error |
Trimps-Soushen | Ensemble 3 | 0.077087 | 0.02991 |
Trimps-Soushen | Ensemble 4 | 0.077429 | 0.02991 |
Trimps-Soushen | Ensemble 2 | 0.077668 | 0.02991 |
Trimps-Soushen | Ensemble 1 | 0.079068 | 0.03144 |
Hikvision | Ensemble of 3 Faster R-CNN models for localization | 0.087377 | 0.03711 |
Hikvision | Ensemble of 4 Faster R-CNN models for localization | 0.087533 | 0.03711 |
NUIST | prefer multi box prediction with refine | 0.090593 | 0.03461 |
NUIST | prefer multi class prediction | 0.094058 | 0.03351 |
CU-DeepLink | GrandUnion + Fused-scale EnsembleNet | 0.098892 | 0.03042 |
CU-DeepLink | GrandUnion + Basic Ensemble | 0.098954 | 0.03049 |
CU-DeepLink | GrandUnion + Multi-scale EnsembleNet | 0.099006 | 0.03046 |
KAISTNIA_ETRI | Ensembles B (further tuned in class-dependent models I) | 0.099286 | 0.03352 |
CU-DeepLink | GrandUnion + Class-reweighted Ensemble with Per-instance Normalization | 0.099349 | 0.03103 |
CU-DeepLink | GrandUnion + Class-reweighted Ensemble | 0.099369 | 0.03096 |
KAISTNIA_ETRI | Ensembles A (further tuned in class-dependent model I ) | 0.100552 | 0.03352 |
KAISTNIA_ETRI | Ensembles B | 0.100676 | 0.03256 |
KAISTNIA_ETRI | Ensembles A | 0.102015 | 0.03256 |
KAISTNIA_ETRI | Ensembles C | 0.102056 | 0.03256 |
NUIST | prefer multi box prediction without refine | 0.11743 | 0.03473 |
SamExynos | 3 model only for classification | 0.236561 | 0.03171 |
SamExynos | single model only for classification | 0.238791 | 0.03614 |
Faceall-BUPT | Single localization network (II) fine-tuned with object-level annotations of training data. | 0.31649 | 0.05184 |
Faceall-BUPT | Ensemble of 5 models for classification, single model for localization. | 0.320754 | 0.04574 |
Faceall-BUPT | Ensemble of 3 models for classification, single model for localization. | 0.325235 | 0.0466 |
WQF_BTPZ | Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029. | 0.374499 | 0.06414 |
Faceall-BUPT | Single localization network (I) fine-tuned with object-level annotations of training data. | 0.415558 | 0.05184 |
DGIST-KAIST | Weighted sum #1 (five models) | 0.489969 | 0.03297 |
DGIST-KAIST | Averaging four models | 0.490373 | 0.03378 |
WQF_BTPZ | For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025. | 0.524586 | 0.06407 |
ResNeXt | Ensemble C, weighted average, tuned on val. [No bounding box results] | 0.737308 | 0.03031 |
ResNeXt | Ensemble B, weighted average, tuned on val. [No bounding box results] | 0.737484 | 0.03092 |
ResNeXt | Ensemble A, simple average. [No bounding box results] | 0.737505 | 0.0315 |
ResNeXt | Ensemble C, weighted average. [No bounding box results] | 0.737526 | 0.03124 |
ResNeXt | Ensemble B, weighted average. [No bounding box results] | 0.737681 | 0.03203 |
SIIT_KAIST-TECHWIN | Ensemble B | 0.931565 | 0.03416 |
SIIT_KAIST-TECHWIN | Ensemble C | 0.931565 | 0.03458 |
SIIT_KAIST-TECHWIN | Ensemble A | 0.931596 | 0.03436 |
SIIT_KAIST-TECHWIN | Single model | 0.931596 | 0.03651 |
DEEPimagine | ImagineNet ensemble for classification only [ALL] | 0.995757 | 0.03536 |
DEEPimagine | ImagineNet ensemble for classification only [PART#2] | 0.995757 | 0.03592 |
DEEPimagine | ImagineNet ensemble for classification only [PART#1] | 0.995768 | 0.03643 |
NEU_SMILELAB | An ensemble of five models. Top-5 error 3.92% on validation set. | 0.999077 | 0.03981 |
NEU_SMILELAB | An ensemble of six models. Top-5 error 4.24% on validation set. | 0.999077 | 0.04268 |
NEU_SMILELAB | A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set. | 0.999077 | 0.04511 |
NEU_SMILELAB | Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set. | 0.999097 | 0.07288 |
DeepIST | EnsembleC | 1.0 | 0.03291 |
DeepIST | EnsembleD | 1.0 | 0.03294 |
DGIST-KAIST | Weighted sum #2 (five models) | 1.0 | 0.03324 |
DGIST-KAIST | Averaging five models | 1.0 | 0.03357 |
DGIST-KAIST | Averaging six models | 1.0 | 0.03357 |
DeepIST | EnsembleB | 1.0 | 0.03446 |
DeepIST | EnsembleA | 1.0 | 0.03449 |
Ordered by classification error
Team name | Entry description | Classification error | Localization error |
Trimps-Soushen | Ensemble 2 | 0.02991 | 0.077668 |
Trimps-Soushen | Ensemble 3 | 0.02991 | 0.077087 |
Trimps-Soushen | Ensemble 4 | 0.02991 | 0.077429 |
ResNeXt | Ensemble C, weighted average, tuned on val. [No bounding box results] | 0.03031 | 0.737308 |
CU-DeepLink | GrandUnion + Fused-scale EnsembleNet | 0.03042 | 0.098892 |
CU-DeepLink | GrandUnion + Multi-scale EnsembleNet | 0.03046 | 0.099006 |
CU-DeepLink | GrandUnion + Basic Ensemble | 0.03049 | 0.098954 |
ResNeXt | Ensemble B, weighted average, tuned on val. [No bounding box results] | 0.03092 | 0.737484 |
CU-DeepLink | GrandUnion + Class-reweighted Ensemble | 0.03096 | 0.099369 |
CU-DeepLink | GrandUnion + Class-reweighted Ensemble with Per-instance Normalization | 0.03103 | 0.099349 |
ResNeXt | Ensemble C, weighted average. [No bounding box results] | 0.03124 | 0.737526 |
Trimps-Soushen | Ensemble 1 | 0.03144 | 0.079068 |
ResNeXt | Ensemble A, simple average. [No bounding box results] | 0.0315 | 0.737505 |
SamExynos | 3 model only for classification | 0.03171 | 0.236561 |
ResNeXt | Ensemble B, weighted average. [No bounding box results] | 0.03203 | 0.737681 |
KAISTNIA_ETRI | Ensembles A | 0.03256 | 0.102015 |
KAISTNIA_ETRI | Ensembles C | 0.03256 | 0.102056 |
KAISTNIA_ETRI | Ensembles B | 0.03256 | 0.100676 |
DeepIST | EnsembleC | 0.03291 | 1.0 |
DeepIST | EnsembleD | 0.03294 | 1.0 |
DGIST-KAIST | Weighted sum #1 (five models) | 0.03297 | 0.489969 |
DGIST-KAIST | Weighted sum #2 (five models) | 0.03324 | 1.0 |
NUIST | prefer multi class prediction | 0.03351 | 0.094058 |
KAISTNIA_ETRI | Ensembles A (further tuned in class-dependent model I ) | 0.03352 | 0.100552 |
KAISTNIA_ETRI | Ensembles B (further tuned in class-dependent models I) | 0.03352 | 0.099286 |
DGIST-KAIST | Averaging five models | 0.03357 | 1.0 |
DGIST-KAIST | Averaging six models | 0.03357 | 1.0 |
DGIST-KAIST | Averaging four models | 0.03378 | 0.490373 |
SIIT_KAIST-TECHWIN | Ensemble B | 0.03416 | 0.931565 |
SIIT_KAIST-TECHWIN | Ensemble A | 0.03436 | 0.931596 |
DeepIST | EnsembleB | 0.03446 | 1.0 |
DeepIST | EnsembleA | 0.03449 | 1.0 |
SIIT_KAIST-TECHWIN | Ensemble C | 0.03458 | 0.931565 |
NUIST | prefer multi box prediction with refine | 0.03461 | 0.090593 |
NUIST | prefer multi box prediction without refine | 0.03473 | 0.11743 |
DEEPimagine | ImagineNet ensemble for classification only [ALL] | 0.03536 | 0.995757 |
DEEPimagine | ImagineNet ensemble for classification only [PART#2] | 0.03592 | 0.995757 |
SamExynos | single model only for classification | 0.03614 | 0.238791 |
DEEPimagine | ImagineNet ensemble for classification only [PART#1] | 0.03643 | 0.995768 |
SIIT_KAIST-TECHWIN | Single model | 0.03651 | 0.931596 |
Hikvision | Ensemble of 3 Faster R-CNN models for localization | 0.03711 | 0.087377 |
Hikvision | Ensemble of 4 Faster R-CNN models for localization | 0.03711 | 0.087533 |
NEU_SMILELAB | An ensemble of five models. Top-5 error 3.92% on validation set. | 0.03981 | 0.999077 |
NEU_SMILELAB | An ensemble of six models. Top-5 error 4.24% on validation set. | 0.04268 | 0.999077 |
NEU_SMILELAB | A single resnet-200 layer trained with small batch size. Top-5 error 4.57% on validation set. | 0.04511 | 0.999077 |
Faceall-BUPT | Ensemble of 5 models for classification, single model for localization. | 0.04574 | 0.320754 |
Faceall-BUPT | Ensemble of 3 models for classification, single model for localization. | 0.0466 | 0.325235 |
Faceall-BUPT | Single localization network (I) fine-tuned with object-level annotations of training data. | 0.05184 | 0.415558 |
Faceall-BUPT | Single localization network (II) fine-tuned with object-level annotations of training data. | 0.05184 | 0.31649 |
WQF_BTPZ | For classification, we merge two ResNet models, the top-5 cls-error on validation is 0.0639. For localization, we use a single faster RCNN model with ResNet, the top-5 cls-loc error on validation is 0.4025. | 0.06407 | 0.524586 |
WQF_BTPZ | Two models for classification, localization model is fixed. The top-5 cls-only error on validation is 0.0645. The top-5 cls-loc error on validation is 0.4029. | 0.06414 | 0.374499 |
NEU_SMILELAB | Our single model with a partition of the 1000 classes. Top-5 error 7.62% on validation set. | 0.07288 | 0.999097 |
Task 2b: Classification+localization with additional training data
Ordered by localization error
Team name | Entry description | Description of outside data used | Localization error | Classification error |
Trimps-Soushen | Ensemble 5 | With extra annotations. | 0.077377 | 0.02991 |
NUIST | prefer multi box prediction | ensemble one model trained on CLS+Place2 (1365) | 0.094992 | 0.04093 |
NUIST | prefer multi class prediction | ensemble one model trained on CLS+Place2(1365) | 0.097782 | 0.03877 |
Ordered by classification error
Team name | Entry description | Description of outside data used | Classification error | Localization error |
Trimps-Soushen | Ensemble 5 | With extra annotations. | 0.02991 | 0.077377 |
NUIST | prefer multi class prediction | ensemble one model trained on CLS+Place2(1365) | 0.03877 | 0.097782 |
NUIST | prefer multi box prediction | ensemble one model trained on CLS+Place2 (1365) | 0.04093 | 0.094992 |
Object detection from video (VID)[top]
Task 3a: Object detection from video with provided training data
Ordered by number of categories won
Team name | Entry description | Number of object categories won | mean AP |
NUIST | cascaded region regression + tracking | 10 | 0.808292 |
NUIST | cascaded region regression + tracking | 10 | 0.803154 |
CUVideo | 4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation | 9 | 0.767981 |
Trimps-Soushen | Ensemble 2 | 1 | 0.709651 |
MCG-ICT-CAS | ResNet101+ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo | 0 | 0.733116 |
MCG-ICT-CAS | ResNet101+ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0 | 0.730793 |
MCG-ICT-CAS | ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0 | 0.720318 |
MCG-ICT-CAS | ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo | 0 | 0.706204 |
MCG-ICT-CAS | ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0 | 0.700729 |
Trimps-Soushen | Ensemble 3 | 0 | 0.684258 |
KAIST-SLSP | set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) | 0 | 0.642787 |
NUS_VISENZE | fused ssd vgg+resnet nms | 0 | 0.64062 |
RUC_BDAI | We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing. | 0 | 0.562668 |
SRA | Object detection using temporal and contextual information | 0 | 0.511638 |
SRA | object detection without contextual information | 0 | 0.492785 |
Faceall-BUPT | faster rcnn, brute force detection, only used DET data to train; map on val is 53.51 | 0 | 0.490969 |
CIGIT_Media | adopt a new method for merging the scores of R-FCN and SSD detectors | 0 | 0.483239 |
CIGIT_Media | object detection from video without tracking | 0 | 0.47782 |
F205_CV | ssd resnet 101 0.01 confidence rate | 0 | 0.472255 |
F205_CV | ssd resnet 101 0.1 confidence rate | 0 | 0.439478 |
F205_CV | ssd resnet 101 0.2 confidence rate | 0 | 0.41922 |
F205_CV | ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate | 0 | 0.357711 |
F205_CV | ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate | 0 | 0.340852 |
SIS ITMO University | SSD | 0 | 0.272236 |
ASTAR_VA | Our model takes into account spatial and temporal information from several previous frames. | 0 | 0.270755 |
MCC | --- | 0 | 0.25457 |
RUC_BDAI | We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. | 0 | 0.108039 |
CUVideo | 4-model ensemble without MCS & MGP | --- | 0.740812 |
CUVideo | Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation | --- | 0.732857 |
Ordered by mean average precision
Team name | Entry description | mean AP | Number of object categories won |
NUIST | cascaded region regression + tracking | 0.808292 | 10 |
NUIST | cascaded region regression + tracking | 0.803154 | 10 |
CUVideo | 4-model ensemble with Multi-Context Suppression and Motion-Guided Propagation | 0.767981 | 9 |
CUVideo | 4-model ensemble without MCS & MGP | 0.740812 | --- |
MCG-ICT-CAS | ResNet101+ResNet200 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo | 0.733116 | 0 |
CUVideo | Single GBD-Net with Multi-Context Suppression & Motion-Guided Propagation | 0.732857 | --- |
MCG-ICT-CAS | ResNet101+ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0.730793 | 0 |
MCG-ICT-CAS | ResNet200 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0.720318 | 0 |
Trimps-Soushen | Ensemble 2 | 0.709651 | 1 |
MCG-ICT-CAS | ResNet101 models for detection,Non-coocurrence filtration, Coherent tublet reclassification, trackInfo | 0.706204 | 0 |
MCG-ICT-CAS | ResNet101 models for detection, Non-coocurrence filtration, Coherent tublet reclassification | 0.700729 | 0 |
Trimps-Soushen | Ensemble 3 | 0.684258 | 0 |
KAIST-SLSP | set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) | 0.642787 | 0 |
NUS_VISENZE | fused ssd vgg+resnet nms | 0.64062 | 0 |
RUC_BDAI | We use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. Then we utilize the contextual information of the video to reduce the noise and add the missing. | 0.562668 | 0 |
SRA | Object detection using temporal and contextual information | 0.511638 | 0 |
SRA | object detection without contextual information | 0.492785 | 0 |
Faceall-BUPT | faster rcnn, brute force detection, only used DET data to train; map on val is 53.51 | 0.490969 | 0 |
CIGIT_Media | adopt a new method for merging the scores of R-FCN and SSD detectors | 0.483239 | 0 |
CIGIT_Media | object detection from video without tracking | 0.47782 | 0 |
F205_CV | ssd resnet 101 0.01 confidence rate | 0.472255 | 0 |
F205_CV | ssd resnet 101 0.1 confidence rate | 0.439478 | 0 |
F205_CV | ssd resnet 101 0.2 confidence rate | 0.41922 | 0 |
F205_CV | ssd with resnet101 filted by nms with a 0.6 overlap rate and 0.1 confidence rate | 0.357711 | 0 |
F205_CV | ssd with resnet101 filted by nms with a 0.6 overlap rate, and 0.02 confidence rate | 0.340852 | 0 |
SIS ITMO University | SSD | 0.272236 | 0 |
ASTAR_VA | Our model takes into account spatial and temporal information from several previous frames. | 0.270755 | 0 |
MCC | --- | 0.25457 | 0 |
RUC_BDAI | We only use the well-trained Faster R-CNN to generate bounding boxes for every frame of the video. | 0.108039 | 0 |
Task 3b: Object detection from video with additional training data
Ordered by number of categories won
Team name | Entry description | Description of outside data used | Number of object categories won | mean AP |
NUIST | cascaded region regression + tracking | proposal network is finetuned from COCO | 17 | 0.79593 |
NUIST | cascaded region regression + tracking | proposal network is finetuned from COCO | 5 | 0.781144 |
Trimps-Soushen | Ensemble 6 | Extra data from ImageNet dataset(out of the ILSVRC2016) | 5 | 0.720704 |
ITLab-Inha | An ensemble for detection, MCMOT for tracking | pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) | 3 | 0.731471 |
DPAI Vison | single model | extra data | 0 | 0.615196 |
DPAI Vison | single model and iteration regression | extra data | 0 | 0.532302 |
TEAM1 | VGG-16 + Faster R-CNN | Imagenet DET dataset | 0 | 0.217933 |
TEAM1 | Ensemble of 6 models | Imagenet DET dataset | 0 | 0.207165 |
TEAM1 | Ensemble of 7 models | Imagenet DET dataset | 0 | 0.189227 |
Ordered by mean average precision
Team name | Entry description | Description of outside data used | mean AP | Number of object categories won |
NUIST | cascaded region regression + tracking | proposal network is finetuned from COCO | 0.79593 | 17 |
NUIST | cascaded region regression + tracking | proposal network is finetuned from COCO | 0.781144 | 5 |
ITLab-Inha | An ensemble for detection, MCMOT for tracking | pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) | 0.731471 | 3 |
Trimps-Soushen | Ensemble 6 | Extra data from ImageNet dataset(out of the ILSVRC2016) | 0.720704 | 5 |
DPAI Vison | single model | extra data | 0.615196 | 0 |
DPAI Vison | single model and iteration regression | extra data | 0.532302 | 0 |
TEAM1 | VGG-16 + Faster R-CNN | Imagenet DET dataset | 0.217933 | 0 |
TEAM1 | Ensemble of 6 models | Imagenet DET dataset | 0.207165 | 0 |
TEAM1 | Ensemble of 7 models | Imagenet DET dataset | 0.189227 | 0 |
Task 3c: Object detection/tracking from video with provided training data
Team name | Entry description | mean AP |
CUVideo | 4-model ensemble | 0.558557 |
NUIST | cascaded region regression + tracking | 0.548781 |
CUVideo | Single GBD-Net | 0.526137 |
MCG-ICT-CAS | ResNet101+ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification++, MDNet tracking | 0.488632 |
MCG-ICT-CAS | ResNet101+ResNet200 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking | 0.484771 |
MCG-ICT-CAS | ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification, MDNet tracking | 0.462013 |
MCG-ICT-CAS | ResNet101 models for detetion, optical flow for tracking, Coherent tublet reclassification | 0.395057 |
MCG-ICT-CAS | ResNet101 models+ResNet200 for detetion, optical flow for tracking, Coherent tublet reclassification | 0.393705 |
KAIST-SLSP | set 1 (ensemble with 2 models w/ various post-processing, including multiple object tracking w/ beta = 0.2) | 0.327421 |
CIGIT_Media | object detection from video with tracking | 0.229714 |
CIGIT_Media | adopt a new method for merging the scores of R-FCN and SSD detectors | 0.221176 |
F205_CV | a simple track with ssd_resnet101 | 0.164678 |
NUS_VISENZE | 17Sept_result_final_ss_ssd_resnet_nms_fused | 0.148463 |
NUS_VISENZE | test | 0.148463 |
F205_CV | a simple track with ssd_resnet101 with 0.1 confidence | 0.139524 |
F205_CV | a simple track with ssd_resnet101 with 0.2 confidence | 0.132039 |
NUS_VISENZE | fused 3 models with tracking | 0.112528 |
NUS_VISENZE | fused 3 models with tracking max 8 classes | 0.112524 |
BSC- UPC | This is the longest run without error. | 0.002263 |
BSC- UPC | This had some error I don't know if it's complete. | 0.002263 |
Task 3d: Object detection/tracking from video with additional training data
Team name | Entry description | Description of outside data used | mean AP |
NUIST | cascaded region regression + tracking | proposal network is finetuned from COCO | 0.583898 |
ITLab-Inha | An ensemble for detection, MCMOT for tracking | pre-trained model from COCO detection, extra data collected by ourselves (100 images per class) | 0.490863 |
Scene Classification (Scene)[top]
Team name | Entry description | Top-5 classification error |
Hikvision | Model D | 0.0901 |
Hikvision | Model E | 0.0908 |
Hikvision | Model C | 0.0939 |
Hikvision | Model B | 0.0948 |
MW | Model ensemble 2 | 0.1019 |
MW | Model ensemble 3 | 0.1019 |
MW | Model ensemble 1 | 0.1023 |
Hikvision | Model A | 0.1026 |
Trimps-Soushen | With extra data. | 0.103 |
Trimps-Soushen | Ensemble 2 | 0.1042 |
SIAT_MMLAB | 10 models fusion | 0.1043 |
SIAT_MMLAB | 7 models fusion | 0.1044 |
SIAT_MMLAB | fusion with softmax | 0.1044 |
SIAT_MMLAB | learning weights with cnn | 0.1044 |
SIAT_MMLAB | 6 models fusion | 0.1049 |
Trimps-Soushen | Ensemble 4 | 0.1049 |
Trimps-Soushen | Ensemble 3 | 0.105 |
MW | Single model B | 0.1073 |
MW | Single model A | 0.1076 |
NTU-SC | Product of 5 ensembles (top-5) | 0.1085 |
NTU-SC | Product of 3 ensembles (top-5) | 0.1086 |
NTU-SC | Sum of 3 ensembles (top-5) | 0.1086 |
NTU-SC | Sum of 5 ensembles (top-3) | 0.1086 |
NTU-SC | Single ensemble of 5 models (top-5) | 0.1088 |
NQSCENE | Four models | 0.1093 |
NQSCENE | Three models | 0.1101 |
Samsung Research America: General Purpose Acceleration Group | Simple Ensemble, 3 Inception v3 models w/various hyper param changes, 32 multi-crop (60.11 top-1, 88.98 top-5 on val) | 0.1113 |
fusionf | Fusion with average strategy (12 models) | 0.1115 |
fusionf | Fusion with scoring strategy (14 models) | 0.1117 |
fusionf | Fusion with average strategy (13 models) | 0.1118 |
YoutuLab | weighted average1 at scale level using greedy search | 0.1125 |
YoutuLab | weighted average at model level using greedy search | 0.1127 |
YoutuLab | weighted average2 at scale level using greedy search | 0.1129 |
fusionf | Fusion with scoring strategy (13 models) | 0.113 |
fusionf | Fusion with scoring strategy (12 models) | 0.1132 |
YoutuLab | simple average using models in entry 3 | 0.1139 |
Samsung Research America: General Purpose Acceleration Group | Model A0, weakly scaled, multi-crop. (59.61 top-1, 88.64 top-5 on val) | 0.1142 |
SamExynos | 3 model | 0.1143 |
Samsung Research America: General Purpose Acceleration Group | Ensemble B, 3 Inception v3 models w/various hyper param changes + Inception v4 res2, 128 multi-crop | 0.1152 |
YoutuLab | average on base models | 0.1162 |
NQSCENE | Model B | 0.117 |
Samsung Research America: General Purpose Acceleration Group | Model A2, weakly scaled, single-crop & mirror. (58.84 top-1, 88.09 top-5 on val) | 0.1188 |
NQSCENE | Model A | 0.1192 |
Samsung Research America: General Purpose Acceleration Group | Model A1, weakly scaled, single-crop. (58.65 top-1, 88.07 top-5 on val) | 0.1193 |
Trimps-Soushen | Ensemble 1 | 0.1196 |
Rangers | ensemble model 1 | 0.1208 |
SamExynos | single model | 0.121 |
Rangers | ensemble model 2 | 0.1212 |
Everphoto | ensemble by learned weights - 1 | 0.1213 |
Everphoto | ensemble by product strategy | 0.1218 |
Everphoto | ensemble by learned weights - 2 | 0.1218 |
Everphoto | ensemble by average strategy | 0.1223 |
MIPAL_SNU | Ensemble of two ResNet-50 with balanced sampling | 0.1232 |
KPST_VB | Model II | 0.1233 |
KPST_VB | Ensemble of Model I and II | 0.1235 |
Rangers | single model result of 69 | 0.124 |
Everphoto | ensemble by product strategy (without specialist models) | 0.1242 |
KPST_VB | Model II with adjustment | 0.125 |
KPST_VB | Model I | 0.1251 |
Rangers | single model result of 66 | 0.1253 |
KPST_VB | Ensemble of Model I and II with adjustment | 0.1253 |
SJTU-ReadSense | Ensemble 5 models with learnt weights | 0.1272 |
SJTU-ReadSense | Ensemble 5 models with weighted validation accuracies | 0.1273 |
iMCB | A combination of CNN models based on researched influential factors | 0.1277 |
SJTU-ReadSense | Ensemble 6 models with learnt weights | 0.1278 |
SJTU-ReadSense | Ensemble 4 models with learnt weights | 0.1287 |
iMCB | A combination of CNN models with a strategy w.r.t.validation accuracy | 0.1299 |
Choong | Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate. | 0.131 |
SIIT_KAIST | 101-depth single model (val.error 12.90%) | 0.131 |
DPAI Vison | An ensemble model | 0.1355 |
isia_ICT | spectral clustering on confusion matrix | 0.1355 |
isia_ICT | fusion of 4 models with average strategy | 0.1357 |
NUIST | inception+shortcut CNN | 0.137 |
isia_ICT | MP_multiCNN_multiscale | 0.1372 |
NUIST | inception+shortcut CNN | 0.1381 |
Viz Insight | Multiple Deep Metaclassifiers | 0.1386 |
iMCB | FeatureFusion_2L | 0.1396 |
iMCB | FeatureFusion_3L | 0.1404 |
DPAI Vison | Single Model | 0.1425 |
isia_ICT | 2 models with size of 288 | 0.1433 |
Faceall-BUPT | A single model with 150crops | 0.1471 |
iMCB | A Single Model | 0.1506 |
SJTU-ReadSense | A single model (based on Inception-BN) trained on the Places365-Challenge dataset | 0.1511 |
OceanVision | A result obtained by VGG-16 | 0.1635 |
OceanVision | A result obtained by alexnet | 0.1867 |
OceanVision | A result obtained by googlenet | 0.1867 |
ABTEST | GoogLeNet Model trained on LSUN dataset and fined tuned on Places2 | 0.3245 |
Vladimir Iglovikov | VGG16 trained on 128x128 | 0.3552 |
Vladimir Iglovikov | VGG19 trained on 128x128 | 0.3593 |
Vladimir Iglovikov | average of VGG16 and VGG19 trained on 128x128 | 0.3712 |
Vladimir Iglovikov | Resnet 50 trained on 128x128 | 0.4577 |
scnu407 | VGG16+4D lstm | 0.8831 |
Scene Parsing[top]
Team name | Entry description | Average of mIoU and pixel accuracy |
SenseCUSceneParsing | ensemble more models on trainval data | 0.57205 |
SenseCUSceneParsing | dense ensemble model on trainval data | 0.5711 |
SenseCUSceneParsing | ensemble model on trainval data | 0.5705 |
SenseCUSceneParsing | ensemble model on train data | 0.5674 |
Adelaide | Multiple models, multiple scales, refined with CRFs | 0.56735 |
Adelaide | Multiple models, multiple scales | 0.56615 |
Adelaide | Single model, multiple scales | 0.5641 |
Adelaide | Multiple models, single scale | 0.5617 |
360+MCG-ICT-CAS_SP | fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training | 0.55565 |
Adelaide | Single model, single scale | 0.5539 |
SenseCUSceneParsing | best single model on train data | 0.5538 |
360+MCG-ICT-CAS_SP | fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before fusion | 0.55335 |
360+MCG-ICT-CAS_SP | fusing 152, 101, 200 layers front models with global context aggregation, iterative boosting and high resolution training, some models adding local refinement network before and after fusion | 0.55215 |
360+MCG-ICT-CAS_SP | 152 layers front model with global context aggregation, iterative boosting and high resolution training | 0.54675 |
SegModel | ensemble of 5 models, bilateral filter, 42.7 mIoU on val set | 0.5465 |
SegModel | ensemble of 5 models,guided filter, 42.5 mIoU on val set | 0.5449 |
CASIA_IVA | casia_iva_model4:DeepLab, Multi-Label | 0.5433 |
CASIA_IVA | casia_iva_model3:DeepLab, OA-Seg, Multi-Label | 0.5432 |
CASIA_IVA | casia_iva_model5:Aug_data,DeepLab, OA-Seg, Multi-Label | 0.5425 |
NTU-SP | Fusion models from two source models (Train + TrainVal) | 0.53565 |
NTU-SP | 6 ResNet initialized models (models are trained from TrainVal) | 0.5354 |
NTU-SP | 8 ResNet initialized models + 2 VGG initialized models (with different bn statistics) | 0.5346 |
SegModel | ensemble by joint categories and guided filter, 42.7 on val set | 0.53445 |
NTU-SP | 8 ResNet initialized models + 2 VGG initialized models (models are trained from Train only) | 0.53435 |
NTU-SP | 8 ResNet initialized models + 2 VGG initialized models (models are trained from TrainVal) | 0.53435 |
Hikvision | Ensemble models | 0.53355 |
SegModel | ensemble by joint categories and bilateral filter, 42.8 on val set | 0.5332 |
ACRV-Adelaide | use DenseCRF | 0.5326 |
SegModel | single model, 41.3 mIoU on valset | 0.53225 |
DPAI Vison | different denseCRF parameters of 3 models(B) | 0.53065 |
Hikvision | Single model | 0.53055 |
ACRV-Adelaide | an ensemble | 0.53035 |
360+MCG-ICT-CAS_SP | baseline,152 layers front model with iterative boosting | 0.52925 |
CASIA_IVA | casia_iva_model2:DeepLab, OA-Seg | 0.52785 |
DPAI Vison | different denseCRF parameters of 3 models(C) | 0.52645 |
DPAI Vison | average ensemble of 3 segmentation models | 0.52575 |
DPAI Vison | different denseCRF parameters of 3 models(A) | 0.52575 |
CASIA_IVA | casia_iva_model1:DeepLab | 0.5243 |
SUXL | scene parsing network 5 | 0.52355 |
SUXL | scene parsing network | 0.52325 |
SUXL | scene parsing network 3 | 0.5224 |
ACRV-Adelaide | a single model | 0.5221 |
SUXL | scene parsing network 2 | 0.5212 |
SYSU_HCP-I2_Lab | cascade nets | 0.52085 |
SYSU_HCP-I2_Lab | DCNN with skipping layers | 0.5136 |
SYSU_HCP-I2_Lab | DeepLab_CRF | 0.51205 |
SYSU_HCP-I2_Lab | Pixel normalization networks | 0.5077 |
SYSU_HCP-I2_Lab | ResNet101 | 0.50715 |
S-LAB-IIE-CAS | Multi-Scale CNN + Bbox_Refine + FixHole | 0.5066 |
S-LAB-IIE-CAS | Combined with the results of other models | 0.50625 |
S-LAB-IIE-CAS | Multi-Scale CNN + Attention | 0.50515 |
NUS_FCRN | trained with training set and val set | 0.5006 |
NUS-AIPARSE | model3 | 0.4997 |
NUS_FCRN | trained with training set only | 0.49885 |
NUS-AIPARSE | model2 | 0.49855 |
F205_CV | Model fusion of ResNet101 and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. | 0.49805 |
F205_CV | Model fusion of ResNet101 and FCN, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. | 0.4933 |
F205_CV | Model fusion of ResNet101, FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. | 0.4899 |
Faceall-BUPT | 6 models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. | 0.4893 |
NUS-AIPARSE | model1 | 0.48915 |
Faceall-BUPT | We use six models finetuned by pre-trained fcn8s and dilatedNet with 3 different images sizes. The pixel-wise accuracy is 76.94% and mean of the class-wise IoU is 0.3552. | 0.48905 |
F205_CV | Model fusion of ResNet101, FCN and DilatedNet, with data augmentation, fine-tuned from places2 scene classification/parsing 2016 pretrained models. | 0.48425 |
S-LAB-IIE-CAS | Multi-Scale CNN + Bbox_Refine | 0.4814 |
Faceall-BUPT | 3 models finetuned by pre-trained fcn8s with 3 different images sizes. | 0.4795 |
Faceall-BUPT | 3 models finetuned by pre-trained dilatedNet with 3 different images sizes. | 0.4793 |
F205_CV | Model fusion of FCN and DilatedNet, with data augmentation and CRF, fine-tuned from places2 scene classification/parsing 2016 pretrained models. | 0.47855 |
S-LAB-IIE-CAS | Multi-Scale CNN | 0.4757 |
Multiscale-FCN-CRFRNN | Multi-scale CRF-RNN | 0.47025 |
Faceall-BUPT | one models finetuned by pre-trained dilatedNet with images size 384*384. The pixel-wise accuracy is 75.14% and mean of the class-wise IoU is 0.3291. | 0.46565 |
Deep Cognition Labs | Modified Deeplab Vgg16 with CRF | 0.41605 |
mmap-o | FCN-8s with classification | 0.39335 |
NuistParsing | SegNet+Smoothing | 0.3608 |
XKA | SegNet trained on ADE20k +CRF | 0.3603 |
VikyNet | Fine tuned version of ParseNet | 0.0549 |
VikyNet | Fine tuned version of ParseNet | 0.0549 |
Team information[top]
Team name | Team members | Abstract |
360+MCG-ICT-CAS_SP | Rui Zhang (1,2)
Min Lin (1) Sheng Tang (2) Yu Li (1,2) YunPeng Chen (3) YongDong Zhang (2) JinTao Li (2) YuGang Han (1) ShuiCheng Yan (1,3) (1) Qihoo 360 (2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China (3) National University of Singapore (NUS) |
Technique Details for the Scene Parsing Task:
There are two core and general contributions for our scene parsing system: 1) Local-refinement-network for object boundary refinement, and 2) Iterative-boosting-network for overall parsing refinement. These two networks collaboratively refine the parsing results from two perspectives, and the details are as below: 1) Local-refinement-network for object boundary refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is m*m feature maps indicating how each of the m*m neighbors propagates the probability vector to the center point for local refinement. It works similar to bounding-box-refinement in object detection task in spirit, but here locally refine the object boundary instead of object bounding box. 2) Iterative-boosting-network for overall parsing refinement. This network takes the original image and the K object probability maps (each for one of the K classes) as inputs, and the output is the refined probability maps for all classes. It iterative boosting the parsing results in a global way. Also two other tricks are used as below: 1) Global context aggregation: The scene classification information may potentially provide the global context information for decision as well as capture the co-occurrence relationship between scene and object/stuff in scene. Thus, we add the features from an independent scene classification model trained on ILSVRC 2016 Scene Classification dataset into our scene parsing system as contexts. 2) Multi-scale scheme: Considering the limited amount of training data and the various scales of objects in different training samples, we use multi-scale data argumentation in both training and inference stages. High resolution models are also trained on magnified images to capture details and small objects. |
360+MCG-ICT-CAS_DET | Yu Li (1,2),
Sheng Tang (2), Min Lin (1), Rui Zhang (1,2), YunPeng Chen (3), YongDong Zhang (2), JinTao Li (2), YuGang Han (1), ShuiCheng Yan (1,3), (1) Qihoo 360, (2) Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China, (3) National University of Singapore (NUS) |
Technique Details for Object Detection Task
The new contributions of this system are three-fold: 1) Implicit sub-categories of background class, 2) Sink class when necessary, and 3) new semantic segmentation features. For training: 1) Implicit sub-categories of background class: for Faster-RCNN [1], the "background" class is considered as ONE class equally as other individual object classes, but it is quite diverse and impossible to describe as one pattern. Thus we use K output nodes, namely K patterns, to implicitly represent the sub-categories of the background class, which well improved the identification capability of the background class. (2) Sink class when necessary: It is often the case that the ground-truth class may have low probability, and thus the result is incorrect since the sum of all probabilities for all classes equals 1. To address this issue and improve the chance for the ground-truth class with low probability to win, we add a so-called "sink" class, which shall take some probability value if the ground-truth class has low probability, make other classes to have even lower probabilities than the ground-truth class, and make the ground-truth to win. We also propose to use sink class for loss function only when necessary, namely when the ground-truth class is not in the top-k list. (3) New semantic segmentation features: On one hand, motivated by [2], we generate weakly supervised segmentation feature which is used to train region proposal scoring functions and make the gradient flow among all branches. On the other hand, an independent segmentation model trained on ILSVRC Scene Parsing dataset is used to provide feature for our detection network, which is supposed to bring in both stuff and object information for decision. (4) Dilation as context: Motivated by widely used dilated convolution [3] in segmentation, we introduce dilated convolutional layers (initialized as identity mapping) to obtain effective context for training. For testing: We utilize box refinement, box voting, multi-scale testing, co-occurrence refinement, and models ensemble approaches to benefit inference stage. References: [1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. [2] Gidaris, Spyros, and Nikos Komodakis. "Object detection via a multi-region and semantic segmentation-aware cnn model." Proceedings of the IEEE International Conference on Computer Vision. 2015. [3] Yu, Fisher, and Vladlen Koltun. "Multi-scale context aggregation by dilated convolutions." International Conference on Learning Representations. 2016. |
ABTEST | Ankan Bansal | We have used a 22 layer GoogleNet [1] model to classify scenes. The model was trained on the LSUN [2] dataset and then finetuned on the Places dataset fot 365 categories. We did not use any intelligent data selection techniques. The network is simply trained using all the available data without considering the data distribution for different classes.
Before training on LSUN, this network was trained using the Places205 dataset. The model was trained till it saturated at around 85% (Top-1) accuracy on the validation dataset of the LSUN challenge. Then the model was fine-tuned on the 365 categories in the Places2 challenge. We did not use the trained models provided by the organisers to initialise our network. References: [1] Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [2] Yu, Fisher, et al. "Lsun: Construction of a large-scale image dataset using deep learning with humans in the loop." arXiv preprint arXiv:1506.03365 (2015). |
ACRV-Adelaide | Guosheng Lin;
Chunhua Shen; Anton van den Hengel; Ian Reid; Affiliations: ACRV; University of Adelaide; |
Our method is based on multi-level information fusion. We generate multi-level representation of the input image and develop a number of fusion networks with different architectures.
Our models are initialized from the pre-trained residual nets [1] with 50 and 101 layers. A part of the network design in our system is inspired by the multi-scale network with pyramid pooling which is described in [2] and the FCN network in [3]. Our system achieves good performance on the validation set. The IoU score on the validation set is 40.3 for using a single model, which is clearly better than the reported results of the baseline methods in [4]. Applying DenseCRF [5] slightly improves the result. We are preparing a technical report on our method and it will be available in arXiv soon. References: [1] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016. [2] "Efficient Piecewise Training of Deep Structured Models for Semantic Segmentation", Guosheng Lin, Chunhua Shen, Anton van den Hengel, Ian Reid; CVPR 2016 [3] "Fully convolutional networks for semantic segmentation", J Long, E Shelhamer, T Darrell; CVPR 2015 [4] "Semantic Understanding of Scenes through ADE20K Dataset" B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442 [5] "Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials", Philipp Krahenbuhl, Vladlen Koltun; NIPS 2012. |
Adelaide | Zifeng Wu, University of Adelaide
Chunhua Shen, University of Adelaide Anton van den Hengel, University of Adelaide |
We have trained networks with different newly designed structures. One of them performs as well as the Inception-Residual-v2 network in the classification task. It was further tuned for several epochs using the Places365 dataset, which finally obtained even better results on the validation set in the segmentation task. As for FCNs, we mostly followed the settings in our previous technical reports [1, 2]. The best result was obtained by combining the FCNs initialized using two pre-trained networks.
[1] High-performance Semantic Segmentation Using Very Deep Fully Convolutional Networks. https://arxiv.org/abs/1604.04339 [2] Bridging Category-level and Instance-level Semantic Image Segmentation. https://arxiv.org/abs/1605.06885 |
ASTAR_VA | Romain Vial (VA Master Intern Student)
Zhu Hongyuan (VA Scientist) Su Bolan (ex ASTAR Scientist) Shijian Lu (VA Head) |
The problem of object detection from videos is an important part of computer vision that has yet to be solved. The diversity of scenes with the presence of movement make this task very challenging.
Our system localizes and recognizes objects from various scales, positions and classes. It takes into account spatial (local and global) and temporal information from several previous frames. The model has been trained on both the training and validation set. We achieve a final score on the validation set of 76.5% mAP. |
BSC- UPC | Andrea Ferri | This is the result of my thesis: Implementing a deep learning envirorment into a computational server and develop a Object Tracking in Video with Tensorflow suitable for the ImageNET VID challenge. |
BUAA ERCACAT | Biao Leng (Beihang University), Guanglu Song (Beihang University), Cheng Xu (Beihang University), Jiongchao Jin (Beihang University), Zhang Xiong (Beihang University)
|
Our group utilize two image object detection architectures, namely Fast R-CNN[1] and Faster R-CNN[2] for the task of object detection. The detection system Faster R-CNN can be divided into two modules including RPN (region proposal network), a fully convolutional network that proposes regions to tell the Faster R-CNN modules where to focus on in an image, and a Fast R-CNN detector that uses region proposals and classifies the objects in the proposal.
Our training model is based on the VGG_16 model, and we utilize a combined model for higher RPN recall. [1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun, “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks,” arXiv:1506.01497. [2]Ross Girshick. "Fast R-CNN: Fast Region-based Convolutional Networks for object detection", CVPR 2015. |
CASIA_IVA | Jun Fu,Jing Liu,Xinxin Zhu,Longteng Guo,Zhenwei Shen,Zhiwei Fang,Hanqing Lu | We implement image semantic segmentation based on the fused result of the three deep models: DeepLab[1], OA-Seg[2] and the officially public model in this challenge. DeepLab is trained with the framework of Resnet101, and is further improved with object proposals and multiscale prediction combination. OA-Seg is trained with VGG, in which object proposals and multiscale supervision are considered. We argument training data by multiscale and mirrored variants for the above both models. We additionally employ multi-label annotation for images to refine the segmentation results.
[1]Liang-Chieh Chen et.al, DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs,arXiv:1606.00915,2016 [2]Yuhang Wang et.al, Objectness-aware Semantic Segmentation, Accepted by ACM Multimedia, 2016. |
Choong | Choong Hwan Choi (KAIST) | Abstract
Ensemble of Deep learning model based on VGG16 & ResNet Based on VGG16, features are extracted from multiple layers. ROI proposal network is not applied. Every neuron from each feature layer is center of ROI candidate. Reference : [1] Liu, Wei, et. al. "SSD: Single Shot Multibox Detector" [2] K. Simonyan, A. Zisserman, "Very Deep Convolutional Networks for Large-Scale Image Recognition" [3] Kaiming He, et. al., "Deep Residual Learning for Image Recognition" |
CIGIT_Media | Youji Feng, Jiangjing Lv, Xiaohu Shao, Pengcheng Liu, Cheng Cheng
Chongqing Institute of Green and Intelligent Technology, Chinese Academy of Sciences |
We present a simple method combining still image object detection and object tracking for the ImageNet VID task. Object detection is first performed on each frame of the video, and the detected targets are then tracked through the nearby frames. Each tracked target is also assigned a detection score by the object detector. According to the scores, non-maximum suppression (NMS) is applied to all the detected and tracked targets on each frame to obtain the VID results. To improve the performance, we actually employ two state-of-the-art detectors for still image object detection, i.e. the R-FCN detector and the SSD detector. We run the above steps for both detectors independently and combine the respective results into the final ones through NMS.
[1] J. Dai, Y. Li, K. He, and J. Sun. R-FCN: Object Detection via Region-based Fully Convolutional Networks. arXiv 2016. [2] W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C. Fu, and A. Berg. SSD: Single Shot MultiBox Detector. arXiv 2016. [3] K. Kang, W. Ouyang, H. Li, and X. Wang. Object Detection from Video Tubelets with Convolutional Neural Networks. CVPR 2016. |
CIL | Seongmin Kang
Seonghoon Kim Yusun Lim Kibum Bae Heungwoo Han |
Our model is based on Faster RCNN [1].
Pre-activation residual network[2] trained with ILSVRC 2016 dataset is modified for detection tasks. Heavy data augmentation is applied. OHEM[3] and atrous convolution are also applied. All of them are implemented on Tensorflow with multi-gpu training.[4] To meet the deadline, the detection model was trained just for 1/3 training epoches we had planned. [1]Shaoqing Ren et al., Faster R-CNN Towards real-time object detection with region proposal networks, NIPS, 2015 [2]Kaiming He et al., Identity Mappings in Deep Residual Networks, ICML, 2016 [3]Abhinav Shrivastava et al., Training Region-based Object Detectors with Online Hard Example Mining, CVPR, 2016 [4]Martín Abadi et al., TensorFlow: Large-scale machine learning on heterogeneous systems, 2015. Software available from tensorflow.org |
CU-DeepLink | Major team members
------------------- Xingcheng Zhang ^1 Zhizhong Li ^1 Yang Shuo ^1 Yuanjun Xiong ^1 Yubin Deng ^1 Xiaoxiao Li ^1 Kai Chen ^1 Yingrui Wang ^2 Chen Huang ^1 Tong Xiao ^1 Wanshen Feng ^2 Xinyu Pan ^1 Yunxiang Ge ^1 Hang Song ^1 Yujun Shen ^1 Boyang Deng ^1 Ruohui Wang ^1 Supervisors ------------ Dahua Lin ^1 Chen Change Loy ^1 Wenzhi Liu ^2 Shengen Yan ^2 1 - Multimedia Lab, The Chinese University of Hong Kong. 2 - SenseTime Inc. |
Our efforts are divided into two relatively independent directions, namely classification and localization. Specifically, the classification framework would predict five distinct class labels for each image, while the localization framework would produce bounding boxes, one for each predicted class label.
Classification ------------------ Our classification framework is built on top of Google's Inception-ResNet-v2 (IR-v2) [1]. We combined several important techniques, which together leads to substantial performance gain. 1. We developed a novel building block, called “PolyInception”. Each PolyInception can be considered as a meta-module that integrates multiple inception modules via K-way polynomial composition. In this way, we substantially improve a module's expressive power. Also, to facilitate the propagation of gradients across a very deep network, we retain an identity path [2] for each PolyInception. 2. At the core of our framework is the Grand Models. Each grand model comprises three sections operating on different spatial resolutions. Each section is a stack of multiple PolyInception modules. To achieve optimal overall performance (within a certain computational budget), we rebalance the number of modules across the sections. 3. Most of our grand models contain over 500 layers. Whereas they demonstrate remarkable model capacity, we observed notable overfitting at later stage of the training process. To overcome this difficulty, we adopted Stochastic Depth [3] for regularization. 4. We trained 20+ Grand Models, some deeper and others wider. These models constitute a performant yet diverse ensemble. The single most powerful Grand Model reached a top-5 classification error at 4.27%(single corp) on the validation set. 5. Given each image, the class label predictions are produced in two steps. First, multiple crops at 8 scales are generated. Predictions are respectively made on these crops, which are subsequently combined via a novel scheme called selective pooling. The multi-crop predictions generated by individual models are finally integrated to reach the final prediction. In particular, we explored two different integration strategies, namely ensemble-net (a two-layer neural-network designed to integrate predictions) and class-dependent model reweighting. With these ensemble techniques, we reached a top-5 classification error below 2.8% on the validation set. Localization ----------------- Our localization framework is a pipeline comprised of Region Proposal Networks (RPN) and R-CNN models. 1. We trained two RPNs with different design parameters based on ResNet. 2. Given an image, 300 bounding box proposals are derived based on the RPNs, using multi-scale NMS pooling. 3. We also trained four R-CNN models, respectively based on ResNet-101, ResNet-269, Extended IR-v2, and one of our Grand Models. These R-CNNs are used to predict how likely a bounding box belongs to each class as well as to refine the bounding box (via bounding box regression). 4. The four RCNN models form an ensemble. Their predictions (on both class scores and refined bounding boxes) are integrated via average pooling. Given a class label, the refined bounding box with highest score corresponding to that class is used as the result. Deep Learning Framework ----------------- Both our classification and localization frameworks are implemented using Parrots, a new Deep Learning framework developed internally by ourselves (from scratch). Parrots is featured with a highly scalable distributed training scheme, a memory manager that supports dynamic memory reuse, and a parallel preprocessing pipeline. With this framework, the training time is substantially reduced. Also, with the same GPU memory capacity, much larger networks can be accommodated. References ----------------- [1] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv:1602.07261. 2016. [2] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. "Identity Mappings in Deep Residual Networks". arXiv:1603.05027. 2016. [3] Gao Huang, Yu Sun, Zhuang Liu, Daniel Sedra, Kilian Weinberger. "Deep Networks with Stochastic Depth". arXiv:1603.09382. 2016. |
CUImage | Wanli Ouyang, Junjie Yan, Xingyu Zeng, Hongsheng Li, Tong Xiao, Kun Wang, Xin Zhu, Yucong Zhou, Yu Liu, Buyu Li, Zhiwei Fang, Changbao Wang, Zhe Wang, Hui Zhou, Liping Zhang, Xingcheng Zhang, Zhizhong Li, Hongyang Li, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang | Compared with CUImage submission in ILSVRC 2015, the new components are as follows.
(1) The models are pretrained for 1000-class object detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed. (2) The region proposal is obtained using the improved version of CRAFT in [b]. (3) A GBD network [c] with 269 layers is fine-tuned on 200 detection classes with the gated bidirectional network (GBD-Net), which passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model and ~5% mAP improvement on the Batch normalized GoogleNet. (4) For handling their long-tail distribution problem, the 200 classes are clustered. Different from the original implementation in [d] that learns several models, a single model is learned, where different clusters have both shared and distinguished feature representations. (5) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track. (6) For the external data track, we propose object detection with landmarks. Comparing to the standard bounding box centric approach, our landmark centric approach provides more structural information and can be used to improve both the localization and classification step in object detection. Based on the landmark annotations provided in [e], we annotate 862 landmarks from 200 categories on the training set. Then we use them to train a CNN regressor to predict landmark position and visibility of each proposal in testing images. In the classification step, we use the landmark pooling on top of the fully convolutional network, where features around each landmark are mapped to be a confidence score of the corresponding category. The landmark level classification can be naturally combined with standard bounding box level classification to get the final detection result. (7) Ensemble of the models using the approaches mentioned above lead to the final result in the external data track. The fastest publicly available multi-GPU caffe code is our strong support [f]. [a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015. [b] Yang, B., Yan, J., Lei, Z., Li, S. Z. "Craft objects from images." CVPR 2016. [c] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016. [d] Ouyang, W., Wang, X., Zhang, C., Yang, X. Factors in Finetuning Deep Model for Object Detection with Long-tail Distribution. CVPR 2016. [e] Wanli Ouyang, Hongyang Li, Xingyu Zeng, and Xiaogang Wang, "Learning Deep Representation with Large-scale Attributes", In Proc. ICCV 2015. [f] https://github.com/yjxiong/caffe |
CUVideo | Hongsheng Li*, Kai Kang* (* indicates equal contribution), Wanli Ouyang, Junejie Yan, Tong Xiao, Xingyu Zeng, Kun Wang, Xihui Liu, Qi Chu, Junming Fan, Yucong Zhou, Yu Liu, Ruohui Wang, Shengen Yan, Dahua Lin, Xiaogang Wang
The Chinese University of Hong Kong, SenseTime Group Limited |
We utilize several deep neural networks with different structures for the VID task.
(1) The models are pretrained for 200-class detection task using the approach in [a] but adapted to the fast-RCNN for faster detection speed. (2) The region proposal is obtained by a separately-trained ResNet-269 model. (3) A GBD network [b] with 269 layers is fine-tuned on 200 detection classes of the DET task and then on the 30 classes of the VID task. It passes messages between features from different support regions during both feature learning and feature extraction. The GBD-Net is found to bring ~3% mAP improvement on the baseline 269 model. (4) Based on detection boxes of individual frames, tracklet proposals are efficiently generated by trained bounding box regressors. An LSTM network is integrate into the network to learn temporal-based appearance variation. (5) Multi-context suppression and motion-guide propagation in [c] are utilized to post-process the per-frame detection results. They result in a ~3.5% mAP improvement on the validation set. (6) Ensemble of the models using the approaches mentioned above lead to the final result in the provided data track. (7) For the VID with tracking task, we modified an online multiple object tracking algorithm [d]. The tracking-by-detection algorithm utilizes our per-frame detection results and generates tracklets for different objects. The fastest publicly available multi-GPU caffe code is our strong support [e]. [a] W. Ouyang, X. Wang, X. Zeng, S. Qiu, P. Luo, Y. Tian, H. Li, S. Yang, Z. Wang, C. Loy, X. Tang, “DeepID-Net: Deformable Deep Convolutional Neural Networks for Object Detection,” CVPR 2015. [b] X. Zeng, W. Ouyang, B. Yang, J. Yan, X. Wang, “Gated Bi-directional CNN for Object Detection,” ECCV 2016. [c] K. Kang, H. Li, J. Yan, X. Zeng, B. Yang, T. Xiao, C. Zhang, Z. Wang, R. Wang, X. Wang, W. Ouyang, “T-CNN: Tubelets with Convolutional Neural Networks for Object Detection from Videos”, arXiv:1604.02532 [d] J. H. Yoon, C.-R. Lee, M.-H. Yang, K.-J. Yoon, “Online Multi-Object Tracking via Structural Constraint Event Aggregation”, CVPR 2016 [e] https://github.com/yjxiong/caffe |
Deep Cognition Labs | Mandeep Kumar, Deep Cognition Labs
Krishna Kishore, Deep Cognition Labs Rajendra Singh, Deep Cognition Labs |
We present these results for scene parsing task that are aquired using a modified Deeplab vgg16 network along with CRF. |
DEEPimagine | Sung-soo Park(DEEPimagine corp.)
Hyoung-jin Moon(DEEPimagine corp.) Contact email : sspark@deepimagine.com |
1.Model design
- Wide Residual SWAPOUT network - Inception Residual SWAPOUT network - We focused on the model multiplicity with many shallow networks - We adopted a SWAPOUT architecture 2.Ensemble - Fully convolutional dense crop - Variant parameter model ensemble [1] " Swapout: Learning an ensemble of deep architectures" Saurabh Singh, Derek Hoiem, David Forsyth [2] " Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning" Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi [3] " Deep Residual Learning for Image Recognition " Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun |
DeepIST | Heechul Jung*(DGIST/KAIST), Youngsoo Kim*(KAIST), Byungju Kim(KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Junho Yim(KAIST), Min-Kook Choi(DGIST), Yeakang Lee(KAIST), Soon Kwon(DGIST), Woo Young Jung(DGIST), Junmo Kim(KAIST)
* indicates equal contribution. |
We basically use nine networks. Networks consist of one 200-layer ResNet, one Inception-ResNet v2, one Inception v3 Net, two 212-layer ResNets and four Branched-ResNets.
Networks are trained for 95 epochs except Inception-ResNet v2 and Inception v3. Ensemble A takes an average of one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2. Ensemble B takes a weighted sum over one 212-layer ResNet, two Branched-ResNets and one Inception-ResNet v2. Ensemble C takes an average of one 200-layer ResNet, two 212-layer ResNets, two Branched-ResNets, one Inception v3 and one Inception-ResNet v2. It achieves a top-5 error rate of 3.16% for 20000 validation images. Ensemble D takes an averaged result on all nine networks. We submit only classification results. References: [1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016). [3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016). [4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015). [5] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013). [6] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). Acknowledgement - DGIST was funded by the Ministry of Science, ICT and Future Planning. - KAIST was funded by Hanwha Techwin CO., LTD. |
DGIST-KAIST | Heechul Jung(DGIST/KAIST), Jihun Jung(DGIST), Junkwang Kim(DGIST), Min-Kook Choi(DGIST), Soon Kwon(DGIST), Junmo Kim(KAIST), Woo Young Jung(DGIST) | We basically use ensemble model of state-the-art architectures [1,2,3,4] as following:
[1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [2] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016). [3] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016). [4] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015). We train five deep neural networks, which models are two 212-layers ResNets, a 224-layers ResNet, an inception-v3, and an Inception-ResNet v2. Given models are linearly combined by weighted some of class probabilities using validation set to obtain appropriate contribution for each model. - This work was funded by the Ministry of Science, ICT and Future Planning. |
DPAI Vison | Object detection: Chris Li, Savion Zhao, Bin Liu, Yuhang He, Lu Yang, Cena Liu
Scene classification: Lu Yang, Yuhang He, Cena Liu, Bin Liu, Bo Yu Scene parsing: Bin Liu, Lu Yang, Yuhang He, Cena Liu, Bo Yu, Chris Li, Xiongwei Xia Object detection from video: Bin Liu, Cena Liu, Savion Zhao, Yuhang He, Chris Li |
Object detection:Our methods is based on faster-rcnn and extra classifier. (1) data processing: data equalization by deleting lots of examples in threee dominating classes (person, dog, and bird); adding extra data for classes with training data less than 1000; (2) COCO pre-train; (3) Iterative bounding box regression + multi-scale (trian/test) + random flip images (train / test) (4) Multimodel ensemble: resnet-101 and inception-v3 (5) Extra classifier with 200 classes which helps to promote recall and refine the detection scores of ultimate boxes.
[1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015. [2] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[C]//Advances in neural information processing systems. 2015: 91-99. Scene classification: We trained the model on Caffe[1]. An ensemble of Inception-V3[2] and Inception-V4[3]. We totally integrated four models. Top1 error on validation is 0.431 and top5 error is 0.129. The single model is modified on Inception-V3[2], the top1 error on validation is 0.434, top5 error is 0.133. [1] Jia, Yangqing and Shelhamer, Evan and Donahue, Jeff and Karayev, Sergey and Long, Jonathan and Girshick, Ross and Guadarrama, Sergio and Darrell, Trevor. Caffe: Convolutional Architecture for Fast Feature Embedding. arXiv preprint arXiv:1408.5093. 2014. [2]C.Szegedy,V.Vanhoucke,S.Ioffe,J.Shlens,andZ.Wojna. Rethinking the inception architecture for computer vision. arXiv preprint arXiv:1512.00567, 2015. [3] C.Szegedy,S.Ioffe,V.Vanhoucke. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. arXiv preprint arXiv:1602.07261, 2016. Scene parsing: We trained 3 models on modified deeplab[1] (inception-v3, resnet-101, resnet-152) and only used the ADEChallengeData2016[2] data. Multi-scale \ image crop \ image fliping \ contrast transformation are used for data augmentation and decseCRF is used as post-processing to refine object boundaries. On validation with combining 3 models, witch achieved 0.3966 mIoU and 0.7924 pixel-accuracy. [1] L. Chen, G. Papandreou, I. K.; Murphy, K.; and Yuille, A. L. 2016. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs. In arXiv preprint arXiv:1606.00915. [2] B. Zhou, H. Zhao, X. P. S. F. A. B., and Torralba, A. 2016. Semantic understanding of scenes through the ade20k dataset. In arXiv preprint arXiv:1608.05442. Object detection from video: Our methods is based on faster-rcnn and extra classifier. We train Faster-RCNN based on RES-101 with the provided training data. We also train extra classifier with 30 classes which helps to promote recall and refine the detection scores of ultimate boxes. |
DPFly | Savion DP.co | We fine-tune the detection models using the DET training set and the val1 set. The val2 set is used for validation.
data process to nearly equal ammount. since some categories have much more images than others. So,we need to process the initial data to let amount of each category near equal. usr res101model+fater_rcnn.The networks are pre- trained on the 1000-class ImageNet classification set, and are fine-tuned on the DET data. use box refinement:In Faster R-CNN, the final output is a regressed box that is different from its proposal box. So for inference, we pool a new feature from the regressed box and obtain a new classification score and a new regressed box. We combine these 300 new predictions with the orig- inal 300 predictions. Non-maximum suppression (NMS) is applied on the union set of predicted boxes using an IoU threshold of 0.3. use multiscale test.In our current implementation, we have performed multi-scale testing. we compute conv feature maps on an image pyramid, where the image’s shorter sides are 300,450,600 use multiscale anchor. we add two anchor scales to original anchor scales of faster rcnn. use test flip. we flip image and combine results with original image |
Everphoto | Yitong Wang, Zhonggan Ding, Zhengping Wei, Linfu Wen
Everphoto |
Our method is based on DCNN approaches.
We use 5 models with different input scales and different network structures as basic models. They are derived from GoogleNet, VGGNet and ResNet. We also utilize the idea of dark knowledge [1] to train several specialist models, and use these specialist models to reassign probability scores and refine the basic outputs. Our final results are based on the ensemble of refined outputs. [1] Hinton G, Vinyals O, Dean J. Distilling the knowledge in a neural network[J]. arXiv preprint arXiv:1503.02531, 2015. |
F205_CV | Cheng Zhou
Li Jiancheng Lin Zhihui Lin Zhiguan Yang Dali All came from Tsinghua university, Graduate School at ShenZhen Lab F205,China |
Our team has five student members from Tsinghua university, Graduate School at ShenZhen Lab F205,China. We have joined two sub-tasks of the ILSVRC2016 & COCO challenge which is the Scene Parsing and Object detection from video. We are the first time to attend this competition.
The two of the members have focus on the Scene Parsing, they mainly utilized several model fusion algorithms on some famous and effective CNN models like ResNet[1], FCN[2] and DilatedNet[3, 4] and used CRF to get more context features to improve the classification accuracy and mean IoU rate. Since the image size is large, the image is downsampled before feeding to the network. What's more, we used vertical mirror technique for data augmentation. The places2 scene classification 2016 pretrained model was used to fine-tune ResNet101 and FCN, while DilatedNet fine-tuned from the places2 scene parsing 2016 pretrained model[5]. Later fusion and CRF were added at last. For object detection from video, the biggest challenge is there are more than 2 millions images with very high resolution in total. We didn't think about using the fast-RCNN[6] like models to solve it. It need much more training and testing time. So we chose the ssd[7] which is an effective and efficient framework for object detection. We utilized the ResNet101 as the base model, but it is slower than VGGNet[8]. For testing it can achieve about 10FPS on single GTX TITAN X GPU. However, there are more than 700 thousands images in the test set. It costed lots of time. On tracking task, we have a dynamic adjustment algorithm, but it need a ResNet101 model for scoring the patch. It can just achieve about less than 1FPS. So we cannot do this work on test set. For the submission, we used a simple method to filter the noise proposals and track the object. References: [1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015. [2] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]//Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015: 3431-3440. [3] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016. [4] F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016. [5] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso and A. Torralba. arXiv:1608.05442 [6] Girshick R. Fast r-cnn[C]//Proceedings of the IEEE International Conference on Computer Vision. 2015: 1440-1448. [8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015. [9] Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint arXiv:1409.1556, 2014. |
Faceall-BUPT | Xuankun HUANG, BUPT, CHINA
Jiangqi ZHANG, BUPT, CHINA Zhiqun HE, BUPT, CHINA Junfei ZHUANG, BUPT, CHINA Zesang HUANG, BUPT, CHINA Yongqiang Yao, BUPT, CHINA Kun HU, BUPT, CHINA Fengye XIONG, BUPT, CHINA Hongliang BAI, Beijing Faceall co., LTD Wenjian FENG, Beijing Faceall co., LTD Yuan DONG, BUPT, CHINA |
# Classification/Localization
We trained the ResNet-101, ResNet-152 and Inception-v3 for object classification. Multi-view testing and models ensemble is utilized to generate the final classification results. For localization task, we trained a Region Proposal Network to generate proposals of each image, and we fine-tuned two models with object-level annotations of 1,000 classes. Moreover, a background class is added into the network. Then test images are segmented into 300 regions by RPN and these regions are classified by the fine-tuned model into one of 1,001 classes. And the final bounding box is generated by merging the bounding rectangle of three regions. # Object detection We utilize faster-rcnn with the publicly available resnet-101. Other than the baseline, we adopt multi-scale roi to obtain features containing richer context information. For testing, we use 3 scales and merge these results using the simple strategy introduced last year. No validation data is used for training, and flipped images are used in only a third of the training epochs. # Object detection from video We use Faster R-CNN with Resnet-101 to do this as in the object detection task. One fifth of the images are tested with 2 scales. No tracking techniques are used because of some mishaps. # Scene classification We trained a single Inception-v3 network with multi-scale and tested with multi-view of 150 crops. On validation the top-5 error is about 14.56%. # Scene parsing We trained 6 models with net structure inspired by fcn8s and dilatedNet with 3 scales(256,384,512). Then we test with flipped images using pre-trained fcn8s and dilatedNet. The pixel-wise accuracy is 76.94% and mean of the class-wise IoU is 0.3552. |
fusionf | Nina Narodytska (Samsung Research America)
Shiva Kasiviswanathan (Samsung Research America) Hamid Maei (Samsung Research America) |
We used several modifications of modern CNNs, including VGG[1], GoogleNet[2,4], and ResNet[3]. We used several fusion strategies,
including a standard averaging and scoring scheme. We also used different subsets of models in different submissions. Training was performed on low-resolution dataset. We used balanced loading to take into account different numbers of images in each class. [1] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. [2] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, A. Rabinovich. Going Deeper with Convolutions. [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun Deep Residual Learning for Image Recognition [4]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning |
Future Vision | Gautam Kumar Singh(independent)
Kunal Kumar Singh(independent) Priyanka Singh(independent) |
Future Vision Project (based on Matcovnet )
======================================== This is an extremely simple CNN model which may not stand in ILSVRC competition. Our major goal was to get a working CNN model which can be enhanced to work efficiently on ILSVRC standards. This project ran on following configurations : processor : Intel core i3-4005U CPU @ 1.70GHz (4cpus) RAM: 4GB As we had no advance hardware resources like GPUs or high speed CPUs,cuDNN could not be used either. So we could not train on this vast data and we were forced to use this simplest shallow model . Besides we also threw away some 90% of the data. 10 % of the data was only used in this project, equally distributed as training data and test data. Places2 validation data : NOT USED Places2 training data : 90% discarded 10% of this training data were further equally divided into two categories('train'& 'test') which were used for training and testing of this project. The output text file on the Places2 test images could not produced as we were facing some techinical dificulties and we were running out of time . Reference :we referenced this project: http://www.cc.gatech.edu/~hays/compvision/proj6 Future Vision Team : Gautam Kumar Singh Kunal Kumar Singh Priyanka Singh |
Hikvision | Qiaoyong Zhong*, Chao Li, Yingying Zhang(#), Haiming Sun*, Shicai Yang*, Di Xie, Shiliang Pu (* indicates equal contribution)
Hikvision Research Institute (#)ShanghaiTech University, work is done at HRI |
[DET]
Our work on object detection is based on Faster R-CNN. We design and validate the following improvements: * Better network. We find that the identity-mapping variant of ResNet-101 is superior for object detection over the original version. * Better RPN proposals. A novel cascade RPN is proposed to refine proposals' scores and location. A constrained neg/pos anchor ratio further increases proposal recall dramatically. * Pretraining matters. We find that a pretrained global context branch increases mAP by over 3 points. Pretraining on the 1000-class LOC dataset further increases mAP by ~0.5 point. * Training strategies. To attack the imbalance problem, we design a balanced sampling strategy over different classes. With balanced sampling, the provided negative training data can be safely added for training. Other training strategies, like multi-scale training and online hard example mining are also applied. * Testing strategies. During inference, multi-scale testing, horizontal flipping and weighted box voting are applied. The final mAP is 65.1 (single model) and 67 (ensemble of 6 models) on val2. [CLS-LOC] A combination of 3 Inception networks and 3 residual networks is used to make the class prediction. For localization, the same Faster R-CNN configuration described above for DET is applied. The top5 classification error rate is 3.46%, and localization error is 8.8% on the validation set. [Scene] For the scene classification task, by drawing support from our newly-built M40-equipped GPU clusters, we have trained more than 20 models with various architectures, such as VGG, Inception, ResNet and different variants of them in the past two months. Fine-tuning very deep residual networks from pre-trained ImageNet models, like ResNet 101/152/200, seemed not to be as good enough as what we expected. Inception-style networks could get better performance in considerably less training time according to our experiments. Based on this observation, deep Inception-style networks, and not-so-deep residuals networks have been used. Besides, we have made several improvements for training and testing. First, a new data augmentation technique is proposed to better utilize the information of original images. Second, a new learning rate setting is adopted. Third, label shuffling and label smoothing is used to tackle the class imbalance problem. Fourth, some small tricks are used to improve the performance in test phase. Finally we achieved a very good top 5 error rate, which is below 9% on the validation set. [Scene Parsing] We utilize a fully convolutional network transferred from VGG-16 net, with a module, called mixed context network, and a refinement module appended to the end of the net. The mixed context network is constructed by a stack of dilated convolutions and skip connections. The refinement module generates predictions by making use of output of the mixed context network and feature maps from early layers of FCN. The predictions are then fed into a sub-network, which is designed to simulate message-passing process. Compared with baseline, our first major improvement is that, we construct the mixed context network, and find that it provides better features for dealing with stuff, big objects and small objects all at once. The second improvement is that, we propose a memory-efficient sub-network to simulate message-passing process. The proposed system can be trained end-to-end. On validation set, the mean iou of our system is 0.4099 (single model) and 0.4156 (ensemble of 3 models), and the pixel accuracy is 79.80% (single model) and 80.01% (ensemble of 3 models). References [1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. [2] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." arXiv preprint arXiv:1604.03540 (2016). [3] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [4] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016). [5] Ioffe, Sergey, and Christian Szegedy. "Batch normalization: Accelerating deep network training by reducing internal covariate shift." arXiv preprint arXiv:1502.03167 (2015). [6] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015). [7] Szegedy, Christian, Sergey Ioffe, and Vincent Vanhoucke. "Inception-v4, inception-resnet and the impact of residual connections on learning." arXiv preprint arXiv:1602.07261 (2016). [8] F. Yu and V. Koltun, "Multi-scale context aggregation by dilated convolutions," in ICLR, 2016. [9] J. Long, E. Shelhamer, and T. Darrell, "Fully convolutional networks for semantic segmentation," in CVPR, 2015. [10] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang, and P. Torr, "Conditional random fields as recurrent neural networks," in ICCV, 2015. [11] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, A. Yuille, "DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs", arXiv:1606.00915, 2016. [12] P. O. Pinheiro, T. Lin, R. Collobert, P. Dollar, "Learning to Refine Object Segments", arXiv:1603.08695, 2016. |
Hitsz_BCC | Qili Deng,Yifan Gu,Mengdie Chu,Shuai Wu,Yong Xu
Harbin Institute of Technology,Shenzhen |
We combined a residual learning framework with Single Shot MultiBox Detector for object detection. For using the ResNet-152,we fixed the all batch-normlization layers and conv1, conv2_x in ResNet.Inspired by HyperNet,we exploit multi-layer features to detect objects.
Reference: [1] He K, Zhang X, Ren S, et al. Deep residual learning for image recognition[J]. arXiv preprint arXiv:1512.03385, 2015. [2] Kong T, Yao A, Chen Y, et al. HyperNet: Towards Accurate Region Proposal Generation and Joint Object Detection[J]. arXiv preprint arXiv:1604.00600, 2016. [3] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. arXiv preprint arXiv:1512.02325, 2015. |
hustvision | Xinggang Wang, Huazhong University of Science and Technology
Kaibin Chen, Huazhong University of Science and Technology |
We propose a very fast and accurate object detection method based on deep neural networks. The core of this method is an object detection loss layer named ConvBox, which directly regresses object boundingbox. The ConvBox loss layer can be plugged into any deep neural networks. In this competition, we choose google-net as the base network. Running on Nvidia GTX 1080, it takes about one day for training on ILSVRC 2016. Testing speed is about 60fps. In the training images of this competition, there are many positive instances are not labelled. To deal with this problem, in the proposed ConvBox loss, we tolerate hard negatives, which improves detection performance to some extent. |
iMCB | *Yucheng Xing,
*Yufeng Zhang, Zhiqin Chen, Weichen Xue, Haohua Zhao, Liqing Zhang @Shanghai Jiao Tong University (SJTU) (* indicates equal contribution) |
In this competition, we submit five entries.
The first model is a single model, which achieved 15.24% top-5 error on validation dataset. It is a Inception-V3[1] model that is modified and trained based on both the challenge and standard datasets[2]. When being tested, images are resized to 337*337 and then a 12-crops skill is used to get the 299*299 inputs to the model, which contributes to the improvement of performance. The second model is a fusion-feature model(FeatureFusion_2L), which achieved 13.74% top-5 error on validation dataset. It is a two layers fusion-feature network, whose input is the combination of fully-connected layer's features extracted from several well performed CNNs(i.e. pretrained models[3], such as Resnet, VGG, Googlenet).As a result, it turns out to be efficient in reducing the error rate. The third model is also a fusion-feature network(FeatureFusion_3L),which achieved 13.95% top-5 error on validation dataset. Comparing with the second model, it is a three layers fusion-feature network which contains two fully-connected layers. The fourth is the combination of CNN models with a strategy w.r.t.validation accuracy, which achieved 13% top-5 error on validation dataset. It combines the probabilities provided by the softmax layer from three CNNs, in which the influential factor of each CNN is determined by the validation accuracy. The fifth is the combination of CNN models based on researched influential factors, which achieved 12.65% top-5 error on validation dataset. There are six CNNs taken into consideration, while four models(Inception-V2, Inception-V3, FeatureFusion_2L and FeatureFusion_3L) of them are trained by us and the other two are pretrained. The influential factors of these models are optimized according to plenty of researches. [1] Szegedy, Christian, et al. "Rethinking the Inception Architecture for Computer Vision." arXiv preprint arXiv:1512.00567 (2015). [2]B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. "Places: An Image Database for Deep Scene Understanding." Arxiv, 2016. [3] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba and A. Oliva. "Learning Deep Features for Scene Recognition using Places Database."Advances in Neural Information Processing Systems 27 (NIPS), 2014. |
isia_ICT | Xinhan Song, Institute of Computing Technology
Chengpeng Chen, Institute of Computing Technology Shuqiang jiang, Institute of Computing Technology |
For convenience, we use the 4 provided models as our basic models, which are used for the following fine-tuning or networks adaptation. Besides, considering the non-uniform and the tremendous image number of the Challenge Dataset, we only use the Standard Dataset for all the following steps.
First, we fuse these models with average strategy as the baseline. And then, we add a SPP layer to VGG16 and ResNet152 perspectively to enable the models to be feed with images with larger scale. After fine-tuning the models, we also fuse them with average strategy, and we only submit the result of the size 288. we also perform spectral clustering on the confusion matrix extracted from validation data to get 20 clusters, which means that 365 classes are separated into 20 clusters mainly dependent on their co-relationship. To classify the classes in the same cluster more precisely, we train an extra classifier within each cluster, which is implemented by fine-tuning the networks with all the layers fixed except for fc8 layer and combining them into a network at last. D. Yoo, S. Park, J. Lee and I. Kweon. “Multi-scale pyramid pooling for deep convolutional representation”. In CVPR Workshop 2015 |
ITLab-Inha | Byungjae Lee, Inha University,
Songguo Jin, Inha University, Enkhbayar Erdenee, Inha University, Mi Young Nam, NaeulTech, Young Giu Jung, NaeulTech, Phill Kyu Rhee, Inha University. |
We propose a robust multi-class multi-object tracking (MCMOT) formulated by a Bayesian framework [1]. Multi-object tracking for unlimited object classes is conducted by combining detection responses and changing point detection (CPD) algorithm. The CPD model is used to observe abrupt or abnormal changes due to a drift and an occlusion based spatiotemporal characteristics of track states.
The ensemble of object detector is based on the Faster R-CNN [2] using VGG16 [3], and ResNet [4] adaptively. For parameter optimization, POMDP based parameter learning approach is adopted which described in our previous work [5]. [1] “Multi-Class Multi-Object Tracking using Changing Point Detection”, Byungjae Lee, Enkhbayar Erdenee, Songguo Jin, Phill Kyu Rhee. arXiv 2016. [2] “Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks”, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. TPAMI 2016. [3] “Very Deep Convolutional Networks for Large-Scale Image Recognition”, Karen Simonyan, Andrew Zisserman. arXiv 2015. [4] “Deep Residual Learning for Image Recognition”, Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. CVPR 2016. [5] “Adaptive Visual Tracking using the Prioritized Q-learning Algorithm: MDP-based Parameter Learning Approach”, Sarang Khim, Sungjin Hong, Yoonyoung Kim, Phill Kyu Rhee. Image and Vision Computing 2014. |
KAIST-SLSP | Sunghun Kang*(KAIST)
Jae Hyun Lim*(ETRI) Houjeung Han(KAIST) Donghoon Lee(KAIST) Junyeong Kim(KAIST) Chang D. Yoo(KAIST) (* indicates equal contribution) |
For both image and video object detection, the faster-rcnn detection algorithm proposed by Shaoqing Ren et al. is integrated with various other state-of-the art key techniques described below in Torch. For image and video, post-processing techniques such as box-refinement and classification rescoring via global context feature. Classification rescoring by combining global context feature with feature outputs is conducted per prediction. In order to further enhance detection performance for video, classification probabilities within tracklets obtained by multiple object tracking were re-scored via combining feature responses weighted on various combination of tracklet lengths. Our architecture is based on ensemble of the several architectures independently trained. The faster-rcnn based on deep residual net is implemented to be learned in an end-to-end manner, and for inference, model ensemble and box-refinement are integrated into the two faster-rcnn architectures.
For both image and video object detection, the following three key components (1-3) that include three post-processing techniques (pp1-3) are integrated in torch for end-to-end learning and inferencing: (1) Deep residual net[1] (2) Faster-R-CNN[2] with end2end training (3) post-processing (pp1) box refinement[3] (pp2) model ensemble (pp3) classification re-scoring via SVM using global context features For only video object detection, the following post-processing techniques (pp4-5) are additionally included in conjunction with the above three post-processing techniques: (3) post-processing (pp4) multiple object tracking[4] (pp5) tracklets re-scoring [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun, "Deep Residual Learning for Image Recognition", IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [2] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun, "Faster {R-CNN}: Towards Real-Time Object Detection with Region Proposal Networks", Advances in Neural Information Processing Systems (NIPS), 2015 [3] Spyros Gidaris and Nikos Komodakis, "Object detection via a multi-region & semantic segmentation-aware CNN model", International Conference on Computer Vision (ICCV), 2015 [4] Hamed Pirsiavash, Deva Ramanan, and Charless C.Fowlkes, “Globally-optimal greedy algorithms for tracking a variable number of objects,” IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2011 |
KAISTNIA_ETRI | Keun Dong Lee(ETRI)
Seungjae Lee(ETRI) Yunhun Jang(KAIST) Hankook Lee(KAIST) Hyung Kwan Son(ETRI) Jinwoo Shin(KAIST) |
For the localization task, we use a variant of Faster-RCNN with ResNet, where the overall training procedure is similar with that in [1]. For the classification task, we used an ensemble of ResNet and GoogLeNet [2] with various data augmentations. Then we recursively obtained attentions in input images to adjust localization outputs. It is further tuned by class-dependent regression models.
[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. [2] Szegedy, C., Ioffe, S., Vanhoucke, V.: Inception-v4, inception-resnet and the impact of residual connections on learning. arXiv:1602.07261, 2016. |
KPST_VB | Nguyen Hong Hanh
Seungjae Lee Junhyeok Lee |
In this work, we used pre-trained ResNet200(ImageNet)[1] and retrained the network on Place 365 Challenge data (256 by 256). We also estimated scene probability using the output of pretrained ResNet200 and scene vs. object (ImageNet 1000 class) distribution on training data. For classification, we used ensemble of two networks with multiple crops and adjusted on scene probability.
[1] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint arXiv:1512.03385, 2015. *Our work is performed by deep learning analysis tool(Deep SDK by KPST). |
Lean-T | Yuechao Gao,Nianhong Liu,Sen Li @ Tsinghua University | For the object dection task,our dectors is based on the Faster RCNN[1]. We used pre-traind VGG16[2] to initialize the net. we used caffe[3] to train our model and only 230K iterations were conducted.The images for the DET dataset served as negative training data were not used.
[1]Ren, S., He, K., Girshick, R., & Sun, J. (2015). Faster R-CNN: Towards real-time object detection with region proposal networks. In Advances in neural information processing systems (pp. 91-99). [2]K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. arXiv:1409.1556, 2014. [3]Jia, Yangqing, et al. "Caffe: Convolutional architecture for fast feature embedding." Proceedings of the 22nd ACM international conference on Multimedia. ACM, 2014. |
LZDTX | Liu Yinan (Independent Individual);
Zhou Yusong (Beijing University of Posts and Telecommunications); Deng Guanghui (BeijingUniversity of Technology); Tuqiang (Beijing Insititute of Technology); Xing Zongheng (University of Science & Technology Beijing); |
In this year, we focus on the Object Detection Task because it is widely used in our project and in other areas such as self-driving, robot and image analysis. Researchers are all the time trying to find a real time object detection algorithm with relatively high accuracy. However, most proposed algorithms are based on proposals and transfer detection task to classification task by classifying the proposals selected from image. Sliding windows is an widely used way but it produces too much proposals. In recent years some methods try to combine traditional proposal select method and deep learning algorithm such as R-CNN. Some methods try to accelerate feature extraction process such as Fast R-CNN and Faster R-CNN. But it is still too slow to most real time applications. Recently some object detection methods without proposals are proposed such as YOLO and SSD. The speed of such methods are much faster than methods with proposals. The drawback of YOLO and SSD is that they are useless to small objects. This is because both YOLO and SSD directly map a box from image to target object. To overcome this drawback, we try to add a deconvolutional structure on the ssd network. The naive idea of our network structure is to enlarge the output feature map and we think larger feature map is able to provide more detailed predicting information and cover small size target object. Deconvolutional layer is the way we use to enlarge the feature map of ssd. We add 3 deconvolutional layers in the basic ssd network, and the deconvolutional layers output predict results as other ssd extra layers. Experimental results show that our deconv-ssd network improves performance of baseline of 300x300 ssd model on validation dataset of ILSVRC 2016 and PASCAL VOC. We submit one model with input size 300x300. We train our model on a Nvidia Titan GPU card, batch size is set as 32.
[1] Uijlings J R R, Sande K E A V D, Gevers T, et al. Selective Search for Object Recognition[J]. International Journal of Computer Vision, 2013, 104(2):154-171. [2] Russakovsky O, Deng J, Su H, et al. ImageNet Large Scale Visual Recognition Challenge[J]. International Journal of Computer Vision, 2015, 115(3):211-252. [3] Simonyan K, Zisserman A. Very Deep Convolutional Networks for Large-Scale Image Recognition[J]. Computer Science, 2015. [4] Sermanet P, Eigen D, Zhang X, et al. OverFeat: Integrated Recognition, Localization and Detection using Convolutional Networks[J]. Eprint Arxiv, 2013. [5] Girshick R. Fast R-CNN[J]. Computer Science, 2015. [6] Ren S, He K, Girshick R, et al. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks.[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2015:1-1. [7] Redmon J, Divvala S, Girshick R, et al. You Only Look Once: Unified, Real-Time Object Detection[J]. Computer Science, 2015. [8] Liu W, Anguelov D, Erhan D, et al. SSD: Single Shot MultiBox Detector[J]. Computer Science, 2015. |
MCC | Lei You, Harbin Institute of Technology Shenzhen Graduate School
Yang Zhang, Harbin Institute of Technology Shenzhen Graduate School Lingzhi Fu, Harbin Institute of Technology Shenzhen Graduate School Tianyu Wang, Harbin Institute of Technology Shenzhen Graduate School Huamen He, Harbin Institute of Technology Shenzhen Graduate School Yuan Wang, Harbin Institute of Technology Shenzhen Graduate School |
We combined and modified the RESNET and faster-RCNN for image classification, and then we constructed several detection models for the target location according to the classification model results, finally we integrated the two steps and got the final results.
Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Neural Information Processing Systems (NIPS), 2015 K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. CVPR, 2016. Russell Stewart, Mykhaylo Andriluka. End-to-end people detection in crowded scenes. CVPR, 2016. |
MCG-ICT-CAS | Sheng Tang (Corresponding email: ts@ict.ac.cn),
Bin Wang, JunBin Xiao, Yu Li, YongDong Zhang, JinTao Li Multimedia Computing Group,Institute of Computing Technology,Chinese Academy of Sciences (MCG-ICT-CAS), Beijing, China |
Technique Details for the Object Detection from Video (VID) Task:
For this year’s VID task, our primary contribution is that we propose a novel tracking framework based on two complementary kinds of tubelet generation methods which focus on precision and recall respectively, followed by a novel tubelet merging method. Under this framework, our main contributions are two-fold: (1) Tubelet generation based on detection and tracking: We propose to sequentialize the detection bounding boxes of same object with different tracking methods to form two complementary kinds of tubelets. One is to use the detection bounding boxes to refine the optical-flow based tracking for precise tubelet generation. The other is to integrate the detection bounding boxes with multi-target tracking based on MDNet to recall missing tubelets. (2) Overlapping and successive tubelet fusion: Based on the above two complementary tubelet generation methods, we propose a novel effective union method to merge two overlapping tubelets, and a concatenation method to merge two successive tubelets, which improves the final AP by a substantial margin. Also three other tricks are used as below: (1) Non-coocurrence filtration: Based on the co-occurrence relationship mined from the training dataset, we filter out the true negative objects which have lower detection scores and whose categories are not concurrently appeared with those objects of highest detection scores. (2) Coherent reclassification: After generating the object tubelets based on detection results and optical flow, we propose a coherent reclassification method to get coherent categories throughout a tubelet. (3) Efficient multi-target tracking with MDNet: we first choose anchor frame, and exploit the adjacent information to determine the reliable anchor targets for efficient tracking. Then, we track each anchor target with a MDNet tracker in parallel. Finally, we use still-image detection results to recall missing tubelets. In our implementation, we use Faster R-CNN [1] with ResNet [2] for still-image detection, optical flow [3] and MDNet [4] for tracking. References: [1] Ren S, He K, Girshick R, Sun J. “Faster R-CNN: Towards real-time object detection with region proposal networks”, NIPS 2015: 91-99. [2] He K, Zhang X, Ren S, Sun J. “Deep residual learning for image recognition”, CVPR 2016. [3] Kang K, Ouyang W, Li H, Wang X. “Object Detection from Video Tubelets with Convolutional Neural Networks”, CVPR 2016. [4] Nam H, Han B. “Learning multi-domain convolutional neural networks for visual tracking”, CVPR 2016. |
MIL_UT | Kuniaki Saito
Shohei Yamamoto Masataka Yamaguchi Yoshitaka Ushiku Tatsuya Harada all of members are from University of Tokyo |
We used Faster RCNN[1] as a basic detection system.
We implemented Faster RCNN based ResNet-152 and ResNet-101[2]. We used pretrained model on 1000classes of ResNet. We placed Region Proposal Network after conv4 on both models. We freezed weight before conv3 during training. We trained these models with end-to-end training procedure. We used Online Hard Example Mining[3] to train these models. We chose top 64 proposals with large loss from 128 proposals for calculating loss. Our submission is from ensemble of the Faster RCNN on ResNet-152 and ResNet-101. For ensemble these models, we shared region proposals from two models, we merged proposals and scores from separately calculated ones. Our result scored map 54.3 on validation dataset. [1] Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." NIPS 2015. [2] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR 2016. [3] Shrivastava, Abhinav, Abhinav Gupta, and Ross Girshick. "Training region-based object detectors with online hard example mining." CVPR 2016. |
MIPAL_SNU | Sungheon Park and Nojun Kwak (Graduate School of Convergence Science and Technology, Seoul National University) | We trained two ResNet-50 [1] networks. One network used 7x7 mean pooling, and the other used multiple mean poolings with various sizes and positions. We also used balanced sampling strategy which is similar to [2] to deal with the imbalanced training set.
[1] He, Kaiming, et al. "Deep residual learning for image recognition." CVPR, 2016. [2] Shen, Li, et al. "Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks." arXiv, 2015. |
mmap-o | Qi Zheng, Wuhan University
Cheng Tong, Wuhan University Xiang Li, Wuhan University |
We use the FULLY CONVOLUTIONAL NETWORKS [1] with VGG 16-layer net to parsing the scene images. The model is adopted with 8 pixel stride nets.
Initial results contain some labels irrelevant to the scene. Some high confidence labels are exploited to group the images into different scenes to remove irrelevant labels. Here we use data-driven classification strategy to refine the results. [1] Long J, Shelhamer E, Darrell T. Fully convolutional networks for semantic segmentation[C]// IEEE Conference on Computer Vision and Pattern Recognition. 2015:1337-1342. |
Multiscale-FCN-CRFRNN | Shuai Zheng, Oxford
Anurag Arnab, Oxford Philip Torr, Oxford |
This submission is trained based on Conditional Random Fields as Recurrent Neural Networks, (described in Zheng et al., ICCV 2015), with multi-scale training pipeline. Our base model is built on ResNet101, which is only pre-trained on ImageNet. After that, the model is built within a Fully Convolutional Network (FCN) structure and only fine-tuned on MIT Scene Parsing dataset. This is done using a multi-scale training pipeline, similar to Farabet et al. 2013. In the end, this FCN-ResNet101 model is plugged in with CRF-RNN and trained in an end-to-end pipeline. |
MW | Gang Sun (Institute of Software, Chinese Academy of Sciences)
Jie Hu (Peking University) |
We leverage the theory named CNA [1] (capacity and necessity analysis) to guide the design of CNNs. We add more layers on the larger feature map (e.g., 56x56) to increase the capacity, and remove some layers on the smaller feature map (e.g., 14x14) to avoid ineffective architectures. We have verified the effectiveness on the models in [2], ResNet-like models [3], and Inception-ResNet-like models [4]. In addition, we also apply cropped patches from original images as training samples by selecting random area and aspect ratio. To increase the ability of generalization, we prune the model weights periodically. Moreover, we utilize balanced sampling strategy [2] and label smooth regularization [5] during training, to alleviate the bias from the non-uniform sample distribution among categories and partial incorrect training labels. We use the provided data (Places365) for training models, do not use any additional data, and train all models from scratch. The algorithm and architecture details will be described in our arXiv paper (available online shortly).
[1] Xudong Cao. A practical theory for designing very deep convolutional neural networks, 2014. (unpublished) [2] Li Shen, Zhouchen Lin, Qingming Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. In ECCV 2016. [3] Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Deep Residual Learning for Image Recognition. In CVPR 2016. [4] Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. ArXiv:1602.07261,2016. [5] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. Rethinking the Inception Architecture for Computer Vision. ArXiv:1512.00567,2016. |
NEU_SMILELAB | YUE WU, YU KONG, JUN LI, LANCE BARCELONA, RAMI SALEH, SHANGQIAN GAO, RYAN BIRKE, HONGFU LIU, JOSEPH ROBINSON, TALEB ALASHKAR, YUN FU
Northeastern University, MA, USA |
We focus on the object classification problem. The 1000 classes are split into 2 parts based on the analysis of the WORDNET structure and a visualization of features from a resnet-200 model [1]. The first part has 417 classes and is annotated as “LIVING THINGS”. The second part has the rest 583 classes and is annotated as “ARTIFACTS and OTHERS”. Two resnet-200 layer models are trained for each part separately. The model for “LIVING THINGS” has a top-5 error of 3.174% on validation set and the model for “ARTIFACTS and OTHERS” has a top-5 error of 7.874% with only center crop testing. However, we cannot find a proper way to combine these two models to get a good result for the total 1000 classes. Our combination of the two models [2] gets a top-5 error 7.62% for 1000 classes with 144-crop testing. We also train several resnet models with different layers. Our submission is based on an ensemble of these models. Our best result achieves a top-5 error 3.92% on validation set. For localization, we simply take the center of the image as the box for object.
[1] Identity Mappings in Deep Residual Networks, ECCV, 2016 [2] Deep Convolutional Neural Network with Independent Softmax for Large Scale Face Recognition, ACM Multimedia (MM), 2016 |
NQSCENE | Chen Yunpeng ( NUS )
Jin Xiaojie ( NUS ) Zhang Rui ( CAS ) Li Yu ( CAS ) Yan Shuicheng ( Qihoo/NUS ) |
Technique Details for the Scene Classification:
For the scene classification task, we propose the following methods to address the data imbalance issues (aka the long tail distribution issue) which benefit and boost the final performance: 1) Category-wise Data Augmentation: We implied a category wise data augmentation strategy, which associates each category with adaptive augmentation level. The augmentation level is updated iteratively during the training. 2) Multi-task Learning: We proposed a multipath learning architecture to jointly learn feature representations from the Imagnet-1000 dataset and Places-365 dataset. Vanilla ResNet-200 [1] is adopted with following elementary tricks: scale and aspect ratio augmentation, over-sampling, multi-scale (x224,x256,x288,x320) dense testing. In total, we have trained four models and fused them by averaging their scores. It costs about three days for training each model using MXNet [2] on a cluster with forty NVIDIA M40 (12GB). ------------------------------ [1] He, Kaiming, et al. "Identity mappings in deep residual networks." arXiv preprint arXiv:1603.05027 (2016). [2] Chen, Tianqi, et al. "Mxnet: A flexible and efficient machine learning library for heterogeneous distributed systems." arXiv preprint arXiv:1512.01274(2015). |
NTU-SC | Jason Kuen, Xingxing Wang, Bing Shuai, Xiangfei Kong, Jianxiong Yin, Gang Wang*, Alex C Kot
Rapid-Rich Object Search Lab, Nanyang Technological University, Singapore. |
All of our scene classification models are built upon pre-activation ResNets [1]. For scene classification using the provided RGB images, we train from scratch a ResNet-200, as well as a relatively shallow Wide-ResNet [2]. In addition to RGB images, we make use of class activation maps [3] and (scene) semantic segmentation masks [4] as complementary cues, obtained from models pre-trained for ILSVRC image classification [5] and scene parsing [6] tasks respectively. Our final submissions consist of ensembles of multiple models.
References [1] He, K., Zhang, X., Ren, S., & Sun, J. “Identity Mappings in Deep Residual Networks”. ECCV 2016. [2] Zagoruyko, S., & Komodakis, N. “Wide Residual Networks”. BMVC 2016. [3] Zhou, B., Khosla, A., Lapedriza, A., Oliva, A., & Torralba, A. “Learning Deep Features for Discriminative Localization”. CVPR 2016. [4] Shuai, B., Zuo, Z., Wang, G., & Wang, B. "Dag-Recurrent Neural Networks for Scene Labeling". CVPR 2016. [5] Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., ... & Berg, A. C. “Imagenet large scale visual recognition challenge”. International Journal of Computer Vision, 115(3), 211-252. [6] Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., & Torralba, A. “Semantic Understanding of Scenes through the ADE20K Dataset”. arXiv preprint arXiv:1608.05442. |
NTU-SP | Bing Shuai (Nanyang Technological University)
Xiangfei Kong (Nanyang Technological University) Jason Kuen (Nanyang Technological University) Xingxing Wang (Nanyang Technological University) Jianxiong Yin (Nanyang Technological University) Gang Wang* (Nanyang Technological University) Alex Kot (Nanyang Technological University) |
We train our improved fully convolution networks (IFCN) for the scene parsing task. More specifically, we use the pre-trained Convolution Neural Network (pre-trained from ILSVRC CLS-LOC task) as encoder, and then adds a multi-branch deep convolution network to perform multi-scale context aggregation. Finally, simple deconvolution network (without unpooling layers) is used as the decoder to generate the high-resolution label prediction maps. IFCN subsumes the above three network components. The network (IFCN) is trained with the class weighted loss proposed in [Shuai et al, 2016].
[Shuai et al, 2016] Bing Shuai, Zhen Zuo, Bing Wang, Gang Wang. DAG-Recurrent Neural Network for Scene Labeling |
NUIST | Jing Yang, Hui Shuai, Zhengbo Yu, Rongrong Fan, Qiang Ma, Qingshan Liu, Jiankang Deng | 1.inception v2 [1] is used in the VID task, which is almost real time with GPU.
2.cascaded region regression is used to detect and track different instances. 3.context inference between instances within each video 4.online detector and tracker update to improve recall [1]Szegedy, Christian, et al. "Going deeper with convolutions." Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2015. [2]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. [3]Dai, Jifeng, et al. "R-FCN: Object Detection via Region-based Fully Convolutional Networks." arXiv preprint arXiv:1605.06409 (2016). |
NuistParsing | Feng Wang:B-DAT Lab, Nanjing University of Information Science and Technology, China
Zhi Li:B-DAT Lab, Nanjing University of Information Science and Technology, China Qingshan Liu:B-DAT Lab, Nanjing University of Information Science and Technology, China |
Scene parsing problem is extremely challenging due to the diversity of appearance and the complexity of configuration,laying, and occasion. We mainly adopt SegNet architecture for scene parsing work. We first extract the edge information of images from ground truth and take the edge as a new class. Then we re-compute the weights of all classes to overcome the imbalance between classes. We use the new ground truth and new weights to train the model. In addition, we employ super-pixel smoothing to optimize the results.
[1] V. Badrinarayanan, A. Handa, and R. Cipolla. SegNet:a deep convolutional encoder-decoder architecture for robust semantic pixel-wise labelling. arXiv preprint arXiv:1505.07293,2015. [2]Wang F, Li Z, Liu Q. Coarse-to-fine human parsing with Fast R-CNN and over-segment retrieval[C]//2016 IEEE International Conference on Image Processing (ICIP). IEEE, 2016: 1938-1942. |
NUS-AIPARSE | XIAOJIE JIN (NUS)
YUNPENG CHEN (NUS) XIN LI (NUS) JIASHI FENG (NUS) SHUICHENG YAN (360 AI INSTITUTE, NUS) |
The submissions are based on our proposed Multi-Path Feedback recurrent neural network (MPF-RNN) [1]. MPF-RNN aims to enhancing the capability of RNNs on modeling long-range context information at multiple levels and better distinguish pixels that are easy to confuse in pixel-wise classification. In contrast to CNNs without feedback and RNNs with only a single feedback path, MPF-RNN propagates the contextual features learned at top layers through weighted recurrent connections to multiple bottom layers to help them learn better features with such "hindsight". Besides, we propose a new training strategy which considers the loss accumulated at multiple recurrent steps to improve performance of the MPF-RNN on parsing small objects as well as stabilizing the training procedure.
In this contest, Res101 is used as baseline model. Multi-scale input data augmentation as well as multi-scale testing are used. [1] Jin, Xiaojie, Yunpeng Chen, Jiashi Feng, Zequn Jie, and Shuicheng Yan. "Multi-Path Feedback Recurrent Neural Network for Scene Parsing." arXiv preprint arXiv:1608.07706 (2016). |
NUS_FCRN | Li Xin, Tsinghua University;
Jin xiaojie, National University of Singapore; Jiashi Feng, National University of Singapore. |
We trained a single fully convolutional neural network with ResNet-101 as frontend model.
We did not use any multiscale data augmentation in both training and testing. |
NUS_VISENZE | Kyaw Zaw Lin(dcskzl@nus.edu.sg)
Shangxuan Tian(shangxuan@visenze.com) JingYuan Chen(a0117039@u.nus.edu) |
Fusion of three models SSD (VGG+Resnet)[2] with Faster Rcnn[4] with Resnet[3]. Context suppression is applied and then tracking is performed using according to [1]. Tracklets are greedily merged after tracking.
[1]Danelljan, Martin, et al. "Accurate scale estimation for robust visual tracking." Proceedings of the British Machine Vision Conference BMVC. 2014. [2]Liu, Wei, et al. "SSD: Single Shot MultiBox Detector." arXiv preprint arXiv:1512.02325 (2015). [3]He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [4]Ren, Shaoqing, et al. "Faster R-CNN: Towards real-time object detection with region proposal networks." Advances in neural information processing systems. 2015. |
OceanVision | Zhibin Yu Ocean University of China
Chao Wang Ocean University of China ZiQiang Zheng Ocean University of China Haiyong Zheng Ocean University of China |
Our homepage: http://vision.ouc.edu.cn/~zhenghaiyong/
We are interesting in scene classification and we aim to build a net for this problem. |
OutOfMemory | Shaohua Wan, UT Austin
Jiapeng Zhu, BIT, Beijing |
Faster RCNN [1] object detection framework plus ResNet-152 [2] network configuration is used in our object detection algorithm. Much effort is made towards optimizing the network such that it consumes much less GPU memory.
[1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. In NIPS, 2015. [2] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015 |
Rangers | Y. Q. Gao,
W. H. Luo, X. J. Deng, H. Wang, W. D. Chen, |
--- |
ResNeXt | Saining Xie, UCSD
Ross Girshick, FAIR Piotr Dollar, FAIR Kaiming He, FAIR |
We present a simple, modularized multi-way extension of ResNet for ImageNet classification. In our network, each residual block consists of multiple ways that are of the same architectural shape, and the network is a simple stack of such residual blocks that share the same template, following the design of the original ResNet. Our model is highly modularized and thus reduces the burdens of exploring the design space. We carefully conducted ablation experiments showing the improvements of this architecture. More details will be available in a technical report. In the submissions we exploited multi-way ResNets-101. We submit no localization result. |
RUC_BDAI | Peng Han, Renmin University of China
An Zhao, Renmin University of China Wenwu Yuan, Renmin University of China Zhiwu Lu, Renmin University of China Jirong Wen, Renmin University of China Lidan Yang, Renmin University of China Aoxue Li, Peking University |
We use the well-trained Faster R-CNN[1] to generate bounding boxes for every frame of the video. And we only use a few frames of every video to train that model. To reduce the effect of the unbalanced problem, the number of every category is basically the same. Then we utilize the contextual information of the video to reduce the noise and add the missing.
[1] S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with region proposal networks. In NIPS, 2015. |
S-LAB-IIE-CAS | Ou Xinyu [1,2]
Ling Hefei [2] Liu Si [1] 1. Chinese Academy of Sciences, Institute of Information Engineering; 2. Huazhong University of Science and Technology (This work was done when the first author worked as an intern at S-Lab of CASIIE.) |
We exploit object-based contextual enhancement strategies to improve the performance of deep convolutional neural network over scene parsing task. Increasing the weights of objects on local proposal regions can enhance the structure characteristics of the object and correct the ambiguous areas which are wrongly judged as stuff. We have verified its effectiveness on ResNet101-like architecture [1], which is designed with multi-scale, CRF, atrous convolutional [2] technologies. We also apply various technologies (such as RPN [3], black hole padding, visual attention, iterative training) to this ResNet101-like architecture. The algorithm and architecture details will be described in our paper (available online shortly).
In this competition, we submit five entries. The first (model A) is a Multi-Scale Resnet101-like model with Fully Connected CRF and Atrous Convolutions, which achieved 0.3486 mIOU and 75.39% pixel-wise accuracy on validation dataset. The second model is a Multi-Scale deep CNN modified by object proposal, which achieved 0.3809 mIOU and 75.69% pixel-wise accuracy. A black hold restoration strategy is attached to model B to generate the model C. The model D attention strategies in deep CNN model. And the model E combined with the results of other four models. [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image Recognition.IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016 [2] Liang-Chieh Chen, George Papandreou, Iasonas Kokkinos, Kevin Murphy, Alan L. Yuille: DeepLab: Semantic Image Segmentation with Deep Convolutional Nets, Atrous Convolution, and Fully Connected CRFs. CoRR abs/1606.00915 (2016) [3] Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. Conference on Neural Information Processing Systems (NIPS), 2015 |
SamExynos | Qian Zhang(Beijing Samsung Telecom R&D Center)
Peng Liu(Beijing Samsung Telecom R&D Center) Jinbin Lin(Beijing Samsung Telecom R&D Center) Junjun Xiong(Beijing Samsung Telecom R&D Center) |
Object localization:
The submission is based on [1] and [2], but we modified the model, and the newtwork is 205 layers. Due to the limit of time and GPUs, we have just trained three CNN model for classification. The top-5 accuracy on the validation set with dense crops(scale:224,256,288,320,352,384,448,480) is 96.44% for the best single model. And the top-5 accuracy on the validation set with dense crops is 96.88% for three model ensemble. places365 classification: The submission is based on [3] and [4], we add 5 layers to resnet 50, and modified the network. Due to the limit of time and GPUs, we have just trained three CNN model for the scene classification task. The top-5 accuracy on the validation set with 72 crops is 87.79% for the best single model. And the top-5 accuracy on the validation set with multiple crops is 88.70% for three model ensemble. [1]Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun,Identity Mappings in Deep Residual Networks. ECCV 2016. [2]Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alemi. "Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning". arXiv preprint arXiv:1602.07261 (2016) [3]Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jonathon Shlens, Zbigniew Wojna. "Rethinking the Inception Architecture for Computer Vision". arXiv preprint arXiv:1512.00567 (2015) |
Samsung Research America: General Purpose Acceleration Group | Dr. S. Eliuk (Samsung), C. Upright (Samsung), Dr. H. Vardhan (Samsung), T. Gale (Intern, Northeastern University), S. Walsh (Intern, University of Alberta). | The General Purpose Acceleration Group is focused on accelerating training via HPC & distributed computing. We present Distributed Training Done Right (DTDR) where standard open-source models are trained in an effective manner via a multitude of techniques involving strong / weak scaling and strict distributed training modes. Several different models are used from standard Inception v3, to Inception v4 res2, and ensembles of such techniques. The training environment is unique as we can explore extremely deep models given the model-parallel nature of our partitioning of data.
|
scnu407 | Li Shiqi South China Normal University
Zheng Weiping South China Normal University Wu Jinhui South China Normal University |
We believe that the spatial relationships between objects in the image is a kind of time-series data. Therefore, we first use VGG16 to extract the features of the image, then add 4 LSTM layer in the back, four LSTM layer representing the four directions of the scanning feature map. |
SegModel | Falong Shen, Peking Univerisity
Rui Gan, Peking University Gang Zeng, Peking Univerisity |
Abstract
Our models are finetuned from resnet152[1] and follow the methods introduced in [2]. References [1] K He,X Zhang,S Ren,J Sun. Deep Residual Learning for Image Recognition. [2] F Shen,G Zeng. Fast Semantic Image Segmentation with High Order Context and Guided Filtering. |
SenseCUSceneParsing | Hengshuang Zhao* (SenseTime, CUHK), Jianping Shi* (SenseTime), Xiaojuan Qi (CUHK), Xiaogang Wang (CUHK), Tong Xiao (CUHK), Jiaya Jia (CUHK) [* equal contribution] | We have employed FCN based semantic segmentation for the scene parsing. We propose a context aware semantic segmentation framework. The additional image level information significantly improves the performance under complex scene in natural distribution. Moreover, we find that deeper pretrained model is better. Our pretrained models include ResNet269, ResNet101 from ImageNet dataset, and ResNet152 from Places2 dataset. Finally, we utilize the deeply supervised structure to assist training the deeper model. Our best single model reach 44.65 mIOU and 81.58 pixel accurcy in validation set.
[1]. Long, Jonathan, Evan Shelhamer, and Trevor Darrell. "Fully convolutional networks for semantic segmentation." CVPR. 2015. [2]. He, Kaiming, et al. "Deep residual learning for image recognition." arXiv:1512.03385, 2015. [3]. Lee, Chen-Yu, et al. "Deeply-Supervised Nets." AISTATS, 2015. |
SIAT_MMLAB | Sheng Guo, Linjie Xing,
Shenzhen Institutes of Advanced Technology, CAS. Limin Wang, Computer Vision Lab, ETH Zurich. Yuanjun Xiong, Chinese University of Hong Kong. Jiaming Liu and Yu Qiao, Shenzhen Institutes of Advanced Technology, CAS. |
We propose a modular framework for large-scale scene recognition, called as multi-resolution CNN (MR-CNN) [1]. This framework addresses the characterization difficulty of scene concepts, which may be based on multi-level visual information, including local objects, spatial layout, and global context. Specifically, in this challenge submission, we utilizes four resolutions (224, 299, 336, 448) as the input sizes of MR-CNN architectures. For coarse resolution (224, 299), we exploit the existing powerful Inception architectures (Inception v2 [2], Inception v4 [3], and Inception-ResNet [3]), while for fine resolution (336, 448), we propose our new inception architectures by making original inception network deeper and wider. Our final submission is the prediction result of MR-CNNs by fusing the outputs of CNNs of different resolutions.
In addition, we propose several principled techniques to reduce the over-fitting risk of MR-CNNs, including class balancing and hard sample mining. These simple yet effective training techniques enable us to further improve the generalization performance of MR-CNNs on the validation dataset. Meanwhile, we use an efficient parallel version of Caffe toolbox [4] to allow for the fast training of our proposed deeper and wider Inception networks. [1] L. Wang, S. Guo, W. Huang, Y. Xiong, and Y. Qiao, Knowledge guided disambiguation for large-scale scene classification with Multi-Resolution CNNs, in arXiv, 2016. [2] S. Ioffe and C. Szegedy, Batch normalization: Accelerating deep network training by reducing internal covariate shift, in ICML, 2015. [3] C. Szegedy, S. Ioffe, and V. Vanhouche, Inception-v4, Inception-ResNet and the impact of residual connections on learning, in arXiv, 2016. [4] L. Wang, Y. Xiong, Z. Wang, Y. Qiao, D. Lin, X. Tang, and L. Van Gool, Temporal segment networks: towards good practices for deep action recognition, in ECCV, 2016. |
SIIT_KAIST | Sihyeon Seong (KAIST)
Byungju Kim (KAIST) Junmo Kim (KAIST) |
We used ResNet[1] (101 layers / 4GPUs) as our baseline model. From the model pre-trained with ImageNet classification dataset(provided by [2]), We re-tuned the model with Places365 dataset (256-resized small dataset). Then, we further fine-tuned the model based on the following ideas:
i) Analyzing correlations between labels : We calculated correlations between each pair of predictions p(i), p(j) where i, j are classes. Then, highly correlated label pairs are extracted by thresholding the correlation coefficients. ii) Additional semantic label generation : Using the correlation table from i), we further generated super/subclass labels by clustering them. Additionally, we generated 170 binary labels for separations of confusing classes, which maximize margins between highly correlated label pairs. iii) Boosting-like multi-loss terms : A large number of loss terms are combined for classifying the labels generated in ii). [1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [2] https://github.com/facebook/fb.resnet.torch |
SIIT_KAIST-TECHWIN | Byungju Kim (KAIST),
Youngsoo Kim (KAIST), Yeakang Lee (KAIST), Junho Yim (KAIST), Sangji Park (Techwin), Jaeho Jang (Techwin), Shimin Yin (Techwin), Soonmin Bae (Techwin), Junmo Kim (KAIST) |
Our methods for classification and localization are based on ResNet[1].
We used Branched-200-layer ResNets, based on the original 200-layer ResNet and Label Smoothing method [2]. The networks are trained on ILSVRC2016 localization dataset. (from scrach) For testing, dense sliding window method[3] was used on six scales and with horizontal flip. 'Single model' is one Branched-ResNet with Label Smoothing method. Validation top-5 classification error rate is 3.7240% 'Ensemble A' consists of one 200-layer ResNet, one Branched-ResNets without label smoothing and 'Single model', which is with label smoothing. 'Ensemble B' consists of three Branched-ResNets without label smoothing and 'Single model', which is with label smoothing. 'Ensemble C' consists of 'Ensemble B' and an original 200-layer ResNet. Ensemble A and B are averaged on soft set of targets distilled by high temperature, which is similar to the method in [4]. Ensemble C is averaged on softmax outputs. This work was supported by Hanwha Techwin CO., LTD. [1] He, Kaiming, et al. "Deep residual learning for image recognition." arXiv preprint arXiv:1512.03385 (2015). [2] Szegedy, Christian, et al. "Rethinking the inception architecture for computer vision." arXiv preprint arXiv:1512.00567 (2015). [3] Sermanet, Pierre, et al. "Overfeat: Integrated recognition, localization and detection using convolutional networks." arXiv preprint arXiv:1312.6229 (2013). [4] Hinton, Geoffrey, Oriol Vinyals, and Jeff Dean. "Distilling the knowledge in a neural network." arXiv preprint arXiv:1503.02531 (2015). |
SIS ITMO University | --- | Single-Shot Detector
|
SJTU-ReadSense | Qinchuan Zhang, Shanghai Jiao Tong University
Junxuan Chen, Shanghai Jiao Tong University Thomas Tong, ReadSense Leon Ding, ReadSense Hongtao Lu, Shanghai Jiao Tong University |
We train two CNN models from the scratch. Model A based on Inception-BN [1] with one auxiliary classifier is trained on the Places365-Challenge dataset [2], which achieved 15.03% top-5 error on validation dataset. Model B based on ResNet [3] with depth of 50 layers is trained on the Places365-Standard dataset and finetuned for 2 epochs on the Places365-Challenge dataset due to the limit of time, which achieved 16.3% top-5 error on validation dataset. We also fuse features extracted from 3 baseline models [2] on the Places365-Challenge dataset and trained two fully connected layers with a softmax classifier. Moreover, we adopt the "class-aware" sampling strategy proposed by [4] for models trained on Places365-Challenge dataset to tackle the non-uniform distribution of images over 365 categories. We implement model A using Caffe [5] and conduct all other experiments using MXNet [6] to deploy larger batch size on a GPU.
We train all models with a 224x224 crop randomly sampled from an 256x256 image or its horizontal flip, with the per-pixel mean subtracted. We apply 12-crops [7] for evaluation on validation and test datasets. We ensemble multiple models with weights (learnt on validation dataset or top-5 validation accuracies), and achieve 12.79% (4 models), 12.69% (5 models), 12.57% (6 models) top-5 error on validation dataset. [1] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. In ICML, 2015. [2] Places: An Image Database for Deep Scene Understanding. B. Zhou, A. Khosla, A. Lapedriza, A. Torralba and A. Oliva. Arxiv, 2016. [3] K. He, X. Zhang, S. Ren, and J. Sun. Deep Residual Learning for Image Recognition. In CVPR, 2016. [4] L. Shen, Z. Lin , Q. Huang. Relay backpropagation for effective learning of deep convolutional neural networks. arXiv:1512.05830, 2015. [5] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe: Convolutional architecture for fast feature embedding. arXiv:1408.5093, 2014. [6] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C.n Zhang, and Z. Zhang. MXNet: A flexible and efficient machine learning library for heterogeneous distributed systems. In NIPS, 2015. [7] C. Szegedy,W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich. Going deeper with convolutions. In CVPR, 2015. |
SRA | Hojjat Seyed Mousavi, The Pennsylvania State University, Samsung Research America
Da Zhang, University of California at Santa Barbara, Samsung Research America Nina Narodytska, Samsung Research America Hamid Maei, Samsung Research America Shiva Kasiviswanathan, Samsung Research America |
Object detection from video is a challenging task in computer vision. These challenges sometime come from the temporal aspects of videos or the nature of objects present in the video. For example detection of objects when they disappear and reappear in the camera’s field of view, or detection of non-rigid objects that change appearances are of common challenges in object detection in video. In this work, we specifically focus on incorporating the temporal and contextual information in addressing some of these challenges. In our proposed method, initial candidates for objects are first detected in each frame of the video sequence. Then based on the information from adjacent cells and also contextual information from the whole video sequence, object detections and categories are recalculated for each video sequence. We have submitted two different submissions to this year’s competition. One corresponds to our algorithm using information from still video frames, temporal information from adjacent frames and contextual information of the whole video sequence. The other submission does not use the contextual information present in the video. |
SunNMoon | Moon Hyoung-jin.
Park Sung-soo. |
We ensembled two object detection, Faster-rcnn and SingleShotDetector.
we used pre-trained Resnet101 classificationmodel. Faster-rcnn is combined with RPN and SPOPnet(Scale-aware Pixel-wise Object Proposal networks) algorithm to find better rois in Faster-rcnn. we trained Faster-rcnn and SingleShotDetector(SSD). the finally result is the combiantion of multi scale Faster-rcnn and SSD300x300. The result of single Faster-rcnn is 42.8% mAP. The result of SSD300x300 is 43.7% mAP. The ensembled result is 47.6% mAP for multi scaled faster-rcnn and SSD. References: [1] Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", CVPR 2015 [2] Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg "SSD: Single Shot MultiBox Detector" [3] Zequn Jie, Xiaodan Liang, Jiashi Feng, Wen Feng Lu, Eng Hock Francis Tay, Shuicheng Yan "Scale-aware Pixel-wise Object Proposal Networks" |
SUXL | Xu Lin SZ UVI Technology Co., Ltd | The proposed model is a combination of some convolutional neural network framework for scene parsing implemented on Caffe. We initialise ResNet-50 and ResNet-101 [1] trained on ImageNet classification dataset; then train this two networks on Place2 scene classification 2016. With some modification for scene parsing task, we train multiscale dilated network [2] initialised by trained parameter of ResNet-101, and FCN-8x and FCN-16x [3] trained parameter of ResNet-50. Considering additional models provided by scene parsing challenge 2016, we do a combination of these models via post network. The proposed model is also refined by fully connected CRF for semantic segmentation [4].
[1]. K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In CVPR, 2016. [2].L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille. Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected CRFs. arXiv:1606.00915, 2016 [3].J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc. CVPR, 2015. [4].Krahenbuhl, P. and Koltun, V. Efficient inference in fully connected crfs with gaussian edge potentials. In NIPS, 2011. |
SYSU_HCP-I2_Lab | Liang Lin (Sun Yat-sen University),
Lingbo Liu (Sun Yat-sen University), Guangrun Wang (Sun Yat-sen University), Junfan Lin (Sun Yat-sen University), Ziyang Tang (Sun Yat-sen University), Qixian Zhou (Sun Yat-sen University), Tianshui Chen (Sun Yat-sen University) |
We design our scene parsing model based on DeepLab2-CRF, and improve it from the following two aspects. First, we incorporate deep, semantic information and shallow, appearance information with skipping layers to produce refined, detailed segmentations. Specifically, We combine the features of the 'fusion' layer (after up-sampled via bilinear interpolation) , the 'res2b' layer and the 'res2c' layer. Second, we develop cascade nets, in which the second network utilize the output of the first network to generate more accurate parsing map. Our ResNet-101 was pre-trained on the standard 1.2M imagenet data and finetuned on ADE20K Dataset. |
TEAM1 | Sung-Bae Cho (Yonsei University)
Sangmuk Jo (Yonsei University) Seung Ha Kim (Yonsei University) Hyun-Tae Hwang (Yonsei University) Youngsu Park (LGE) Hyungseok Ohk (LGE) |
We use the Faster-RCNN framework[1] and finetune the network with the provided VID training data and the additional DET training data.
(We modified the DET training data into 30 classes from 200 classes.) For our baseline model, we use the model VGG-16. [2] [1] REN, Shaoqing, et al. Faster R-CNN: Towards real-time object detection with region proposal networks. In: Advances in neural information processing systems. 2015. p. 91-99. [2] WANG, Limin, et al. Towards good practices for very deep two-stream convnets. arXiv preprint arXiv:1507.02159, 2015. |
ToConcoctPellucid | Digvijay Singh
MS by Research in Computer Science, IIIT Hyderabad |
This submission for the task of object detection follows the algorithmic guidelines provided by Ren et al [2], also known as Faster-RCNN algorithm. This deep neural network based method removes the dependency of detection algorithms on older object proposal techniques. Faster-RCNN network consists of two networks: Region proposal network (RPN) for object proposals and Fast-RCNN for object detection. The authors show that the framework is independent of the classification network architecture chosen. Particularly, Residual Networks (ResNet) [1] trained for ImageNet classification task are picked because of their recent success in many domains. It has been observed that the improvement in quality of object proposals is drastic as compared to older available methods like Selective Search, MCG etc. The Faster-RCNN layers which are not borrowed from ResNet, like RPN layers and final fully connected layers for roi-classification and bounding box offset prediction, are randomly initialized with a gaussian distribution. Further, to train/fine-tune our baseline single model we utilize train + val1 set of ImageNet DET dataset and the other half val2 set is kept for validation purposes. This split is maintained for all the three entries here. Three entries have been submitted for object detection challenge, these are:
Entry 1: ResNet-101 + Faster-RCNN. Pretrained ResNet-101 network provided by the authors of ResNet are used to initialize most of the layers in Faster-RCNN. The extra layers belonging to RPN network and final fully connected layers are randomly initialized. The ROI-Pooling is put between conv4 and conv5 blocks of ResNet-101. Hence, ResNet layers conv1-conv4 act as feature extractor and these features are used for object proposal generation as well as object detection task. Object Region of Interests are generated by the RPN network and are passed on to ROI pooling which scales all roi bounded activations to a fixed size of 7x7 [WxH]. Layers in conv5 act as object classifier. Entry 2: Taking hints from last year winner's recommendations, this entry is an ensemble of two Residual Networks. To add to this, more semantically meaningful box-voting technique [3] is also used. For our ensemble, ResNet-101 and ResNet-50 networks are picked because of the availability of pretrained models. While testing, both networks generate proposals. Proposals from both networks are combined and Non-maximal-suppresion is done to remove redundancy. The final set of ROIs are given to both the networks and detections from each of them are collected. To obtain a final set of detections, box-voting technique [3] is used to refine the detections from both. Box-voting technique can be seen as weighted-averaging of instances belonging to a certain niche (which is decided by an IoU criteria). Entry 3: In this entry, the ResNet layers are altered, removed and modified to get a different topology from networks used in Entry 2. As an example, the batch-nomalization and scaling layers which are followed by Eltwise addition in standard ResNet are removed and are affixed after the Eltwise operation. Rest of the training/testing settings remain the same as Entry 2. [1] Kaiming He, Xiangyu Zhang, Shaoqing Ren and Jian Sun. Deep Residual Learning for Image Recognition. CVPR 2016 [2] Shaoqing Ren, Kaiming He, Ross Girshick and Jian Sun. Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks. NIPS 2015 [3] S. Gidaris and N. Komodakis. Object Detection via a multi-region and semantic segmentation-aware cnn model. ICCV 2015 |
Trimps-Soushen | Jie Shao, Xiaoteng Zhang, Zhengyan Ding, Yixin Zhao, Yanjun Chen, Jianying Zhou, Wenfei Wang, Lin Mei, Chuanping Hu
The Third Research Institute of the Ministry of Public Security, P.R. China. |
Object detection (DET)
We use several pre-trained models, including ResNet, Inception, Inception-Resnet etc. By taking the predict boxes from our best model as region proposals, we average the softmax scores and the box regression outputs across all models. Other improvements include annotations refine, boxes voting and features maxout. Object classification/localization (CLS-LOC) Based on image classification models like Inception, Inception-Resnet, ResNet and Wide Residual Network (WRN), we predict the class labels of the image. Then we refer to the framework of "Faster R-CNN" to predict bounding boxes based on the labels. Results from multiple models are fused in different ways, using the model accuracy as weights. Scene classification (Scene) We adopt different kinds of CNN models such as ResNet, Inception and WRN. To improve the performance of features from multiple scales and models, we implement a cascade softmax classifier after the extraction stage. Object detection from video (VID) Same methods as DET task were applied to each frame. Optical flow guided motion prediction helped to reduce the false negative detections. [1] Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks, Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. NIPS 2015 [2] Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning, Christian Szegedy, Sergey Ioffe, Vincent Vanhoucke, Alex Alem. [3] Zagoruyko S, Komodakis N. Wide Residual Networks[J]. arXiv preprint arXiv:1605.07146, 2016. |
VB | JongGook Ko
Seungjae Lee KeunDong Lee DaUn Jeong WeonGeun Oh |
In this work, we use a variant of SSD[1] with ResNet[2] for detection task. The overall training of the detection network follows a similar procedure with [1]. For detection, we design detection network from the ResNet and select multiple object candidates from different layers with various aspect ratio, scales and so on. For Ensemble models, we also train Faster RCNN[3] using ResNet and a variant of SSD with VGG network[1].
[1]"SSD:Single Shot MultiBox Detector", Wei Liu, Dragomir Anguelov, Dumitru Erhan, Christian Szegedy, Scott Reed, Cheng-Yang Fu, Alexander C. Berg. arXiv preprint arXiv:1512.02325(2015). [2]"Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. Tech Report 2015. [3]"Faster r-cnn: Towards real-time object detection with region proposal networks[J]". Ren S, He K, Girshick R, et al. arXiv preprint arXiv:1506.01497, 2015. |
VikyNet | K.Vikraman , Independent researcher. Graduate from IIT Roorkee | Semantic Segmentation requires careful adjustment of parameters. Due to max pooling, a lot of useful information about edges are lost. A lot of algorithms like Deconvolutional networks try to capture them but at the cost of increased computational time.
FCNs have a lot of advantages over them in terms of processing time. ParseNet has an increased perfomance. I have fine-tuned the model to perform better. References: 1)ParseNet: Looking Wider to See Better 2)Fully Convolutional Networks for Semantic Segmentation |
VIST | Jongyoul Park(ETRI)
Joongsoo Lee(ETRI) Joongwon Hwang(ETRI) Seung-Hwan Bae(ETRI) Young-Suk Yoon(ETRI) Yuseok Bae(ETRI) |
We use ResNet models for classification network and adapt Faster R-CNN for region proposal network basically in this work. |
Viz Insight | Biplab Ch Das, Samsung R&D Institute Bangalore
Shreyash Pandey, Samsung R&D Institute Bangalore |
Ensembling approaches have been known to outperform individual classifiers on standard classification tasks [No free Lunch Theorem :)]
In our approach we trained state of the art classifiers including variations of: 1.ResNet 2.VGGNet 3.AlexNet 4.SqueezeNet 5.GoogleNet Each of these classifiers were trained on different views of the provided places 2 challenge data. Multiple Deep Metaclassifiers were trained on the confidence of the labels predicted by above classifiers successfully accomplishing a non linear ensemble, where the weights of the neural network are in a way to maximize the accuracy of scene recognition. To impose further consistency between objects and scenes, a state of art classifier trained on imagenet was adapted to places via a zero shot learning approach. We did not use any external data for training the classifiers. However we balanced the data to make the classifiers get unbiased results. So some of the data remained unused. |
Vladimir Iglovikov | Vladimir Iglovikov
|
Hardware: Nvidia Titan X
Software: Keras with Theano backend Time spent: 5 days All models trained on 128x128 resized from "small 256x256" dataset. [1] Modiffied VGG16 => validation top5 error => 0.36 [2] Modified VGG19 => validation top5 error => 0.36 [3] Modified Resnet 50 => validation top5 error => 0.46 [4] Average of [1] and [2] => validation top5 error 0.35 Main changes: Relu => Elu Optimizer => Adam Batch Normalization added to VGG16 and VGG19 |
WQF_BTPZ | Weichen Sun
Yuanyuan Li Jiangfan Deng |
We participate in the classification and localization task. Our framework is mainly based on deep residual network (ResNet) [1] and faster RCNN [2].
We make the following improvements. (1) For the classification part, we train our models with the 1000-category classification dataset providing multi-scale inputs. We pre-train our models with 224x224 crops from resized images of 256x256 Then we fine-tune the pre-trained ResNet (50-layer and 101-layer) classification models with 299x299 crops from resized images of 328x328. (2) In order to promote the optimization abilities of our models, we also replace ReLU layers with PReLU or RReLU layers. By introducing different activation methods, our models achieve a better classification performance. (3) For the localization part, we use the faster RCNN framework with a 50-layer ResNet as a classifier. Based on the pre-trained model in step (1), we train the model with the localization dataset. [1] "Deep Residual Learning for Image Recognition", Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. [2] "Faster R-CNN: Towards Real-Time Object Detection with Region Proposal Networks", Shaoqing Ren, Kaiming He, Ross Girshick, Jian Sun. [3] “Empirical Evaluation of Rectified Activations in Convolution Network”, Bing Xu, Naiyan Wang, Tianqi Chen, Mu Li. |
XKA | Zengming Shen,Yifei Liu, Lengyue Chen,Honghui Shi, Thomas Huang
University of Illinois at Urbana-Champaign |
SgeNet is trained only on ADE20k dataset and post processed with CRF.
1.SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation Vijay Badrinarayanan, Alex Kendall and Roberto Cipolla 2.Efficient Inference in Fully Connected CRFs with Gaussian Edge Potentials, Philipp Krähenbühl and Vladlen Koltun, NIPS 2011 |
YoutuLab | Xiaowei Guo, YoutuLab
Ruixin Zhang, YoutuLab Yushi Yao, YoutuLab Pai Peng, YoutuLab Ke Li, YoutuLab |
We build a scene recognition system using deep CNN models. These CNN models are inspired by original resnet[1] and inception[2] network architectures. We train these models on challenge dataset and apply balanced sampling strategy[3] to adapt unbalanced challenge dataset. Moreover, DSD[4] process is applied to further improve model performance.
In this competition, we submit five entries. The first and second are combinations of single scale results using weighted arithmetic average which weights is searched by greedy strategy. The third is a combination of single model results using same strategy with the first entry. The fourth and fifth are combinations using simple average strategy of single model results. [1] K. He, X. Zhang, S. Ren, J. Sun. Identity Mappings in Deep Residual Networks. In ECCV 2016. abs/1603.05027 [2] C. Szegedy, S. Ioffe, V. Vanhoucke, A. Alemi. Inception-v4, Inception-ResNet and the Impact of Residual Connections on Learning. In ICLR 2016. abs/1602.07261 [3] L. Shen, Z. Lin, Q. Huang. Relay Backpropagation for Effective Learning of Deep Convolutional Neural Networks. abs/1512.05830 [4] S. Han, J. Pool, S. Narang, H. Mao, S. Tang, E. Elsen, B. Catanzaro, J. Tran, W. J. Dally. DSD: Regularizing Deep Neural Networks with Dense-Sparse-Dense Training Flow. abs/1607.04381 |