ImageNet Challenge 2012 Analysis

The goal of the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) is to evaluate performance on automatically annotating photographs. An analysis of the state of large-scale categorical object detection based on the results of ILSVRC 2012 can be found in

Detecting avocados to zucchinis: what have we done, and where are we going?
Olga Russakovsky, Jia Deng, Zhiheng Huang, Alexander C. Berg, Li Fei-Fei.
International Conference on Computer Vision (ICCV), 2013.
[pdf] [supplement] [slides pptx | slides pdf] [bibTex]

This webpage provides some additional analysis of the ILSVRC dataset as compared to the long-standing PASCAL VOC benchmark, as well as a summary of the performance of top algorithms on ILSVRC 2012. We make several observations:

- If we take the top 562 synsets from ILSVRC 2012 (out of 1000 synsets) measured by localization difficulty, ILSVRC is comparable to PASCAL VOC in terms of chance performance of localization and object scale.

- ILSVRC is similar to PASCAL VOC with respect to the average number of object instances and the amount of clutter per image.

- For the classification challenge, the SuperVision algorithm outperforms the second place competitor ISI on nearly all categories

- For the detection challenge, SuperVision consistently outperforms the second place competitor Oxford VGG except when considering the set of 239 hardest categories with respect to localization difficulty.

[top]

1. Visualization of ILSVRC Data

We visualize the variety of the ILSVRC 2012 data using the validation set images. The images are first rescaled to the canonical size of 300x300 pixels. For each resized image we generate a 300x300 heat map where the region occupied by the annotated object of interest has value 1 while the rest of the image has value 0. The first column shows the average heat map. The red overlaid box is the average of all the annotated ground truth bounding boxes (computed by averaging each of the 4 coordinates). The second column shows the average of all images in this category

switch, electric switch, electrical switch

Average Object Location

Average Image

Example Images for this Category

Click here for all ILSVRC 2012 categories

[top]

2. Analysis of Dataset Difficulty: ILSVRC 2012 vs. PASCAL VOC 2012

We demonstrate that the ILSVRC is a challenging testbed for evaluating object detection algorithms. We compare it to the PASCAL VOC which has been accepted by the computer vision community as the benchmark detection dataset.

[top]

2.1 Variety of Object Classes

PASCAL VOC	ILSVRC
aeroplane	warplane, military plane airship, dirigible airliner space shuttle
bicycle	bicycle-built-for-two, tandem bicycle, tandem tricycle, trike, velocipede unicycle, monocycle mountain bike, all-terrain bike, off-roader
bird	white stork, Ciconia ciconia house finch, linnet, Carpodacus mexicanus albatross, mollymawk goldfinch, Carduelis carduelis water ouzel, dipper ...
boat	gondola speedboat lifeboat yawl fireboat ...
bottle	beer bottle pop bottle, soda bottle wine bottle pill bottle water bottle ...
bus	trolleybus, trolley coach, trackless trolley minibus school bus
car	cab, hack, taxi, taxicab racer, race car, racing car jeep, landrover minivan beach wagon, station wagon, wagon, estate car, beach waggon, station waggon, waggon ...
cat	Egyptian cat Persian cat tiger cat Siamese cat, Siamese tabby, tabby cat ...
chair	barber chair folding chair rocking chair, rocker throne toilet seat
cow	water buffalo, water ox, Asiatic buffalo, Bubalus bubalis ox bison
diningtable	dining table, board pool table, billiard table, snooker table desk
dog	Sussex spaniel vizsla, Hungarian pointer wire-haired fox terrier curly-coated retriever soft-coated wheaten terrier ...
horse	sorrel zebra
motorbike	moped
person	ballplayer, baseball player scuba diver groom, bridegroom
pottedplant	daisy yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum
sheep	llama ibex, Capra ibex bighorn, bighorn sheep, cimarron, Rocky Mountain bighorn, Rocky Mountain sheep, Ovis canadensis impala, Aepyceros melampus ram, tup ...
sofa	studio couch, day bed park bench
train	bullet train, bullet steam locomotive electric locomotive
tvmonitor	monitor
Not present	5 examples of ILSVRC categories which are not present in PASCAL. Click here for all ILSVRC 2012 categories croquet ball bearskin, busby, shako hay bathing cap, swimming cap torch ...

[top]

2.2 Chance Performance of Localization(CPL)

We define the chance performance of localization (CPL) measure as the expected accuracy of a detector which first randomly samples an object instance of that class and then uses its bounding box directly as the proposed localization window on all other images (after rescaling the images to the same size). Concretely, let B_1, B2, \ldots, B_N be all the bounding boxes of the object instances within a class, define the interesection over union(IOU) measure as

IOU(\hat{B}, B) = \frac{area(\hat{B} \cap B)}{area(\hat{B} \cup B)}

Then

CPL = \frac{\sum_i \sum_{j \neq i} \delta(IOU(B_i, B_j)\leq 0.5)}{N(N-1)}

Figure 2.1 shows CPL distribution on ILSVRC and PASCAL detection datasets. The 20 categories of PASCAL have an average CPL of 8.8% on the validation set; average CPL for all 1000 categories of ILSVRC is 20.8%. If we only keep the 562 most difficult categories of ILSVRC, the average CPL will be the same as PASCAL(up to 0.02% precision). We will refer to this set of 562 categories as normalized ILSVRC dataset. Figure 2.2 compares the CPL distribution of normalized ILSVRC versus PASCAL.

Figure 2.1 Full ILSVRC & Pascal CPL Histogram.

Figure 2.2 Normalized ILSVRC & Pascal CPL Histogram. Normalized ILSVRC refers to the ILSVRC subset that contains the 562 most difficult categories.

5 most difficult categories according to CPL

PASCAL VOC	ILSVRC
bottle CPL=0.38%	basketball CPL=0.082%
pottedplant CPL=0.94%	swimming trunks, bathing trunks CPL=0.098%
sheep CPL=1.3%	ping-pong ball CPL=0.13%
chair CPL=1.5%	rubber eraser, rubber, pencil eraser CPL=0.13%
boat CPL=1.7%	nail CPL=0.19%

[top]

2.3 Average Object Scale

The object scale in an image is calculated as the ratio between the area of ground truth bounding box and area of the image. The average scale per class on the PASCAL dataset is 0.241, on the ILSVRC is 0.358 and on the normalized ILSVRC is 0.251. Figure 2.3 and Figure 2.4 show the distirbution of object scale on both ILSVRC and PASCAL dataset.

Figure 2.3 ILSVRC & Pascal Average Object Scale Histogram.

Figure 2.4 Normalized ILSVRC & Pascal Average Object Scale Histogram.

5 most difficult categories according to object scale

PASCAL VOC	ILSVRC
bottle scale=0.07	basketball scale=0.02
sheep scale=0.12	ping-pong ball scale=0.02
pottedplant scale=0.12	swimming trunks, bathing trunks scale=0.02
chair scale=0.13	pickelhaube scale=0.03
car scale=0.14	bearskin, busby, shako scale=0.03

[top]

2.4 Average Number of Instances

The number of instances in an image is calculated as the number of non-overlapping ground truth bounding boxes in that image. The average number of instances per class on the PASCAL dataset is 1.69, on the ILSVRC is 1.590 and on the normalized ILSVRC is 1.911. The histograms of average number of instances for ILSVRC and Pascal are shown in Figure 2.5 and Figure 2.6.

Figure 2.5 ILSVRC & Pascal Average Number of Instances Histogram.

Figure 2.6 Normalized ILSVRC & Pascal Average Number of Instances Histogram.

5 most difficult categories according to number of instance

PASCAL VOC	ILSVRC
sheep instance number=3.1	strawberry instance number=5.4
person instance number=2.3	hay instance number=5.2
chair instance number=2.2	bubble instance number=5.2
cow instance number=2.2	pretzel instance number=5.1
bottle instance number=2.0	fig instance number=4.8

[top]

2.5 Level of Clutter

To capture the level of clutter of an image, we resize them to largest dimension of 300 pixels, use an unsupervised class-independent selective search[1] approach to generate candidate image regions likely to contain a coherent object, and then filter out all boxes which are entirely contained within an object instance or correctly localize it according to the IOU criterion. The remaining boxes are considered clutter. We use the number of clutter boxes as a measure of clutter level. The average level of clutter per class on the PASCAL dataset is 129.96, on the ILSVRC is 106.98 and on the normalized ILSVRC is 124.67. Figure 2.7 and 2.8 demonstrates that ILSVRC is comparable to the PASCAL object detection dataset on this metric.

* We have corrected an error in an earlier version of the analysis. Perviously the PASCAL average level of clutterness was given as 74.94; the normalized ILSVRC 107.15. We are sorry for any possible inconvenience.

Figure 2.7 ILSVRC & Pascal Level of Clutter Histogram.

Figure 2.8 Normalized ILSVRC & Pascal Level of Clutter Histogram.

5 most difficult categories according to object level of clutter

PASCAL VOC	ILSVRC
pottedplant clutter=173.4	football helmet clutter=242.0
bottle clutter=170.5	organ, pipe organ clutter=236.5
car clutter=159.7	steel drum clutter=230.5
person clutter=156.5	umbrella clutter=224.0
chair clutter=156.0	bearskin, busby, shako clutter=218.6

Reference

[1] van de Sande, K. E. A. and Uijlings, J. R. R. and Gevers, T. and Smeulders, A. W. M. Segmentation As Selective Search for Object Recognition. ICCV 2011.

[top]

3. Analysis of ILSVRC2012 Results

[top]

3.1 Classification Challenge

For the ILSVRC2012 classification challenge, each object class C has a set of images associated with it. Given an image, an algorithm is allowed to predict up to 5 object classes (since additional unannotated objects may be present in these images). An image is considered correctly classified if one of these guesses correctly matches the object class C. Classification accuracy of an algorithm on class C is the fraction of correctly classified images.

We compare the performance of the top 2 algorithms on the ILSVRC2012 classification challenge, namely SuperVision and ISI. Figure 3.1 shows how the average classification accuracy varies according to the number of guesses allowed. SuperVision consistently outperforms ISI.

Figure 3.2 shows the cumulative classification accuracy across the categories sorted from hardest to easiest according to CPL. The vertical line is the cutoff for the normalized ILSVRC. SuperVision consistently outperform ISI.

Figure 3.1 Average classification accuracy as a function of number of allowed guesses.

Figure 3.2 The classification accuracy of object categories as a function of CPL. For each CPL value, the height of the curve corresponds to the average classification accuracy of all object categories in ILSVRC with equal or smaller CPL measures.

Categories with Highest Classification Accuracy (5 guesses)

SuperVision	ISI
Leonberg accuracy 1.0	canoe accuracy 1.0
African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus accuracy 1.0	geyser accuracy 1.0
daisy accuracy 1.0	odometer, hodometer, mileometer, milometer accuracy 1.0
yellow lady's slipper, yellow lady-slipper, Cypripedium calceolus, Cypripedium parviflorum accuracy 1.0	gondola accuracy 0.99
geyser accuracy 1.0	electric locomotive accuracy 0.99

Categories with Lowest Classification Accuracy (5 guesses)

SuperVision	ISI
ladle accuracy 0.31	spatula accuracy 0.08
muzzle accuracy 0.33	ladle accuracy 0.11
velvet accuracy 0.35	hatchet accuracy 0.13
hook, claw accuracy 0.36	screwdriver accuracy 0.17
hatchet accuracy 0.36	cleaver, meat cleaver, chopper accuracy 0.17

Categories where SuperVision and ISI Differ Most in Classification Accuracy(5 guesses)

SuperVision Outperformed ISI	ISI Outperformed SuperVision
weasel win by 0.52	fountain win by 0.06
Great Dane win by 0.39	jigsaw puzzle win by 0.06
hog, pig, grunter, squealer, Sus scrofa win by 0.39	chainlink fence win by 0.06
reel win by 0.36	sea slug, nudibranch win by 0.05
spatula win by 0.36	megalith, megalithic structure win by 0.05

[top]

3.2 Detection Challenge

For the ILSVRC2012 detection challenge, each object class C has a set of images associated with it, and each image is human annotated with bounding boxes B_1, B_2, \ldots indicating the location of all instances of this object class. Given an image, an algorithm is allowed to predict up to 5 annotations, each consisting of an object class ci and a bounding box b_i. The object is considered correctly detected if for some proposed annotation (c_i, b_i), it is the case that c_i = C and b_i correctly localizes one of the object instances B_1, B_2, \ldots according to the standard intersectin over union >= 0.5 criteria. Detection accuracy of an algorithm on class C is the fraction of images where the object is correctly detected.

We compare the performance of the top 2 algorithms on the ILSVRC2012 detection challenge, namely SuperVision and Oxford VGG. Figure 3.3 shows the difference of SuperVision and Oxford VGG detection accuracy for each category. SuperVision outperform Oxford VGG on the 825 of the 1000 categories. The categories on which Oxford VGG outperforms SuperVision tend to be the more difficult ones according to CPL, with an average CPL of 0.057.

Figure 3.4 shows the cumulative detection accuracy across the categories sorted from hardest to easiest according to CPL. The vertical line is the cutoff for the normalized ILSVRC. SuperVision consistently outperforms Oxford VGG except when considering the set of 239 hardest categories with respect to CPL.

Figure 3.3 Detection accuracy difference of SuperVision and Oxford VGG. Each dot corresponds to SuperVision detection accuracy minus Oxford VGG detection accuracy on a specific category.

Figure 3.4 The detection accuracy of object categories as a function of CPL. For each CPL value, the height of the curve corresponds to the average detection accuracy of all object categories in ILSVRC with equal or smaller CPL measures.

Categories with Highest Detection Accuracy

SuperVision	Oxford VGG
convertible accuracy 0.99	web site, website, internet site, site accuracy 0.97
Leonberg accuracy 0.98	daisy accuracy 0.94
leopard, Panthera pardus accuracy 0.96	Leonberg accuracy 0.93
Tibetan mastiff accuracy 0.96	car mirror accuracy 0.92
African hunting dog, hyena dog, Cape hunting dog, Lycaon pictus accuracy 0.96	Model T accuracy 0.91

Categories with Lowest Detection Accuracy

SuperVision	Oxford VGG
volleyball accuracy 0.01	sea snake accuracy 0.01
basketball accuracy 0.01	spider monkey, Ateles geoffroyi accuracy 0.02
ping-pong ball accuracy 0.02	hook, claw accuracy 0.02
rugby ball accuracy 0.03	wooden spoon accuracy 0.03
croquet ball accuracy 0.04	water snake accuracy 0.04

Categories Where SuperVision and Oxford VGG Differ Most in Detection Accuracy

SuperVision Outperformed Oxford VGG	Oxford VGG Outperformed SueprVision
Irish terrier win by 0.67	bearskin, busby, shako win by 0.69
tailed frog, bell toad, ribbed toad, tailed toad, Ascaphus trui win by 0.61	volleyball win by 0.65
weasel win by 0.6	rugby ball win by 0.45
otter win by 0.58	steel drum win by 0.43
whiptail, whiptail lizard win by 0.57	microwave, microwave oven win by 0.36

ImageNet Challenge 2012 Analysis

Contents

1. Visualization of ILSVRC Data

switch, electric switch, electrical switch

2. Analysis of Dataset Difficulty: ILSVRC 2012 vs. PASCAL VOC 2012

2.1 Variety of Object Classes

2.2 Chance Performance of Localization(CPL)

2.3 Average Object Scale

2.4 Average Number of Instances

2.5 Level of Clutter

3. Analysis of ILSVRC2012 Results

3.1 Classification Challenge

Categories with Highest Classification Accuracy (5 guesses)

Categories with Lowest Classification Accuracy (5 guesses)

Categories where SuperVision and ISI Differ Most in Classification Accuracy(5 guesses)

3.2 Detection Challenge

Categories with Highest Detection Accuracy

Categories with Lowest Detection Accuracy

Categories Where SuperVision and Oxford VGG Differ Most in Detection Accuracy