News   History   Timetable   Introduction   Main challenges   Taster challenges   GPU resources   Registration   FAQ   Citation   Contact  

News

History

2014, 2013, 2012, 2011, 2010

Tentative Timetable

Introduction

This challenge evaluates algorithms for object localization/detection and image/scene classification from images and videos at large scale. This year there will be two main competitions and two taster competitions:
  1. Two main competitions:
    1. Object detection for 200 fully labeled categories.
    2. Object localization for 1000 categories.
  2. Two taster competitions New:
    1. Object detection from video for 30 fully labeled categories.
    2. Scene classification for 401 categories. Joint with MIT Places team.

Main competitions

I: Object detection

The training and validation data for the object detection task will remain unchanged from ILSVRC 2014. The test data will be partially refreshed with new images for this year's competition. There are 200 basic-level categories for this task which are fully annotated on the test data, i.e. bounding boxes for all categories in the image have been labeled. The categories were carefully chosen considering different factors such as object scale, level of image clutterness, average number of object instance, and several others. Some of the test images will contain none of the 200 categories. Browse all annotated detection images here.

For each image, algorithms will produce a set of annotations $(c_i, s_i, b_i)$ of class labels $c_i$, confidence scores $s_i$ and bounding boxes $b_i$. This set is expected to contain each instance of each of the 200 object categories. Objects which were not annotated will be penalized, as will be duplicate detections (two annotations for the same object instance). The winner of the detection challenge will be the team which achieves first place accuracy on the most object categories.

II: Object localization

The data for the classification and localization tasks will remain unchanged from ILSVRC 2012 . The validation and test data will consist of 150,000 photographs, collected from flickr and other search engines, hand labeled with the presence or absence of 1000 object categories. The 1000 object categories contain both internal nodes and leaf nodes of ImageNet, but do not overlap with each other. A random subset of 50,000 of the images with labels will be released as validation data included in the development kit along with a list of the 1000 categories. The remaining images will be used for evaluation and will be released without labels at test time. The training data, the subset of ImageNet containing the 1000 categories and 1.2 million images, will be packaged for easy downloading. The validation and test data for this competition are not contained in the ImageNet training data.

In this task, given an image an algorithm will produce 5 class labels $c_i, i=1,\dots 5$ in decreasing order of confidence and 5 bounding boxes $b_i, i=1,\dots 5$, one for each class label. The quality of a localization labeling will be evaluated based on the label that best matches the ground truth label for the image and also the bounding box that overlaps with the ground truth. The idea is to allow an algorithm to identify multiple objects in an image and not be penalized if one of the objects identified was in fact present, but not included in the ground truth.

The ground truth labels for the image are $C_k, k=1,\dots n$ with $n$ class labels. For each ground truth class label $C_k$, the ground truth bounding boxes are $B_{km},m=1\dots M_k$, where $M_k$ is the number of instances of the $k^\text{th}$ object in the current image.

Let $d(c_i,C_k) = 0$ if $c_i = C_k$ and 1 otherwise. Let $f(b_i,B_k) = 0$ if $b_i$ and $B_k$ have more than $50\%$ overlap, and 1 otherwise. The error of the algorithm on an individual image will be computed using:

\[ e=\frac{1}{n} \cdot \sum_k min_{i} min_{m} max \{d(c_i,C_k), f(b_i,B_{km}) \} \] The winner of the object localization challenge will be the team which achieves the minimum average error across all test images.

Taster competitions New

I: Object detection from video

This year there is a new detection task, object detection from video, similar in style to the object detection task. There are 30 basic-level categories for this task, which is a subset of the 200 basic-level categories of the object detection task. The categories were carefully chosen considering different factors such as movement type, level of video clutterness, average number of object instance, and several others. All classes are fully labeled for each clip. We used service from Datatang(Datatang) for bounding box annotation. Browse all annotated train/val snippets here.

For each video clip, algorithms will produce a set of annotations $(f_i, c_i, s_i, b_i)$ of frame number $f_i$, class labels $c_i$, confidence scores $s_i$ and bounding boxes $b_i$. This set is expected to contain each instance of each of the 30 object categories at each frame. The evaluation metric is the same as for the objct detection task, meaning objects which are not annotated will be penalized, as will duplicate detections (two annotations for the same object instance). The winner of the detection from video challenge will be the team which achieves best accuracy on the most object categories.

II: Scene classification

This taster challenge is being organized by the MIT Places team, namely Aditya Khosla, Bolei Zhou, Agata Lapedriza, Antonio Torralba and Aude Oliva. Please feel free to send any questions or comments to Aditya Khosla (khosla@mit.edu).

If you are reporting results of the taster challenge or using the Places2 dataset, please cite:

The goal of this challenge is to identify the scene category depicted in a photograph. The data for this task comes from the Places2 dataset which contains 10+ million images belonging to 400+ unique scene categories. Specifically, the challenge data will be divided into 8.1M images for training, 20k images for validation and 381k images for testing coming from 401 scene categories. Note that there is a non-uniform distribution of images per category for training, ranging from 4,000 to 30,000, mimicking a more natural frequency of occurrence of the scene.

For each image, algorithms will produce a list of at most 5 scene categories in descending order of confidence. The quality of a labeling will be evaluated based on the label that best matches the ground truth label for the image. The idea is to allow an algorithm to identify multiple scene categories in an image given that many environments have multi-labels (e.g. a bar can also be a restaurant) and that humans often describe a place using different words (e.g. forest path, forest, woods).

For each image, an algorithm will produce 5 labels \( l_j, j=1,...,5 \). The ground truth labels for the image are \( g_k, k=1,...,n \) with n classes of scenes labeled. The error of the algorithm for that image would be

\[ e= \frac{1}{n} \cdot \sum_k \min_j d(l_j,g_k). \]

\( d(x,y)=0 \) if \( x=y \) and 1 otherwise. The overall error score for an algorithm is the average error over all test images. Note that for this version of the competition, n=1, that is, one ground truth label per image.

Computational resources NEW

NVIDIA and IBM Cloud are pleased to announce they are partnering in support of this year’s ILSVRC 2015 competition by making GPU resources available using IBM Cloud’s SoftLayer infrastructure for up to 30 days for any team accepted into the competition. Interested teams that have not already been approved to access these GPU resources must apply before October 9th, 2015, 5:00 PM US/Pacific time.

Registration

Please register to obtain the download links for the data and the application form for accessing GPU resources.

FAQ

1. Are challenge participants required to reveal all details of their methods?

Entires to ILSVRC2015 can be either "open" or "closed." Teams submitting "open" entries will be expected to reveal most details of their method (special exceptions may be made for pending publications). Teams may choose to submit a "closed" entry, and are then not required to provide any details beyond an abstract. The motivation for introducing this division is to allow greater participation from industrial teams that may be unable to reveal algorithmic details while also allocating more time at the ICCV15 ImageNet and MS COCO Visual Recognition Challenges Joint Workshop to teams that are able to give more detailed presentations. Participants are strongly encouraged to submit "open" entires if possible.

2. Can additional images or annotations be used in the competition?

Entires submitted to ILSVRC2015 will be divided into two tracks: "provided data" track (entries only using ILSVRC2015 images and annotations from any aforementioned tasks in both main competitions and taster competitions -- different from ILSVRC 2014), and "external data" track (entries using any outside images or annotations). Any team that is unsure which track their entry belongs to should contact the organizers ASAP. Additional clarifications will be posted here as needed.

3. Is there still image classification task?

There will be no more classification task on the ILSVRC 2012 classification/localization dataset, instead teams have to submit results in the localization format (meaning both class labels and bounding boxes). If they choose to do so, they can return the full image as their guess for the object bounding box. If teams are still only interested in image level classification problem, they can participate the scene classification challenge.

4. How many entries can each team submit per competition?

Participants who have investigated several algorithms may submit one result per algorithm (up to 5 algorithms). Changes in algorithm parameters do not constitute a different algorithm (following the procedure used in PASCAL VOC).

Citation

If you are reporting results of the challenge or using the dataset, please cite:

Organizers

Sponsors

Contact

Please feel free to send any questions or comments to