Semantic Boundaries Dataset and Benchmark


We created the Semantic Boundaries Dataset(henceforth abbreviated as SBD) and the associated benchmark to evaluate the task of predicting semantic contours, as opposed to semantic segmentations. While semantic segmentation aims to predict the pixels that lie inside the object, we are interested in predicting the pixels that lie on the boundary of the object, a task that is arguably harder (or alternatively, an error metric that is arguably more stringent).

The dataset and benchmark can be downloaded as a single tarball here.


Please note that the train and val splits included with this dataset are different from the splits in the PASCAL VOC dataset. In particular some "train" images might be part of VOC2012 val.

If you are interested in testing on VOC 2012 val, then use this train set, which excludes all val images. This was the set used in our ECCV 2014 paper. This train set contains 5623 images.

The following sections provide an overview of the dataset and benchmark. For details about how to use the benchmarking code, please look at the README inside the download. If you use this dataset and benchmark, please cite:

author = "Bharath Hariharan and Pablo Arbelaez and Lubomir Bourdev and Subhransu Maji and Jitendra Malik",
title = "Semantic Contours from Inverse Detectors",
booktitle = "International Conference on Computer Vision (ICCV)",
year = "2011",


The SBD currently contains annotations from 11355 images taken from the PASCAL VOC 2011 dataset.These images were annotated on Amazon Mechanical Turk and the conflicts between the segmentations were resolved manually. For each image, we provide both category-level and instance-level segmentations and boundaries. The segmentations and boundaries provided are for the 20 object categories in the VOC 2011 challenge.


We focus on the evaluation of category-specific boundaries. The experimental framework we propose is based heavily on the BSDS benchmark. Machine pixels are matched to pixels on the ground truth boundaries. Pixels that are farther from the ground truth than a threshold are not matched. Machine pixels that are matched form the true positives, while other machine pixels are false positives. One can then compute precision-recall curves. The numbers we report in the paper are the AP(average precision) and MF(maximal F-measure).


For the purpose of comparison, we also provide our own best results ("1-stage (allclasses)" from Table 1 in [1]) here. These results are slightly different from the numbers in [1], since the dataset has been cleaned up. Please use these newer results for your comparisons.