OCELOT Dataset

The dataset is available on Zenodo! Before downloading the dataset, please make sure to carefully read and agree to the Terms and Conditions.

Introduction

The OCELOT dataset is a histopathology dataset designed to facilitate the development of methods that utilize cell and tissue relationships. The dataset comprises both small and large field-of-view (FoV) patches extracted from digitally scanned whole slide images (WSIs), with overlapping regions. The small and large FoV patches are accompanied by annotations of cells and tissues, respectively. The WSIs are sourced from the publicly available TCGA database and were stained using the H&E method before being scanned with an Aperio scanner.

Each sample of the OCELOT dataset is composed of six components, $$\mathcal{D} = \{\left(x_{s}, y_s^{c}, x_l, y_l^{t}, c_x, c_y\right)_{i}\}_{i=1}^{N}$$ where $x_s, x_l$ are the small and large FoV patches extracted from the WSI, $y_s^{c}, y_l^{t}$ refer to the corresponding cell and tissue annotations, respectively, and $c_x, c_y$ are the relative coordinates of the center of $x_s$ within $x_l$. The below figure shows the visualization of a sample.

A sample from the OCELOT dataset. Each sample of the dataset consists of two input patches and the corresponding annotations. Left shows the large FoV patch $x_{l}$ with tissue segmentation annotation $y_{l}^{t}$, where green denotes the cancer area. Right shows the small FoV patch $x_{s}$ with cell point annotation $y_{s}^{c}$, where blue and yellow dots denote tumor and background cells, respectively. The red box indicates the size and location of the $x_{s}$ with respect to the $x_{l}$.

Dataset Details

We collect 304 Whole Slide Images (WSIs) obtained from the publicly available TCGA database1. Then, the dataset is divided into three subsets: training, validation, and test, following a 6:2:2 ratio. To prevent information leaking among the data subsets, we randomly split the dataset per WSI, so that different patches from the same WSI are not included in different splits. We maintain consistent cancer-type ratios in each subset.

File Structure

PathDescription
images/{train,val,test}/{cell,tissue}/{uuid:03d}.jpg
Cell/tissue patch of size 1024x1024
annotations/{train,val,test}/cell/{uuid:03d}.csv
Cell annotations of format x,y,label in each line
annotations/{train,val,test}/tissue/{uuid:03d}.png
Tissue annotations
metadata.json
Metadata of every data (agnostic to split)

Note. the validation and test splits will be released after August 11, 2023.

Metadata

The metadata of each sample is stored in a JSON file (metadata.json), which contains the following information:

Field NameDescription
slide_name
TCGA Slide name. To obtain additional information about the slide or case, you can query https://portal.gdc.cancer.gov/ with this slide name.
patch_x_offset
x-coordinate representing the relative center location of the cell patch within the tissue patch
patch_y_offset
y-coordinate representing the relative center location of the cell patch within the tissue patch
mpp_x
Micron-Per-Pixel (MPP) value of the slide, along the x-axis.
mpp_y
Micron-Per-Pixel (MPP) value of the slide, along the y-axis.
organ
The organ where the sample was taken from.
subset
Partition of the dataset, i.e. train, val or test
x_start
x-coordinate representing the start of the coordinate range that specifies the cell or tissue patch within the Whole-Slide-Image (WSI). Specifically, the patch is extracted from the pixels between x_start (inclusive) and x_end (exclusive) within the WSI. The top-left corner pixel is assigned the coordinates (0,0).
y_start
y-coordinate representing the start of the coordinate range that specifies the cell or tissue patch within the Whole-Slide-Image (WSI). Specifically, the patch is extracted from the pixels between x_start (inclusive) and x_end (exclusive) within the WSI. The top-left corner pixel is assigned the coordinates (0,0).
x_end
x-coordinate representing the end of the coordinate range that specifies the cell or tissue patch within the Whole-Slide-Image (WSI). Specifically, the patch is extracted from the pixels between x_start (inclusive) and x_end (exclusive) within the WSI. The top-left corner pixel is assigned the coordinates (0,0).
y_end
y-coordinate representing the end of the coordinate range that specifies the cell or tissue patch within the Whole-Slide-Image (WSI). Specifically, the patch is extracted from the pixels between x_start (inclusive) and x_end (exclusive) within the WSI. The top-left corner pixel is assigned the coordinates (0,0).
resized_mpp_x
Micron-Per-Pixel (MPP) value of the resized patch, along the x-axis.
resized_mpp_y
Micron-Per-Pixel (MPP) value of the resized patch, along the y-axis.

We also provide a script to parse the metadata into python class. You can find the script in utils/metadata.py.

Statistics

Annotation

Labels# Cells
TrainValTest
BC (Background Cell)23.3K8.4K9.7K
TC (Tumor Cell)42.5K16.3K12.9K
Total65.8K24.7K22.5K
Cell label count
Labels# Pixels
TrainValTest
BG (Background)235.4M79.1M71.6M
CA (Cancer Area)166.6M57.8M58.6M
UNK (Unknown)17.4M6.8M6.1M
Total419.4M143.7M136.3M
Tissue label count

Dataset size per organ and data subset

Organs# Slides# Patch Pairs
TrainValTestTrainValTest
Bladder351414822926
Endometrium381313862925
Head-and-neck135627910
Kidney4715181224141
Prostate251210471716
Stomach1565361212
Total1736566400137130
Dataset size per organ and data subset

Sources of error

The extraction and processing of the specimens is a complex procedure depending on multiple factors that can cause various artifacts. As a result, the annotation of cells and tissue patches can be non-trivial for specific samples. Furthermore, pathologists show high intra- and inter-rater variability when assessing the different tissues and cells; i.e., there is a considerable amount of subjectivity. Both issues potentially lead to high annotation-wise discrepancies across pathologists. Especially for the cell patches, we adopt the 2+1 annotation strategy in order to reduce annotation variability.

Data Collection

We collect 304 Whole Slide Images (WSIs) obtained from the publicly available TCGA database1. We provide the list of case IDs required to query further patient data that is available via the NCI GDC Database2. These WSIs are gathered from 6 different organs: bladder, endometrium, head-and-neck, kidney, prostate, and stomach. From each WSI, two types of patches (image regions) are extracted at a resolution of approximately ~0.2 micronper-pixel, as follows:

In total, we acquire a total of 667 cell and tissue patch pairs with the corresponding annotations.

Patch Configuration

Cell detection tasks benefit from fine-grained spatial information to better capture detailed cell properties (e.g. border, shape, color, and opacity). In contrast, tissue segmentation requires a larger context to enable a better understanding of the overall structural information. Therefore, we define the FoV sizes of $x_s$ (cell detection) and $x_l$ (tissue segmentation) as 1024×1024 and 4096×4096 pixels, respectively, at a resolution of 0.2 Microns-per-Pixel (MPP). Finally, the large FoV patches and tissue annotations ($x_l$, $y_l^t$) are down-sampled by a factor of 4, resulting in a size of 1024×1024 pixels.

Annotation procedure

All cell-tissue pairs of patches are annotated by board-certified pathologists. A total of 67 certified pathologists conducted the manual annotations as follows:

cell annotations

Pathologists pinpoint the center position of each cell and assign the corresponding cell category. Annotations consider two categories of cells: Tumor Cell (TC) and Background Cell (BC). Background cells include all other cells in the patch that are not TCs.

tissue annotations

Pathologists annotated pixel-wise tissue categories by marking cancer and non-cancer regions in the tissue patch. Finally, the patch is reviewed and submitted. Three categories are considered: Cancer Area (CA), Background (BG), and Unknown (Unk). The Unknown class corresponds to uncertain annotated pixels that can be excluded during training and evaluation.

startegy

Each tissue patch is annotated by a single board-certified pathologist while for the cell patches we adopt a 2+1 annotation strategy with a final validation step. We find that the cell annotations require a more rigorous annotation procedure due to the higher intra- and inter-rater variability. First, 2 independent expert pathologists annotate the same patch. Afterward, both annotations are passed to the third annotator. The task of the third annotator is to merge both annotations. Finally, a fourth annotator validates the merged annotation to finally ensure the quality of the annotation and samples.

References