IntPhys V1.0


The goal of this benchmark is:

  1. to provide an evaluation set for object permanence
  2. to provide a training set for systems that would acquire object permanence in a completely unsupervised way.

Note that the training set is optional. It is possible to test any system on our object permanence test set, even if trained on a different dataset (eg, real videos), or even hand designed systems.

It is fully described in the paper: To get started, use our github repo (on the 1.0 branch)

Train set

We provide 15k videos, of 100 frames each (approx. 25 hours of video). The uncompressed file size is 200G. The training samples are always physically possible and have high variability in object sizes, trajectories, textures, etc.

_images/train_1.gif _images/train_2.gif _images/train_3.gif _images/train_4.gif


Each video comes with its associated depth field and object masking (each object has a unique id), along with a detailed status in JSON format.

_images/meta_1.gif _images/meta_2.gif _images/meta_3.gif

Example of json file (only 2 frames for illustration purposes, real json files contain 100 frames):

    "camera":"196 304 153 -13 -120 -0",
      "light_1":"-2978.0249023438 -1332.8410644531 181.63397216797 -20.454223632812 98.856430053711 9.1122478806938e-07"
        "object_3":"456.66390991211 -861.45208740234 159.97134399414 -28.465270996094 80.850463867188 -90.793273925781",
        "object_2":"-378.92248535156 -700.74951171875 131.89306640625 34.369613647461 -91.140380859375 16.656764984131",
        "object_1":"500 -550 188.73477172852 -62.817230224609 9.9395532608032 -68.216156005859"
        "object_3":"413.35668945312 -872.896484375 132.60464477539 -28.465270996094 80.850463867188 -90.793273925781",
        "object_2":"-355.95150756836 -701.56634521484 144.69281005859 34.369613647461 -91.140380859375 16.656764984131",
        "object_1":"500 -550 181.11834716797 -62.817230224609 9.9395532608032 -68.216156005859"

Dev and Test

_images/test_1.gif _images/test_2.gif _images/test_3.gif _images/test_4.gif

The test samples come as quadruplets: 2 possible cases and 2 impossible ones. They are organized into 18 conditions resulting from the combination of 2 visibilities (visible vs occluded) times 3 numbers of objects (0 vs 1, 1, vs 2, 2 vs 3) and 3 mouvements types (static, dynamic 1, dynamic 2) as follows:


There are 200 movies per condition (total of 3.6K movies). The dev files is much smaller, and only there to allow tuning of the plausibility function, but not for detailed model comparison (12 movies per condition, 216 total). The files in the dev set are provided with metadata and the evaluation code. The test set is just provided with raw videos (RGBD), but no metadata nor evaluation code.

In order to evaluate the test results, participants should submit their numbers by creating a zip file and create a DOI for it in the zenodo intphys1.0 group. The evaluation will be automatically created and added in the leaderboard of the benchmark, together with dates and description of file and group.

Evaluation metric

The system requirements are that given a movie x, it must return a plausibility score P(x). Because the test movies are structured in N matched sets of positive and negative movies S_i = \{Pos^1_i .. Pos^{n_i}_i , Imp^1_i .. Imp^{n_i}_i\}, we derive two different metrics. The relative error rate L_R computes a score within each set. It requires only that within a set, the positive movies are more plausible than negative movies.

L_{R}=\frac{1}{N}\sum_{i}{\mathbb{I}_{\sum_{j}P(Pos_{i}^{j}) < \sum_{j}P(Imp_{i}^{j})}}

The absolute error rate L_A requires that globally, the score of the positive movies is more positive than the score of the negative movies. It is computed as:

L_{A}=1-AUC(\{i,j; P(Pos_{i}^{j})\}, \{i,j;  P(Imp_{i}^{j})\})

Where AUC is the Area Under the ROC Curve, which plots the true positive rate against the false positive rate at various threshold settings.

Human performance

Here is the human performance on this test, as evaluated through 71 Amazon Mechanical Turk workers.


Baseline results

Here are results with 2 baseline systems using a resnet and a GAN on the output of a mask reconstruction CNN, tested in a future prediction task with a span of 5 and 35 frames respectively.


The source code for the baseline systems can be found here:

Next arrow_forward
Further benchmarks