I recently came across a paper “CLEVRER” (“CoLlision Events for Video REpresentation and Reasoning”, by – Kexin Yi, Chuang Gan, Yunzhu Li, Pushmeet Kohli, Jiajun Wu, Antonio Torralba, Joshua B. Tenenbaum). It intrigued me, so I wanted to share some thoughts about it.

With the advancements in NN-based learning algorithms, many of us are wondering if one day we may get a learning algorithm that we can show a video to and get a short summary that helps us decide whether we should watch it or not. This paper studies exactly that. 

CLEVRER is a diagnostic video dataset for temporal and causal reasoning under a fully controlled environment. It follows two guidelines: 

  1. The posted tasks should focus on logical reasoning in the temporal and causal domains while staying simple and exhibiting minimal biases on visual scenes and language.
  2. The dataset should be fully controlled and well-annotated in order to host the complex reasoning tasks and provide effective diagnostics for models on those tasks.

The authors identified 3 key elements that are essential to this task:

  1. Recognition of the objects and events in the videos.
  2. Modeling the dynamics and causal relations between the objects and events.
  3. Understanding the symbolic logic behind the questions.

So for this, they study Neuro-Symbolic Dynamic Reasoning that explicitly joins these components via a symbolic video representation and assess its performance and limitations.

Video link here.

Let’s take a look at an example of the above input video. Let’s stop right at the first statement: “The red sphere enters the scene.” So, it was able to correctly identify not only what we are talking about in terms of color and shape, but also knows what this object is doing. Then it correctly identifies the collision event with the cylinder, this cylinder then hits another cylinder, and look at that – it identifies that the cylinder is made of metal. I like that a lot because this particular object is made of a reflective material, which shows us more about the surrounding room than the object itself.

It not only tells us what is going on in its terms, but we can also ask questions and it can

answer them correctly.  Motivated by the theory of human causal judgment, CLEVRER includes four types of questions: descriptive (e.g., “what color”), explanatory (”what’s responsible for”), predictive (”what will happen next”), and counterfactual (“what if”).

Image Credits – Link


Shubham Bindal


  1. CoLlision Events for Video REpresentation and Reasoning: https://arxiv.org/pdf/1910.01442.pdf
  2. Website: http://clevrer.csail.mit.edu/

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s