Measuring Compositional Consistency for Video Question Answering (CVPR 2022)

Paper » Code » Data »

Recent video question answering benchmarks indicate that state-of-the-art models struggle to answer compositional questions. However, it remains unclear which types of compositional reasoning cause models to mispredict. Furthermore, it is difficult to discern whether models arrive at answers using compositional reasoning or by leveraging data biases. In this paper, we develop a question decomposition engine that programmatically deconstructs a compositional question into a directed acyclic graph of sub-questions. The graph is designed such that each parent question is a composition of its children. We present AGQA-Decomp, a benchmark containing 2.3M question graphs, with an average of 11.49 sub-questions per graph, and 4.55M total new sub-questions. Using question graphs, we evaluate three state-of-the-art models with a suite of novel compositional consistency metrics. We find that models either cannot reason correctly through most compositions or are reliant on incorrect reasoning to reach answers, frequently contradicting themselves or achieving high accuracies when failing at intermediate reasoning steps.

Measuring Compositional Consistency for Video Question Answering

If the paper inspires you, please cite us:

@inproceedings{Gandhi2022Measuring,
title={Measuring Compositional Consistency for Video Question Answering},
author={Mona Gandhi*, Mustafa Omer Gul*, Eva Prakash, Madeleine Grunde-McLaughlin, Ranjay Krishna, Maneesh Agrawala},
booktitle ={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
year={2022}
}

Decomposing AGQA Questions

Compositional reasoning is fundamental to how humans represent visual events. To measure compositional reasoning of visual events, the computer vision community has proposed multiple video benchmarks using question answering. Using these benchmarks, researchers have recently concluded that state-of-the-art models struggle to reason compositionally. However, existing benchmarks are unable to explain why video question answering models struggle with compositional questions. Hence, we develop a question decomposition engine that decomposes a compositional question into a Directed Acyclic Graph (DAG) of sub-questions. To generate the DAG, we first represent the given question as a functional program, which consists of the individual reasoning steps needed to answer the question. We recursively iterate over this program and its arguments to generate a heirarchy of subquestions that address the original question requires at varying levels of complexity. This structure allows for evaluation of the model's performance on these intermediate reasoning steps for each question.

Question: What is the person in front of first after eating some food?
Program: first(objects_after(objExists(person), relationExists(in front of), interactionExists(objExists(person), relationExists(eating), objExists(food))))

Recursive iteration over the functional program to form the DAG

Directed Acyclic Graph (DAG)

Metrics

The sub-questions enable us to identify difficult sub-questions and study which compositions cause models to struggle. It also allows us to test whether models are right for the right reasons. To measure models' performance, we design the following novel metrics:

Compositional accuracy

If the model answers all children questions correctly, does it answer the parent question correctly?

a question that is answered correctly
a question that is answered incorrectly

Right for the wrong reasons

If the model answers a child question incorrectly, does it still answer the parent question correctly?

a question that is answered correctly
a question that is answered incorrectly

Internal Consistency

Do the model's answers reflect a consistent understanding of visual events?

parent and child answers are consistent
parent and child answers are inconsistent

Results

On evaluating three state-of-the-art video question answering models, HCRN, HME and PSAC using our DAGs and metrics, we get the following results:

For a majority of compositional reasoning steps, models either fail to successfully complete the step or rely on faulty reasoning mechanisms.
They frequently contradict themselves and achieve high accuracies even when failing at intermediate steps.
Models struggle when asked to choose between or compare two options, such as objects or relationships.
There is a weak negative correlation between internal consistency and accuracy across DAGs, suggesting that the models are frequently inaccurate and propagates this inaccuracy due to its internal consistency.

We believe that our decomposed question DAGs could further enable a host of future research directions: from promoting transparency through developing interactive model analysis tools.