A Survey of Visual Question Generation

Give an image, the task is to generate natural Question based on the image.

0x01. Datasets


Antol, Stanislaw, et al. “Vqa: Visual question answering.Proceedings of the IEEE international conference on computer vision. 2015.

Zhang, Peng, et al. “Yin and yang: Balancing and answering binary visual questions.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2016.

Goyal, Yash, et al. “Making the V in VQA matter: Elevating the role of image understanding in Visual Question Answering.Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 2017.

VQA is a new dataset containing open-ended questions about images. These questions require an understanding of vision, language and commonsense knowledge to answer.

  • 265,016 images (COCO and abstract scenes)
  • At least 3 questions (5.4 questions on average) per image
  • 10 ground truth answers per question
  • 3 plausible (but likely incorrect) answers per question
  • Automatic evaluation metric


Mostafazadeh, Nasrin, et al. “Generating Natural Questions About an Image.Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2016.

This dataset is described in http://aclanthology.info/papers/generating-natural-questions-about-an-image. The dataset is comprised of 9 csv’s, organized first by the source of the image, Bing, MSCOCO, or Flickr, then by type of dataset, train, dev and test. Within each file, we organize by image_id, the link to the image, and the up to 5 natural questions authored by crowdworkers on Amazon Mechnical Turk in response to the image. Please be sure to maintain these files separately in order to report system accuracy and progress on dev and test sets. For the Bing images, the dataset includes up to 5 captions for each image link; captions for the COCO and Flickr images are available elsewhere. In addition, each of the test set files includes the human rating of the question necessary to compute the deltaBleu score (see http://aclanthology.info/papers/deltableu-a-discriminative-metric-for-generation-tasks-with-intrinsically-diverse-targets).


IV. Visual Genome

V. VizWiz

VI. Visual Dialog

0x02. Researchers

  • Indian Institute of Technology, Badri N. Patro & Vinay P. Namboodiri 这哥们近期发了6篇有关VQG/VQA的文章,其中三篇被录用,两篇已经开源(不过star很少)

  • Microsoft, Nan Duan, Duyu Tang, Tong Wang(Maluuba)

  • Google DeepMind, Oriol Vinyals

  • Nasrin Mostafazadeh

  • Alexander Toshev

0x03. Architecture

0. Rule-based ❌


2016 Microsoft

2. IQ

with VAE/maximizes mutual information/doesn’t need to know the expected answer


0x04 Experiments


0xFD Metrics

  • word-overlap metrics BLEU, METEOR, ROUGE etc.

  • embedding-based metrics Skip-Thought, Embedding average, Vector extrema, Greedy matching etc.

1. Word-overlap metrics

1.1. BLEU

Widely used in the machine translation literature.


  • Focus on precision, don’t care about recall.
  • Regardless of word order.


First compute brevity penalty BP shows below:

1, &if \ c > r\
e^{\ (\ 1 - r / c\ )}, &if \ c<r

Where c is the total length of the candidate translation corpus, and r is the effective reference corpus length.

Next, compute the geometric average of the modified n-gram precisions, $p_n$, using n-grams up to length N and positive weights $w_n$ summing to one.

Then, BELU is shown below:

$$BLEU = BP \cdot exp\left(\sum_{n=1}^N w_n \log p_n\right)$$

Where $exp\left(\sum_{n=1}^N w_n \log p_n\right)$ represents the weighted sum of the logarithms of the accuracy of different n-grams

And the ranking behavior is more immediately apparent in the log domain.

$$log \ BLEU = min(1-\frac{r}{c},0) + \sum_{n=1}^N w_n \log p_n$$

[1] [2]

1.2. ROUGE

Is almost same as BLEU, but caculate recall instead of precision.

1.3. METOR

Not only based on exact matches but also stem, synonym, and paraphrase matches.

2. Embedding-based metrics

  • Skip-Thought
  • Embedding average
  • Vector extrema
  • Greedy matching


0xFE. Open Source Project


VQG: (sorted by stars)

0xFF. Papers (sorted by date)

  • Gao, J., Galley, M., & Li, L. (2019). Neural approaches to conversational ai. Foundations and Trends® in Information Retrieval, 13(2-3), 127-298.[Poster] [Paper]

  • Patro, B. N., Kurmi, V., Kumar, S., & Namboodiri, V. (2020). Deep Bayesian Network for Visual Question Generation. In The IEEE Winter Conference on Applications of Computer Vision (pp. 1566-1576).

  • Patro, B. N., Patel, S., & Namboodiri, V. P. (2019). Granular Multimodal Attention Networks for Visual Dialog. arXiv preprint arXiv:1910.05728.[Paper]

  • Patro, B. N., Lunayach, M., Patel, S., & Namboodiri, V. P. (2019). U-cam: Visual explanation using uncertainty based class activation maps. In Proceedings of the IEEE International Conference on Computer Vision (pp. 7444-7453). [Paper] [Proj] [code]

  • Patro, B. N., & Namboodiri, V. P. (2019). Deep Exemplar Networks for VQA and VQG. arXiv preprint arXiv:1912.09551.

  • Patro, B. N., & Namboodiri, V. P. (2019). Probabilistic framework for solving Visual Dialog. arXiv preprint arXiv:1909.04800.[Paper]

  • Lee, S. W., Gao, T., Yang, S., Yoo, J., & Ha, J. W. (2019). Large-Scale Answerer in Questioner’s Mind for Visual Dialog Question Generation. ICLR 2019.[Paper] [code]

  • Jedoui, K., Krishna, R., Bernstein, M., & Fei-Fei, L. (2019). Deep Bayesian Active Learning for Multiple Correct Outputs. arXiv preprint arXiv:1912.01119.

  • Krishna, R., Bernstein, M., & Fei-Fei, L. (2019). Information maximizing visual question generation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 2008-2018).[Paper] [Proj] [Code]

  • Fan, Z., Wei, Z., Wang, S., Liu, Y., & Huang, X. J. (2018, August). A reinforcement learning framework for natural question generation using bi-discriminators. In Proceedings of the 27th International Conference on Computational Linguistics (pp. 1763-1774).[Paper]

  • Patro, B. N., Kumar, S., Kurmi, V. K., & Namboodiri, V. P. (2018). Multimodal differential network for visual question generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (pp. 4002-4012). [Paper]. [Project Link] [code]

  • [49] Li, Y., Duan, N., Zhou, B., Chu, X., Ouyang, W., Wang, X., & Zhou, M. (2018). Visual question generation as dual task of visual question answering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6116-6124). [Paper] [code]

  • [362] Das, A., Kottur, S., Gupta, K., Singh, A., Yadav, D., Moura, J. M., … & Batra, D. (2017). Visual dialog. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 326-335).

  • Zhang, J., Wu, Q., Shen, C., Zhang, J., Lu, J., & Hengel, A. V. D. (2017). Asking the difficult questions: Goal-oriented visual question generation via intermediate rewards. arXiv preprint arXiv:1711.07614. [Paper]

  • [41] Wang, T., Yuan, X., & Trischler, A. (2017). A joint model for question answering and question generation. arXiv preprint arXiv:1706.01450.

  • [59] Tang, D., Duan, N., Qin, T., Yan, Z., & Zhou, M. (2017). Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027.

  • [65] Jain, U., Zhang, Z., & Schwing, A. G. (2017). Creativity: Generating diverse questions using variational autoencoders. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (pp. 6485-6494). [Paper].

  • [67] Mostafazadeh, N., Brockett, C., Dolan, B., Galley, M., Gao, J., Spithourakis, G. P., & Vanderwende, L. (2017). Image-grounded conversations: Multimodal context for natural question and response generation. IJCNLP (pp. 462–472). [Paper].

  • [62] Duan, N., Tang, D., Chen, P., & Zhou, M. (2017, September). Question generation for question answering. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing (pp. 866-874).

  • [24] Zhang, S., Qu, L., You, S., Yang, Z., & Zhang, J. (2016). Automatic Generation of Grounded Visual Questions. [Paper]

  • [131] Mostafazadeh, N., Misra, I., Devlin, J., Mitchell, M., He, X., & Vanderwende, L. (2016). Generating natural questions about an image. In ACL, the Association for Computational Linguistics (pp. 1802-1813).[Paper] [code1] [code2] [code3] [code4]

  • Yang, Y., Li, Y., Fermuller, C., & Aloimonos, Y. (2015). Neural Self Talk: Image Understanding via Continuous Questioning and Answering.[Paper].

  • Carl Saldanha, Visual Question Generation

  • [3410] Vinyals, O., Toshev, A., Bengio, S., & Erhan, D. (2015). Show and tell: A neural image caption generator. In Proceedings of the IEEE conference on computer vision and pattern recognition (pp. 3156-3164).[Paper] [code1] [code2] [code3]