Cross-Domain Image Captioning with Discriminative Finetuning


  • Date: Friday, 26 May 2023
  • Time: 10:00-11:00 (Local Time)
  • Venue: Aula Seminari - Via Salaria 113, 3rd floor
  • Speaker: Roberto Dessì, Meta AI and Universitat Pompeu Fabra
  • Zoom Link:


Neural captioners are typically trained to mimic human image descriptions, without any awareness of the function that such descriptions might have, leading to biased and vague captions. In this talk, I’ll empirically show that adding a self-supervised objective to a neural captioner helps to recover a plain, visually descriptive language that is more informative about image contents. In particular, I’ll describe experiments where we take an out-of-the-box neural captioner, and we finetune it with a discriminative objective. Given a target image, the system must learn to produce a description that enables an out-of-the-box text-conditioned image retriever to identify the target image among a set of candidates. In terms of similarity to ground-truth human descriptions, the captions emerging from discriminative finetuning lag slightly behind those generated by the non-finetuned model, when the latter is trained and tested on the same caption dataset. However, when the model is used without further tuning to generate captions for out-of-domain datasets, the discriminatively-finetuned captioner generates descriptions that resemble human references more than those produced by the same captioner trained only with supervised learning and without finetuning. I’ll further show that, on the Conceptual Captions dataset, discriminatively finetuned captions are more helpful than either vanilla captions or ground-truth captions for human subjects tasked with an image discrimination task. If time allows, I’ll conclude the talk by drawing a connection between our work and reinforcement learning from human feedback (RLHF), a recently introduced method powering models like ChatGPT and InstructGPT.

Short Bio

Roberto Dessì is a last-year Ph.D. student at Meta AI in Paris and Universitat Pompeu Fabra in Barcelona under the supervision of Marco Baroni. His background is at the intersection of computer engineering (B.Sc. at La Sapienza in Rome) and cognitive science (M.Sc. at the University of Trento) with a strong research focus on understanding how to teach machines to understand and generate language in more human-like ways. Specifically, during his PhD he has worked on a variety of topics such as linguistic compositionality in sequence to sequence models, emergent communication, reinforcement learning (RL) and language, and most recently on large language models that can have access to external tools. He organized several iterations of the emergent communication workshop at NeurIPS and ICLR and his research has been published in top conferences like NeurIPS, EMNLP, ACL, ICLR, and CVPR.

Adrian R. Minut
Adrian R. Minut
PhD Student