In this project, the Video-To-Text Pilot Task for video captioning in NIST TRECVID 2017 is tackled. We adopted the Show-and-Tell model for video captioningas our baseline architecture and made several modifications on top of it to improveits performance. The five main approaches explored are the following: (1) pretraining the model on image dataset; (2) applying spatial attention model; (3) feeding external semantic image tags; (4) using multi-modal fusion; (5) adding generativeadversarial network (GAN) loss.
Experiments on these approaches are conducted on the MSRVTT and MSVDdataset. It is found that attention model, semantic tags, and multi-modal fusion canboost the system performance incrementally. Combining these three techniquescan lead to a 13 percent increase in the sum of BLEU-4, METEOR, and CIDEr evaluation scores on the MSRVTT dataset compared to our baseline Show-and-Tell model. It is also shown that state-of-art performance is achieved on the MSVD dataset
To make the generated caption more similar to the ground truth captions given while at the same time retaining the good sequence/sentence property, the following Generative Adversarial Network (GAN) based method is experimented to observe the effect of updating the network weights in a game theory setup.Experimental GAN method architecture: