Yifan Xing

MSCV, CMU

|

Spatial Show Attend And Tell Model for Video Captioning

In this project, the Video-To-Text Pilot Task for video captioning in NIST TRECVID 2017 is tackled. We adopted the Show-and-Tell model for video captioningas our baseline architecture and made several modifications on top of it to improveits performance. The five main approaches explored are the following: (1) pretraining the model on image dataset; (2) applying spatial attention model; (3) feeding external semantic image tags; (4) using multi-modal fusion; (5) adding generativeadversarial network (GAN) loss.

Experiments on these approaches are conducted on the MSRVTT and MSVDdataset. It is found that attention model, semantic tags, and multi-modal fusion canboost the system performance incrementally. Combining these three techniquescan lead to a 13 percent increase in the sum of BLEU-4, METEOR, and CIDEr evaluation scores on the MSRVTT dataset compared to our baseline Show-and-Tell model. It is also shown that state-of-art performance is achieved on the MSVD dataset

Classic Approach: Show and Tell Model without attention model
Proposed Methodologies

To make the generated caption more similar to the ground truth captions given while at the same time retaining the good sequence/sentence property, the following Generative Adversarial Network (GAN) based method is experimented to observe the effect of updating the network weights in a game theory setup.

Experimental GAN method architecture:
GAN method Qualitative Results (Baseline: Classic Show and Tell, All: Spatial Attention + Tag Injection + MFCC Audio Feature Fusion):
Check out the repository on GitHub!
Technical Report