Understanding videos is much more than extracting objects from images
A video contains rich information such as movement, objects, sound, text on screen, and speech. In order for an AI to contextually understand videos, it must extract all of this information as well as understand the complex relations between objects and connections between past and present.