Twelve Labs is excited to announce that we’ve raised a $5M Seed funding round led by Index Ventures.
The Twelve Labs mission is to help developers build programs that can see, listen, and understand the world as we do by giving them the world’s most powerful video understanding infrastructure.
And these are the incredible folks who share our conviction that the world deserves an intelligent video understanding infrastructure, and have joined the round to build that future with us.
Today, more than 80% of the world’s data is in video. In fact, Cisco estimated in 2020 that it would take more than 5 million years to watch the amount of video that will cross global IP networks each month. According to Nielson, US adults spent 5 hours and 21 seconds watching videos in 2021. That’s one-third of our total waking hours!
This seems like a lot, but it’s not that difficult to believe when we consider all the time we spend each day watching Youtube or Netflix, doing calls on Zoom, or recording videos of our kids on our phones. Video is here to stay, and it’s only becoming more deeply ingrained in every part of our lives.
Despite the exorbitant amount of video data that we consume and create each day, content within videos is still not searchable. If you were to search for a phrase within 300 pages of a text-based document, you’d be able to find it in less than a second with a simple CTRL+F. Across videos? Not possible.
Instead, we’ve had to rely on inadequate methods that are either wildly time-consuming or ineffective. An obvious workaround is to manually watch all the videos until you find what you are looking for. Larger organizations and enterprise would have people spend hours writing up tags (metadata) to match each timecode so that scenes could be located later through text-based matches on those tags. The more tech-enabled approach of today would be to use tech giants’ object detection APIs to auto-generate those tags based on the objects detected in an image.
Unfortunately, no finite number of tags could possibly be enough to fully describe a scene. If a scene had not been tagged properly, it wouldn’t be found through a metadata search. But most importantly, tagging can’t take any sort of context into consideration. And context matters.
Why does context matter? Humans understand the world by forming relationships between objects in a scene and making connections between past and present. The way we search is the way we perceive and remember the world. Unless the tags are complex enough to include contextual understanding, they can’t help with search.
So we built the search that we believe the world deserves. From noteworthy discussion points within an organization’s extensive Zoom recordings to urgently needed scenes within a media company’s archive all the way to that special day with your firstborn, all it takes is a search to find that exact moment you are looking for. The beauty is that you can just type in whatever comes to mind when you remember it, and you will be brought to the exact time code and file relevant to your query. It’s not a tag match, it’s a real search.
This is what our AI does: It views and understands the content of a video, including visuals such as action and movement and conversation. (Situational and temporal context included, of course!) It then transforms everything about the video into a powerful intermediary data format called vectors, which are basically a list of floating numbers that statistically represent the content of the video. When a user types in a search query, it finds the vectors that are closest to the query, and automatically outputs the most relevant scene and video file name.
And we’ve come up with intuitive interfaces for developers to access this AI. With simple Index and Search API calls, developers can integrate a powerful semantic video search to their video applications.. without ever having to actually think about search!
And we are officially the best in the world at it. At the end of last year, we got tired of being asked questions like, “So are you better than tech giants?” That’s when we decided to participate in the 2021 ICCV VALUE Challenge for Video Retrieval (= Search) hosted by Microsoft. And we won first place!
We are proud to say that we bested tech giants and outperformed Microsoft’s previous state-of-the-art when we had no venture funding with just 12 people on the team. Here is Aiden’s (CTO) account of how we were able to beat the giants of the world.
We believe that to understand video is to understand the world. A strong video understanding infrastructure that can most accurately transform videos into vectors will pave the way for even better search and other intelligent applications that power the next generation of videos. Just a few of these applications include: video-to-video search, summary generation, and content recommendation.
By building a foundation model that understands videos, we help developers build programs that can see, hear, and understand the world as we do.
Though Twelve Labs is the best in the world at video search today, we know that there is still much more science to do and improve on what we have. We are extremely grateful and excited about the innovation we can make possible with the support of our partners.