As a movie aficionado 🎬🍿 and content creator 🎨🖌️✍️, I have created my own Plex server to house my cherished film collection. Often, I like to use movie scenes as anecdotes to enhance my storytelling and create more engaging content. For example, when making a video about motivation, willpower, and overcoming odds, I might showcase relevant moments from a favorite anime, such as the exhilarating Super Saiyan transformation scenes from the Dragon Ball Super saga, or a workout and training scene from one of my favorite movies, like Never Back Down. Alternatively, directors or writers who are developing new movie scripts may want to analyze a set of similar movies to identify common themes or patterns, such as the number of comedy scenes, their duration, the occurrence of drift races, or the frequency of muscle cars being shown. Finding particular scenes within a vast array of movies or even within a single movie can be quite challenging, even for those with impressive memory skills. This is where video understanding comes to the rescue ⛑️.
Twelve Labs Search API offers a flexible video search solution that enables users to write simple natural language queries and combine them ingeniously, helping to uncover the relevant video segments. For instance, one could craft a combined query to reveal the specific drift scenes where the lead actor drives a red Mitsubishi. Alternatively, users might search for the thrilling moment when their favorite Formula One car crosses the finish line victoriously 🏁✌️.
Search result from the indexed Tokyo Drift movie for the combined query - 'drift' (search option: visual) AND 'Mitsubishi' (search option: logo) 😎
In the first part of this tutorial series, we explored how to perform searches within videos using simple search, where we only used one query at a time in our search requests. To make the most of this follow-up tutorial, I highly recommend reviewing the previous one to understand the basics of the Twelve Labs Search API. Assuming you have a good grasp on the basics, this tutorial will introduce more advanced concepts. We'll dive into the combined queries feature offered by the Twelve Labs API, which enables us to flexibly and conveniently locate specific moments of interest within indexed videos. To showcase this, I will create two separate indexes: one for Formula One races, and another for a full-length well-known movie, “Tokyo Drift” from the Fast and the Furious franchise. I'll then demonstrate how to use various operators to combine search queries, allowing us to identify the intriguing moments we're looking for. With that said, let's proceed to the tutorial overview and concretely outline what you can expect to learn throughout this guide.
💡 By the way, if you're reading this article and you're not a developer, fear not! I've included a link to a ready-made Jupyter notebook. You can easily tweak the queries and operators, then run the entire process to obtain the results 😄. Enjoy!
The previous tutorial is the only prerequisite to this one. If you hit any snags while reading this or the previous one, don't hesitate to give me a shout for help! We have super quick response times 🚅🏎️⚡️ on our Discord server. If Discord isn't your vibe, feel free to reach out to me via email. After creating a Twelve Labs account, you can access the API Dashboard and obtain your API key. For the purpose of this demo, I'll be using my existing account:
This is the initial step, where I'll create two indexes using our latest state-of-the-art video understanding engine, "Marengo 2.5," but with distinct indexing options. For the index focused on Formula One racing, besides visual and conversation, enabling text-in-video and logo options is beneficial because Formula One events are abundant with logos on vehicles, tracks, fences, and a significant amount of on-screen text during the award distribution. However, for the Tokyo Drift movie index, enabling text-in-video option may not provide any value. This is where the flexibility of creating indexes with different options comes into play. By customizing the indexing options to suit your specific needs, you can optimize the use of compute resources and ultimately save on costs.
Initiating video indexing tasks
I've set up the code to automatically take in all videos from a specific folder, assign them the same name as the video file itself, and upload them to the platform asynchronously. Just make sure to place all the videos you want to include in an index within a single folder. The total indexing time will be approximately 40% of the longest video's duration since, even if you upload videos using a 'for' loop asynchronously without creating a thread for parallel uploads, the system will index them synchronously (simultaneously). If you want to index more videos within the same index later, no problem! There's no need to create a new folder for new video files. Just add them to the existing folder, and the code will check if there's already an indexed video with the same name or a pending indexing task for a video with the same name before initiating indexing. This way, you can avoid duplicates – pretty cool, right? 😄
Monitoring the indexing process
I designed the monitoring function to display the estimated time remaining for the current video being indexed. Once the indexing task is complete, the monitoring process moves on to the next video indexing task, which is already in progress due to the system's parallel indexing approach. This continues until all the videos within your folder are indexed. Finally, the total time taken to perform this synchronous indexing is presented in seconds.
List all videos in the index
To ensure that all the required videos have been indexed, let's double-check by listing all the videos present within the index. Additionally, I'm creating a list containing all video IDs and their corresponding names, as this will later be used to fetch the corresponding video name for the video segment, which will be displayed along with the start and end timestamps.
Once the system finishes indexing the videos and generating video embeddings, we will be ready to find the topmost semantically matching moments using the search API. We've already explored how to use simple queries in the previous tutorial; here, we will focus on formulating useful combined queries.
The search API enables constructing combined queries using the following operators:
That was quite a bit of theory; now let's dive into the more exciting application aspect by performing our first search using a combined query. This combined query consists of two simple queries combined together using the “AND” operator, with each query having different search options. The first query is to look for scenes that are semantically similar to the concept of "winning a trophy" in both audio and visuals, while the second query is to look for scenes containing text or a logo that reads "crypto.com." By combining these queries, we can find video segments that satisfy both criteria simultaneously:
Corresponding video segments:
This part exhilarates me as it highlights the presence of intelligence. The model exhibits a human-like understanding of the video content. As you can see in the above screenshot, the system nails it by pinpointing the exact moments I wanted to extract.
Let's give it another shot by combining a set of simple queries to search specifically, but this time for the second index containing the entire Tokyo Drift Movie:
Corresponding video segment:
Bingo! Once again, the system pinpointed the perfect moment spot-on. The scene features Sean (Lucas Black), the lead actor, skillfully drifting a red Mitsubishi.
Let's prepare a python list that includes each video's ID, corresponding title, and its respective start and end timestamps. We'll pass this list to the Flask app in the next step, allowing us to display our search results on a webpage:
To save the list for later use in a Flask app, we can serialize (pickle) the list to a file:
We've arrived at the final step, where we'll leverage the JSON responses we received to efficiently retrieve and display video segments without having to manually identify the start and end points. To achieve this, we'll host a web page that can utilize these timestamps and apply them to the videos retrieved from our local drive. As a result, we will have visually appealing video segments that match our search, all seamlessly displayed on our web page.
The directory structure will look like this:
Flask app code
Below is the code for the "app.py" file:
Below is a sample Jinja2-based HTML template that incorporates code within the HTML file to iterate through the list we prepared earlier, fetch the required videos from the local drive, and display the results of our combined query:
Running the Flask app
Perfect! let’s just run the last cell of our Jupyter notebook to launch our Flask app:
You should see an output similar to the one below, which indicates that everything is going according to our expectations 😊:
Once you click on the URL link http://127.0.0.1:5000, depending upon your combined search query, the output will be as follows:
Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial - https://tinyurl.com/combinedQueries
In the upcoming post, we'll dive into the Classification API and will develop classification criteria on the fly to effectively classify a set of videos. Stay tuned for the forthcoming excitement and don't forget to join our Discord community to engage with other like-minded individuals who are passionate about multimodal foundation models.
Until next time,
Creating awesome Developer Experiences @ Twelve Labs