Video Optical Character Recognition (OCR) involves detecting and extracting text from video frames using computer vision and machine learning algorithms. With video OCR, you can easily sift through your video content, pinpointing the exact moments where certain words, phrases or even entire sentences make their appearance on the screen. Imagine the applications - from streamlining content search and navigation, to diving deep into content analysis, optimizing advertisement placement, summing up content, turbocharging SEO, and ensuring compliance and monitoring.
Examples of elements that can be recognized by video OCR include:
In this tutorial we will explore how the Twelve Labs platform enables video OCR at two distinct levels. On the video level, we're taking on the entire video in one fell swoop, harnessing every morsel of text it holds. On the other hand, the index level approach sharpens our focus, honing in on a specific keyword or a cluster of keywords, which we'll input as natural language queries to perform a comprehensive search across a library of videos indexed on the Twelve Labs platform.
The cherry on top? With Twelve Labs API at your disposal, you can accomplish all of this without worrying about the nitty-gritty of implementing and maintaining the OCR process. We've got your back from development to infrastructure, and even ongoing support. So gear up, and let's embark on this exciting expedition into the realm of video OCR together.
The Twelve Labs platform is presently in its open beta phase, and we are offering free video indexing credits for up to 10 hours upon sign-up. It'll be advantageous for you to signup and get acquainted with the foundational aspects of the Twelve Labs platform before diving into this tutorial. Things like understanding video indexing, indexing options, the Task API, and search options are vital to smoothly follow through this tutorial, all of which I've covered extensively in my first tutorial. However, if you hit a roadblock or find yourself lost at any juncture, don't hesitate to reach out. By the way, our response times on our Discord server are lightning fast 🚅🏎️⚡️ if Discord is your preferred platform.
Following our previous discourse, we will explore video OCR tackling it from two distinct angles and levels. Accordingly, I've divided this tutorial into two pivotal sections, followed by a finale where we bring everything together in a working demo web-app:
The process of extracting all recognized text from a specific video entails these three steps:
Video OCR enabled us to scrutinize an entire video and distill all instances of text. Now, the text-in-video search feature empowers us to zero in on precise moments or video snippets where the input or searched text materializes. This greatly diminishes the time spent perusing a sizable catalogue of videos, yielding accurate search results predicated on alignment of search terms with the text that becomes visible on screen during video playbacks.
In our initial tutorials, we delved into content search within indexed videos, using natural language queries and various search options like visual (audio-visual search), conversation (dialogue search), and text-in-video (OCR). In this tutorial, we're going to repurpose our approach, harnessing only OCR technology to search for text within videos. To optimize processing time and costs, we'll create an index using solely the text_in_video indexing option. Then, we'll fire off our search query with the text_in_video search option, enabling us to discover relevant text matches within the indexed videos.
To bring it all home, we'll take the data yielded by the API endpoints and showcase them on a webpage, spinning up a Flask-based demo app that serves up a simple HTML page. The result of the video OCR will be neatly tabulated, displaying timestamps and associated text, while the text search will show the query we used and the corresponding video segments we found in response.
For the sake of simplicity, I've uploaded just two videos to an index using a pre-existing account. Feel free to sign up; given we're currently in open beta, you'll receive complimentary credits allowing you to index up to 10 hours of video content. If your needs extend beyond that, check out our pricing page for upgrading to the Developer plan.
Here, we’re going to delve into the essential elements that we'll need to include in our Jupyter notebook. This includes the necessary imports, defining API URLs, creating the index, and uploading videos from our local file system to kick off the indexing process:
Uploading two videos to the index we've just created. The videos are titled "A Brief History of Film" (courtesy of Film Thought Project, available at https://www.youtube.com/watch?v=utntGgcsZWI) and "GPT - Explained!" (courtesy of CodeEmporium, available at https://www.youtube.com/watch?v=3IweGfgytgY). I have downloaded these videos from their respective YouTube channels and saved them in a folder named 'static' on my local hard drive. We'll use these local files to index the videos onto the Twelve Labs platform:
Now let's enumerate all the videos in our index. This allows us to retain the video ID of a specific video, the goal being to extract all the text embedded within it. Furthermore, akin to our methods in prior tutorials, I'm assembling a list of video IDs and their respective titles, designed to be subsequently fed into our Flask application.
Time to put our plan into action! We'll now proceed to extract all textual content from the chosen video:
As you can see, the API extracted all the texts on screen, line by line, like a charm. You can save these texts as metadata for downstream workflows such as filtering, classifying and searching content.
Launching our search query utilizing the text_in_video search option to uncover pertinent text matches within our collection of indexed videos:
💡Bear in mind that the text-in-video search feature is set up to locate all occurrences within the indexed videos where the input query aligns (not necessarily word-for-word) with the text visually presented on screen as the video plays. For instance, if I enter "horse moving," the system will identify instances where the on-screen text reads "horse in motion." However, the confidence level of this match will be lower compared to when I input "horse in motion”. The confidence level depends on the percentage of words matched with the natural language query we input. For example, a two out of three-word match will yield a higher confidence level than a match with only one word.
Preparing the data for the Flask application to ensure our results will be presented neatly:
Further data preparation for the video OCR results, followed by our standard procedure of pickling everything:
We're now at the final leg of our video OCR adventure - bringing together all elements to animate our results. Besides the standard configuration we implement for fetching videos from the local folder and loading the pickled data dispatched from the Jupyter notebook, this time we have some additional requirements - a conversion of timestamps from a seconds-only format to a minutes-and-seconds format. This makes the data visualization on the webpage more intuitive. Here's the code for the app.py file:
Awesome! let’s just run the last cell of our Jupyter notebook to launch our Flask app:
You should see an output similar to the one below, confirming that everything went as anticipated 😊:
After clicking on the URL link http://127.0.0.1:5000, you should be greeted with the following web page:
Here's the Jupyter Notebook containing the complete code that we've put together throughout this tutorial - https://drive.google.com/drive/folders/1D97_UU2Z0lvp3y52BHV5GKkSNOQKv3Xi?usp=share_link
Anticipate more thrilling content on the horizon! If you haven't already, I warmly invite you to become part of our lively Discord community, teeming with individuals who share a fervor for multimodal AI.
See you next time,
Crafting stellar Developer Experiences @Twelve Labs