Extract clips from videos based on a prompt, using OpenAI's GPT-4o model.
Build a prompt-based video clip extracter using OpenAI's GPT-4o model. In this project, we extract all clips where Will Smith wears sunglasses in the 1997 Men In Black movie trailer.
Forwarded this email? Subscribe here for more ✨
Welcome to the 12th edition of the 100 GenAI MVPs newsletter, a journey all about shipping 100 happy and helpful GenAI solutions, and how to build them! 🚀👩💻✨ Together, we built a robot panda you can chat with (🏆 award winner), a member finder for online communities (🏆 competition winner), and many more!
Hellooo AI Alchemists! 🤖🧪✨
It is possible, right now, to build your own prompt-based video clip extractor using OpenAI's GPT-4o multi-model, and it has blown my mind 🤯
SO HAPPY!
You can give it a video, and ask it to extract all the clips where something you care about happens in the video. I gave it the Men In Black movie trailer (1997 version), and asked it to extract all clips where Will Smith was putting on or wearing sunglasses 😎
Here’s a video of the tool I built to do this in action:
I built the tool you see in the video, in Relevance AI, using a mix of the built-in tool-steps (e.g. llm vision) and custom code in the python steps (e.g. clip extraction).
However, I’ll assume you’re building this on your local machine using a standard dev environment with Python when breaking down the step-by-steps.
Are you ready!??
Let’s get building!
— Becca 💖
If you’d like to see more frequent updates, connect with me on LinkedIn 🥰
Where this idea came from
About a year ago, I got the chance to peek behind-the-scenes of a friend making a movie trailer for the new Sonic the Hedgehog movie. They would manually scrub through hours of movie footage to extract clips containing main characters and action scenes.
That sparked the idea for this project. At the time, GPT-4o didn’t exist and I wasn’t aware of any multi-models that could process videos. So this idea got set aside into the “when tech catches up” list.
The fact that you can do this now with a model that can’t even process videos or audio yet is utterly magical to me.
Overview
Here is a quick overview of what we’ll be doing to build this custom prompt-based video clip extractor:
Download a video.
Break the video down into image frames, and add timestamps to them.
Narrow down the frames to 2 per second, and convert to a base64 format which can be passed to GPT-4o.
Detect the presence of your subject with GPT-4o (return true/false and timestamp).
Merge the timestamps into a series of clip start and end times.
Extract the clips from the original video.
Step 1: Download a video
The first thing you need to do is download an mp4 file of the video you want to extract clips from. I chose the Men In Black (1997 version) movie trailer for this project.
from yt_dlp import YoutubeDL
def download_video_as_mp4(youtube_video_url, output_file):
ydl_opts = {
'quiet': False,
'format': 'mp4',
'outtmpl': output_file
}
with YoutubeDL(ydl_opts) as ydl:
ydl.download([youtube_video_url])
youtube_video_url = "https://www.youtube.com/watch?v=UxUTTrU6PA4"
download_video_as_mp4(youtube_video_url, "video_name.mp4")
Unfortunately YouTube has made it harder for any third party tools to download videos. So you might run into the “please sign-in and confirm you’re not a bot” error when trying to run this.
Step 2: Turn video into timestamped image frames
Next, we need to break the video down into its individual image frames, and add a timestamp identifier to each of them.
Adding a timestamp or unique identifier to each of your images gets around a limitation of GPT-4o, which is that it can’t tell you which image it got a piece of information from. It has no concept of index or position tracking. Now it can read the ID (or in this case, timestamp) from the image.
import cv2
import os
from datetime import timedelta, datetime
def create_black_bar(frame, bar_height, timestamp):
height, width, _ = frame.shape
frame_with_bar = cv2.copyMakeBorder(frame, bar_height, 0, 0, 0, cv2.BORDER_CONSTANT, value=(0, 0, 0))
font = cv2.FONT_HERSHEY_SIMPLEX
font_scale = 1
font_thickness = 2
timestamp_text = f"timestamp: {timestamp}"
text_size = cv2.getTextSize(timestamp_text, font, font_scale, font_thickness)[0]
text_x = (frame_with_bar.shape[1] - text_size[0]) // 2
text_y = bar_height - 10
cv2.putText(frame_with_bar, timestamp_text, (text_x, text_y), font, font_scale, (255, 255, 255), font_thickness)
return frame_with_bar
def calculate_timestamp(frame_index, fps):
seconds = frame_index / fps
time_obj = timedelta(seconds=seconds)
return str((datetime.min + time_obj).strftime('%H:%M:%S'))
def extract_and_stamp_frames(video_file_path, output_folder):
cap = cv2.VideoCapture(video_file_path)
if not cap.isOpened():
print(f"Error: Unable to open video file {video_file_path}")
return
if not os.path.exists(output_folder):
os.makedirs(output_folder)
fps = cap.get(cv2.CAP_PROP_FPS)
frame_index = 0
while True:
ret, frame = cap.read()
if not ret:
break
timestamp = calculate_timestamp(frame_index, fps)
frame_with_bar = create_black_bar(frame, 50, timestamp)
output_path = os.path.join(output_folder, f"frame_{frame_index:04d}.jpg")
cv2.imwrite(output_path, frame_with_bar)
frame_index += 1
cap.release()
print(f"Timestamps added to frames and saved to '{output_folder}'")
video_path = "mib_trailer.mp4"
output_folder = "timestamped_frames"
extract_and_stamp_frames(video_path, output_folder)
Step 3: Reduce fps and convert to base64
Most videos are made up of 24-30 frames per second (fps), which is way more than we need for GPT-4o to be able to complete this task.
The Men In Black movie trailer was made up of 3472 frames (24fps). After reducing to 2fps, there were 289 frames.
def reduce_frames_to_fps(frames, target_fps, video_fps):
frame_interval = int(video_fps / target_fps)
reduced_frames = [(frame, timestamp) for i, (frame, timestamp) in enumerate(frames) if i % frame_interval == 0]
return reduced_frames
Next, we convert the images to base64 format, which we can use to send an array of image urls to the GPT-4o model.
def process_frames_as_base64(frames):
base64_array = []
for frame, timestamp in frames:
img = Image.fromarray(frame)
with BytesIO() as img_byte_arr:
img.save(img_byte_arr, format='JPEG')
img_byte_arr.seek(0)
raw_bytes = img_byte_arr.read()
base64_str = base64.b64encode(raw_bytes).decode('utf-8')
base64_url = f"data:image/jpeg;base64,{base64_str}"
base64_array.append((base64_url, timestamp))
return base64_array
Each image will look like this in our GPT-4o API call:
{
"image": "data:image/png;base64,<VERY_LONG_IMAGE_BASE64_ENCODED_TEXT>"
}
Step 4: Process each frame with GPT-4o
Now that we have broken down our video into images that can be sent over an API call, we can now use GPT-4o to process our images.
The ultimate goal of this project, is to be able to extract clips from a video where a subject is present. E.g. A main character is present, or something is happening like a fight scene or specific gadgets are being used.
In this case, I want to extract all clips in the Men In Black movie trailer where Agent J (Will Smith) is putting on or wearing Sunglasses (which means he’s about to wipe someone’s memory with the Neuralyzer).
To do this, we need GPT-4o to look at each individual image frame and tell us if Will Smith is putting on or wearing sunglasses or not, plus the timestamp so we can identify the frame later.
Here is the API call to GPT-4o for processing a single image frame:
prompt = "Does the image contain: '" + subject + "'. Return an object with two keys: {'present': [true or false], 'timestamp': [timestamp pulled directly from the image, in hh:mm:ss format]}. The first character in your response should be '{'"
base64Frames = [base64_frame]
response = client.chat.completions.create(
model="GPT-4o",
messages=[
{"role": "system", "content": prompt},
{"role": "user", "content": [
"These are the frames from the video.",
*map(lambda x: {"type": "image_url",
"image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64Frames)
],
}
],
temperature=0,
)
print(response.choices[0].message.content)
The same code will work for multiple image frames, all you need to do is add more than one base64 encoded image to the base64Frames array just below the prompt.
In fact, it’s extremely inefficient to process one image frame at a time when the model can handle many of them. You’d need to experiment to see how many frames you can pass in to guarantee that none of them are skipped.
Step 5: Merge timestamps
We now have an array of objects which tells us the timestamp of each image frame, and whether or not our subject (Will Smith putting on or wearing sunglasses) was present in the frame or not.
[
{ "timestamp": "00:00:55", "present": true },
{ "timestamp": "00:00:55", "present": true },
{ "timestamp": "00:00:56", "present": true },
{ "timestamp": "00:00:56", "present": true },
{ "timestamp": "00:00:57", "present": false },
{ "timestamp": "00:00:57", "present": false },
{ "timestamp": "00:00:58", "present": false },
{ "timestamp": "00:00:58", "present": false },
{ "timestamp": "00:00:58", "present": true },
{ "timestamp": "00:00:59", "present": true },
{ "timestamp": "00:00:59", "present": true }
]
Next, we need to merge the timestamps so that we end up with the start and end time of each clip where the thing we care about has happened.
[
{
"start_time": "00:00:55",
"end_time": "00:00:56",
},
{
"start_time": "00:00:58",
"end_time": "00:00:62",
}
]
Here is the code for doing that:
from datetime import datetime, timedelta
def filter_present_data(data):
return [item for item in data if item['present']]
def parse_timestamp(timestamp):
return datetime.strptime(timestamp, "%H:%M:%S")
def combine_adjacent_time_ranges(data):
combined = []
start_time = None
end_time = None
for item in data:
current_time = parse_timestamp(item['timestamp'])
if start_time is None:
start_time = current_time
end_time = current_time
elif current_time - end_time <= timedelta(seconds=1):
end_time = current_time
else:
combined.append(create_time_range(start_time, end_time))
start_time = current_time
end_time = current_time
if start_time and end_time:
combined.append(create_time_range(start_time, end_time))
return combined
def create_time_range(start_time, end_time):
return {
'start_time': start_time.strftime("%H:%M:%S"),
'end_time': end_time.strftime("%H:%M:%S")
}
def process_time_ranges(detections):
filtered_data = filter_present_data(detections['transformed'])
sorted_data = sorted(filtered_data, key=lambda item: parse_timestamp(item['timestamp']))
return combine_adjacent_time_ranges(sorted_data)
combined_time_ranges = process_time_ranges(frame_detections)
return combined_time_ranges
Step 6: Extract clips
Now that we have an array of clip start and end times, we can use them to extract clips from the original video.
from moviepy.video.io.VideoFileClip import VideoFileClip
clips = [
{"start_time": "00:00:55", "end_time": "00:00:56"},
{"start_time": "00:00:58", "end_time": "00:01:02"},
]
for i, clip_info in enumerate(clips):
clip = VideoFileClip("video.mp4").subclip(clip_info["start_time"], clip_info["end_time"])
clip.write_videofile(f"clip_{i+1}.mp4", codec="libx264")
Here are the clips!
This being possible right now blows my mind. All of the best-in-class image classification models designed to detect the presence of objects, could not do this (understandably).
Unlocks some pretty awesome use-case ideas, e.g: Search security footage for incidents, collect clips of main characters to help with movie trailer production, help film students collect wide-angle shots, city scenes etc.
Until next time,
Stay sparkly 💖
If you’d like to see more frequent updates, connect with me on LinkedIn 🥰
Moar GenAI Projects! 🤖🧪✨
🎁 Here are some more projects to check out if you loved this one!
#8. ChatGPT-Powered Robot Panda
Helloooo AI Alchemists! ✨🤖🧪 Late last year, my friend Stan and I won the People’s Choice Award at Fishburner’s Young Entrepreneur Pitch night, where I demo’d chatting with the panda on stage in front of 200+ founders and entrepreneurs 🐼🎤 The CEO of Fishburners said:
#7. Create a lip-syncing character that responds to your messages by talking!
Helloooo AI Alchemists!!! 🤖🧪 Ever since sharing this project live on LinkedIn, I have gotten a TON of requests to make lip-syncing chatty avatars for a wide range of use-cases. This is incredibly popular, and there is a crazy amount of potential for extending and monetising this.
#2. Analyse a journal entry for unhelpful thinking patterns with ChatGPT's function calling 🤯
Helloooo AI Alchemists! ✨🤖🧪 Last weekend, I built a program that I’ve personally wished existed for years. It’s solved a real pain point for me, and I’m so grateful it was even possible. Not only was it possible, but it was easy. Crazy easy. This is it:
✨ Fairylights | 100 GenAI Projects is a reader-supported publication. To receive new posts and support my work, consider becoming a free or paid subscriber.