Use GPT-4o to produce a multi-model summary of a painting tutorial.

GPT-4o accepts an array of hundreds of images (enough to make up a whole video) as well as other input like an audio transcript. See results of audio vs visual vs Audio + visual video summaries.

Jun 10, 2024

Forwarded this email? Subscribe here for more ✨

Welcome to the 11th edition of the 100 GenAI MVPs newsletter, a journey all about shipping 100 happy and helpful GenAI solutions, and how to build them! 🚀👩‍💻✨ Together, we built a robot panda you can chat with (🏆 award winner), a member finder for online communities (🏆 competition winner), and many more!

Hellooo AI Alchemists! 🤖🧪✨

The last month or so has been craaazy. I started a new job as an AI Engineer at Relevance AI, and I could not be happier.

In addition to this 100GenAI MVP series, I’ve started a 100 AI Agent series. The first two are:

Finlay the FAQ writer: Produces SEO-Optimised Frequently Asked Questions based on Google’s People’s Also Asked section on the search results page for a give query.
Teddy the tool template webpage producer: Every day at 9am, Teddy checks for new tools that have been made publicly available on our platform. For each new tool, Teddy researches and writes tool explainer content, and uploads it to a Webflow collection ready for us to click publish on.

Besides that, I’m also working on turning one of the first 9 MVPs so far into an actual product! I have added analytics to each of them and discovered that the Thought Checker has gotten the most traction so far. That surprises me as it was the first AI project I built, when I knew the least about AI.

Here is a progress shot. I’m building it in React Native, and will be doing custom illustrations to make it as fun and beautiful to use as possible.

This newsletter will be about playing with GPT-4o to compare single vs multi-model summaries for a watercolour painting video. I plan to create a front-end for you to try it out, but will likely wait until the thought checker app is shipped.

I’m VERY excited about GPT-4o, and the possibilities it suggests for the future (true multi-model LLMs = oh my golly).

If you’d like to see more frequent updates, connect with me on LinkedIn 🥰

Are you ready!??

Let’s get building!

— Becca 💖

I recently spent the day playing with the shiny new GPT-4o model. What sets this model apart from all previous OpenAI models, is that it is truly multi-model.

You can give it multiple types of input, and get back multiple types of output:

You can point your camera at random objects and use your voice to ask ChatGPT to tell you the names of those objects in different languages.
While talking to the voice chat version of ChatGPT, you can send it a text prompt to ask it to speak faster or change it’s speaking voice and style entirely (tho just asking would be easier). It can even sing now.
If you’re a blind person who wants to hail a taxi, you can hold up the camera and point it towards the road, and it will tell you when a taxi is coming so that you can signal it to stop.

These are just some of the incredible use-cases demo’d by OpenAI on their GPT-4o launch post last week.

GPT-4o: API version

What about the API version of GPT-4o? Well, many people have been disappointed because it can’t accept video or audio input yet.

Except it kind of can, right now.

While it can’t accept videos in MP4 format (yet), it can accept an array containing hundreds of image frames, enough to make up the whole video. Given that a video is made up of still image frames, that means it already CAN accept video input.

However, you can’t give it audio. You can give it a transcript but that flat out isn’t the same because it’s missing a ton of sound information, like tone and pitch and speech patterns. For now though it’s good enough to give us a hint of what’s going to be possible.

The absolute best thing about this model to me right now, is that you can give it multiple kinds of inputs.

To show you what I mean, I’m going to produce four different summaries of a watercolour painting video 🎨

🎧 Audio-only summary based on the video transcript.
👀 Visual-only summary based on the image frames that make up the entire video.
🎧 + 👀 Multi-model summary based on both audio and visual summaries.
🎧 + 👀 + 📝 Multi-model summary based on both audio and visual summaries with a much better prompt.

This experiment is adapted from the introductory GPT-4o cookbook example produced by OpenAI. They do the same thing but for an OpenAI DevDay video clip.

I immediately wanted to try it on a painting tutorial video, because trying to recreate something like that as an artist, you really do depend on both the visual demonstration and audio technique explanations to really “get” it.

I can’t wait for you to see the results.

It blew my mind, again 🤯

Use GPT-4o to summarise a painting video

Producing multi-model summaries right now is slow. You need to choose a short video to try this with. I originally tried a 20 min urban sketching watercolour video but I saw no output after 40 mins of running it.

Switching to a 60s video produced output in about 5-10 minutes.

However, using it on pure text-based inputs only is way way faster than GPT-4. It produced a summary of the audio transcript in milliseconds. It was just done pretty much the same instant I clicked run.

1. Get the raw video file and audio transcript

To do this experiment with me, you need to download a short video, as well as the audio transcript for it. I used this online YouTube video downloader to get the raw video file.

If like me you go for a YouTube Short video, you can get the transcript by changing the URL from this:

https://www.youtube.com/shorts/GVetUPw64bs

To this:

https://www.youtube.com/watch?v=GVetUPw64bs

Then you can get the transcript as you would for a normal YouTube video by expanding the video description and clicking on “show transcript”.

This is the 1m video I created summaries for:

2. Convert a video into base64 frames

GPT-4o accepts an array of images (video frames), rather than a video file like mp4 or mov. So the first step is to extract the frames from the video you want.

I extracted every 5th frame from the watercolour video. It had 24 frames per second, so every 5th frame was more than enough to not lose any movement info.

To convert a video into frames, I copied the code from this medium post which teaches you how to extract frames from a video much faster than other solutions:

Extracting frames FAST from a video using OpenCV and Python - by Hayden Faulkner.

I pasted Hayden’s frame extraction code into ChatGPT and asked it to save the frames as an array of base64 images instead of to a directory called frames. This was the code it came up with (which worked fine and very fast).

The only thing you need to install to run the frame extraction code is cv2. I also installed OpenAI because for the generation steps later on.

pip install openai opencv-python

# video_to_frames.py

from concurrent.futures import ProcessPoolExecutor, as_completed
import cv2
import multiprocessing
import os
import sys
import base64
import numpy as np


def print_progress(iteration, total, prefix='', suffix='', decimals=3, bar_length=100):
    format_str = "{0:." + str(decimals) + "f}"
    percents = format_str.format(100 * (iteration / float(total)))
    filled_length = int(round(bar_length * iteration / float(total)))
    bar = '#' * filled_length + '-' * (bar_length - filled_length)
    sys.stdout.write('\r%s |%s| %s%s %s' % (prefix, bar, percents, '%', suffix))
    sys.stdout.flush()


def get_video_fps(video_path):
    capture = cv2.VideoCapture(video_path)
    fps = capture.get(cv2.CAP_PROP_FPS)
    capture.release()
    return fps


def extract_frames(video_path, start=-1, end=-1, every=1):
    video_path = os.path.normpath(video_path)
    assert os.path.exists(video_path)

    capture = cv2.VideoCapture(video_path)

    if start < 0:
        start = 0
    if end < 0:
        end = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))

    capture.set(1, start)
    frame = start
    while_safety = 0
    saved_frames = []

    while frame < end:
        _, image = capture.read()

        if while_safety > 500:
            break

        if image is None:
            while_safety += 1
            continue

        if frame % every == 0:
            while_safety = 0
            _, buffer = cv2.imencode('.jpg', image)
            jpg_as_text = base64.b64encode(buffer).decode('utf-8')
            saved_frames.append(jpg_as_text)

        frame += 1

    capture.release()
    return saved_frames


def video_to_frames(video_path, every=1, chunk_size=1000):
    video_path = os.path.normpath(video_path)
    video_dir, video_filename = os.path.split(video_path)

    capture = cv2.VideoCapture(video_path)
    total = int(capture.get(cv2.CAP_PROP_FRAME_COUNT))
    capture.release()

    if total < 1:
        print("Video has no frames. Check your OpenCV + ffmpeg installation")
        return None

    frame_chunks = [[i, i + chunk_size] for i in range(0, total, chunk_size)]
    frame_chunks[-1][-1] = min(frame_chunks[-1][-1], total - 1)

    prefix_str = "Extracting frames from {}".format(video_filename)

    base64_frames = []

    with ProcessPoolExecutor(max_workers=multiprocessing.cpu_count()) as executor:
        futures = [executor.submit(extract_frames, video_path, f[0], f[1], every) for f in frame_chunks]

        for i, f in enumerate(as_completed(futures)):
            base64_frames.extend(f.result())
            print_progress(i, len(frame_chunks) - 1, prefix=prefix_str, suffix='Complete')

    return base64_frames


if __name__ == '__main__':
    video_path = 'video.mp4'
    fps = get_video_fps(video_path)
    print(f"The video {video_path} has {fps} frames per second.")

    base64_frames = video_to_frames(video_path, every=5, chunk_size=1000)
    print("Total frames extracted and converted to base64:", len(base64_frames))

Experiment: GPT-4o audio, visual or audio & visual summaries.

Once the frame extraction puzzle piece was solved, I followed the process in OpenAI’s introductory GPT-4o cookbook to experiment with single vs multi-model summaries.

To really show the difference in quality, I decided to write a bare minimum prompt with zero prompt-engineering strategies for all summary attempts. The audio and visual summary generated a much higher quality summary even with such a basic prompt.

For this experiment, I chose a 60s watercolour YouTube Short video (longer videos took exponentially longer to process so it were harder to test). The video was made up of 24 frames per second, and I extracted every 5th frame which was enough to capture all of the movement without losing anything.

By the end of processing, I ended up with an array of 276 video frame images, representing the whole painting tutorial. These are what I passed to the GPT-4o model.

Audio only summary with GPT-4o

Once I extracted the frames from the watercolour painting video, I used GPT-4o to generate a summary for the video using just the audio transcript.

This is the code for creating the audio-transcript summary with a very basic prompt:

def generate_audio_summary(transcript):
    client = OpenAI(api_key="<YOUR_API_KEY>")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": """You are generating a transcript summary. Create a summary of the provided transcription. Respond in Markdown."""},
            {"role": "user", "content": [
                {"type": "text", "text": f"The audio transcription is: {transcript}"}
            ],
             }
        ],
        temperature=0,
    )
    return response.choices[0].message.content

This was the output.

In this audio transcription, the speaker provides a step-by-step guide on creating a simple ink and watercolor painting. They start by sketching, which serves as the foundation and roadmap for the artwork. Using a fountain pen with waterproof ink, they emphasize letting personality and character show through the lines. Next, they swatch out colors to prepare a color palette and do quick mini studies. The wet-on-wet technique is then used by covering the entire paper with water and applying the base layer of colours, which is allowed to dry. Subsequent layers add shadows and depth, working from light to dark. Finally, the speaker adds details and highlights to complete the painting. They encourage creativity and wish the listener an awesome day.

This audio summary captures some techniques, and the general process for creating any kind of watercolour painting. However, we have no idea what is being painted.

Visual-only summary with GPT-4o

Next, I created a visual-only summary of the watercolour video by passing in all 276 frames:

def generate_video_summary(base64_frames):
    client = OpenAI(api_key="<YOUR_API_KEY")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": "You are generating a video summary. Please provide a summary of the video. Respond in Markdown."},
            {"role": "user", "content": [
                "These are the frames from the video.",
                *map(lambda x: {"type": "image_url",
                                "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64_frames)
            ]}
        ],
        temperature=0,
    )
    return response.choices[0].message.content

This was the summary produced:

The video showcases the process of creating a detailed watercolor and ink painting of a charming street scene. Here is a summary of the steps depicted in the frames:

1. Initial Sketching: The artist begins by sketching the outlines of the buildings and street elements with a pencil on watercolor paper.
2. Inking: The artist uses a fountain pen to go over the pencil sketch, adding details and defining the structures and foliage.
3. Watercolor Application: The artist applies watercolor paints, starting with lighter washes and gradually adding more vibrant and darker colors to build depth and contrast.
4. Detailing: The artist adds finer details with a smaller brush and pen, enhancing the textures and features of the buildings, windows, and greenery.
5. Final Touches: The artist uses colored pens to add highlights and final touches, bringing the painting to life with intricate details and vibrant colors.
6. Presentation: The completed artwork is shown, capturing the picturesque street scene with a blend of ink and watercolor techniques.
The video concludes with the artist displaying the finished painting and the tools used, emphasizing the combination of ink and watercolor in the creative process.

While the prompt was just as basic as the one used to create a summary of the transcript, this visual-only summary is far richer.

We know that the artist was painting a “charming street scene”. We can see how the painting progressed through the video, from initial sketching to inking to watercolour application to detailing. We also know what kinds of tools the artist used to produce their work, including a pencil, a fountain pen and a smaller detail brush. We know that the artist presented the finished piece at the end.

However, unlike the audio transcription, we don’t know the name of any of the watercolour painting techniques used throughout the painting process.

Multi-model (audio and visual) summary with GPT-4o.

Next, I created a multi-model audio AND visual summary of the watercolour video:

def generate_audio_and_visual_summary(base64_frames, transcript):
    client = OpenAI(api_key="<YOUR_API_KEY>")

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system",
             "content": """You are generating a video summary. Create a summary of the provided video and its transcript. Respond in Markdown"""},
            {"role": "user", "content": [
                "These are the frames from the video.",
                *map(lambda x: {"type": "image_url",
                                "image_url": {"url": f'data:image/jpg;base64,{x}', "detail": "low"}}, base64_frames),
                {"type": "text", "text": f"The audio transcription is: {transcript}"}
            ],
             }
        ],
        temperature=0,
    )
    return response.choices[0].message.content

This was the summary produced:

The video demonstrates a step-by-step process of creating a watercolor painting with ink outlines. Here is a summary of the process:
1. Introduction: The artist invites viewers to grab their sketchbook and watercolors to try a simple ink and watercolor technique.
2. Sketching: The artist begins with a pencil sketch, which serves as the foundation and roadmap for the painting. The focus is on the placement of elements in the composition.
3. Inking: Using a fountain pen with waterproof ink, the artist outlines the sketch. This step allows the artist's personality and character to show through the lines.
4. Color Swatching: The artist swatches out colors to prepare a color palette and does quick mini studies as a prep for the final painting.
5. Wet-on-Wet Technique: The artist wets the entire paper with a big brush, applying the wet-on-wet technique. This involves covering the paper with water before adding the first base layer of colors.
6. Layering Colors: The artist lays in all the base colors and allows them to dry. Watercolor painting involves working from light to dark, so subsequent layers add shadows and depth.
7. Adding Details: The artist goes back to add little details and highlights, enhancing the painting's depth and texture.
8. Final Touches: The finished painting is revealed, showcasing the vibrant and detailed result of the ink and watercolor technique.
9. Conclusion: The artist encourages viewers to stay creative and have an awesome day.
The video is a comprehensive guide to combining ink and watercolor, emphasizing the importance of layering and detail work in creating a dynamic and expressive painting.

In this multi-model summary, we know the stages of the painting, what the artist says to viewers and what techniques are being used like in the audio-only summary.

We also know what tools they are using, and what the painting looks like at different stages, like the page being brushed with water before colour is added on top of that with the wet-on-wet technique.

This multi-model summary combines the best of both the audio and visual only summaries, even when using the simplest of prompts to generate the summary.

However, we no longer know what the artist is painting. This is because the model is trying to decide what to include from both the transcript and frames. In this case, it has excluded something that was my favourite part of the visual summary.

The summary results can be greatly improved with a little prompt-engineering, which we’ll try next.

Multi-model (audio and visual) summary with GPT-4o and prompt engineering

In all of the previous single and multi-model summary attempts, I used a very basic prompt. The results showed that the summary produced via a multi-model (audio and visual) approach, was richer than single-model approaches.

However, the multi-model summary is limited in quality because of the simplicity of the prompt. So I wanted to see how good I could get the summary just by improving the prompt.

Old prompt:

You are generating a video summary. Please provide a summary of the video. Respond in Markdown.

New prompt:

You are an expert watercolour artist who teaches beginner artists how to recreate a painting end-to-end.

You produce an extremely high-quality tutorial that is step-by-step in nature. You describe exactly what is happening at each stage, including materials and tools used, what is being produced, how it is being produced and what techniques are involved.

You will receive an audio transcript, as well as a series of frames which make up a watercolour painting tutorial. You will capture the artists personality, as well as the best of both the visual (frames) and audio (transcript) information.

The guide should be a minimum of 500 words, with full paragraphs, tool and technique breakdowns and dive deep into the visuals.

Here were the results:

Step-by-Step Watercolor and Ink Tutorial
Welcome to this fun and engaging tutorial where we will explore a simple yet effective ink and watercolor technique. Grab your sketchbook, watercolors, and let's get started!
Materials Needed:
Sketchbook with watercolor paper
Waterproof ink fountain pen
Watercolor paints
Brushes (a large brush for washes and a smaller brush for details)
Water container
Paper towel
Step 1: Preparing Your Sketchbook
Begin by selecting a high-quality sketchbook with watercolor paper. The texture and weight of the paper are crucial for handling the water and pigment effectively. Open your sketchbook to a fresh page and ensure it lies flat.
Step 2: Initial Sketch
Using a pencil, lightly sketch the basic outlines of your scene. In this tutorial, we are focusing on a charming street scene with buildings and foliage. Keep your lines light and loose, as this sketch will serve as a guide for your ink work.
Step 3: Inking the Sketch
Now, grab your waterproof ink fountain pen. This is where you can let your personality shine. Begin by tracing over your pencil lines, adding details and textures as you go. Let the pen dance on the paper, creating lively and expressive lines. This step is all about capturing the essence and character of the scene.
Step 4: Swatching Colors
Before diving into the painting, it's helpful to swatch out your colors. On a separate piece of paper, test the colors you plan to use. This helps in visualizing your palette and ensures you have the right hues ready for your painting.
Step 5: Wet-on-Wet Technique
With a large brush, wet the entire surface of your paper. This is the wet-on-wet technique, which allows the colors to blend beautifully. Start applying your base colors while the paper is still wet. In the frames, you can see the artist laying down vibrant greens, blues, and reds, creating a lively base layer. Allow this layer to dry completely before moving on.
Step 6: Adding Depth and Shadows
Watercolor painting is all about building layers from light to dark. Once your base layer is dry, start adding more color to create shadows and depth. Use a smaller brush for more control and precision. Notice how the artist adds darker greens and blues to the foliage and shadows on the buildings, enhancing the three-dimensional effect.
Step 7: Detailing
After the shadows and depth are in place, it's time to add the finer details. Use your smaller brush to add highlights and intricate details. The artist in the frames goes back in to refine the windows, doors, and foliage, bringing the scene to life.
Step 8: Final Touches
For the final touches, you can use a white gel pen or a fine brush with white paint to add highlights and accents. This step adds a pop of brightness and makes certain elements stand out. The artist uses this technique to highlight the leaves and small details on the buildings.
Step 9: Review and Enjoy
Take a step back and review your painting. Make any final adjustments if needed. Once satisfied, let your painting dry completely.
Conclusion
And there you have it! A beautiful ink and watercolor painting using a simple yet effective technique. This method allows for a lot of creativity and personal expression. Remember, the key is to have fun and let your personality shine through your work. Stay creative and have an awesome day!
By following these steps, you can create your own charming street scene with a combination of ink and watercolor. Happy painting!

How absolutely wonderful is this!?!? 😭🎨👏

The artist in the video could use this as a blog post for their video.

Even though the summary is generated by AI, it’s the most human sounding summary I’ve ever seen from an AI model 🤯

Until next time,

Stay sparkly 💖

p.s. Some of my own watercolour experiments:

Moar GenAI Projects! 🤖🧪✨

🎁 Here are some more projects to check out if you loved this one!

#8. ChatGPT-Powered Robot Panda

February 4, 2024

Helloooo AI Alchemists! ✨🤖🧪 Late last year, my friend Stan and I won the People’s Choice Award at Fishburner’s Young Entrepreneur Pitch night, where I demo’d chatting with the panda on stage in front of 200+ founders and entrepreneurs 🐼🎤 The CEO of Fishburners said:

Read full story

#7. Create a lip-syncing character that responds to your messages by talking!

Tiny Rawr 🤖🦕

February 1, 2024

#7. Create a lip-syncing character that responds to your messages by talking!

Helloooo AI Alchemists!!! 🤖🧪 Ever since sharing this project live on LinkedIn, I have gotten a TON of requests to make lip-syncing chatty avatars for a wide range of use-cases. This is incredibly popular, and there is a crazy amount of potential for extending and monetising this.

Read full story

#2. Analyse a journal entry for unhelpful thinking patterns with ChatGPT's function calling 🤯

September 17, 2023

Helloooo AI Alchemists! ✨🤖🧪 Last weekend, I built a program that I’ve personally wished existed for years. It’s solved a real pain point for me, and I’m so grateful it was even possible. Not only was it possible, but it was easy. Crazy easy. This is it:

Read full story

✨ Fairylights | 100 GenAI Projects

#8. ChatGPT-Powered Robot Panda

#7. Create a lip-syncing character that responds to your messages by talking!

#2. Analyse a journal entry for unhelpful thinking patterns with ChatGPT's function calling 🤯

Discussion about this post