CSharpForYou: Automating Podcast Transcription, Subtitles, and AI Summaries with Whisper and Ollama

One of the things I want to add to the CloudTalkShow production pipeline is automatic transcription. After each episode records, I want three things to exist without anyone having to think about it: a full transcript, an SRT subtitle file ready to upload to YouTube, and a summary that Raquel can use as a starting point for show notes. This post walks through how I built that, what went wrong, and how I fixed it.

The Setup

The script runs on the Windows 11 VM I have dedicated to OBS production. That VM has an NVIDIA GeForce RTX 4060 passed through from the Proxmox host, which matters a lot here — running Whisper on a GPU versus a CPU is not a minor difference. I also have Ollama installed on the same VM to handle the summarization step locally without sending episode content to an external API.

Step 1 — Installing the Dependencies

Everything installs cleanly from the command line. Start with Python and FFmpeg, which Whisper needs to process audio from video files:

winget install Python.Python.3.12
winget install Gyan.FFmpeg

Then install Whisper and the progress bar library:

pip install -U openai-whisper
pip install tqdm

Step 2 — The First Script

The starting point was straightforward. Point Whisper at every MP4 in the current directory, transcribe each one, and save the result as a text file. Skip anything that already has a transcript so reruns are safe.

import os
import glob
import whisper

model = whisper.load_model("base")

video_files = glob.glob("*.mp4")
if not video_files:
    print("No files found")
else:
    print(f"Found {len(video_files)} to transcribe")

for video_path in video_files:
    base_name = os.path.splitext(video_path)[0]
    transcript_path = f"{base_name}.transcript.txt"
    if os.path.exists(transcript_path):
        print(f"Skipping '{video_path}'")
        continue
    print(f"Processing: '{video_path}'")
    try:
        result = model.transcribe(video_path)
        with open(transcript_path, "w", encoding="utf-8") as f:
            f.write(result["text"])
        print(f"Success: saved to '{transcript_path}'.\n")
    except Exception as e:
        print(f"Error processing '{video_path}': {e}\n")

print("All files processed!")

Running this immediately produced a warning:

warnings.warn("FP16 is not supported on CPU; using FP32 instead")

Whisper was running on the CPU. The 4060 was sitting there doing nothing. The fix is to reinstall PyTorch with CUDA support — the default pip install does not include it:

pip uninstall torch torchvision torchaudio -y
pip cache purge
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

After that, verify the GPU is visible to PyTorch before touching the script:

python -c "import torch; print('GPU Available:', torch.cuda.is_available()); print('Device Name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None')"

You want to see your GPU name come back, not None. Once that is confirmed, change the model load line to tell Whisper to use CUDA:

model = whisper.load_model("base", device="cuda")

Step 3 — Better Model, Better Output

With the GPU working, the base model felt like leaving performance on the table. The 4060 has 8GB of VRAM, which is enough to run Whisper's turbo model — a distilled version of large that runs significantly faster with comparable accuracy. I also added timing and a progress bar with tqdm so it is obvious what is happening and how long it takes.

import os
import glob
import time
import whisper
from tqdm import tqdm

print("Loading Whisper 'turbo' model...")
model = whisper.load_model("turbo", device="cuda")

video_files = glob.glob("*.mp4")
if not video_files:
    print("No .mp4 files found in the current directory.")
else:
    print(f"Found {len(video_files)} video file(s). Starting queue...\n")

for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
    base_name = os.path.splitext(video_path)[0]
    transcript_path = f"{base_name}.transcript.txt"

    if os.path.exists(transcript_path):
        continue

    print(f"\n[Processing] {video_path}")
    start_time = time.time()

    try:
        result = model.transcribe(video_path)

        with open(transcript_path, "w", encoding="utf-8") as f:
            f.write(result["text"])

        elapsed_time = time.time() - start_time
        mins, secs = divmod(int(elapsed_time), 60)
        print(f"[Success] Saved transcript. Time taken: {mins}m {secs}s.")

    except Exception as e:
        print(f"[Error] Failed on '{video_path}': {e}")

print("\nAll tasks finished!")

Step 4 — Adding Subtitles and Summaries

Whisper's transcription result includes segment-level timing data, which makes generating an SRT subtitle file essentially free — just format the timestamps correctly. I added that along with a summarization step using Ollama and llama3.1 running locally.

First, install the Ollama Python library and pull the model:

pip install ollama
ollama pull llama3.1

Then the updated script that produces all three output files — transcript, subtitles, and summary — for each MP4:

import os
import glob
import time
import whisper
import ollama
from tqdm import tqdm

def format_srt_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"

model = whisper.load_model("turbo", device="cuda")

video_files = glob.glob("*.mp4")
if not video_files:
    print("No files found")
else:
    print(f"Found {len(video_files)} to transcribe")

for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
    base_name = os.path.splitext(video_path)[0]
    transcript_path = f"{base_name}.transcript.txt"
    subtitle_path = f"{base_name}.subtitles.srt"
    summary_path = f"{base_name}.summary.txt"

    if os.path.exists(transcript_path) and os.path.exists(subtitle_path) and os.path.exists(summary_path):
        print(f"Skipping '{video_path}'")
        continue

    print(f"\nProcessing: '{video_path}'")
    start_time = time.time()

    try:
        print(" -> Step 1/3: Transcribing...")
        result = model.transcribe(video_path)
        raw_text = result["text"].strip()

        with open(transcript_path, "w", encoding="utf-8") as f:
            f.write(raw_text)

        print(" -> Step 2/3: Creating Subtitles...")
        with open(subtitle_path, "w", encoding="utf-8") as srt_file:
            for index, segment in enumerate(result["segments"], start=1):
                start = format_srt_time(segment["start"])
                end = format_srt_time(segment["end"])
                text = segment["text"].strip()
                srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n")

        print(" -> Step 3/3: Summarizing...")
        prompt = (
            f"You are writing show notes for 'The Cloud Talk Show', a podcast hosted by "
            f"Larry Smithmier and Ralph Lecesse that covers cloud-native development, "
            f"self-hosted infrastructure, and hands-on technical topics.\n\n"
            f"Based on the following transcript, write engaging show notes in the style of a "
            f"knowledgeable tech blogger. Write in present tense as if describing the episode "
            f"to a potential listener. Do not use phrases like 'this transcript' or 'the host' "
            f"— use 'Larry', 'Ralph', or 'this episode' instead.\n\n"
            f"Structure the output as:\n"
            f"1. A 3-4 sentence episode overview in an engaging, direct tone\n"
            f"2. A bulleted list of key topics and takeaways\n"
            f"3. A one-sentence closing that tells the reader why they should watch\n\n"
            f"Transcript:\n{raw_text}"
        )

        ollama_response = ollama.generate(
            model="llama3.1",
            prompt=prompt
        )

        with open(summary_path, "w", encoding="utf-8") as f:
            f.write(ollama_response['response'])

        elapsed_time = time.time() - start_time
        mins, secs = divmod(int(elapsed_time), 60)
        print(f"Success: All files saved for {video_path}! (Time: {mins}m {secs}s)\n")

    except Exception as e:
        print(f"Error processing '{video_path}': {e}\n")

print("All files processed!")

Step 5 — The Summary Was Bad

The script worked. The transcript and subtitle files were exactly what I wanted. The summary was not.

The output read like someone who had never heard of the show and was hedging every sentence. It referred to things as "the host" instead of Larry or Ralph, and it ended with the phrase "this transcript provides a fascinating glimpse" — which tells you the model had no idea it was writing show notes for a podcast. It was just pattern-matching on "I was given a transcript, I will write about a transcript."

There were two problems.

Problem 1: The prompt gave the model no context. Without knowing what the show is, who the hosts are, or what tone is appropriate, the model defaults to generic academic summarization. The fix is to front-load the prompt with enough context that the model knows exactly what role it is playing before it reads a single word of the transcript.
Problem 2: The context window was too small. This one is less obvious. Here is what is actually happening under the hood.

Understanding Model Size, Context, and VRAM

When you run a model through Ollama, two separate things consume your GPU's VRAM:

Model weights are fixed. When you see llama3.1 listed at 4.9GB in Ollama, that is how much VRAM the model itself needs regardless of what you ask it to do. It loads once and stays there.
The KV cache is dynamic. Every token the model processes — both input and output — needs to be stored in what is called the key-value cache. The larger the context window, the more VRAM the cache consumes. This is why models that support 128K context cannot actually use all of it on consumer hardware.

Rules of thumb for an 8GB GPU like the 4060:

Model weights consume roughly what Ollama reports as the model size
Each 8K of context adds approximately 0.5-1GB of KV cache for a 7-8B model
32K context adds roughly 2-3GB on top of the model weights — so llama3.1 (4.9GB) at 32K context needs about 7-8GB total. Tight but workable on a 4060.
128K context on an 8GB GPU is not realistic — the KV cache alone would need 15-20GB
The Ollama default context is 8K, which for a podcast transcript means the model is almost certainly reading a truncated version of what you gave it and filling in the gaps with hallucination

That last point explains the summary I got. The model did not read the whole episode. It read part of it and made up the rest. The fix is to explicitly set num_ctx in the Ollama call to give it more room:

ollama_response = ollama.generate(
    model="llama3.1",
    prompt=prompt,
    options={"num_ctx": 32768}
)

Step 6 — The VRAM Conflict

Running the updated script with the improved prompt and increased context window immediately hit a new error:

model requires more system memory (6.8 GiB) than is available (3.9 GiB) (status code: 500)

The problem is straightforward once you see it. The script loads Whisper at startup and keeps it in VRAM for the entire run. Whisper's turbo model uses about 6GB of the 4060's 8GB, which leaves only 2GB free. When the script gets to the Ollama step and tries to load llama3.1 at 32K context — which needs roughly 7-8GB — there is simply nowhere to put it. The fix is to explicitly release Whisper from VRAM before calling Ollama, then the full 8GB is available for the summarization step. Python's garbage collector does not do this automatically for GPU memory — you have to do it yourself with three lines:

del model
gc.collect()
torch.cuda.empty_cache()

del model removes the Python reference. gc.collect() runs the garbage collector to clean up any remaining references. torch.cuda.empty_cache() is the important one — it tells PyTorch to actually release the VRAM back to the GPU rather than holding it in reserve for potential reuse. Two other changes go along with this. First, import gc and import torch need to be added to the imports at the top of the script. Second, the Whisper model load moves from the top of the script to inside the loop, so it loads fresh for each file rather than once at startup. This means Whisper reloads on each iteration, which adds a few seconds per file, but for podcast episodes that is completely negligible. The updated loop now looks like this:

# --- Step 1/3: Transcribe (Whisper on GPU) ---
print(" -> Step 1/3: Transcribing...")
model = whisper.load_model("turbo", device="cuda")
result = model.transcribe(video_path)
raw_text = result["text"].strip()

with open(transcript_path, "w", encoding="utf-8") as f:
    f.write(raw_text)

# --- Step 2/3: Subtitles (no GPU needed) ---
print(" -> Step 2/3: Creating Subtitles...")
with open(subtitle_path, "w", encoding="utf-8") as srt_file:
    for index, segment in enumerate(result["segments"], start=1):
        start = format_srt_time(segment["start"])
        end = format_srt_time(segment["end"])
        text = segment["text"].strip()
        srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n")

# --- Free Whisper from VRAM before loading Ollama ---
del model
gc.collect()
torch.cuda.empty_cache()
print("    (Whisper unloaded from VRAM)")

# --- Step 3/3: Summarize (Ollama on GPU) ---
print(" -> Step 3/3: Summarizing...")

I also had to add code to unload the Ollama model using the following code:
        # After the ollama.generate call, unload the model from VRAM
        ollama.generate(
            model="llama3.1",
            prompt="",
            keep_alive=0
        )
        print("    (Ollama unloaded from VRAM)")

The terminal will now print (Whisper unloaded from VRAM) between the subtitle and summarization steps and (Ollama unloaded from VRAM) after, which makes it easy to confirm the sequence is working correctly when you watch a run in progress.

Step 7 — The Updated Script

Here is the final version incorporating the improved prompt and the increased context window. Note the bug fix in the timing output as well — the earlier version used min as a variable name (which shadows Python's built-in) and then referenced mins in the print statement, which would have thrown a NameError on any file that actually completed successfully.

import os
import glob
import time
import whisper
import ollama
import gc
import torch
from tqdm import tqdm

def format_srt_time(seconds):
    hours = int(seconds // 3600)
    minutes = int((seconds % 3600) // 60)
    secs = int(seconds % 60)
    milliseconds = int((seconds % 1) * 1000)
    return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"

video_files = glob.glob("*.mp4")
if not video_files:
    print("No files found")
else:
    print(f"Found {len(video_files)} to transcribe")

for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
    # At the start of each loop iteration, ensure Ollama isn't holding VRAM
    try:
        ollama.generate(model="llama3.1", prompt="", keep_alive=0)
    except:
        pass  # Model wasn't loaded, that's fine
    base_name = os.path.splitext(video_path)[0]
    transcript_path = f"{base_name}.transcript.txt"
    subtitle_path = f"{base_name}.subtitles.srt"
    summary_path = f"{base_name}.summary.txt"

    if os.path.exists(transcript_path) and os.path.exists(subtitle_path) and os.path.exists(summary_path):
        print(f"Skipping '{video_path}'")
        continue

    print(f"\nProcessing: '{video_path}'")
    start_time = time.time()

    try:
        # --- Step 1/3: Transcribe (Whisper on GPU) ---
        print(" -> Step 1/3: Transcribing...")
        model = whisper.load_model("turbo", device="cuda")
        result = model.transcribe(video_path)
        raw_text = result["text"].strip()

        with open(transcript_path, "w", encoding="utf-8") as f:
            f.write(raw_text)

        # --- Step 2/3: Subtitles (no GPU needed) ---
        print(" -> Step 2/3: Creating Subtitles...")
        with open(subtitle_path, "w", encoding="utf-8") as srt_file:
            for index, segment in enumerate(result["segments"], start=1):
                start = format_srt_time(segment["start"])
                end = format_srt_time(segment["end"])
                text = segment["text"].strip()
                srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n")

        # --- Free Whisper from VRAM before loading Ollama ---
        del model
        gc.collect()
        torch.cuda.empty_cache()
        print("    (Whisper unloaded from VRAM)")

        # --- Step 3/3: Summarize (Ollama on GPU) ---
        print(" -> Step 3/3: Summarizing...")
        prompt = (
            f"You are writing show notes for 'The Cloud Talk Show', a podcast hosted by "
            f"Larry Smithmier and Ralph Lecesse that covers cloud-native development, "
            f"self-hosted infrastructure, and hands-on technical topics.\n\n"
            f"Based on the following transcript, write engaging show notes in the style of a "
            f"knowledgeable tech blogger. Write in present tense as if describing the episode "
            f"to a potential listener. Do not use phrases like 'this transcript' or 'the host' "
            f"— use 'Larry', 'Ralph', or 'this episode' instead.\n\n"
            f"Structure the output as:\n"
            f"1. A 3-4 sentence episode overview in an engaging, direct tone\n"
            f"2. A bulleted list of key topics and takeaways\n"
            f"3. A one-sentence closing that tells the reader why they should watch\n\n"
            f"Transcript:\n{raw_text}"
        )

        ollama_response = ollama.generate(
            model="llama3.1",
            prompt=prompt,
            options={"num_ctx": 32768}
        )

        with open(summary_path, "w", encoding="utf-8") as f:
            f.write(ollama_response['response'])

        # After the ollama.generate call, unload the model from VRAM
        ollama.generate(
            model="llama3.1",
            prompt="",
            keep_alive=0
        )
        print("    (Ollama unloaded from VRAM)")

        elapsed_time = time.time() - start_time
        mins, secs = divmod(int(elapsed_time), 60)
        print(f"Success: All files saved for {video_path}! (Time: {mins}m {secs}s)\n")

    except Exception as e:
        print(f"Error processing '{video_path}': {e}\n")

print("All files processed!")

Alternative: mistral-nemo for Better Long-Context Summarization

If the summary quality on long episodes still feels like it is missing content, there is a better model for this job. mistral-nemo is a 12B model that runs at 7.1GB — it fits on the 4060 — and it handles long-form structured output significantly better than llama3.1:8b. It also has a native 128K context window, which means it is less likely to lose the thread on a long episode even at moderate num_ctx settings.

Pull the model:

ollama pull mistral-nemo

The only change to the script is the model name in the ollama.generate call:

ollama_response = ollama.generate(
    model="mistral-nemo",
    prompt=prompt,
    options={"num_ctx": 32768}
)

At 32K context, mistral-nemo needs roughly 9-10GB of VRAM — slightly over the 4060's 8GB. In practice Ollama will offload some layers to system RAM automatically, so it will still run, just a bit slower than a model that fits entirely on the GPU. Whether the quality improvement is worth the trade-off is worth testing on a real episode.

What Is Next

The script is working but it is not yet wired into the production pipeline. My plan is to add a trigger so it runs automatically when a new recording lands in the OneDrive folder — either a file system watcher or a scheduled task that checks for new MP4s. Once that is in place, Raquel will have the transcript, subtitles, and summary waiting for her before she even opens Camtasia.

I also want to upload the SRT file to YouTube automatically as part of the same pipeline, since that step is currently manual. More on both of those when I get there.

The script and final version are available if you want to use them — drop a comment or reach out if you have questions.

CSharpForYou

Buy Me a Coffee

Tuesday, May 19, 2026

Automating Podcast Transcription, Subtitles, and AI Summaries with Whisper and Ollama