The Setup
The script runs on the Windows 11 VM I have dedicated to OBS production. That VM has an NVIDIA GeForce RTX 4060 passed through from the Proxmox host, which matters a lot here — running Whisper on a GPU versus a CPU is not a minor difference. I also have Ollama installed on the same VM to handle the summarization step locally without sending episode content to an external API.
Step 1 — Installing the Dependencies
Everything installs cleanly from the command line. Start with Python and FFmpeg, which Whisper needs to process audio from video files:
winget install Python.Python.3.12 winget install Gyan.FFmpegThen install Whisper and the progress bar library:
pip install -U openai-whisper pip install tqdm
Step 2 — The First Script
The starting point was straightforward. Point Whisper at every MP4 in the current directory, transcribe each one, and save the result as a text file. Skip anything that already has a transcript so reruns are safe.
import os
import glob
import whisper
model = whisper.load_model("base")
video_files = glob.glob("*.mp4")
if not video_files:
print("No files found")
else:
print(f"Found {len(video_files)} to transcribe")
for video_path in video_files:
base_name = os.path.splitext(video_path)[0]
transcript_path = f"{base_name}.transcript.txt"
if os.path.exists(transcript_path):
print(f"Skipping '{video_path}'")
continue
print(f"Processing: '{video_path}'")
try:
result = model.transcribe(video_path)
with open(transcript_path, "w", encoding="utf-8") as f:
f.write(result["text"])
print(f"Success: saved to '{transcript_path}'.\n")
except Exception as e:
print(f"Error processing '{video_path}': {e}\n")
print("All files processed!")
Running this immediately produced a warning:
warnings.warn("FP16 is not supported on CPU; using FP32 instead")
Whisper was running on the CPU. The 4060 was sitting there doing nothing. The fix is to reinstall PyTorch with CUDA support — the default pip install does not include it:
pip uninstall torch torchvision torchaudio -y pip cache purge pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124After that, verify the GPU is visible to PyTorch before touching the script:
python -c "import torch; print('GPU Available:', torch.cuda.is_available()); print('Device Name:', torch.cuda.get_device_name(0) if torch.cuda.is_available() else 'None')"
You want to see your GPU name come back, not None. Once that is confirmed, change the model load line to tell Whisper to use CUDA:
model = whisper.load_model("base", device="cuda")
Step 3 — Better Model, Better Output
With the GPU working, the
base model felt like leaving performance on the table. The 4060 has 8GB of VRAM, which is enough to run Whisper's turbo model — a distilled version of large that runs significantly faster with comparable accuracy. I also added timing and a progress bar with tqdm so it is obvious what is happening and how long it takes.
import os
import glob
import time
import whisper
from tqdm import tqdm
print("Loading Whisper 'turbo' model...")
model = whisper.load_model("turbo", device="cuda")
video_files = glob.glob("*.mp4")
if not video_files:
print("No .mp4 files found in the current directory.")
else:
print(f"Found {len(video_files)} video file(s). Starting queue...\n")
for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
base_name = os.path.splitext(video_path)[0]
transcript_path = f"{base_name}.transcript.txt"
if os.path.exists(transcript_path):
continue
print(f"\n[Processing] {video_path}")
start_time = time.time()
try:
result = model.transcribe(video_path)
with open(transcript_path, "w", encoding="utf-8") as f:
f.write(result["text"])
elapsed_time = time.time() - start_time
mins, secs = divmod(int(elapsed_time), 60)
print(f"[Success] Saved transcript. Time taken: {mins}m {secs}s.")
except Exception as e:
print(f"[Error] Failed on '{video_path}': {e}")
print("\nAll tasks finished!")
Step 4 — Adding Subtitles and Summaries
Whisper's transcription result includes segment-level timing data, which makes generating an SRT subtitle file essentially free — just format the timestamps correctly. I added that along with a summarization step using Ollama and
llama3.1 running locally. First, install the Ollama Python library and pull the model:
pip install ollama ollama pull llama3.1Then the updated script that produces all three output files — transcript, subtitles, and summary — for each MP4:
import os
import glob
import time
import whisper
import ollama
from tqdm import tqdm
def format_srt_time(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
milliseconds = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"
model = whisper.load_model("turbo", device="cuda")
video_files = glob.glob("*.mp4")
if not video_files:
print("No files found")
else:
print(f"Found {len(video_files)} to transcribe")
for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
base_name = os.path.splitext(video_path)[0]
transcript_path = f"{base_name}.transcript.txt"
subtitle_path = f"{base_name}.subtitles.srt"
summary_path = f"{base_name}.summary.txt"
if os.path.exists(transcript_path) and os.path.exists(subtitle_path) and os.path.exists(summary_path):
print(f"Skipping '{video_path}'")
continue
print(f"\nProcessing: '{video_path}'")
start_time = time.time()
try:
print(" -> Step 1/3: Transcribing...")
result = model.transcribe(video_path)
raw_text = result["text"].strip()
with open(transcript_path, "w", encoding="utf-8") as f:
f.write(raw_text)
print(" -> Step 2/3: Creating Subtitles...")
with open(subtitle_path, "w", encoding="utf-8") as srt_file:
for index, segment in enumerate(result["segments"], start=1):
start = format_srt_time(segment["start"])
end = format_srt_time(segment["end"])
text = segment["text"].strip()
srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n")
print(" -> Step 3/3: Summarizing...")
prompt = (
f"You are writing show notes for 'The Cloud Talk Show', a podcast hosted by "
f"Larry Smithmier and Ralph Lecesse that covers cloud-native development, "
f"self-hosted infrastructure, and hands-on technical topics.\n\n"
f"Based on the following transcript, write engaging show notes in the style of a "
f"knowledgeable tech blogger. Write in present tense as if describing the episode "
f"to a potential listener. Do not use phrases like 'this transcript' or 'the host' "
f"— use 'Larry', 'Ralph', or 'this episode' instead.\n\n"
f"Structure the output as:\n"
f"1. A 3-4 sentence episode overview in an engaging, direct tone\n"
f"2. A bulleted list of key topics and takeaways\n"
f"3. A one-sentence closing that tells the reader why they should watch\n\n"
f"Transcript:\n{raw_text}"
)
ollama_response = ollama.generate(
model="llama3.1",
prompt=prompt
)
with open(summary_path, "w", encoding="utf-8") as f:
f.write(ollama_response['response'])
elapsed_time = time.time() - start_time
mins, secs = divmod(int(elapsed_time), 60)
print(f"Success: All files saved for {video_path}! (Time: {mins}m {secs}s)\n")
except Exception as e:
print(f"Error processing '{video_path}': {e}\n")
print("All files processed!")
Step 5 — The Summary Was Bad
The script worked. The transcript and subtitle files were exactly what I wanted. The summary was not.
The output read like someone who had never heard of the show and was hedging every sentence. It referred to things as "the host" instead of Larry or Ralph, and it ended with the phrase "this transcript provides a fascinating glimpse" — which tells you the model had no idea it was writing show notes for a podcast. It was just pattern-matching on "I was given a transcript, I will write about a transcript."
There were two problems.
- Problem 1: The prompt gave the model no context. Without knowing what the show is, who the hosts are, or what tone is appropriate, the model defaults to generic academic summarization. The fix is to front-load the prompt with enough context that the model knows exactly what role it is playing before it reads a single word of the transcript.
- Problem 2: The context window was too small. This one is less obvious. Here is what is actually happening under the hood.
Understanding Model Size, Context, and VRAM
When you run a model through Ollama, two separate things consume your GPU's VRAM:
- Model weights are fixed. When you see
llama3.1listed at 4.9GB in Ollama, that is how much VRAM the model itself needs regardless of what you ask it to do. It loads once and stays there. - The KV cache is dynamic. Every token the model processes — both input and output — needs to be stored in what is called the key-value cache. The larger the context window, the more VRAM the cache consumes. This is why models that support 128K context cannot actually use all of it on consumer hardware.
Rules of thumb for an 8GB GPU like the 4060:
- Model weights consume roughly what Ollama reports as the model size
- Each 8K of context adds approximately 0.5-1GB of KV cache for a 7-8B model
- 32K context adds roughly 2-3GB on top of the model weights — so
llama3.1(4.9GB) at 32K context needs about 7-8GB total. Tight but workable on a 4060. - 128K context on an 8GB GPU is not realistic — the KV cache alone would need 15-20GB
- The Ollama default context is 8K, which for a podcast transcript means the model is almost certainly reading a truncated version of what you gave it and filling in the gaps with hallucination
num_ctx in the Ollama call to give it more room:
ollama_response = ollama.generate(
model="llama3.1",
prompt=prompt,
options={"num_ctx": 32768}
)Step 6 — The VRAM Conflict
Running the updated script with the improved prompt and increased context window immediately hit a new error:model requires more system memory (6.8 GiB) than is available (3.9 GiB) (status code: 500)
The problem is straightforward once you see it. The script loads Whisper at startup and keeps it in VRAM for the entire run. Whisper's turbo model uses about 6GB of the 4060's 8GB, which leaves only 2GB free. When the script gets to the Ollama step and tries to load llama3.1 at 32K context — which needs roughly 7-8GB — there is simply nowhere to put it. The fix is to explicitly release Whisper from VRAM before calling Ollama, then the full 8GB is available for the summarization step. Python's garbage collector does not do this automatically for GPU memory — you have to do it yourself with three lines:
del model gc.collect() torch.cuda.empty_cache()del model removes the Python reference. gc.collect() runs the garbage collector to clean up any remaining references. torch.cuda.empty_cache() is the important one — it tells PyTorch to actually release the VRAM back to the GPU rather than holding it in reserve for potential reuse. Two other changes go along with this. First, import gc and import torch need to be added to the imports at the top of the script. Second, the Whisper model load moves from the top of the script to inside the loop, so it loads fresh for each file rather than once at startup. This means Whisper reloads on each iteration, which adds a few seconds per file, but for podcast episodes that is completely negligible. The updated loop now looks like this:
The terminal will now print (Whisper unloaded from VRAM) between the subtitle and summarization steps and (Ollama unloaded from VRAM) after, which makes it easy to confirm the sequence is working correctly when you watch a run in progress.# --- Step 1/3: Transcribe (Whisper on GPU) --- print(" -> Step 1/3: Transcribing...") model = whisper.load_model("turbo", device="cuda") result = model.transcribe(video_path) raw_text = result["text"].strip() with open(transcript_path, "w", encoding="utf-8") as f: f.write(raw_text) # --- Step 2/3: Subtitles (no GPU needed) --- print(" -> Step 2/3: Creating Subtitles...") with open(subtitle_path, "w", encoding="utf-8") as srt_file: for index, segment in enumerate(result["segments"], start=1): start = format_srt_time(segment["start"]) end = format_srt_time(segment["end"]) text = segment["text"].strip() srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n") # --- Free Whisper from VRAM before loading Ollama --- del model gc.collect() torch.cuda.empty_cache() print(" (Whisper unloaded from VRAM)") # --- Step 3/3: Summarize (Ollama on GPU) --- print(" -> Step 3/3: Summarizing...")I also had to add code to unload the Ollama model using the following code:print(" (Ollama unloaded from VRAM)")# After the ollama.generate call, unload the model from VRAM ollama.generate( model="llama3.1", prompt="", keep_alive=0 )
Step 7 — The Updated Script
Here is the final version incorporating the improved prompt and the increased context window. Note the bug fix in the timing output as well — the earlier version used
min as a variable name (which shadows Python's built-in) and then referenced mins in the print statement, which would have thrown a NameError on any file that actually completed successfully.
import os
import glob
import time
import whisper
import ollama
import gc
import torch
from tqdm import tqdm
def format_srt_time(seconds):
hours = int(seconds // 3600)
minutes = int((seconds % 3600) // 60)
secs = int(seconds % 60)
milliseconds = int((seconds % 1) * 1000)
return f"{hours:02d}:{minutes:02d}:{secs:02d},{milliseconds:03d}"
video_files = glob.glob("*.mp4")
if not video_files:
print("No files found")
else:
print(f"Found {len(video_files)} to transcribe")
for video_path in tqdm(video_files, desc="Overall Progress", unit="video"):
# At the start of each loop iteration, ensure Ollama isn't holding VRAM
try:
ollama.generate(model="llama3.1", prompt="", keep_alive=0)
except:
pass # Model wasn't loaded, that's fine
base_name = os.path.splitext(video_path)[0]
transcript_path = f"{base_name}.transcript.txt"
subtitle_path = f"{base_name}.subtitles.srt"
summary_path = f"{base_name}.summary.txt"
if os.path.exists(transcript_path) and os.path.exists(subtitle_path) and os.path.exists(summary_path):
print(f"Skipping '{video_path}'")
continue
print(f"\nProcessing: '{video_path}'")
start_time = time.time()
try:
# --- Step 1/3: Transcribe (Whisper on GPU) ---
print(" -> Step 1/3: Transcribing...")
model = whisper.load_model("turbo", device="cuda")
result = model.transcribe(video_path)
raw_text = result["text"].strip()
with open(transcript_path, "w", encoding="utf-8") as f:
f.write(raw_text)
# --- Step 2/3: Subtitles (no GPU needed) ---
print(" -> Step 2/3: Creating Subtitles...")
with open(subtitle_path, "w", encoding="utf-8") as srt_file:
for index, segment in enumerate(result["segments"], start=1):
start = format_srt_time(segment["start"])
end = format_srt_time(segment["end"])
text = segment["text"].strip()
srt_file.write(f"{index}\n{start} --> {end}\n{text}\n\n")
# --- Free Whisper from VRAM before loading Ollama ---
del model
gc.collect()
torch.cuda.empty_cache()
print(" (Whisper unloaded from VRAM)")
# --- Step 3/3: Summarize (Ollama on GPU) ---
print(" -> Step 3/3: Summarizing...")
prompt = (
f"You are writing show notes for 'The Cloud Talk Show', a podcast hosted by "
f"Larry Smithmier and Ralph Lecesse that covers cloud-native development, "
f"self-hosted infrastructure, and hands-on technical topics.\n\n"
f"Based on the following transcript, write engaging show notes in the style of a "
f"knowledgeable tech blogger. Write in present tense as if describing the episode "
f"to a potential listener. Do not use phrases like 'this transcript' or 'the host' "
f"— use 'Larry', 'Ralph', or 'this episode' instead.\n\n"
f"Structure the output as:\n"
f"1. A 3-4 sentence episode overview in an engaging, direct tone\n"
f"2. A bulleted list of key topics and takeaways\n"
f"3. A one-sentence closing that tells the reader why they should watch\n\n"
f"Transcript:\n{raw_text}"
)
ollama_response = ollama.generate(
model="llama3.1",
prompt=prompt,
options={"num_ctx": 32768}
)
with open(summary_path, "w", encoding="utf-8") as f:
f.write(ollama_response['response'])
# After the ollama.generate call, unload the model from VRAM
ollama.generate(
model="llama3.1",
prompt="",
keep_alive=0
)
print(" (Ollama unloaded from VRAM)")
elapsed_time = time.time() - start_time
mins, secs = divmod(int(elapsed_time), 60)
print(f"Success: All files saved for {video_path}! (Time: {mins}m {secs}s)\n")
except Exception as e:
print(f"Error processing '{video_path}': {e}\n")
print("All files processed!")
Alternative: mistral-nemo for Better Long-Context Summarization
If the summary quality on long episodes still feels like it is missing content, there is a better model for this job.
mistral-nemo is a 12B model that runs at 7.1GB — it fits on the 4060 — and it handles long-form structured output significantly better than llama3.1:8b. It also has a native 128K context window, which means it is less likely to lose the thread on a long episode even at moderate num_ctx settings. Pull the model:
ollama pull mistral-nemoThe only change to the script is the model name in the
ollama.generate call:
ollama_response = ollama.generate(
model="mistral-nemo",
prompt=prompt,
options={"num_ctx": 32768}
)
At 32K context, mistral-nemo needs roughly 9-10GB of VRAM — slightly over the 4060's 8GB. In practice Ollama will offload some layers to system RAM automatically, so it will still run, just a bit slower than a model that fits entirely on the GPU. Whether the quality improvement is worth the trade-off is worth testing on a real episode.What Is Next
The script is working but it is not yet wired into the production pipeline. My plan is to add a trigger so it runs automatically when a new recording lands in the OneDrive folder — either a file system watcher or a scheduled task that checks for new MP4s. Once that is in place, Raquel will have the transcript, subtitles, and summary waiting for her before she even opens Camtasia.
I also want to upload the SRT file to YouTube automatically as part of the same pipeline, since that step is currently manual. More on both of those when I get there.
The script and final version are available if you want to use them — drop a comment or reach out if you have questions.