Subtitle generation with Subtitle Edit and Whisper
Images to Text
DVB transmissions often have graphic subtitles that many video playing devices cannot decode. Subtitle Edit can convert them to text (*.srt) files with various options of OCR processors, automatically and almost error free. Just record DVB including subtitles and let Subtitle Edit do its job, then multiplex the .srt files into your video file using e.g. MKVtoolNix. Note that Subtitle Edit has millions more features, just discover!
Speech to Text
Subtitle Edit also offers an audio to
subtile conversion based on Vosk, and also on Whisper.
Vosk is pretty fast (about 4x realtime on a good I5 CPU) but is not
perfect, good results only with a clearly spoken comment.
Whisper, based on openAI, is a lot slower, but it runs on CPU and GPU.
With a decent GPU, like a 980 e.g., it also achieves about 4-6x
realtime speed, using the "small" language model. on GPU it's about 10
times
slower. For the GPU we will need CUDA. I will describe an installation
that allows Whisper to be used as a stand-alone app, offering all of its
options. This works as follows:
-
Install Nvidia CUDA 12, even if the current torch version is for 11.6,
that's no problem. Only if CUDA 12 says your hardware isn't fit, use
11.6.
- You need to sign up as a developer for the download CUDA, which is easily done.
- Do a custom install of CUDA, selecting only the runtime part, and the display driver (if your's not yet up to date).
- Download CUDNN and copy its files into a new folder, C:\Program Files\NVIDIA\CUDNN\v8.x\
- zLib is a data compression software library that is needed by CUDNN.
- Download and extract the zLib package.
- Copy the zlibwapi.dll from it into C:\Program Files\NVIDIA\CUDNN\v8.x\bin\
- Add a path entry C:\Program Files\NVIDIA\CUDNN\v8.x\bin
- How to add a path entry: right click on Computer, select Properties, Advanced Options, Environment Variables. There are two path strings you can edit, one for just your use an one for the system. Use system for hardware related things like CUDA. Click on the one intended, then "add new entry".
-
We also need ffmpeg command line version in order to extract audio from
various file formats:
- Download ffmpeg essential.
- Then create a folder C:\Program Files\ffmpeg\ and copy all files into it.
- Add a path entry C:\Program Files\ffmpeg\bin\
- Download GIT, 2.39 or newer, and
install it, all options default, but you may want to deselect adding to
context menu.
- Note: if you had GIT installed before, in particular without CUDA, then first uninstall it, and also delete the GIT folder in Program Files. Maybe you need to do so with Python as well.
- Add path entries C:\Program Files\Git\bin\git.exe and C:\Program Files\Git\cmd (the latter should already be there)
- Install Python 3.8 (or a version below 3.11), choose "add to environment variables" and "install pylauncher".
- Reboot in order to make sure the new path entries are loaded.
-
Get a command window in admin mode (Start menu - under
Windows, System) right click on it, run as admin) and
enter the following commands:
- pip install torch torchvision torchaudio --extra-index-url https://download.pytorch.org/whl/cu116
This will download approx. 2,4 GB !.
Note that it's crucial to get exactly this version, 116, others won't work !- python.exe -m pip install --upgrade pip
- pip install git+https://github.com/openai/whisper.git --use-pep517
(pep517 avoids some complaint about Wheel package not installed)
Now you may test, in a command window, supposed you have a file named test.wav
and you've opened a command window and navigated to the same directory (see hint on getting a window with a command prompt below):
- whisper "test.mkv" --model small --language English --device cuda
This will download the small language model, then generate text and subtitle files for test.mkv or whatever file you specified.
Some texts are difficult enough to get whisper in error loops; you will
notice this when some output lines are repeated many times instead of
generating the subsequent text. In this case, the option
- --condition_on_previous_text False
will help. It may also speed up the process in these cases, and it
reduces memory usage (e.g. only 2.4 GB of GPU memory instead of 3).
It's however not always better, may sometimes result in phrases from
one line being repeated in the next one.
Whisper packet sizes,
speed, memory usage
This is comparing an i5 CPU@4x4GHz an a GTX980 graphics card.
model | file size MB | RAM GB | RAM GB | RAM GB | RAM GB | speed factor | speed factor |
CPU | CPU c.o.p.t. | GPU | GPU c.o.p.t. | CPU | GPU | ||
small | 472 | 1.8 | 1.8 | 2.5 | 3.4 | 0.45 | 5 |
medium | 1492 | 4,7 | 4,5 | 6,5? | 8,5? | 0,18 | 2? |
large | 3015 | 8,8 | 9,4 | 12? | 18? | 0.1 | 1? |
c.o.p.t. means --condition_on_previous_text True (the default setting)
Note that in addition to the above numbers, approx. 3 GB extra CPU memory may be necessary with Windows10, provided no other memory intensive apps are running.
With GPU, memory is absolutely critical, the process stops if even only a bit is missing. So most currently installed graphics cards may not be able to run the large models.
The small Whisper model
already beats Vosk hands down, quality wise, but it really needs a GPU to compete for speed.
The quality gain for the
medium and large models vs. the small one is not always obvious, but
the large model knows a lot more things, which eliminates many
spelling errors and sometimes produces results so good that it's almost
uncanny. Sometimes it's getting weird, though, e.g., if a line
saying
"Copyright xxx" appears where there is not the slightest spoken text
like that or any connection to xxx whatsoever... .
Maybe the larger model is better for rare
languages, I would guess it is.
The problem with graphics cards is,
you need a lot of graphics memory even for the medium
model, and, e.g., a GTX 1080 w. 11 GB costs about 3 times as much as a
GTX 980 w.4GB, but it runs only about 50% faster, so the price/value
relation clearly is on the side of the 980, even more so as the 1080
would just allow to run the medium instead of the small model, and some
post editing will anyway be necessary in both cases. For very difficult
cases, normally the exemption, you may still run the large model on
the CPU. The alternative would be a really recent high-end graphics card, like a 4080 or 4090, a four-digit investment.
That said, I'm most of the time perfectly OK with the results of
the small model on a GTX 980, which runs at five times real speed.
Preparing whisper cmd batches:
Whisper can be run from within Subtitle Edit, quite conveniently (see below). But even then it
may also be convenient to do batch operations in the command window.
This give you more control, as you can use parameters, such as, e.g.,
--condition_on_previous_text False, and you see text lines being generated ii real time.
If you are subtitling
many files at once, it may also be useful to merge subtitles into large numbers of video files automatically.
Let's first describe some tricks for batch subtitle generation:
- Open a command prompt in the folder containing your files.
- Enter dir /b (or dir /b /s if you want to span subdirectories) in the window popping up, then copy all of the listing out of the command window (select the text, then type ctrl-C) then may paste this into some application window for further processing, typing ctrl-V.
- Now Excel or alike would come in handy: mark a column, then paste the dir listing by ctrl-V.
- Fill columns before and after with other parts of the commands you want to have executed, as well as some "s to allow for filenames with blanks in them.
- Then mark the entire field and copy by ctrl-C, paste it into an editor (the editor from Windows Accessories) window by ctrl-V, and save the text to some file. re-open the text, mark one of the many tabs (wide spaces) in the text and use replace-all to change them to nothing. Now we have all command lines ready.
- Select all , copy/paste them into a command window from the same directory where your files sit, and off you go!.
Why all that window copy/pasting? It avoids problems with special characters in file names.
These techniques also allow for making batch files processing through entire directory trees.
Post processing Whisper' srt files:
You may/should process srt files generated by Whisper alone, for two crucial purposes:
Splitting
long lines (Whisper makes many very long ones), and granting the right
character encoding by saving the files as UTF8 with BOM. This step is not necessary if we generated srt files from within Subtitle Edit, with the right options.
Subtitle Edit's batch function serves for this.The following images show some options to use for better line splitting, and some options useful for the batch processing.
Note: Auto balance lines may fail quite often, so for reliability, maybe better leave it out.
You may also want to have the subtitles displayed in the preview window overlayed to the video, like when playing them back on TV. For this, just download mpv lib and then set your font size::
- Get the latest stable Version of Subtitle Edit, now 3.6.11.
- Select from the menu Video, Audio to Text , Whisper.
- Right-click on the window, select Python, and navigate to the path of whisper.exe, which is e.g. located at C:\Program Files\Python310\Scripts. The rest then is self-explaining.
The good thing here is, Subtitle Edit automatically corrects for Whisper problems like too long lines, and some timing errors.
The latter option
is not yet available separately in Subtitle Edit's batch processor (yet
it doesn't appear to do what I'd wish it to do,correcting for text occasional to
be displayed much too early and then for very long).
On the other hand, with the cmd prompt version, Whisper displays text
generated in real time, and you have more control, like with the
--condition_on_previous_text False parameter. This parameter missing can indeed send Subtitle Edit's conversions badly off rail.
Joining the subtitles with the videos
At last, we want to merge the subtitles with the videos.
Manually, this is done with MKVtoolNix. It's quite self explaining so I won't provide more to it here.
But
you may also want to use the batch processing tricks described above,
and in this case you can use the mkvmerge.exe program that comes with
MKVtoolNix and is found in it's program folder. Run it within a
command window and you'll get its help text describing the parameters
available.
I recommend adding a path entry for MVtoolNix' program folder so you can use the tools from anywhere in a command window.
A genuine command prompt can also be opened from the Start Menu, ..Windows, ..System. (Right- click for options, such as as Run as Administrator). Then navigate to specific folders by cd <directory_name>. Entering cd.. gets you one level up.
A more convenient way may be to install OpenCommandPromptHere from 4dots-Software, letting you choose if you want the prompt in normal or admin mode, or FileMenuTools from Lopesoft, the latter one coming with lots more right-click tools.
If you want to build yourself a genuine command window option for any directory in windows 10/11 like it was in Windows 9, see here and here.
Or, in short (beware, only try this if you know what you're doing!):
- Start regedit, goto key HKEY_CLASSES_ROOT\Directory\shell\cmd
- Select permissions of the cmd key, extended, change the assignee to you username, then ok and ok,
- Then enter your username into the groups/usernames list, and allow it full access.
- Change the name of HideBasedOnVelocityId to ShowBasedOnVelocityId
- Perform the same steps with HKEY_CLASSES_ROOT\Directory\Background\shell\cmd
Copyright
(C) 2023; all
rights reserved. All
materials in these pages are presented for scientific evaluation
of video technologies only. They may not be copied from here and
used for entertainment or commercial activities of any kind.
We do not have any relation to and do not take any responsibility
for any software and links mentioned on this site. This website
does not contain any illegal software for download. If we, at
all, take up any 3rd party software here, it's with the explicit
permission of the author(s) and regarding all possible licensing
and copyright issues, as to our best knowledge. All external download
links go to the legal providers of the software concerned, as
to our best knowledge.
Any trademarks mentioned here are the property of their owners.
To our knowledge no trademark or patent infringement exists in
these documents; any such infringement would be purely unintentional.
If you have any questions or objections about materials posted
here, please e-mail us
immediately.
You may use the information presented herein at your own risk
and responsibility only. We do also not guarantee the correctness
of any information on this site or others and do not encourage
or recommend any use of it.
One further remark: These pages are covering only some aspects
of PC video and are not intended to be a complete overview or
an introduction for beginners.