video encoding&cleaning

Subtitle generation with Subtitle Edit and Whisper

Images to Text

DVB transmissions often have graphic subtitles that many video playing devices cannot decode. Subtitle Edit can convert them to text (*.srt) files with various options of OCR processors, automatically and almost error free. Just record DVB including subtitles and let Subtitle Edit do its job, then multiplex the .srt files into your video file using e.g. MKVtoolNix. Note that Subtitle Edit has millions more features, just discover!

Speech to Text

Subtitle Edit also offers an audio to subtile conversion based on Vosk, and also on Whisper.
Vosk is pretty fast (about 4x realtime on a good I5 CPU) but is not perfect, good results only with a clearly spoken comment.
Whisper, based on openAI, is a lot slower, but it runs on CPU and GPU. With a decent GPU, like a 980 e.g., it also achieves about 4-6x realtime speed, using the "small" language model. on GPU it's about 10 times slower. For the GPU we will need CUDA. I will describe an installation that allows Whisper to be used as a stand-alone app, offering all of its options. This works as follows:

Now you may test, in a command window, supposed you have a file named test.wav and you've opened a command window and navigated to the same directory (see hint on getting a window with a  command prompt below):

This will download the small language model, then generate text and subtitle files for test.mkv or whatever file you specified.

Some texts are difficult enough to get whisper in error loops; you will notice this when some output lines are repeated many times instead of generating the subsequent text. In this case, the option

will help. It may also speed up the process in these cases, and it reduces memory usage (e.g. only 2.4 GB of GPU memory instead of 3).
It's however not always better, may sometimes result in phrases from one line being repeated in the next one.

Let's now explore the specific requirements and results for CPU and GPU processing.

Whisper packet sizes, speed, memory usage
This is comparing an i5 CPU@4x4GHz an a GTX980 graphics card.

model  file size MB RAM GB RAM GB RAM GB RAM GB speed factor speed factor

CPU CPU c.o.p.t.  GPU GPU c.o.p.t. CPU GPU
small  472 1.8 1.8 2.5 3.4 0.45 5
medium 1492 4,7 4,5 6,5? 8,5? 0,18 2?
large 3015 8,8 9,4 12? 18? 0.1 1?

c.o.p.t. means --condition_on_previous_text True (the default setting) 

Note that in addition to the above numbers, approx. 3 GB extra CPU memory may be necessary with Windows10, provided no other memory intensive apps are running.

With GPU, memory is absolutely critical, the process stops if even only a bit is missing. So most currently installed graphics cards may not be able to run the large models. 

The small Whisper model already beats Vosk hands down, quality wise, but it really needs a GPU to compete for speed.
The quality gain for the medium and large models vs. the small one is not always obvious, but the large model knows a lot more things, which eliminates many spelling errors and sometimes produces results so good that it's almost uncanny. Sometimes it's getting weird, though, e.g., if a line saying "Copyright xxx" appears where there is not the slightest spoken text like that or any connection to xxx whatsoever... . 

Maybe the larger model is better for rare languages, I would guess it is.

The problem with graphics cards is, you need a lot of graphics memory even for the medium model, and, e.g., a GTX 1080 w. 11 GB costs about 3 times as much as a GTX 980 w.4GB, but it runs only about 50% faster, so the price/value relation clearly is on the side of the 980, even more so as the 1080 would just allow to run the medium instead of the small model, and some post editing will anyway be necessary in both cases. For very difficult cases, normally the exemption, you may still run the large model on the CPU. The alternative would be a really recent high-end graphics card, like a 4080 or 4090, a four-digit investment.

That said, I'm most of the time perfectly OK with the results of the small model on a GTX 980, which runs at five times real speed.

Preparing whisper cmd batches:

Whisper can be run from within Subtitle Edit, quite conveniently (see below). But even then it may also be convenient to do batch operations in the command window. This give you more control, as you can use parameters, such as, e.g., --condition_on_previous_text False, and you see text lines being generated ii real time.
If you are subtitling many files at once, it
may also be useful to merge subtitles into large numbers of video files automatically.
Let's first describe some tricks for batch subtitle generation:

Why all that window copy/pasting? It avoids problems with special characters in file names.
These techniques also allow for making batch files processing through entire directory trees.

Post processing Whisper' srt files:

You may/should process srt files generated by Whisper alone, for two crucial purposes:
Splitting long lines (Whisper makes many very long ones), and granting the right character encoding by saving the files as UTF8 with BOM. This step is not necessary if we generated srt files from within Subtitle Edit, with the right options.

Subtitle Edit's batch function serves for this.The following images show some options to use for better line splitting, and some options useful for the batch processing.

srt split options

srt batch options

Note: Auto balance lines may fail quite often, so for reliability, maybe better leave it out.

You may also want to have the subtitles displayed in the preview window overlayed to the video, like when playing them back on TV. For this, just download mpv lib and then set your font size::

Showing subs overlayed in video window: download mpv lib

Whisper in Subtitle Edit

As mentioned above, the Whisper installation described above can also be used from within Subtitle Edit. It can be selected by right-clicking on the conversion window. Yet, as far as my tests go, the GPU is not really used. While the GPU memory gets filled as if it woul happen there, the processing nevertheless takes place on the CPU only. Whisper 3.12 beta should solve for this, but still the same thing happens and in my lates test, the subtitles generated were all rubbish. What works so far is (CPU):

The good thing here is, Subtitle Edit automatically corrects for Whisper problems like too long lines, and some timing errors. The latter option is not yet available separately in Subtitle Edit's batch processor (yet it doesn't appear to do what I'd wish it to do,correcting for text occasional  to be displayed much too early and then for very long).
On the other hand, with the cmd prompt version, Whisper displays text generated in real time, and you have more control, like with the 
--condition_on_previous_text False parameter. This parameter missing can indeed send Subtitle Edit's conversions badly off rail.

Joining the subtitles with the videos

At last, we want to merge the subtitles with the videos.
Manually, this is done with MKVtoolNix. It's quite self explaining so I won't provide more to it here.
But you may also want to use the batch processing tricks described above, and in this case you can use the mkvmerge.exe program that comes with MKVtoolNix and is found in it's program folder. Run it  within a command window and you'll get its help text describing the parameters available.
I recommend adding a path entry for MVtoolNix' program folder so you can use the tools from anywhere in a command window.

How to get a Command prompt:

In Windows 9, a command prompt can be obtained by right-clicking on a folder, then selecting 'open command prompt here'. In Windows 10/11, only Powershell is offered. There, an equivalent of a command prompt can be obtained in by selecting 'open Powershell window', and within the Powershell window, entering "cmd".

A genuine command prompt can also be opened from the Start Menu, ..Windows, ..System. (Right- click for options, such as as Run as Administrator). Then navigate to specific folders by cd <directory_name>.  Entering cd.. gets you one level up.

A more convenient way may be to install OpenCommandPromptHere from 4dots-Software, letting you choose if you want the prompt in normal or admin mode, or  FileMenuTools from Lopesoft, the latter one coming with lots more right-click tools.

If you want to build yourself a genuine command window option for any directory in windows 10/11 like it was in Windows 9, see here and here.
Or, in short (beware, only try this if you know what you're doing!):
Now you can get the command prompt here option by right-clicking on a folder or on the background of an opened folder.

Copyright (C) 2023; all rights reserved. All materials in these pages are presented for scientific evaluation of video technologies only. They may not be copied from here and used for entertainment or commercial activities of any kind.
We do not have any relation to and do not take any responsibility for any software and links mentioned on this site. This website does not contain any illegal software for download. If we, at all, take up any 3rd party software here, it's with the explicit permission of the author(s) and regarding all possible licensing and copyright issues, as to our best knowledge. All external download links go to the legal providers of the software concerned, as to our best knowledge.
Any trademarks mentioned here are the property of their owners. To our knowledge no trademark or patent infringement exists in these documents; any such infringement would be purely unintentional.
If you have any questions or objections about materials posted here, please
e-mail us immediately.
You may use the information presented herein at your own risk and responsibility only. We do also not guarantee the correctness of any information on this site or others and do not encourage or recommend any use of it.
One further remark: These pages are covering only some aspects of PC video and are not intended to be a complete overview or an introduction for beginners.