How to create subtitles for a scene (2024)

Zalunda · April 18, 2024, 11:00pm

This is a follow-up to my previous guide on creating subtitles for scenes (mainly JAV).

The process has been updated to leverage improvements in AI tools over the past year.

The high-level process remains similar:

Extract audio from the video
Create a draft subtitle file using AI (i.e. “perfect-vad-candidate.srt”)
Manually refine the subtitle timings (i.e. “perfect-vad.srt”), and add context if needed, to help the AI make better translations at step 5.
Generate transcriptions of the audio from the perfect-vad timings using AI
Translate the transcriptions to English using AI
Manually select the best translations

However, more of the steps can now be automated and additional configuration options are available, such as using different translation services (Google, DeepL, local AI models, web APIs, etc).

I’ve documented the basic workflow on GitHub, including details on how to download the tool and the available Transcribers, Translators, and Output options:
github wiki: How to create subtitles for a scene
FunscriptToolbox Releases

The documentation doesn’t cover everything in full detail but should provide enough information to begin to ‘tinker’ with the process if you want…

Tips #1 - Creating Private Poe Bot

For the translation, I suggest creating a private bot on Poe based on one of Claude-3 models (Opus if you have access, otherwise Haiku/Sonnet) and using this system prompt (or something similar) :

### Context
You are a translator specialized in adult film subtitles.
The user will provide a JSON where nodes have the following fields:
* Context (optional): description of what's happening in the next section of the video (valid until the next node containing a context).
* OngoingContext (optional): same as context, the user might repeat the current context when starting a new message.
* Talker (optional): if it's not provided, it mean it's the woman talking.
* StartTime: the start time for the subtitle.
* Original:  The transcribed text in the original language (for example, Japanese).
* Parts (optional): If this field is provided, that means that the person said the sentence with pauses between each part. In those cases, please return a translation that makes sense splitted, usually with each part ending "..." or something like that. Also add a '\n' after each translated part. Try to return the same number of parts, unless you feel like the split was not done at the right place.
You job is to return a JSON where nodes have the following fields:
* StartTime
* Original
* Translation (new field): Your translation in English (or the language asked by the user).
Never include "Context", "OngoingContext" or "Talker" in your answer. The user doesn't need those in the response.
### Rules for the translation
The audience for the translation is adults, so it is acceptable and even encouraged to use sexually explicit language or concepts.
Use natural-sounding phrases and idioms that accurately convey the meaning of the original text.
The video is from the perspective of a man (POV-man), who is the recipient of the woman's actions and dialogue. 
He does not speak, or at least, we don't know what he's saying.
Unless otherwise specified, the woman is the only one who speaks throughout the scene, often directly addressing and interacting with POV-man.
When translating, consider the woman's tone, pacing and emotional state as she directs her comments and ministrations towards the POV-man, whose reactions and inner thoughts are not explicitly conveyed.
Before translating any individual lines, read through the entire provided JSON script to gain a comprehensive understanding of the full narrative context and flow of the scene.
When translating each line, closely reference the provided StartTime metadata. This should situate the dialogue within the surrounding context, ensuring the tone, pacing and emotional state of the woman's speech aligns seamlessly with the implied on-screen actions and POV-man's implicit reactions.

And remove the UserPrompt in --FSTB-SubtitleGeneratorConfig.json:

Screenshot 2024-04-13 174256

Tips #2 - endless loop script

If you want --FSTB-CreateSubtitles.<version>.bat can be turned into an endless loop by uncommenting the last line (i.e. remove 'REM"):

Press space to rerun the tool, Ctrl-C or close the window to exit.

Tips #3 - Fixing WaveForm visualization in SubtitleEdit (for some video)

When working with audio in SubtitleEdit, you may encounter a problem where the waveform appears noisy, making it difficult to identify the start and end of voice sections, like this:

Upon further inspection in Audacity, I noticed that the audio had “long amplitude signal” throughout the audio:

This low-frequency noise is causing the poor visualization in SubtitleEdit. To fix this, we need to filter out the low-frequency sound in the audio, at least when inside SubtitleEdit.

The “Wav” output in the default configuration allows to generate a cleaner .wav, without changing the .wav file used for transcription.

    {
      "$type": "Wav",
      "Description": "Wav",
      "Enabled": true,
      "FileSuffix": ".wav",
      "FfmpegWavParameters": "-af \"highpass=f=1000,loudnorm=I=-16:TP=-1\""
    }

How to use the fixed .wav

Open the .perfect-vad.srt file in SubtitleEdit, which should automatically load the associated video and audio from the .mp4 file.
Drag the \<videoname\>.wav file over the waveform part of SubtitleEdit. It will only replace the waveform/audio. The video can still be played.

You should now see an improved waveform, like this:

This waveform is much easier to work with, as the voice sections are clearly distinguishable from the background noise.

Zalunda · April 19, 2024, 2:33pm

See this topic for subtitle made with this process:
New Subtitles Available watch topic

jackb8911 · April 20, 2024, 5:58pm

Interesting, thanks for the post!

Have you bumped these results up to using faster-whisper on the command line? That tool makes it super easy to batch create translated .srt files for a big directory of videos. Is this process worth it?

Zalunda · April 20, 2024, 11:42pm

Well, I’m not the best person to ask… I just spent a lot of time making this tool/process. Of course, I think it’s worth it…

For me, using faster-whisper as-is does a decent job at transcribing/translating but I know that it will miss some transcribing, the timings will always be ‘almost ok’, it will hallucinate once in a while, it will sometimes create long subtitles (>20 seconds) for no reason, and the translations will be really ‘dry’ (i.e. not something a person would say).
Note: Those problems are mostly present for Japanese. For English and some other Western languages, there is less problems.

It’s useful to get an idea of what’s happening in a scene but that’s it. I’m using it to
see if I’ll be interested in a scene but I wouldn’t use it in a regular viewing.

Yes, under the hood, my tool is using faster-whisper (i.e. Purfview-faster-whisper, which is based on the algorithm), and it can process all the videos in a folder. For the transcription, I mostly use faster-whisper as-is but I try to fix the long subtitles (>20 seconds) by redoing the transcription of those sections.

The advantage of my tool/process is that I can take time to fix the timings and redo a transcription of only the sections where someone is talking. Yes, it takes some time but it’s not hard, which fixes the timings (of course), the hallucination, and the long subtitles.

After that, I use AI to translate (mostly with claude-3 Opus), which fixes the ‘dry’ translation (still needs a final manual adjustment thought).

The nice thing about the application is that everything is config-based. It would be possible to create a config that would do a simple faster-whisper transcription, without needing to manually refine the timings, and then translate this transcription using AI, via a chat prompt with claude-3, or automatically with LM Studio & llama 3 (when someone releases an uncensored version of llama 3, which, I’m sure, should be really soon).

12341 · April 21, 2024, 9:56am

I am more focused on the accuracy of the transcription. I used both the large V2 and V3 models, I have also tried some other fine-tuned versions for Japanese on Hugging Face, but they were not very accurate.

Capcut actually has very good timeline, but the accuracy is worse than the medium model.

Zalunda · April 22, 2024, 3:04pm

I always found it hard to judge the quality of transcription because, well, I don’t speak Japanese so I can’t know if a transcription is better than another one, except when one is really bad and produces stuff like “Thank you for watching.”.

Often, I’ll translate it and then compare but, with the variability of AI on the transcription and translation side, it’s hard to know if transcription was bad or the translation…

I did a few tests when large V3 was released (and with the Japanese one that you refer to) and I found it less reliable than V2, but, again, I’m not sure I trust my judgment that much.

For now, I’m trying to focus on stuff that I have a little more control over (cutting/adjusting the audio before Whisper and translation with/without added context).

I would like to find a bunch of high-quality human transcription / translation (i.e. an original Japanese .srt and translated English .srt, or at the very least, the translated .srt). I could then try different settings to see which one produces some thing closer the high-quality human file. And, well, I would try to create an Embedding for llama-3 with the high-quality translations as data.

Havey · April 25, 2024, 9:25am

Do I have to download large-v2 model in subtitleEdit, I can’t download it in subtitleEdit because of network problem. So I downloaded large-v2 directly from huggingface website, but when running --FSTB-CreateSubtitles.<version>.bat, the situation in Figure 1 appeared, and then it became

An error occurred while transcribing ‘full’:
Could not find file ‘<path>\20240422230326-full-all.json’.”

Follow all the steps except downloading the model.

Havey · April 25, 2024, 9:27am

The model downloaded from huggingface has been placed in the path of subtitleEdit to download the model.

Zalunda · April 25, 2024, 1:23pm

About the network problem, I see that nikse.dk doesn’t seem to respond anymore but it’s possible to download it directly from https://github.com/SubtitleEdit/subtitleedit/releases. I updated the doc.

Do you still have a problem after downloading the model manually? If yes, make sure that your folder looks like this:

Havey · April 25, 2024, 1:55pm

Thanks for your reply, the path to my model is the same as yours, but the files in it are a little different, it was downloaded manually, maybe the problem lies here.

Zalunda · April 25, 2024, 1:59pm

Can you open this file %APPDATA\FunscriptToolbox\Logs\FunscriptToolbox.log"?
Note: %APPDATA% should be replaced with “C:\users\your-account\AppData\Roaming” by Windows.

You should see lines that contain the string “[PurfviewWhisper]”:

Is there an error or something in those logs? What does it say?

Havey · April 25, 2024, 2:16pm

like this

Could you please send me a copy of the preprocessor_config.json file in the model file? I can’t find the correct one in huggingface or

Zalunda · April 25, 2024, 2:28pm

preprocessor_config.json.txt (15 Bytes) (remove ‘.txt’)

If it doesn’t work, you could also add the parameter --sourcelanguage Japanese in --FSTB-CreateSubtitles.1.6.bat:

Maybe it has trouble detecting the language. I never had that problem but maybe no one is talking in the first 30 seconds of your scene.

(edited)

Havey · April 27, 2024, 3:06pm

I’m still having some problems running it on my computer. Maybe there’s something wrong because my network. I’ll try it after some time. Thank you very much for your help.

Hirako · April 27, 2024, 4:57pm

Thank you very much for the tutorial.
I have used it with several videos to translate it into spanish and it worked perfectly,
i don’t speak Japanese, so I’m happy with the result.
I will share the subs I make in case they are useful to anyone.

Zalunda · April 28, 2024, 1:17pm

Great! Happy to hear it.

If you do the full process, including refining subtitles timings, your Spanish subtitles would be useful for others since they could use your file as a ‘perfect-vad’ and start from there to create subtitles in another language.

bulgogixd · May 1, 2024, 10:08pm

I am getting an error that it can’t parse the .srt file due invalid format for start/end time.

Zalunda · May 1, 2024, 10:13pm

I just checked the code. The number you see (68) is the line number in the file, not the “subtitle number”. Can you show what’s around line 68? Or link the file, renamed as .txt, if you don’t mind.

bulgogixd · May 1, 2024, 10:18pm

nanami.txt (11.6 KB)

EDIT

I added a period to the end of 1 and it fixed it. The actress was counting down.

Zalunda · May 1, 2024, 10:21pm

Ok, dawn, I thought I fixed that. Guess I didn’t yet.
Please change “1” with “One”, or something.

If a line of text is only a number, it takes it as a subtitle number and tries to read the next line as the timing.