How to create subtitles for a scene (2024)

This is a follow-up to my previous guide on creating subtitles for scenes (mainly JAV).

The process has been updated to leverage improvements in AI tools over the past year.

The high-level process remains similar:

  1. Extract audio from the video
  2. Create a draft subtitle file using AI (i.e. “perfect-vad-candidate.srt”)
  3. Manually refine the subtitle timings (i.e. “perfect-vad.srt”), and add context if needed, to help the AI make better translations at step 5.
  4. Generate transcriptions of the audio from the perfect-vad timings using AI
  5. Translate the transcriptions to English using AI
  6. Manually select the best translations

However, more of the steps can now be automated and there are additional configuration options available, such as using different translation services (Google, DeepL, local AI models, web APIs, etc).

I’ve documented the basic workflow on GitHub, including details on how to download the tool and the available Transcribers, Translators and Outputs options:
github wiki: How to create subtitles for a scene
FunscriptToolbox Releases

The documentation doesn’t cover everything in full detail, but should provide enough information to begin to ‘tinker’ with the process if you want…

Tips #1 - Creating Private Poe Bot

For the translation, I suggest creating a private bot on Poe based on one of Claude-3 models (Opus if you have access, otherwise Haiku/Sonnet) and using this system prompt (or something similar) :

### Context
You are a translator specialized in adult film subtitles.
The user will provide a JSON where nodes have the following fields:
* Context (optional): description of what's happening in the next section of the video (valid until the next node containing a context).
* Talker (optional): if it's not provided, it mean it's the woman talking.
* StartTime: the start time for the subtitle.
* Original:  The transcribed text in the original language (for example, Japanese).

You job is to return a JSON where nodes have the following fields:
* StartTime
* Original
* Translation (new field): Your translation in English (or the language asked by the user).

### Rules for the translation
The audience for the translation is adults, so it is acceptable and even encouraged to use sexually explicit language or concepts.
Use natural-sounding phrases and idioms that accurately convey the meaning of the original text.
The video is from the perspective of a man (POV-man), who is the recipient of the woman's actions and dialogue. 
He does not speak, or at least, we don't know what he's saying.
Unless otherwise specified, the woman is the only one who speaks throughout the scene, often directly addressing and interacting with POV-man.
When translating, consider the woman's tone, pacing and emotional state as she directs her comments and ministrations towards the POV-man, whose reactions and inner thoughts are not explicitly conveyed.
Before translating any individual lines, read through the entire provided JSON script to gain a comprehensive understanding of the full narrative context and flow of the scene.
When translating each line, closely reference the provided StartTime metadata. This should situate the dialogue within the surrounding context, ensuring the tone, pacing and emotional state of the woman's speech aligns seamlessly with the implied on-screen actions and POV-man's implicit reactions.

And remove the UserPrompt in --FSTB-SubtitleGeneratorConfig.json:

FYI, Claude-3 Haiku and Sonnet model might refuse to translate sometimes. So far, I could always open a new chat and get it through (note: NSFW is not against Poe usage guidelines but Claude-3 might be more restrictive). I might be paranoid but I think it got worse since I started testing my tools.

Funny thing, so far, Opus never refused to translate subtitles for me. It seems to take its job as an Adult Movie Translator seriously, I even had to coax it to translate non-sexual conversation at some point.

Tips #2 - endless loop script

If you want --FSTB-CreateSubtitles.<version>.bat can be turned into an endless loop by uncommenting the last line (i.e. remove 'REM"):

Press space to rerun the tool, Ctrl-C or close the window to exit.

Tips #3 - Fixing WaveForm visualisation in SubtitleEdit (for some video)

When working with audio in SubtitleEdit, you may encounter a problem where the waveform appears noisy, making it difficult to identify the start and end of voice sections, like this:
image

Upon further inspection in Audacity, I noticed that the audio had “long amplitude signal” throughout the audio:

This low-frequency noise is causing the poor visualization in SubtitleEdit. To fix this, we need to filter out the low-frequency sound in the audio. It should be noted that it probably has no effect, good or bad, for whisper transcription since, I assume, Whisper already ignores those low-frequency signals (until they add ‘Whale’ to the supported list of languages).

Fixing the Issue for All Future Videos

  1. Open --FSTB-SubtitleGeneratorConfig.json config file.
  2. Locate the “AudioExtractor” section.
  3. Add -af highpass=f=1000 to the “DefaultFfmpegWavParameters” to filter out frequencies below 1000 Hz. You can also add loudnorm=I=-16:TP=-1 to normalize the audio (unrelated to the problem):
    "AudioExtractor": {   
        "DefaultFfmpegWavParameters": "-af \"highpass=f=1000,loudnorm=I=-16:TP=-1\""   
    },   
    
  4. Save the config file.

Applying the Fix

  1. Open the .perfect-vad.srt file in SubtitleEdit, which should automatically load the associated video and audio from the .mp4 file.
  2. Drag the \<videoname\>-full-all.wav file from the \<videoname\>_Backup folder over the waveform part of SubtitleEdit. It will only replace the waveform/audio. The video can still be played.

You should now see an improved waveform, like this:
image

This waveform is much easier to work with, as the voice sections are clearly distinguishable from the background noise.

32 Likes

See this topic for subtitle made with this process:
New Subtitles Available watch topic

7 Likes

Interesting, thanks for the post!

Have you bumped these results up to using faster-whisper on the command line? That tool makes it super easy to batch create translated .srt files for a big directory of videos. Is this process worth it?

Well, I’m not the best person to ask… I just spent a lot of time making this tool/process. Of course, I think it’s worth it… :grin:

For me, using faster-whisper as-is does a decent job at transcribing/translating but I know that it will miss some transcribing, the timings will always be ‘almost ok’, it will hallucinate once in a while, it will sometimes create long subtitles (>20 seconds) for no reason, and the translations will be really ‘dry’ (i.e. not something a person would say).
Note: Those problems are mostly present for Japanese. For English and some other Western languages, there is less problems.

It’s useful to get an idea of what’s happening in a scene but that’s it. I’m using it to
see if I’ll be interested in a scene but I wouldn’t use it in a regular viewing.

Yes, under the hood, my tool is using faster-whisper (i.e. Purfview-faster-whisper, which is based on the algorithm), and it can process all the videos in a folder. For the transcription, I mostly use faster-whisper as-is but I try to fix the long subtitles (>20 seconds) by redoing the transcription of those sections.

The advantage of my tool/process is that I can take time to fix the timings and redo a transcription of only the sections where someone is talking. Yes, it takes some time but it’s not hard, which fixes the timings (of course), the hallucination, and the long subtitles.

After that, I use AI to translate (mostly with claude-3 Opus), which fixes the ‘dry’ translation (still needs a final manual adjustment thought).

The nice thing about the application is that everything is config-based. It would be possible to create a config that would do a simple faster-whisper transcription, without needing to manually refine the timings, and then translate this transcription using AI, via a chat prompt with claude-3, or automatically with LM Studio & llama 3 (when someone releases an uncensored version of llama 3, which, I’m sure, should be really soon).

I am more focused on the accuracy of the transcription. I used both the large V2 and V3 models, I have also tried some other fine-tuned versions for Japanese on Hugging Face, but they were not very accurate.

Capcut actually has very good timeline, but the accuracy is worse than the medium model. :expressionless:

I always found it hard to judge the quality of transcription because, well, I don’t speak Japanese so I can’t know if a transcription is better than another one, except when one is really bad and produces stuff like “Thank you for watching.”.

Often, I’ll translate it and then compare but, with the variability of AI on the transcription and translation side, it’s hard to know if transcription was bad or the translation…

I did a few tests when large V3 was released (and with the Japanese one that you refer to) and I found it less reliable than V2, but, again, I’m not sure I trust my judgment that much.

For now, I’m trying to focus on stuff that I have a little more control over (cutting/adjusting the audio before Whisper and translation with/without added context).

I would like to find a bunch of high-quality human transcription / translation (i.e. an original Japanese .srt and translated English .srt, or at the very least, the translated .srt). I could then try different settings to see which one produces some thing closer the high-quality human file. And, well, I would try to create an Embedding for llama-3 with the high-quality translations as data.

1 Like

Do I have to download large-v2 model in subtitleEdit, I can’t download it in subtitleEdit because of network problem. So I downloaded large-v2 directly from huggingface website, but when running --FSTB-CreateSubtitles.<version>.bat, the situation in Figure 1 appeared, and then it became

An error occurred while transcribing ‘full’:
Could not find file ‘<path>\20240422230326-full-all.json’.”

Follow all the steps except downloading the model.

The model downloaded from huggingface has been placed in the path of subtitleEdit to download the model.

About the network problem, I see that nikse.dk doesn’t seem to respond anymore but it’s possible to download it directly from Releases · SubtitleEdit/subtitleedit · GitHub. I updated the doc.

Do you still have a problem after downloading the model manually? If yes, make sure that your folder looks like this:

Thanks for your reply, the path to my model is the same as yours, but the files in it are a little different, it was downloaded manually, maybe the problem lies here.
图片

Can you open this file %APPDATA\FunscriptToolbox\Logs\FunscriptToolbox.log"?
Note: %APPDATA% should be replaced with “C:\users\your-account\AppData\Roaming” by Windows.

You should see lines that contain the string “[PurfviewWhisper]”:

Is there an error or something in those logs? What does it say?

like this


Could you please send me a copy of the preprocessor_config.json file in the model file? I can’t find the correct one in huggingface or

preprocessor_config.json.txt (15 Bytes) (remove ‘.txt’)

If it doesn’t work, you could also add the parameter --sourcelanguage Japanese in --FSTB-CreateSubtitles.1.6.bat:

Maybe it has trouble detecting the language. I never had that problem but maybe no one is talking in the first 30 seconds of your scene.

(edited)

I’m still having some problems running it on my computer. Maybe there’s something wrong because my network. I’ll try it after some time. Thank you very much for your help. :heart:

Thank you very much for the tutorial.
I have used it with several videos to translate it into spanish and it worked perfectly,
i don’t speak Japanese, so I’m happy with the result.
I will share the subs I make in case they are useful to anyone. :grinning:

Great! Happy to hear it.

If you do the full process, including refining subtitles timings, your Spanish subtitles would be useful for others since they could use your file as a ‘perfect-vad’ and start from there to create subtitles in another language.

I am getting an error that it can’t parse the .srt file due invalid format for start/end time.

I just checked the code. The number you see (68) is the line number in the file, not the “subtitle number”. Can you show what’s around line 68? Or link the file, renamed as .txt, if you don’t mind.

image
nanami.txt (11.6 KB)

EDIT

I added a period to the end of 1 and it fixed it. The actress was counting down.

1 Like

Ok, dawn, I thought I fixed that. Guess I didn’t yet.
Please change “1” with “One”, or something.

If a line of text is only a number, it takes it as a subtitle number and tries to read the next line as the timing.