How to create subtitles for a scene (2025)

Zalunda · September 13, 2025, 6:15pm

This is a refresh to my previous guide on creating subtitles for scenes (mainly JAV).

This version is a MAJOR improvement over the last iteration of the tool.

Here are the new features:

Optimized for API calling.
Support AI transcription that can take into account the context and that can transcribe ‘small segments’ of audio well (which Whisper couldn’t do). That means that the tool doesn’t have to deal with transcription that overlaps multiple “.manual-input timings”. The whole process can work one-to-one with the “.manual-input timings”.
AI Image Transcriber that do the OCR for the text that appears on screen (i.e. add a {GrabOnScreenText} tag in the .manual-input file).
AI Image/Audio Transcriber that can create Translation Analysis based on the voice delivery and what’s happening on screen (even between VOD detection). It can also identify characters if you provide something like {VisualTraining:Hana on the left, Ena on the right} for one of the subtitles in the .manual-input. It will generates something like this for every subtitles in the file (Ongoing metadata is generated only when something changes):

{TranslationAnalysis-Audio:Gap Summary: She moved from the center of the room to the sofa, kneeling on it to clear away the clutter.
She gestures towards a spot on the sofa she is clearing. Her tone is efficient and slightly bossy, directing him to the space she decided on rather than asking him where he wants to sit. 'Kocchi' (Here/This way) serves as a command disguised as guidance.

Option 1 (Directive): "Ah, right. Over here."
Option 2 (Casual Command): "Okay, yeah. This way."}

A coded subtitled conformer step (i.e. it’s not AI) that can merge consecutive subtitles and add new lines to adhere to subtitles standards (max 2 lines of 50 characters, etc).
Can analyse multi-part scenes as if it was one long video (i.e., need to create a small ‘.vseq’ file with the filenames).

Added a UI to identify speakers in a multiple-speaker scene.
Note: The tool work best if a ‘.mp4.FAST.mp4’ is created in the same folder using the script below.

Script to create optimized video for multi-speaker tool

echo off
REM This script re-encodes a video
REM It takes only the left half of the video (for side-by-side sources) and scales it down.
REM and sets every frame as a keyframe (-g 1) for "looping" playback.

FOR %%A IN (%*) DO (
    "ffmpeg.exe" ^
        -y ^
        -i "%%~dpA\%%~nxA" ^
        -filter_complex "crop=iw/2:ih:0:0,scale=-1:512" ^
        -g 1 ^
        -c:a copy ^
        "%%~dpA\%%~nA.mp4.FAST.mp4"
)
pause

Warning:

I’m not responsible for what you do with the tool…

The new process is explained in the Wiki from the tool HERE.
The application binary can be found here: github releases.

Zalunda · September 13, 2025, 6:24pm

Case study, for CRVR-286 - Yuki Rino - First time in an hotel room:

Small .vseq file created to let the tool know that it’s a multi-part scene (without the .txt):
CRVR-286-M.vseq.txt (142 Bytes)

Manually created .perfect-vad files, one per video:
CRVR-286-A - Yuki Rino.perfect-vad.srt (19.5 KB)
CRVR-286-B - Yuki Rino.perfect-vad.srt (4.7 KB)
CRVR-286-C - Yuki Rino.perfect-vad.srt (5.0 KB)
CRVR-286-D - Yuki Rino.perfect-vad.srt (7.9 KB)
CRVR-286-E - Yuki Rino.perfect-vad.srt (1.2 KB)

Generated metadata by all the transcribers/translators/arbitrer:
CRVR-286-A - Yuki Rino.wip-metadatas.srt (175.8 KB)
CRVR-286-B - Yuki Rino.wip-metadatas.srt (46.3 KB)
CRVR-286-C - Yuki Rino.wip-metadatas.srt (52.1 KB)
CRVR-286-D - Yuki Rino.wip-metadatas.srt (77.1 KB)
CRVR-286-E - Yuki Rino.wip-metadatas.srt (11.0 KB)

nsfwlance · September 15, 2025, 7:35pm

How safe is it creating adult subtitles with the Google API keys? Do you use a seperate Google Account in case you will get banned or is it safe to use because there is some kind of encryption? Would like to try it out and share subtitles but I dont want to get my Account banned.

Zalunda · September 15, 2025, 9:21pm

Only you can ultimately answer that question. I don’t know.

The thing that I know is that Google Gemini and GTP-5 are doing their own ‘nuanced’ censoring.

For example, when I was doing subtitles for “VRKM-1000 - Sexy Nurse Heals”, GPT-5 told me he couldn’t help me unless the story is tweaked (i.e. the nurse said something that could be translated as “you boy are…” which triggered underage censoring). I just added “OngoingContext:The scene take place at an university. A 19 years old student goes to the nurse because he’s not feeling good.}”, which is true in the scene, and it did the translation using terms like “young man” instead.

Gemini seem to be doing something similar (but I’m not using it for translating right now, only for transciption).

So, if they validate this kind of thing and do it with adjustments, for me, that means they are fine with doing NSFW.

Personally, I’m using my own account right now because I’ll always translate “official” scenes that passed the Japanese laws (consenting adult actors, etc). In my mind, the AI might refuse to help, but I don’t think that could be a bann-able offense. But, again, I don’t know.

[small addition]
Above, I meant that, as long as we don’t lie to the AI (or jailbreak-it), we should be OK.

Which make me think that I shouldn’t have added this to all userprompt in the default config:

Everyone involved, characters & actors, in this fictitious story are 18+ years old

I will remove it in the next version and let peoples add it for themself when needed.

happymanfr10 · September 16, 2025, 3:09pm

Thank you for the amazing tool.

I’m following the workflow. The video I want to transcript is already in the FSTB-CreateSubtitles2025 folder. But when I click in the --FSTB-CreateSubtitles.2.0.bat file, the following error message appears. I need some light to know if I’m doing something wrong.

nsfwlance · September 16, 2025, 4:05pm

Thanks for the detailed explanation. Do I understand it correctly that without paying for poe you cannot really use it for this process because it requires ~50,000 points for ~600 subtitles? Or can you share a possible configuration that works well with the free tier?

Zalunda · September 16, 2025, 5:57pm

When you have an error while parsing the config file, the tool creates a file “–FSTB-SubtitleGenerator.config.as.json” (i.e., it creates a file with “.as.json” at the end). This file is exactly what the tool is trying to parse. In that file, the error will be written first, at the start of the file. You need to delete those lines then go to the line that you saw in the error message (220, in your case) and try to fix the error (you could ask ChatGPT).

From what I can tell (i.e. SharedObjects[5].Text, which mean the 6th items in the SharedObject section), did you adjust the “# OPTICAL INTELLIGENCE (OPTINT) MANDATE” system prompt? If so, you need to know that the following lines are important; you cannot delete them:
=_______________________________________
and
_______________________________________=

Look at the The Hybrid-JSON Format \ 2. Multi-lines string \ Example: How a Multi-Line Prompt is Processed section in Configuration File Overview.

If you haven’t changed the file at all (except changing the path to Whisper), I would need to see the .as.json file because the file can be parsed without any problems on my end.

Zalunda · September 16, 2025, 6:01pm

I’m working on the configuration system right now to allow multiple type of “workflow” (High-Quality, Automatic, etc).

I’ll try to explain it when I release that version. It’s sure that some feature will have to be desactivated (ex. visual-analysis) because, as far as I know, they are only available on a paid tier (or you would need to do a max of 37 subtitles per day).

nsfwlance · September 16, 2025, 6:15pm

Amazing, but I am already glad you introduced me into the subject with Edit Subtitle and local transcription. Already some really cool tooling!

liembert · September 17, 2025, 11:19pm

I got a paid GPT subscription recently. Can I generate an API key with OpenAI and use it with this? , If not can you add that function? (claude too if possible please, although I cancelled that subscription recently so don’t use it anymore)

Zalunda · September 17, 2025, 11:29pm

I just released version 2.0.1. No real changes in the Transcribers/Translators but I reworked the whole installation / configuration of the application to be more flexible.

The installation now create this:

There is a configuration at the root with all the workers disabled (i.e. transcribers, translators, output, etc).
Then, there is the folder “Staging” (you can change the name if you like) that enable the workers for a draft subtitles (i.e. whisper with basic google transcription). Personnally, I place all the scene that I download in that folder (i.e. Staging\IPVR-3000\IPVR-3000-A.mp4, etc) and run the script at the root to get a preliminary subtitles.
When I want to do a better subtitles, I move the folder “ManualHQWorkflow” and run again to start the ‘real’ subtitle process.

In the .override file, you can see which worker is activated and I added comment with “option” that you might choose. This file is a lot more ‘readable’ then the root config.

(see the documentation for more details)

Zalunda · September 17, 2025, 11:58pm

Yes, I use OpenAI GPT-5 also (when I don’t have points left on Poe.com).

The root configuration have an entry for GPT5 on OpenAI (i.e. “AIEngineGPT5”) but the default configuration doesn’t use it (it use “AIEngineGPT5ViaPoe”).

Here an .override file that could be placed in the root, beside the .config file (remove .txt at the end):

–FSTB-SubtitleGenerator.override.config.txt removed <= This file is now part of the installation process.

Right now, the file contains the AIEngine used for each task, identical to the root config. You could replace all the “AIEngineGPT5ViaPoe” with “AIEngineGPT5” and add a “APIKeyOpenAI” entry in the .private.config file (with your OpenAI key).

In that file, you could create other AIEngine and reference them instead of the one I used.

Zalunda · September 28, 2025, 3:36pm

Create a new version 2.0.3:

Added the “finish_reason” to the console and error message when calling an API (ex, “Finish_reason: Prohibited content”).
In SpeakerValidationTool, added options to set all remaining items to a single speaker or to no one (and mark the job as finished).
Added an .override.config file at the root that can be used to override the WhisperPath and which API to use per task.
Added a new “Timings refiner” step that can take a transcription and try to adjust the start and end time, and also add ‘cadence markers’ in the VoiceText (ex, “How about..10..we go in the bedroom?” where ..10.. means that there was a pause of 1 second in the delivery of the line). It’s not perfect, but it might be useful if using a ‘fully automatic’ subtitles workflow. Disabled by default.
A lot of small fixes and speed optimization everywhere.

Zalunda · September 28, 2025, 3:42pm

To give people an idea of the cost when using OpenAI GPT-5 directly (i.e. not going through POE), I just did the ‘visual-analysis’, ‘translations’ and ‘arbitration’ of a scene with 1,900 subtitles (an usual scene has about ~500). The cost was ~7$ (in ‘flex’ mode, which is how ‘AIEngineGPT5’ is configured).

justsoicanfav · September 28, 2025, 3:54pm

I’ve tried few times to get Whisper to work but something goes wrong. I’m not really adept technically when it comes to using Command Line stuff and all the python-based stuff. I will give this a try again to see if it works because I’d really like to be able to make subtitles for the scripts that do not come with one.

Zalunda · September 29, 2025, 6:35pm

Well, python is not used anymore in this version.
Also, if you still have trouble with Whisper. It’s not required anymore when you are only using the HQ workflows (Automatic or Manual) since they are doing the transcription using AI. You would only need to disable the “mergedvad” step, which isn’t really needed for the process; it’s just there as a last resort backup transcription. In the --FSTB-SubtitleGenerator.override.config in the workflow folder, change enabled to false:

justsoicanfav · September 29, 2025, 10:55pm

Ooh let me give this a try then. Appreciate the clarification!

justsoicanfav · September 30, 2025, 3:27am

Just wanted to come by and thank you again. I was able to set it up and ran one video through the process as a test. It works and the translation quality was amazing even with the automatic HQ option! Took almost 100k Poe tokens for a 30 min video with lots of lines but the dialogue seems to match spot on to what is happening in the video. As far as I can tell.

Zalunda · October 2, 2025, 6:36pm

Create a new version 2.0.5.

Main change is that I added a step in the Manual HQWorkflow that feed the singlevad-ai transcription, the full-ai transcription and the audio clip back to an AI for refinement. That gives a better ‘final transcription’.

And fixed a few bugs and improvements.

One thing that I noticed for people who use Automatic HQWorkflow is that it might cost you a little more API points in some cases. Sometimes the full-ai transcription might create a bunch of “Ah” subtitles during sex and all those will get ‘visual-analysed’, translated, etc.

Zalunda · October 15, 2025, 5:58pm

Create a new version 2.0.7 (and 2.0.6).

Main changes are:

Added a output to generate .asig file for parts and for the full audio.
It’s now possible to reduce the size of the “context” sent in a batch request. For example, the workflow now limits the inclusion of TranslationAnalysis-Visual to the 10 most recent nodes, even when 1000 context nodes are included.
singlevad-ai-refined now injects ‘markers’ in VoiceText if the line delivery is ‘complex’ ([whispers] ねえ、[gasp]先生も動いて…). It also generates a ‘TranslationAnalysis-Audio’ similar to what the visual-analyst does. It also . For example:
TranslationAnalysis-Audio: The speaker reads the words slowly and deliberately, as if they are unfamiliar and uninteresting to her. This delivery suggests she finds the topic overly academic or boring. A translation could reflect this by being flat or slightly drawn out, e.g., 'Literary... theory.'
Added a new default path in the workflow that takes the ‘maverick’ translation alone and formats it for subtitle using a new ‘subtitle-finalizer’. The old path with multiple translations and an arbitrator is still available as an option.
Updated the Maverick prompt to take advantage of the new TranslationAnalysis. It’s also using GeminiPro by default, instead of GPT-5.
Better handling or response that can’t be parsed as JSON. If the tool detects that the response is not in JSON format, it generates an “error file” (“TODO_E”) that will be ignored on the next run. If it’s JSON, it generates a “TODO file”, like before, that needs to be fixed by the user.
AudioSyncVerbs: The file order is now forced when reading input and output files. Before, it was possible that file B would be read before file A.