How to create subtitles for a scene (2025)

This is a refresh to my previous guide on creating subtitles for scenes (mainly JAV).

This version is a MAJOR improvement over the last iteration of the tool.

Here are the new features:

  • Optimized for API calling.
  • Support AI transcription that can take into account the context and that can transcribe ‘small segments’ of audio well (which Whisper couldn’t do). That means that the tool doesn’t have to deal with transcription that overlaps multiple “.manual-input timings”. The whole process can work one-to-one with the “.manual-input timings”.
  • AI Image Transcriber that do the OCR for the text that appears on screen (i.e. add a {GrabOnScreenText} tag in the .manual-input file).
  • AI Image Transcriber that can describe what’s happening while someone is talking. It can also identify character if you provide something like {VisualTraining:Hana on the left, Ena on the right} for one of the subtitles in the .manual-input. It will generates something like this for every subtitles in the file (Ongoing metadata is generated only when something changes):
{OngoingEnvironment:School nurse’s office: desk with monitor and supplies, rolling chair, metal cabinet with books, examination bed, folding privacy screens, closed pink curtains.}
{OngoingAppearance:Nurse: white lab coat over a light-blue blouse (partly unbuttoned), black skirt, black thigh‑high stockings with visible garter straps, dark heels, long wavy hair. Man (POV): seated in dark clothing (top and pants).}
{ParticipantsPoses:Man: seated, legs apart, hands resting on his thighs, facing the woman. Nurse: standing very close in front of him, torso slightly leaned forward, facing him.}
{TranslationAnalysis:Her apology reads as a casual, friendly greeting; standing very close in front of the seated listener makes it feel personal and attentive rather than formal.}
  • Can translate with multiple AI personalities (maverick & naturalist included in the configuration example).
  • All the translations can then be analyzed by an AI Abitrer that chooses the best translation, and also transforms the text to adhere to subtitles standards (merge consecutive subtitles if it makes sense, max 2 lines of 50 characters, etc).
  • Can analyse multi-part scenes as if it was one long video (i.e., need to create a small ‘.vseq’ file with the filenames).
  • Added a UI to identify speakers in a multiple-speaker scene (Note: it’s still a bit buggy). Creating an AI to help with that might be an option in the future. I tried diarization, but it was hit and miss. Using an Image AI seems like a better bet, we’ll see…

Warning:

  • The default configuration created by the tool is optimized for high-quality subtitles. For example, we pass the parameter “–reasoning_effort medium” to GPT-5, which costs about 2.5 times more tokens than no thinking (“reasoning low” costs only 20% less than medium).
    From what I could see, doing a scene with ~600 subtitles costs:
    • ~50,000 poe.com points with no thinking.
    • ~130,000 poe.com points with medium thinking.
    • Not sure of the cost for the transcription made with Gemini because it’s free with a Google Account right now.
  • I’m not responsible for what you do with the tool…

The new process is explained in the Wiki from the tool HERE.
The application binary can be found here: github releases.

10 Likes

Case study, for CRVR-286 - Yuki Rino - First time in an hotel room:

Small .vseq file created to let the tool know that it’s a multi-part scene (without the .txt):
CRVR-286-M.vseq.txt (142 Bytes)

Manually created .perfect-vad files, one per video:
CRVR-286-A - Yuki Rino.perfect-vad.srt (19.5 KB)
CRVR-286-B - Yuki Rino.perfect-vad.srt (4.7 KB)
CRVR-286-C - Yuki Rino.perfect-vad.srt (5.0 KB)
CRVR-286-D - Yuki Rino.perfect-vad.srt (7.9 KB)
CRVR-286-E - Yuki Rino.perfect-vad.srt (1.2 KB)

Generated metadata by all the transcribers/translators/arbitrer:
CRVR-286-A - Yuki Rino.wip-metadatas.srt (175.8 KB)
CRVR-286-B - Yuki Rino.wip-metadatas.srt (46.3 KB)
CRVR-286-C - Yuki Rino.wip-metadatas.srt (52.1 KB)
CRVR-286-D - Yuki Rino.wip-metadatas.srt (77.1 KB)
CRVR-286-E - Yuki Rino.wip-metadatas.srt (11.0 KB)

1 Like

How safe is it creating adult subtitles with the Google API keys? Do you use a seperate Google Account in case you will get banned or is it safe to use because there is some kind of encryption? Would like to try it out and share subtitles but I dont want to get my Account banned.

Only you can ultimately answer that question. I don’t know.

The thing that I know is that Google Gemini and GTP-5 are doing their own ‘nuanced’ censoring.

For example, when I was doing subtitles for “VRKM-1000 - Sexy Nurse Heals”, GPT-5 told me he couldn’t help me unless the story is tweaked (i.e. the nurse said something that could be translated as “you boy are…” which triggered underage censoring). I just added “OngoingContext:The scene take place at an university. A 19 years old student goes to the nurse because he’s not feeling good.}”, which is true in the scene, and it did the translation using terms like “young man” instead.

Gemini seem to be doing something similar (but I’m not using it for translating right now, only for transciption).

So, if they validate this kind of thing and do it with adjustments, for me, that means they are fine with doing NSFW.

Personally, I’m using my own account right now because I’ll always translate “official” scenes that passed the Japanese laws (consenting adult actors, etc). In my mind, the AI might refuse to help, but I don’t think that could be a bann-able offense. But, again, I don’t know.

[small addition]
Above, I meant that, as long as we don’t lie to the AI (or jailbreak-it), we should be OK.

Which make me think that I shouldn’t have added this to all userprompt in the default config:

Everyone involved, characters & actors, in this fictitious story are 18+ years old

I will remove it in the next version and let peoples add it for themself when needed.

1 Like

Thank you for the amazing tool.

I’m following the workflow. The video I want to transcript is already in the FSTB-CreateSubtitles2025 folder. But when I click in the --FSTB-CreateSubtitles.2.0.bat file, the following error message appears. I need some light to know if I’m doing something wrong.

Thanks for the detailed explanation. Do I understand it correctly that without paying for poe you cannot really use it for this process because it requires ~50,000 points for ~600 subtitles? Or can you share a possible configuration that works well with the free tier?

When you have an error while parsing the config file, the tool creates a file “–FSTB-SubtitleGenerator.config.as.json” (i.e., it creates a file with “.as.json” at the end). This file is exactly what the tool is trying to parse. In that file, the error will be written first, at the start of the file. You need to delete those lines then go to the line that you saw in the error message (220, in your case) and try to fix the error (you could ask ChatGPT).

From what I can tell (i.e. SharedObjects[5].Text, which mean the 6th items in the SharedObject section), did you adjust the “# OPTICAL INTELLIGENCE (OPTINT) MANDATE” system prompt? If so, you need to know that the following lines are important; you cannot delete them:
=_______________________________________
and
_______________________________________=

Look at the The Hybrid-JSON Format \ 2. Multi-lines string \ Example: How a Multi-Line Prompt is Processed section in Configuration File Overview.

If you haven’t changed the file at all (except changing the path to Whisper), I would need to see the .as.json file because the file can be parsed without any problems on my end.

I’m working on the configuration system right now to allow multiple type of “workflow” (High-Quality, Automatic, etc).

I’ll try to explain it when I release that version. It’s sure that some feature will have to be desactivated (ex. visual-analysis) because, as far as I know, they are only available on a paid tier (or you would need to do a max of 37 subtitles per day).

1 Like

Amazing, but I am already glad you introduced me into the subject with Edit Subtitle and local transcription. Already some really cool tooling!

I got a paid GPT subscription recently. Can I generate an API key with OpenAI and use it with this? , If not can you add that function? (claude too if possible please, although I cancelled that subscription recently so don’t use it anymore)

I just released version 2.0.1. No real changes in the Transcribers/Translators but I reworked the whole installation / configuration of the application to be more flexible.

The installation now create this:

There is a configuration at the root with all the workers disabled (i.e. transcribers, translators, output, etc).
Then, there is the folder “Staging” (you can change the name if you like) that enable the workers for a draft subtitles (i.e. whisper with basic google transcription). Personnally, I place all the scene that I download in that folder (i.e. Staging\IPVR-3000\IPVR-3000-A.mp4, etc) and run the script at the root to get a preliminary subtitles.
When I want to do a better subtitles, I move the folder “ManualHQWorkflow” and run again to start the ‘real’ subtitle process.

In the .override file, you can see which worker is activated and I added comment with “option” that you might choose. This file is a lot more ‘readable’ then the root config.

(see the documentation for more details)

1 Like

Yes, I use OpenAI GPT-5 also (when I don’t have points left on Poe.com).

The root configuration have an entry for GPT5 on OpenAI (i.e. “AIEngineGPT5”) but the default configuration doesn’t use it (it use “AIEngineGPT5ViaPoe”).

Here an .override file that could be placed in the root, beside the .config file (remove .txt at the end):

–FSTB-SubtitleGenerator.override.config.txt removed <= This file is now part of the installation process.

Right now, the file contains the AIEngine used for each task, identical to the root config. You could replace all the “AIEngineGPT5ViaPoe” with “AIEngineGPT5” and add a “APIKeyOpenAI” entry in the .private.config file (with your OpenAI key).

In that file, you could create other AIEngine and reference them instead of the one I used.

1 Like

Create a new version 2.0.3:

  • Added the “finish_reason” to the console and error message when calling an API (ex, “Finish_reason: Prohibited content”).
  • In SpeakerValidationTool, added options to set all remaining items to a single speaker or to no one (and mark the job as finished).
  • Added an .override.config file at the root that can be used to override the WhisperPath and which API to use per task.
  • Added a new “Timings refiner” step that can take a transcription and try to adjust the start and end time, and also add ‘cadence markers’ in the VoiceText (ex, “How about..10..we go in the bedroom?” where ..10.. means that there was a pause of 1 second in the delivery of the line). It’s not perfect, but it might be useful if using a ‘fully automatic’ subtitles workflow. Disabled by default.
  • A lot of small fixes and speed optimization everywhere.

To give people an idea of the cost when using OpenAI GPT-5 directly (i.e. not going through POE), I just did the ‘visual-analysis’, ‘translations’ and ‘arbitration’ of a scene with 1,900 subtitles (an usual scene has about ~500). The cost was ~7$ (in ‘flex’ mode, which is how ‘AIEngineGPT5’ is configured).

I’ve tried few times to get Whisper to work but something goes wrong. I’m not really adept technically when it comes to using Command Line stuff and all the python-based stuff. I will give this a try again to see if it works because I’d really like to be able to make subtitles for the scripts that do not come with one.

Well, python is not used anymore in this version.
Also, if you still have trouble with Whisper. It’s not required anymore when you are only using the HQ workflows (Automatic or Manual) since they are doing the transcription using AI. You would only need to disable the “mergedvad” step, which isn’t really needed for the process; it’s just there as a last resort backup transcription. In the --FSTB-SubtitleGenerator.override.config in the workflow folder, change enabled to false:

1 Like

Ooh let me give this a try then. Appreciate the clarification!

Just wanted to come by and thank you again. I was able to set it up and ran one video through the process as a test. It works and the translation quality was amazing even with the automatic HQ option! Took almost 100k Poe tokens for a 30 min video with lots of lines but the dialogue seems to match spot on to what is happening in the video. As far as I can tell.

Create a new version 2.0.5.

Main change is that I added a step in the Manual HQWorkflow that feed the singlevad-ai transcription, the full-ai transcription and the audio clip back to an AI for refinement. That gives a better ‘final transcription’.

And fixed a few bugs and improvements.

One thing that I noticed for people who use Automatic HQWorkflow is that it might cost you a little more API points in some cases. Sometimes the full-ai transcription might create a bunch of “Ah” subtitles during sex and all those will get ‘visual-analysed’, translated, etc.

Create a new version 2.0.7 (and 2.0.6).

Main changes are:

  • Added a output to generate .asig file for parts and for the full audio.

  • It’s now possible to reduce the size of the “context” sent in a batch request. For example, the workflow now limits the inclusion of TranslationAnalysis-Visual to the 10 most recent nodes, even when 1000 context nodes are included.

  • singlevad-ai-refined now injects ‘markers’ in VoiceText if the line delivery is ‘complex’ ([whispers] ねえ、[gasp]先生も動いて…). It also generates a ‘TranslationAnalysis-Audio’ similar to what the visual-analyst does. It also . For example:
    TranslationAnalysis-Audio: The speaker reads the words slowly and deliberately, as if they are unfamiliar and uninteresting to her. This delivery suggests she finds the topic overly academic or boring. A translation could reflect this by being flat or slightly drawn out, e.g., 'Literary... theory.'

  • Added a new default path in the workflow that takes the ‘maverick’ translation alone and formats it for subtitle using a new ‘subtitle-finalizer’. The old path with multiple translations and an arbitrator is still available as an option.

  • Updated the Maverick prompt to take advantage of the new TranslationAnalysis. It’s also using GeminiPro by default, instead of GPT-5.

  • Better handling or response that can’t be parsed as JSON. If the tool detects that the response is not in JSON format, it generates an “error file” (“TODO_E”) that will be ignored on the next run. If it’s JSON, it generates a “TODO file”, like before, that needs to be fixed by the user.

  • AudioSyncVerbs: The file order is now forced when reading input and output files. Before, it was possible that file B would be read before file A.