This is a refresh to my previous guide on creating subtitles for scenes (mainly JAV).
This version is a MAJOR improvement over the last iteration of the tool.
Here are the new features:
- Optimized for API calling.
- Support AI transcription that can take into account the context and that can transcribe ‘small segments’ of audio well (which Whisper couldn’t do). That means that the tool doesn’t have to deal with transcription that overlaps multiple “.manual-input timings”. The whole process can work one-to-one with the “.manual-input timings”.
- AI Image Transcriber that do the OCR for the text that appears on screen (i.e. add a
{GrabOnScreenText}tag in the .manual-input file). - AI Image Transcriber that can describe what’s happening while someone is talking. It can also identify character if you provide something like
{VisualTraining:Hana on the left, Ena on the right}for one of the subtitles in the .manual-input. It will generates something like this for every subtitles in the file (Ongoing metadata is generated only when something changes):
{OngoingEnvironment:School nurse’s office: desk with monitor and supplies, rolling chair, metal cabinet with books, examination bed, folding privacy screens, closed pink curtains.}
{OngoingAppearance:Nurse: white lab coat over a light-blue blouse (partly unbuttoned), black skirt, black thigh‑high stockings with visible garter straps, dark heels, long wavy hair. Man (POV): seated in dark clothing (top and pants).}
{ParticipantsPoses:Man: seated, legs apart, hands resting on his thighs, facing the woman. Nurse: standing very close in front of him, torso slightly leaned forward, facing him.}
{TranslationAnalysis:Her apology reads as a casual, friendly greeting; standing very close in front of the seated listener makes it feel personal and attentive rather than formal.}
- Can translate with multiple AI personalities (maverick & naturalist included in the configuration example).
- All the translations can then be analyzed by an AI Abitrer that chooses the best translation, and also transforms the text to adhere to subtitles standards (merge consecutive subtitles if it makes sense, max 2 lines of 50 characters, etc).
- Can analyse multi-part scenes as if it was one long video (i.e., need to create a small ‘.vseq’ file with the filenames).
- Added a UI to identify speakers in a multiple-speaker scene (Note: it’s still a bit buggy). Creating an AI to help with that might be an option in the future. I tried diarization, but it was hit and miss. Using an Image AI seems like a better bet, we’ll see…
Warning:
- The default configuration created by the tool is optimized for high-quality subtitles. For example, we pass the parameter “–reasoning_effort medium” to GPT-5, which costs about 2.5 times more tokens than no thinking (“reasoning low” costs only 20% less than medium).
From what I could see, doing a scene with ~600 subtitles costs: - I’m not responsible for what you do with the tool…
The new process is explained in the Wiki from the tool HERE.
The application binary can be found here: github releases.


