How to create subtitles for a scene (2025)

Hey @Zalunda - first - thanks for the guide!

1 Like

Hey, potentially stupid question incoming:

Do you need an API key for both Gemini and Poe? Reason I’m asking is b/c I’m running into a “429 - too many requests” error on step 3 (part 7, when I run FSTB-CreateSubtitles.bat after making edits), which I think is coming from Gemini since I have free account. While on Poe, I have points. Are they not interchangeable in that regard?

Again, very sorry if this is a dumb question. Very new to ai stuff and just wanted to translate some scenes I like.

Create a new version 2.0.11.

Main changes are:

  • SubtitleVerb: Merged TranscriberAudioSingleVADAI and TranscriberImageAI into a “multimodal” TranscriberAI that can receive audio and/or image. This allows us to add the image to singevad-ai-refined’s requests and get a much better TranslationAnalysis that contains targeted image context.
  • SubtitleVerb: Added the possibility to ‘inject’ context-only binaries (mainly images) in the request. The idea is to send images taken from the ‘gap’ between the VOD so that the AI has an idea of what happened since the last time someone spoke. For example, [Gap Summary: She shifted her hand to playfully touch his chest.] Trying to rouse herself, she initiates a lazy conversation about the day's plans. Her hand stroking his chest (visible in POV) adds a tactile layer to the casual question.
  • SubtitleVerb: Fixed bug where my ‘json fixing’ code could add a ‘,’ inside the value, like a translated text (ex. "Say \"Please\"" was replaced by "Say \",Please\") because I didn’t ignore quote inside a string value.
  • SubtitleVerb: Changed the response handling if the StartTime received from the AI didn’t match any of the node’s StartTime. Now, I’ll match the closest node if it’s less then 50 ms away. If not, the same exception as before is raised.
  • SubtitleVerb: Adjustement to use gemini-3-pro-preview instead of 2.5.
  • SubtitleVerb: Fixed so that an empty JSON array (i.e. []) is considered a valid JSON. It was not before.
  • SubtitleVerb: Re-added an explicit rule to break subtitles on characters like 。!? in ‘full-ai’ step.
  • SubtitleVerb: Allowed more formats when parsing TimeSpan (i.e., accept 1.200, 1.20, or 1.2). Before, it only accepted 3 numbers after the dot.
  • SubtitleVerb: Default configuration for ManualHQWorkflow is using Gemini 3.0 (preview) for all the steps, except full-ai, which still uses Gemini 2.5 because 3.0 is really bad with the timings.

Overall, the main change (for manual HQ, at least) is that all steps now use Gemini. The problem with Poe is that it doesn’t support audio data. That mean that we always needs Google Gemini for that step anyway.

I still have trouble knowing how much it costs to do an “average” scene (~500 subtitles), or what is the quota when only using the a free account because I have created a paid account in august and it gave me 400$ promotional to be used until november. I’m still in that state, but I’ll know soon.

It’s not a dumb question and, yes, “429 too many request” comes from Gemini because you reached you limit for the day (rate limit documentation => Requests per day (RPD) quotas reset at midnight Pacific time.).

From what I understand on the page:

  • Gemini 2.5 Pro: 50 requests per day.
  • Gemini 3.0 Pro preview (not listed) but might be in same “group” as 2.5 pro.

If true, it’s not too bad. For a 500 subtitles scene, 90 minutes duration, it would be:

  • full-ai transcription: 90 minutes / 5 minutes per request => 18 requests.
  • singlevad-ai: 500 / ~100 subtitles per request => 6 requests.
  • singlevad-ai-refined: 500 / ~30 subtitles per request => 17 requests (I’m using 15 to get better result => 34 requests)
  • translated-texts_maverick: 500 / ~100 subtitles per request => 6 requests.
  • finalized_maverick: 500 / ~200 subtitles per request => 3 requests.

Add a few for empty response / partial response => 50 to 60. It would take 2 days of free requests to do.

Note: Of that list, only translated-texts_maverick and finalized_maverick can be redirected to Poe.

Thanks for the clarification! Do I need to change anything to redirect translated-texts_maverick and finalized_maverick to Poe? Or is that handled automatically?

I’m certainly learning a lot just trying to translate these vids haha

It’s not handled automatically but I might do something in the future for stuff like that. I’m thinking of doing “AIEnginePriority” object where you could list a few AIEngine it would use them in order (ex. “AIEngine using free Gemini API key”, “AIEngine using Poe API key”, “AIEngine using paid Gemini API key”).

Right now, you would need to add this in ‘–FSTB-SubtitleGenerator.override.config’ at the root of structure, in the SharedObjects array (anywere in the array):

    {
      "$id": "AIEngineGeminiPro3.0ViaPoe",
      "$type": "AIEngineAPI",
      "BaseAddress": "https://api.poe.com/v1",
      "Model": "gemini-3-pro",
      "ValidateModelNameInResponse": false,
      "APIKeyName": "APIKeyPoe",
      "TimeOut": "00:15:00",
      "RequestBodyExtension": {
        "max_tokens": 65536,
        "extra_body": {
          "google": {
            "thinking_config": {
              "include_thoughts": true
            }
          }
        }
      },
      "UseStreaming": true,
      "PauseBeforeSendingRequest": false,
      "PauseBeforeSavingResponse": false
    },

And then, in the same file, modify the engine attached to translated-texts_maverick and finalized_texts:

    {
      "Id": "translated-texts_maverick",
      "Engine": {
        "$ref": "AIEngineGeminiPro3.0ViaPoe"
      }    
    },
    {
      "Id": "finalized_maverick",
      "Engine": {
        "$ref": "AIEngineGeminiPro3.0ViaPoe"
      }
    },

Note: I haven’t done a lot of test with that config but, in theory, it should be identical that calling Gemini directly.

Got it, I’ll give it a shot. Thanks so much for the help!

Ok, I did more testing. I don’t know when but Poe improved their application and API. They now list “Input (audio)” for Gemini-3 and it works (for gemini 2.5 too):

I see that videos is not a lot more expensive then audio (i.e. only twice more points per seconds). I might try adding doing test with that in the future.

And they now support some settings in the API:

Before, I had to find way to add the settings in the prompt (ex --reasoning_effort medium for GPT-5).

When using Gemini 3.0 via Poe, I was getting bad responses (hallucination, etc) when calling the API, and I wasn’t seeing the ‘thinking’ of the AI, but it got better when I added a “thinking budget” to the requests. It seems that, by default, thinking is disabled or extremely low.

    {
      "$id": "AIEngineGeminiPro3.0ViaPoe",
      "$type": "AIEngineAPI",
      "BaseAddress": "https://api.poe.com/v1",
      "Model": "gemini-3-pro",
      "ValidateModelNameInResponse": false,
      "APIKeyName": "APIKeyPoe",
      "TimeOut": "00:15:00",
      "RequestBodyExtension": {
        "max_tokens": 65536,
        "extra_body": {
          "thinking_budget": 8192
        }
      },
      "UseStreaming": true,
      "PauseBeforeSendingRequest": false,
      "PauseBeforeSavingResponse": false
    },

Anyway, I’ll work on a new version with those settings, and with my “priority” AIEngine idea.

Create a new version 2.0.12.

Changes are:

  • SubtitleVerb: Fixed the format for audio data when sending requests to Poe (i.e. it’s not the same format as OpenAI protocol).
  • SubtitleVerb: Added a AIEngineCollection class that allows setting a list of AIEngines that get used in order (until the first has used all it quotas). Note: I didn’t do a lot of “real world” tests on this.
  • SubtitleVerb: Improved prompts used in HQManualWorkflow.
  • SubtitleVerb: Create AIEngineCollection for GTP5, Gemini2.5 and Gemini3.0 in the config.

With this version, you’ll be able to use Gemini on Poe. Only difference is that I had to ‘hardcode’ (well, in the config) the number of thinking token. When calling Google API, you can set it at -1 and it decide how much token to use automatically. I haven’t found a way to do that with Poe. If I set the thinking_budget to -1, I get a Bad Request response.

Created a new version 2.0.13:

Changes are:

  • Created vendor-specific AIEngineAPI (Google, OpenAI, Poe). This allows the use of specific vendor features like ‘mediaResolution = “MEDIA_RESOLUTION_LOW”’ on Google (saves a lot of tokens). For Google, it also gives more detailed information on the number of tokens used by modality (text, audio, or images).
  • Create a new TranslatiorSubtitleConformer ‘in code’, instead of using AI. It’s faster, costs nothing, and, I think, it’s doing a better job than the AI.
  • Improved the CostReport (“.cost.txt”) a lot (token used by modality and “prompt section”, estimated $ based on the engine, etc).
  • Moved all the AIEngine definitions in the .override file to make it easier to adjust for the user, if needed.
  • Removed the ‘arbitrer’ steps as an option (translated-texts_naturalist, arbitrer-choice, arbitrer-final-choice). That path was far inferior to the other process, and I was tired of updating it.
  • A lot of small changes to try to find a good balance between subtitle quality vs. the cost to create.
  • Changed the config to use Gemini 2.5 by default because I found out that a ‘free account’ cannot use 3.0 via the API. If you have a Poe or paid Google Account, it’s possible to change it back to 3.0 in the override file. Gemini 3.0 seems to be noticeably better than 2.5 only when analyzing images (i.e., singlevad-ai-refined).
  • Minor: Added the file index when converting time from global time to the time specific to a file (i.e. 1:23:14.222 => “2, 0:14:16.252” when 2 at the start is the second file).
My free credits on Google ran out, and I was able to see how much the request costs and what's included for a 'free account' (i.e., Gemini 3.0 Pro is **not** available on a free account). Knowing that, I made some adjustments to the config to reduce the overall cost.

Using a paid account, it still costs about 5-8 US$ for a 90-minute, ~500-subtitle scenes. It can get more expensive when you are dealing with 1500+ subtitle scenes.

That being said, if you are patient and don’t mind doing the subtitles over a few days, the 50 requests per day on the free Google account is all you need… If you plan on doing a scene in the future then, whenever you have ‘free requests’ available, I suggest creating the .vseq file and running the application so that the ‘full-ai’ step can be completed already (i.e. ~18 requests for a 90 minutes scenes). For smaller scenes, 50 requests should be enough to complete the rest of the process when you are ready to take the time to correct the timings of the subtitles, add some context, etc.

I’m now able to do a “single actress” 90-minute scene in less than 90 minutes of my time, if the API is not too finicky (i.e., returns badly formatted JSON, or is too prude and returns “prohibated” error). I sometimes have to switch to Grok-4 to complete some sections of a scene.

Note: My free “requests per day” didn’t seem to reset today. I will stop speculating on how the Google Free Tier works since it’s not well documented by design, it seems.

First of all, fantastic tool, incredible progress on the ease to generate subtitles.
I used the automatic flow and it generated pretty good results I think without needing much editing afterwards.
Took me a couple hours but most of it was understanding the process, I have no doubt it would be much faster now.

A few small details regarding the documentation that could maybe be improved:

  1. when I first used it, it asked for an OpenAI key to be able to do the visual analysis, but this isn’t mentioned in the doc
  2. when getting a cost error, it’s not immediately clear which backend is causing the issue, it generates a cost file but that was not clear to me how to know where the issue come from from it (it was mostly about Poe when the issue was with Google)
  3. how to update the configuration to select the backend could be a bit more documented (or maybe I missed something?), it took me a while to understand what to change. I setup my Google account so I have plenty of free credits there so I wanted to use that instead of Poe, but it took a bit to realize that I needed to update the Workers section to do that.

Apologies if this is obvious or if I just missed this. Where do I put the vseq file for multi-part scenes? Best regards!

Created a new version 2.0.14:

Changes are:

  • Implemented BatchMode for Google API (which costs half the price).
  • Changed full-ai engine to AIEngineGeminiPro2.5OnGoogle because it’s the only combination that returns valid timings. Gemini-3.0 and Poe seem to ‘shorten’ the audio before giving it to the AI.
  • Added code to fix ‘unescaped quote’ inside of a string in an AI JSON response (ex. "VoiceText": "Say "Please" now" => "VoiceText": "Say \"Please\" now").
  • Added a TranscriptionToIgnorePatterns array to the full-ai step. If the text returned by the AI for a node fits exactly one of the patterns, it will not be included in the transcription (ex. aah, hmm).
  • Added an Enabled property to AIEngine. Used only when the engine is part of a collection. For the application, when an engine needs an APIKey, and it’s not set in ‘.private.config’, the engine is also considered ‘disabled’.
  • Added a SupportAudio and SupportImage properties on AIEngine. If an engine does not support Audio/Image, that information will be stripped from the request.

It needs to be in the same folder as the .mp4.

1 Like

Yes, the doc might need some updating now that the process is more stable. It’s been a while since I updated it. The basic process is the same, but there have been ‘tweaks’ everywhere.

Note:
I really encourage you to try the ‘manual’ path. The only difference is that you will have to fix the timings by hand. With the Gemini full-ai transcription, it’s pretty reliable that it won’t miss any talking (it might have too much thought). That means that, when watching the video to fix the timings, you can skip when there is a long time when no one is talking (ex, during sex). For some of the video, I end up taking only ~%50 of the duration to fix the timings (i.e. 30 minute for a 60 minutes scene).

The manual path has better TranslationAnalysis in my view since it contains information about ‘how it was said’ and only the important visual cues that is important for this specific translation.
Manual:

{TranslationAnalysis-Audio:Gap Summary: She moved from the center of the room to the sofa, kneeling on it to clear away the clutter.
She gestures towards a spot on the sofa she is clearing. Her tone is efficient and slightly bossy, directing him to the space she decided on rather than asking him where he wants to sit. 'Kocchi' (Here/This way) serves as a command disguised as guidance.

Option 1 (Directive): "Ah, right. Over here."
Option 2 (Casual Command): "Okay, yeah. This way."}

Automatic, with the ‘old’ visual-analysis give a lot of visual information but most of is redundant, unnecessary. And it doesn’t include any ‘voice delivery’ analysis.

OngoingEnvironment:School nurse’s office: desk with monitor and supplies, rolling chair, metal cabinet with books, examination bed, folding privacy screens, closed pink curtains.}
{OngoingAppearance:Nurse: white lab coat over a light-blue blouse (partly unbuttoned), black skirt, black thigh‑high stockings with visible garter straps, dark heels, long wavy hair. Man (POV): seated in dark clothing (top and pants).}
{ParticipantsPoses:Man: seated, legs apart, hands resting on his thighs, facing the woman. Nurse: standing very close in front of him, torso slightly leaned forward, facing him.}
{TranslationAnalysis:Her apology reads as a casual, friendly greeting; standing very close in front of the seated listener makes it feel personal and attentive rather than formal.}

Note #2:
I might take some time to update the automatic path to use similar steps to the manual, maybe with “padding” on the audio since I can’t trust 100% the timings.

It might be a documentation problem, but the cost file should not be used to fix/understand problems. It’s only there to give you a summary of all the calls/costs that were made so far.

When an error occurred, you should look at the <videoname>.TODO_<stepname>_<number>.txt file or a TODO_E file in the folder.
TODO_E will contain an error that occurred that cannot be fixed by the user. If you rerun the application, the file will be moved to the backup folder, and the request will be sent again to the AI.
TODO will contain an error that occurred, usually a JSON parsing error. It will also contain the response received. It expects the user to fix the error, save the file, and rerun. If the JSON is fixed, the application will parse the nodes in the fixed files and continue with the next nodes. If you don’t know how to fix the error, you can also simply delete the file and rerun the request. Fixing it by hand mainly saves you the cost of rerunning the request twice.
Note: Most of the JSON errors are auto-fixed by the application, but there are still a few types of errors that I don’t auto-fix.

Yes, this is a part that needs documenting. I’ve been making a lot of changes in the config file and overrides on each version, so it might be hard to follow.

To change the engine used (ex. Gemini-Pro instead of Poe), you can change the worker:

  "Workers": [
    {
      "TranscriptionId": "full-ai",
      "Engine": {
        "$ref": "AIEngineGeminiPro2.5OnGoogle"
      }
    },

But, now that I’ve defined collection for most of the model, you can change the collection in the SharedObjects section also:

  "SharedObjects": [
    {
      "$type": "AIEngineCollection",
      "$id": "AIEngineGeminiPro2.5",
      "Engines": [
        // [Option] Change the order if you want to use your Poe points first.
        { "$ref": "AIEngineGeminiPro2.5OnGoogle" },
        { "$ref": "AIEngineGeminiPro2.5OnPoe" }
      ],
      "SkipOnQuotasUsed": true,
      "SkipOnServiceUnavailable": true
    },
  ...

And, when it’s a collection, there is yet another way, only the engine that have a Key will be considered. If you don’t have a APIKey defined for Gemini, for example, Poe will be used even if it’s the second item in the list.

1 Like

Ok, good to know, I’ll try that next time then.
The scene I did was pretty simple, so I only had a few timings to fix at the end in the generated file.

I thought I was doing everything to the letter but I get an error

Do you happen to know an easy fix, I couldn’t find anything in the troubleshooter. Or I didn’t understand it :P. That could also be a major problem :smiley:

Do you get the same error if you try to rerun the application?

This looks like a local error when I call ‘ffmpeg’ to try to generate a screenshot. ffmpeg might return without an error but it didn’t generate the file.

Can you try running this in a command prompt in the folder you are doing? (with the right video name)
"%APPDATA%\FunscriptToolbox\ffmpeg\ffmpeg.exe" -ss 2:00 -i "KAVR-440-4-8K.mp4" -frames:v 1 test.jpg
And see if it generates a test.jpg file.


Thanks for the help, this is what happens

Try in a ‘normal’ command prompt. I start a prompt with “cmd”. Not a Powershell prompt. It should so the same thing.

Or run this in powershell:
& "$env:APPDATA\FunscriptToolbox\ffmpeg\ffmpeg.exe" -ss 2:00 -i "KAVR-440-4-8K.mp4" -frames:v 1 test.jpg