How to create subtitles for a scene (even if you don't understand the language)

Has anyone tried the claude-3 and Gemini-pro from POE? They have fewer restrictions than the ones on the official website. I translated the Japanese into Traditional Chinese, and the results were great, especially with Claude - it can be used directly.

GUI subtrans is also a good choice, you need to have the GPT3.5 API, but the results are not as good as claude-3 and Gemini-pro.

Awesome, can’t wait to see it in deployment (if you decide on fully publishing it!) Your software is great.

I’m not too knowledgeable in these things, but could you point me to the ‘draft’ version of SilaroVAD? Is that one of its models?
Also, just tried Mistral, and I have to say it’s pretty concise…Not sure if it’s due to me refining my prompt or just the nature of the model. Thanks for the share!

This gave me an idea, maybe you could use something like this:

  1. screencap/movie thumbnailer (one with ffmpeg)
  2. feed images into a img2txt stable diffusion prompt (Prompt Text Prediction from a Generated Image) or something like this, or this?
  3. feed result from 2 into your JSON prompt
  4. profit?

Just an idea… you have probably already thought of something like this to help automate the process. Let me know your thoughts, otherwise, I’m back to experimenting :wink:

Yes, I did. Great suggestion. I like that Poe allows NSFW (even if the underlying model from other companies might refuse to answer).
All 3 models (Haiku, Sonnet and Opus) of Claude-3 are great. They give good translation but Opus is a lot more cooperative then Haiku & Sonnet.
Gemini-pro and even, GPT-4, were not as good.

This was an option for my old process. Now, I simply use Whisper to do the draft, it does a better job than SilaroVAD.

Mistral Next was too concise, it didn’t translate the ‘whole JSON’ that I gave it to translate. Mistral Large translated everything. But I don’t know about other use cases thought.

My new process allows to ‘fake’ having vision by telling the AI who’s talking (which could be done in the future with good Speaker Diarization AI) or by giving a general description of what will happen on the screen in the next portion of the video, and see if it improve the translation.

1 Like

Ok, this process is now obsolete.

I created a new topic with an updated tool/process: here