# funscript-ai — open-source audio → funscript that holds still during dialogue

TL;DR — A free, open-source tool that generates a .funscript from an audio track. Unlike beat/energy-based generators, it uses pretrained neural networks to understand what kind of sound it’s hearing, so it stays still while a character is just talking and only strokes during the actual action. Built specifically for spoken-voice / ASMR audio works (e.g. Japanese doujin voice dramas), not music/PMV.

GitHub: GitHub - HZXXXC/funscript-ai · GitHub (MIT)


Hi everyone, I’m HZXXXC. I’m not a professional programmer — this started as a personal itch and I built it over a few intense nights with a lot of help from AI coding tools. I’m sharing it because I think the approach might be interesting to this community, and I’d genuinely love feedback.

The problem I was trying to solve

Most audio-driven generators map “loud = move / beat = stroke”. That works great for music and PMVs, but it falls apart on voice-driven audio dramas: the toy twitches during the intro monologue, moves while the character is just talking, and can’t tell moaning apart from impact sounds. The single thing I cared about most was: don’t move when nobody is doing anything — only when there’s actual action.

What makes it different

Existing audio tools (FunscriptDancer, F.A.P.S, etc.) are excellent and I learned a lot from them — but they’re mostly rhythm/energy-driven or rely on speech-to-text keyword matching. funscript-ai is content/semantics-driven and language-independent (it never transcribes anything, so Japanese/Chinese/etc. all work the same). It combines three signal sources and lets each do what it’s best at:

  1. PANNs CNN14 (pretrained on Google AudioSet, 527 sound classes) — recognizes Speech / Narration / Whispering / Moan / Pant / Breathing with high confidence. This is the semantic layer.
  2. Silero VAD — a dedicated voice-activity neural net that gives precise “a person is speaking here” timestamps. Speech regions are hard-gated to HOLD (no motion).
  3. Multi-band acoustic physics (low/voice/high STFT bands + an absolute RMS-dB silence gate) — detects the rhythmic impact sounds that the neural net is actually weak at (AudioSet’s “slap” class is mostly face-slaps, a different spectral signature).

A joint decision tree then labels every segment as holding / gentle / intense / climax, and a motion generator produces stroke points with real device physics applied during generation (speed limit, min interval) — not as a lossy afterthought.

One design choice worth mentioning: during silence/dialogue it holds at full depth (pos=100), not at a mid-point. I reverse-engineered some professional human-made scripts and this “held, not idling” behavior is what makes pauses feel right. It’s in the design notes.

Status: experimental, learning project

This is early and experimental. It’s a heuristic pipeline, not magic — output should always be reviewed before use, and the thresholds are tuned from a small number of works I tested on. It runs on CPU (~15s for a 12-min track), has a one-click GUI (Gradio), a CLI, and a Python API. Model weights download automatically on first run.

I wrote up the full v1 → v9 evolution (including all the dead-ends) in docs/DESIGN.md because I found the failures more instructive than the final design.

What I’m hoping for

I’m sharing this in a learning spirit — I’m not claiming it beats anything, and I’m very aware of its limits. I’d love to hear:

  • New ideas / approaches — especially anyone who has tried training a small model on paired audio↔funscript data.
  • Your experience if you try it — what works, what feels wrong, on what kind of content.
  • Critiques of the design — I’m sure there are better ways to do the voice-vs-impact separation.

Thanks for reading, and thanks to this community + the authors of PANNs, Silero VAD, FunscriptDancer and F.A.P.S for the inspiration. Open to all feedback. :folded_hands:

1 Like