Sure, I can share my very basic understanding of those concepts, but the extensive Ultralytics website along with deepseek / chatgpt got me to understand the very little I know.
Anyway, to retrain the model you are using while keeping the existing classes and adding new classes for nails and hands, my understanding is you might want to reuse the initial dataset augmented of nails and hands images / boxes in existing dataset and retrain it based on this so as to not lose the ability to identify the initial classes with high confidence.
I trained a transformer model yesterday and raw coded this editor with suggestions/predictions based on the last dozen positions scripted, that you can accept with Enter or just override, etc.
The positions on the left are the one I input, on the right of the red bar the suggestions. Disregard the non relevancy of the positions, it is just to showcase the prediction based on the previous positions.
Actually planning on including this type of prediction in my unsupervised approach when both yolo tracking and kalman predictors fail.
Definitely, if it gets any traction / interest at some point.
Not sure how nor when, the very little help offers I received until now did not seem to make the original project progress, so I actually started back from start with the yolo training approach.
Not necessarily, i meant a full editor similar to the open funscripter project (where you can not only script manually but also extend the functionality of the editor via plugins) that would also contain a layer of AI features on top. And if it sounds like i basically want OFS but better, it’s because i want OFS but better lol. It was very dissapointing to see that project get canned when it had so much potential, but on the flipside it was not easy to extend it because of the author’s choice of underlying framework/language. But yeah i feel like this community would benefit greatly from something that would take the best of all existing funscripter environments, put them in one package and offer AI features to boot. Probably just a pipe dream though.
Further refining of the algorithm, footage below of unsupervised funscript generation (green in the graph), comparing to the human made funscript (blue in the graph) for 4 randomly picked sections of the video:
Still tuning, mouth and hands movements being dealt with better during blowjob/handjob. Still need to test foot job, and adjust amplitude globally. Working on it. Here is the result of a generated funscript vs a human generated script the scripter considers “exaggerated as per his taste” - got most of the strokes right and in sync, some human generated strokes seem out of context when analyzing deeply, and some erroneous interpretation from the algorithm, amplitude needs to be tuned… Fully unsupervised.
So is this a fine tuning of a vision transformer or did you generate embeddings for each frame from sort of convnet and treat those embeddings as the sequence?
Thanks for your offer, if you have a VR video for which you’d like to generate a funscript, feel free to suggest. I would be glad to get some feedback and assess what needs tuning and what needs a full rehaul…
In the meantime, I just picked a video from the Script request section, and launched the pipeline to see what comes out and see if the user is fine with sharing feedback to adjust and tune the algorithm
Out of curiosity, have you tried comparing different runs on the same downsized video, with less and less FPS till it breaks? Just to have a metric on how resilient it is.
I actually downsized once a video to 640p (size of the images yolo11 was trained on), but I had the impression the detection were not as good as when used on the 1080p version of that same video.
But, you are absolutely right, that would be a nice benchmark to run, I need to plan that.
It looks like some sections (flat line), beyond the pussy closeup which was identified properly, have put it at some difficulties. Need to look into it.
Actually, upon deeper dive-in, looks like a mess compared to the unit-single sections I was parsing before.
Need to understand what broke in the scaling-up-to-full-video process.
Getting tired of this.
Looks like I finally made some progress out of the swamp I was stuck in.
Took time to rest and think of other stuff, before fully rehauling the tracking algorithm yesterday, while still relying on the yolo model for detection (first stage of the pipeline)