VR Funscript generation helper - Python & now CV AI

Oh boy, this is so much more interesting than my actual job…

And sorry, guys, I am very into Yumi Sin, she is so beautiful to me… :heart_eyes:

1 Like

After the tracking comes the difficult part. I have not yet found a reasonable way to merge the individual tracking results into a reasonable output. I am curious to see what progress you will make here.

Just a note: you could continue training on the yolo model that herpaderpapotato provided and use it to train on your dataset to improve it further. (You should probably use the same labels as he did. It would certainly be helpful to set a standard for the labels so that everyone doesn’t do their own thing, which doesn’t lead to any real progress)

Thank you so much for pointing out that post, awesome piece of work. I will check if there is anything I can leverage in my this use case.

Anyway, after tagging and labelling 3000+ random frames from my own vr video collection, and training a yolo v11 s model, I get so less flickering on classes detection that I feel overwhelmed. Outpacing/Outbeating my previously trained 11n model.

Will now blend my previous approach with this one , the previous was actually consisting in cutting sections of the video and applying a logic based on the type of detected position then performing some seaming task.

1 Like

At the same time I put up the last post with the pose gif, I also added a pose model to huggingface and I think that’s a better reference point than the detection models I did. As far as I understand, with detection models yolo learns the contents of the bounding boxes, and learns that everything else is ‘background’. It feels like that that with the images in a typical scene, the things that distinguish body parts we want to detect, with body parts we’re not interested in isn’t enough, and makes it harder for the models to reach the performance levels that they should with just detection of body parts.

Pose was a lot more of a pain in the neck to label though. It also uses a bounding box, but it’s around the person, which in the context of the previous paragraph tells yolo to “detect this, and pay attention to all the stuff in this box”, and then the keypoints within the box are labelled and so the model learns where they should be in the context of what’s in the bounding box. Keypoint has also been good because it estimates the position of body parts it can’t see but may still be in frame. I’ve been thinking that the next step for the pose models is that I need to go back over the dataset and label the male and their keypoints as another class, so that the models can be more universal.

One of the things that had me look back at keypoints was some other examples I’d where they’re detecting bolts in a tray and use keypoints for the head and the point of the bolt to find their orientation, as well as detection of whether books have their spine the right side up in a shelf. In those examples it was important to have the whole context of the object to determine orientation via the keypoints.

Of course this is all anecdotal, ymmv.

3 Likes

how could get this program?????Maybe we can train the ai together

If you’re not into Yumi Sin, then what are you even doing IMO? Ditto for Agatha Vega.

I’m working on a resnet that operates entirely on MPEG macroblock motion vectors, mostly to script outliers I’m into like Jade Kush and Pussykat. I might pivot to some sort of vision transformer down the road, but this is plenty enough research for now. Nothing to show yet.

1 Like

Hahaha, you are right :slight_smile:

Made some progress, got rid of all the OpenCV tracking, now fully relying on a model I trained and algorithm to detect what is relevant to use in the frame.

This footage below is fully unsupervised and picks 4 random sections of 240 frames in a video. I still need to fix the blowjob algorithm, got a glitch in the matrix when fixing another section of the algorithm, but almost there:

Still tweaking the BJ/HJ algo, but now disregards scenes that it considers irrelevant.

Model trained on 4500+ hand picked and tagged images :hot_face:

edit: I was curious, so I counted… 30149 classes tagged in those images to train the model :melting_face:

Introduced Kalman filtering (again… :face_holding_back_tears:) and fallback tracking (booooobies :heart_eyes_cat:).

Below a 10 random picked and unsupervised processed cuts of a video:

1 Like

Another shot of the algo capabilities and inabilities at this stage, in a better resolution.

Still tweaking…

Though for reference, the base video is a reencoded 60 fps VR video at 1920 x 1080 and the whole inference process runs at 90fps with monitoring/displaying/logging on my Mac mini when I was barely getting 8 to 20 fps with the previous trackers solution.

Thank you very much for your comment and sharing your results. Very enlightening, awesome work.

Thank you very much for you interest :slight_smile:

Sorry, the code still a mess as of now, even if it went down to 500 ish lines instead of 2k+ in the previous iteration with OpenCV trackers.

Though definitely taking comments in case you want to submit a video file (section) for scripting and feedback so as to tweak the algorithm.

Might the current status be of any interest to you now ? :smile:

Anyway, thank you for shaking the coconut tree, I had no clue what had been done or not till now, I was just playing around :upside_down_face:

That’s alright, i’m glad you’re progressing on your endeavor. Though i will say this is not really what i would be interested in contributing to. In my orig post i described a wish to see a replacement tool for OFS or similar that would incorporate the innards of MTFG and deep learning motion tracking. Something that would actually be a funscripting copilot AIO scripting environment. To my knowledge nobody’s yet brave or foolish enough to attempt that.

In parallel i have my own plans and my own interests which probably do not match yours haha. I’m more interested in non-vr furry animations and scripting for rotary or linear drive fuck machines. All i can say personally is that i’m seeing good progress here given your area of interest, so my mantra always is: keep up the good work and always open source your efforts so that other can contribute.

Hi, first of all, congratulations on your impressive work with the nsfw-identification-yolo10m-1024 model on Hugging Face! I recently downloaded and successfully used the model, and it performed excellently. However, at the moment, my test script runs only on a CPU, and I’m looking to adapt it to leverage the GPU for better performance.

Could you let me know if you run this model on Windows or Linux, and if you have any references or guidelines for configuring it to run with CUDA?

Thanks again for your great work, and I appreciate any advice you can share!

Hi there,

Thank you very much for your kind words.

However, I did not use the model you mentioned.

I started with a yolo11n base model that I tuned with 9 new classes and 4500 frames randomly extracted from my VR library.

To do so, I hand tagged and boxed classes in a first couple hundred frames, trained the model, then use it to help suggest boxes in the frames, adjusted the boxes and classes, retrained, etc, until this result.

My editing logic looks like this:

After gathering 4.5k+ images and 30k+ annotations, I ran training on yolo11n for 200 epochs or so. Tried yolo11s and yolo11m too, but the gain in accuracy is not impressive on my dataset, and the inference time is shorter on yolo11n so I stuck with it.

All this runs on my Mac using MPS (Metal Performance Shaders on Mac with ARM chip) instead of CUDA or CPU, so I can’t really help on that side, sorry :confused:

Check here, the inference can be done leveraging GPU with using and defining the device argument :

Thank you for sharing your workflow. I understand the process you’ve outlined, and I apologize for not having experience with CUDA. However, I will continue on my own to explore whether it’s feasible to implement it.

Regarding the dataset I mentioned earlier (nsfw-identification-yolo10m-1024), I’m able to detect the trained objects as expected. The next step is to retrain the model specifically to identify hands and nails, in order to generate funscripts for handjob videos.

Since I’m new to this technology (I’ve only been using it for a few hours), would you be willing to explain how to perform model training and how to generate new models? Or if you have any resources or references where I can deepen my understanding, I would really appreciate it!

Thanks again for your input, and I’ll keep you updated on my progress!

Sure, I can share my very basic understanding of those concepts, but the extensive Ultralytics website along with deepseek / chatgpt got me to understand the very little I know.

Anyway, to retrain the model you are using while keeping the existing classes and adding new classes for nails and hands, my understanding is you might want to reuse the initial dataset augmented of nails and hands images / boxes in existing dataset and retrain it based on this so as to not lose the ability to identify the initial classes with high confidence.

Thanks for your work, will you eventually release a version? Maybe it would be faster if more people were involved?