Great effort, how could get the app?
Been working a tad more on my free time, with YOLO, now I have auto detection for bodyparts and estimation for scene types based on a model I trained on my video library along with cue points from timestamp.trade so as to adapt the algorithm accordingly.
If anyone with a proper experience in CV is interested in contributing, please message me, feeling kinda lonely
I sometimes feel like I should just script instead…
Below raw footage is unsupervised, and only relies on models I trained, but will be used to give start key points for actual OpenCV (or other) tracking:
Pretty sure we can do something.
Made progress on the model today, getting way better and faster
Everything in this gif, including labelling, in unsupervised.
Training takes time, but handpicking zone of interest and matching them to a class for learning is beyond that. Made my own software to run through my VR video library and draw boxes over ‘classes’ for later (now showing) yolo learning… Went through 4000+ snapshot and hand drew the different boxes matching desired classes. Might need to surgery-fix my mouse-handling wrist now.
Anyway, I initially thought this could serve as anchoring the CV trackers and then start the proper tracking, but this is actually now even way faster, more reliable and does not need as much supervision as CV tracker management required. Quite impressed at the result, but I am a know-nothing in that field, please, experts, pardon my insulting ignorance (readying the whip).
Still need to add smoothing, but kalman filters and thresholding parameters should help, this is again very raw at this stage.
Oh boy, this is so much more interesting than my actual job…
And sorry, guys, I am very into Yumi Sin, she is so beautiful to me…
After the tracking comes the difficult part. I have not yet found a reasonable way to merge the individual tracking results into a reasonable output. I am curious to see what progress you will make here.
Just a note: you could continue training on the yolo model that herpaderpapotato provided and use it to train on your dataset to improve it further. (You should probably use the same labels as he did. It would certainly be helpful to set a standard for the labels so that everyone doesn’t do their own thing, which doesn’t lead to any real progress)
Thank you so much for pointing out that post, awesome piece of work. I will check if there is anything I can leverage in my this use case.
Anyway, after tagging and labelling 3000+ random frames from my own vr video collection, and training a yolo v11 s model, I get so less flickering on classes detection that I feel overwhelmed. Outpacing/Outbeating my previously trained 11n model.
Will now blend my previous approach with this one , the previous was actually consisting in cutting sections of the video and applying a logic based on the type of detected position then performing some seaming task.
At the same time I put up the last post with the pose gif, I also added a pose model to huggingface and I think that’s a better reference point than the detection models I did. As far as I understand, with detection models yolo learns the contents of the bounding boxes, and learns that everything else is ‘background’. It feels like that that with the images in a typical scene, the things that distinguish body parts we want to detect, with body parts we’re not interested in isn’t enough, and makes it harder for the models to reach the performance levels that they should with just detection of body parts.
Pose was a lot more of a pain in the neck to label though. It also uses a bounding box, but it’s around the person, which in the context of the previous paragraph tells yolo to “detect this, and pay attention to all the stuff in this box”, and then the keypoints within the box are labelled and so the model learns where they should be in the context of what’s in the bounding box. Keypoint has also been good because it estimates the position of body parts it can’t see but may still be in frame. I’ve been thinking that the next step for the pose models is that I need to go back over the dataset and label the male and their keypoints as another class, so that the models can be more universal.
One of the things that had me look back at keypoints was some other examples I’d where they’re detecting bolts in a tray and use keypoints for the head and the point of the bolt to find their orientation, as well as detection of whether books have their spine the right side up in a shelf. In those examples it was important to have the whole context of the object to determine orientation via the keypoints.
Of course this is all anecdotal, ymmv.
how could get this program?????Maybe we can train the ai together
If you’re not into Yumi Sin, then what are you even doing IMO? Ditto for Agatha Vega.
I’m working on a resnet that operates entirely on MPEG macroblock motion vectors, mostly to script outliers I’m into like Jade Kush and Pussykat. I might pivot to some sort of vision transformer down the road, but this is plenty enough research for now. Nothing to show yet.
Hahaha, you are right
Made some progress, got rid of all the OpenCV tracking, now fully relying on a model I trained and algorithm to detect what is relevant to use in the frame.
This footage below is fully unsupervised and picks 4 random sections of 240 frames in a video. I still need to fix the blowjob algorithm, got a glitch in the matrix when fixing another section of the algorithm, but almost there:
Still tweaking the BJ/HJ algo, but now disregards scenes that it considers irrelevant.
Model trained on 4500+ hand picked and tagged images
edit: I was curious, so I counted… 30149 classes tagged in those images to train the model
Introduced Kalman filtering (again… ) and fallback tracking (booooobies ).
Below a 10 random picked and unsupervised processed cuts of a video:
Another shot of the algo capabilities and inabilities at this stage, in a better resolution.
Still tweaking…
Though for reference, the base video is a reencoded 60 fps VR video at 1920 x 1080 and the whole inference process runs at 90fps with monitoring/displaying/logging on my Mac mini when I was barely getting 8 to 20 fps with the previous trackers solution.
Thank you very much for your comment and sharing your results. Very enlightening, awesome work.
Thank you very much for you interest
Sorry, the code still a mess as of now, even if it went down to 500 ish lines instead of 2k+ in the previous iteration with OpenCV trackers.
Though definitely taking comments in case you want to submit a video file (section) for scripting and feedback so as to tweak the algorithm.
Might the current status be of any interest to you now ?
Anyway, thank you for shaking the coconut tree, I had no clue what had been done or not till now, I was just playing around
That’s alright, i’m glad you’re progressing on your endeavor. Though i will say this is not really what i would be interested in contributing to. In my orig post i described a wish to see a replacement tool for OFS or similar that would incorporate the innards of MTFG and deep learning motion tracking. Something that would actually be a funscripting copilot AIO scripting environment. To my knowledge nobody’s yet brave or foolish enough to attempt that.
In parallel i have my own plans and my own interests which probably do not match yours haha. I’m more interested in non-vr furry animations and scripting for rotary or linear drive fuck machines. All i can say personally is that i’m seeing good progress here given your area of interest, so my mantra always is: keep up the good work and always open source your efforts so that other can contribute.
Hi, first of all, congratulations on your impressive work with the nsfw-identification-yolo10m-1024 model on Hugging Face! I recently downloaded and successfully used the model, and it performed excellently. However, at the moment, my test script runs only on a CPU, and I’m looking to adapt it to leverage the GPU for better performance.
Could you let me know if you run this model on Windows or Linux, and if you have any references or guidelines for configuring it to run with CUDA?
Thanks again for your great work, and I appreciate any advice you can share!
Hi there,
Thank you very much for your kind words.
However, I did not use the model you mentioned.
I started with a yolo11n base model that I tuned with 9 new classes and 4500 frames randomly extracted from my VR library.
To do so, I hand tagged and boxed classes in a first couple hundred frames, trained the model, then use it to help suggest boxes in the frames, adjusted the boxes and classes, retrained, etc, until this result.
My editing logic looks like this:
After gathering 4.5k+ images and 30k+ annotations, I ran training on yolo11n for 200 epochs or so. Tried yolo11s and yolo11m too, but the gain in accuracy is not impressive on my dataset, and the inference time is shorter on yolo11n so I stuck with it.
All this runs on my Mac using MPS (Metal Performance Shaders on Mac with ARM chip) instead of CUDA or CPU, so I can’t really help on that side, sorry
Check here, the inference can be done leveraging GPU with using and defining the device argument :