Gifs of pose estimation on vr video

I’ll start small and see how I go :).

I’ve seen the original of that image in gif form. I’d embed it but it’s over 100mb so here’s an image for a frame of it instead.

It’s clearer in the gif, but it implies a detection model fine-tuned on a custom dataset, and then a workflow that would either rely on selection of the relevant detections by a user, or a separate automated process determining which category of detections to track. There’s probably a first stage though where detection is used to crop the image too.

The second link Falafel added is pretty cool because it’s almost exactly what they’ve done. I’ve seen a few different variations on that model and inference with python. The classes it uses is:
“FEMALE_GENITALIA_COVERED”,
“FACE_FEMALE”,
“BUTTOCKS_EXPOSED”,
“FEMALE_BREAST_EXPOSED”,
“FEMALE_GENITALIA_EXPOSED”,
“MALE_BREAST_EXPOSED”,
“ANUS_EXPOSED”,
“FEET_EXPOSED”,
“BELLY_COVERED”,
“FEET_COVERED”,
“ARMPITS_COVERED”,
“ARMPITS_EXPOSED”,
“FACE_MALE”,
“BELLY_EXPOSED”,
“MALE_GENITALIA_EXPOSED”,
“ANUS_COVERED”,
“FEMALE_BREAST_COVERED”,
“BUTTOCKS_COVERED”
The first link runs prediction at the server, so assume they’re harvesting any images you test :wink:

I’ll pick on that image for a bit though. This is what nudeNet in browser (notai.tech) thinks
image

This is what the base YOLOv8 web service does to the image
image

I’ve played around with YOLOv8 a bit before and did a quick prediction against that scene with the script I have
video
It’s a bit jittery but YOLOv8 would be similar to comments about openpose in that it was trained on COCO128 which is not a nudity heavy dataset, and the base categories the classification head is trained on is only about 15 different objects.

And if I lower the detection threshold even further for lulz
videoall

If someone wanted to copy the process hinted at in that first image then the starting point is labelled data, and it probably needs to be a bit more specific to this purpose than the classes used by nudenet, and the next step would be to fine tune something like YOLOv8 which is relatively well documented. From there it could be shoehorned into something like the motion tracking funscript generator extension for ofs instead of the opencv point tracking I assume it uses.

1 Like

I took the methodology from my previous models that I put in github/huggingface and this an example prediction generated by the ml after less than 24 hours of training on one (sometimes 2) gpu it does this


If for some reason that embed doesn’t work, Riding Nude Cowgirl Porn GIF by herpaderpapotato or motion.mp4 ~ pixeldrain.

Animating a 3d object manually with a 2d plane is hard, so I drew the axis with opencv on the image
tilt of the blue line is roll
tilt light blue line is twist
red line is position vertical, length of red line is surge
green line is sway, length of green line is pitch

I went through flowertrample’s free multiaxis scripts and grabbed the corresponding videos I have access to and ran through the following process.
Stage 1:

  • I put the videos and funscripts in a folder, matched the names and then put the timestamps of cowgirl scenes in a file named .txt. I only really did this for 3, and trained on 2. I should to more but the next steps take time.
  • Used python to extract each left half frame to file named like frame.jpg in the original resolution
  • For each frame, read the funscripts and interpolated the position for the frame and stored the axis values in frame.txt

Stage 2:

  • I wanted to speed up training, so instead of making a LSTM that used a pretrained model to generate features every epoch, I just made a model which would take 60 x the tensor shape of the features (which is an array of 12x12x1280) which saves training time. I used efficientnetv2-s-21k-ft1k to pregenerate all the images into features saved as numpy files.
  • I wrote a dataloader that reads all the numpy file names and works out which ones are sequential of 60 or more, and return however many sets of sequence data I request. It’s a really bad dataloader because it loads that many sequences into memory but I couldn’t be bothered writing a smarter data generator.
    At this point, across the 3 videos I prepped there’s 51830 jpgs, 51830 txt files and 51830 npy files. At this point there’s also approx 51,500 overlapping 60 frame sequences the dataloader will grab from.

You can visualize the data in the npy files and they look a bit like:

In an image classification model, that is what the final classification layer would ‘see’ and make decisions on, but I smash 60 together and tell it to do LSTM stuff with it before pushing it through a few dense layers.

Stage 3:

  • Because my dataloader is bad/dumb, I set the training to get 128 sequences at random, train for 100 epochs on them, and then loop that continuously.
  • I watched the loss numbers go down. There’s no right answer on this, the loss number went down much quicker than it did when I train basic funscript because that has a lot more variation than the other axis so it will learn to predict in a smaller range quickly which brings the loss down but doesn’t mean it’s accurate. e.g. part of the later training after adding more sequences

Stage 4:

  • I grabbed a video for one of flowertramples paid scripts and used it to test, because it’s not in my source dataset and could be used for comparison without fear of contamination
  • I loaded the timestamps of cowgirl in the video, pushed the frames through efficientnetv2-s-21k-ft1k, and then made the model do predictions of 60 frames at a time, which is the number of frames I configured the model to do.
  • I used opencv to draw on the frames to indicate the predictions.
  • also checked the up down axis output on a plot
    image

Caveats/notes

  • It jitters a little every 60 frames, because each 60 frames is a separate prediction. One day I’ll be bothered to add an input to the model that takes the position of the previous frame as a starting point. Doing overlapping predictions seems wasteful but is an option.
  • The model has never seen this scene before. In training it has only seen 2 other scenes so it’s not very diverse but it still matching that tempo reasonably.
  • It hasn’t been trained very long
  • the base model(efficientnetv2-s-21k-ft1k) hasn’t been fine-tuned at all
  • I don’t have explicit permission to train and share a model based on flowerstrample’s scripts. Although I can share code etc, the model weights can’t be shared and I’d end up deleting them to avoid accidentally using them elsewhere.
  • I used linear interpolation to create the frame matching pos values, and so the values I generated aren’t the best, but writing something sinusoidal was more effort/time
  • although there’s plenty of multiaxis anime and pmv, I think they’d probably contradict the model’s learnings and so I can’t use them
  • I realized I broke one of my rules and the dense layers after the LSTM each have less neurons than the final layer, and I try to avoid doing that so I probably should redo it with fresh weights.
  • although I did speed up training by pre-generating the pretrained model outputs, it meant I couldn’t use any data augmentation in training (although maybe I could add some noise to the features).
  • I could write some python to output funscripts based off the data (I’ve made some for my previous models, but not multi-axis), it’s too soon to bother because in all likelihood the model will freak out the minute someone transitions from cowgirl leaning back to cowgirl leaning forward with arms on chest.
  • I didn’t do any pre-cropping on the images it’s training on or predicted on, so it could be better if there’s less distractions which is what I saw on other models.
  • it’s only learnt from cowgirl scenes, and so is probably crap at everything else

As a proof of concept though, I don’t mind it. I mostly still want to find a way to simplify images prior to prediction to improve the eventual speed and accuracy, and/or to leave some breadcrumbs for others to improve upon.

3 Likes

Cool…
Of course I will send you the script in a private message, but the thing is that on this script I once again changed the math of writing (increased the angles - it’s Falafel’s fault with his video) + began to pay more attention to the anatomy of a woman (location of the vagina / rectum + oral cavity) Write to what videos you have access to, well, or I still have all the projects - I can give you at once flat converted videos 1024x1024 and multi-axis scripts to them.

The “conqueror” one? Irrc I limited the sway range during that video… The SR6 can sway a bit wider but it will be a little shaky.

1 Like

This one.
I thought these were the maximum angles…

Sorry I might have confused things, I’m not sure which script is referred to to send.

I alluded to it in a previous post, but I’ve put a previous model and code up, with the code at herpaderpapotato/silver-lamp: ml to assist funscript creation (github.com). I’m not really trying to drive traffic to it though so I’ve been avoiding posting the url. It uses weights I put on huggingface herpaderpapotato (herpader papotato) (huggingface.co) which was trained on datasets composed by me.

I’ll put the necessary code up I used to do all the above as well after I’ve had a chance to tidy it up. Whether the weights are shared will be up to flowerstrample because if it would be based on their intellectual property.
I’ve only trained spectacularly bad models though, so it’s not that I’m expecting it to come close to human funscripts. Sharing model weights would really just be to evidence that I’m not making all this up :slight_smile:

1 Like

I kept the training running with a sequence from third scene added to the dataset and it got way worse at predictions. I eventually took a look at the sequence and realized it was a much different and subtle style of cowgirl, it’s like a forward thrusting with no obvious vertical movement.
I’m going to limit this to just obvious vertical cowgirl movement for now, so I took it out of the dataset and continued training.

It kind of got better eventually, but I figured I wanted to fix the smaller middle layers and so started a fresh model from scratch.

The fresh model could predict the same sequence shown previously after about 3 hours of training, but then I realized I still had some middle layers thinner than the final layer, so I started again with another fresh model.

I shifted the code to a fresh repo and dumped it in herpaderpapotato/glowing-octo-giggle (github.com). The image of the plot shown is for the most recent training. There’s virtually no polish on any of it, but ¯\_(ツ)_/¯

1 Like

I’m still questing for a better feature extractor and model architecture and I ended up dropping the LSTM from the model and trying DepthwiseConv2d layers.

e.g. I predict against 60 frames and use efficientnet to extract the features, so each sequence is an array of (60, 12, 12, 1280).

e.g. 60 of something like this

image

I have an animated version of that somewhere to show how it changes with based on a scene, but basically the brighter blobby cells tended to stay brighter and blobby, and all the blobs in the cells tend to move around and change in their cells based on the input video.

With the LSTM and initial Dense layers, the model architecture was basically saying “take some relevant stuff from all that data, and each of the other frame’s data, and predict an output”.

But that’s hard for a model to learn because the data in each cell is most immediately relevant to the data in the same cell on the other frames, not all the other cells in all the other frames. Thinking about it, what would be better is that if there was a LSTM for each 12 x 12 feature.
image
Image blown up 10x

There’s 1280 of features like that per frame, and I want to do predictions against 60 frames at a time for smoothness. Initializing a model of 1280 LSTMs takes 10+minutes to build and 2 minutes to predict so it’s not practical to do it that way.

Instead I’m trying with a DepthwiseConv2d layer which is:
“Depthwise convolution is a type of convolution in which each input channel is convolved with a different kernel (called a depthwise kernel).”
Which is good, because treating each of the 1280 features as and transforming with weights relevant to that cell makes sense, but I’m still not 100% sure I’m using it right, or if there’s not a better way of doing it. I can think of ways of using the DepthwiseConv2d plus the LSTM which make more sense, but I’ll give my current training run more time to see if I think of anything else before wiping the slate clean and starting fresh.

This approach has had better results than the LSTM approach though, but I end up restarting training from scratch a fair bit when testing. Lots of trial and error though.

On a different approach, I did see that MMPose documents implementing your own model, and I haven’t wrapped my head around it but implementing your own custom model head on the MMPose base is something along the lines of what is previously described Implement New Models — MMPose 1.3.1 documentation.

1 Like

A little off-topic, but I’ll share a thought nonetheless, since I can’t do it myself yet.
Have you looked into creating a 3-dimensional mesh, then pulling rig onto it, based on 180 VR video? In that case all you need to do is get the interconnections between pelvic bone and dick for sex, head bone and dick for blowjobs, and hand bone with dick for wanking?
We have already calibrated video for VR, which can be imported into software (Blender/Maya, etc.) as mesh and then try to insert pose recognition there. For VR video for a neural network with your approach we lose depth data (recognizing poses for 2D pictures).

My understanding is that something like a midas or normalbae generated depth map is much like a point cloud, although is missing any obscured surfaces. e.g. a 512h x 512w x depth

I’ve never tried rigging them though, I have done some visualizations with them but found what works well for one scene, doesn’t work very well for another. I had an gif above that I had lineart output superimposed on midas as an example of what they ‘see’ over time.

output1
e.g. aside from the flickering(depth changing) on the ceiling, this one was fairly consistent over time.

midas uses a trained resnet backbone for extraction of features in its encoder and then the decoder is largely doing upscaling
I’ve seen a relatively recent depth estimator was released called Marigold and it vae encodes the image, then pushes it through a unet (basically another encoder/decoder model) and it does quite well, although from what I read previously is that it’s not particularly temporally consistent. e.g. The State of the Art of Depth Estimation from Single Images | by PatricioGonzalezVivo | Medium

Essentially the encoder stage distils information out of the images into feature maps and then the decoders work backwards to create depth estimation images that are a lot like point clouds. I look at that and my take-away is that all the information needed is in the feature map, and instead of needing to add the overhead of decoding that image and then doing it over time and encoding to something usable, it should just be working with the feature maps.

The significant difference is that the feature extractors in a midas model are trained/finetuned on their task, whereas the efficientnet backbone I used was trained to extract features that make it an accurate categorization model. It might be okay for the task I gave it, visualizing the feature map outputs in video shows they do hold positional data relevant to the image, but it could always be better. For any model I’ve done above, fine tuning the backbone after initial training helps with that.

And then when it comes back to skeleton/pose detection or human rigging I revert to looking at that last link I posted about a custom head for mmpose. Although I was originally thinking of it all being something like this Human Pose Classification with MoveNet and TensorFlow Lite, but that’s literally just using the 17 keypoints output form the movenet and from my original post in this thread, they’re very hit and miss in their pretrained state.

I’m not sure what the calibrated video for VR which imports into software as mesh is, but that reminds me, if anyone had lens distortion maps for commonly used VR lenses it’d be good because the tests I did with OpenCV: Depth Map from Stereo Images really didn’t seem to like the wider fov and lens distortion and I’ve seen articles like Correcting lens distortion using FFMpeg | Daniel Playfair Cal’s Blog which offer more accurate ways of correcting it.

1 Like

I come back to this every month or so and try a new or different idea, but nothing worth mentioning in the realm of custom models.

I did revisit yolov8. There’s a guide to creating datasets and tuning yolov8 at notebooks/notebooks/train-yolov8-object-detection-on-custom-dataset.ipynb at main · roboflow/notebooks (github.com).

It’s possible to follow the instructions and label your own dataset with whatever you like. There’s other tools for dataset labelling than roboflow, but their example was fairly straightforward.

With a relatively small and sparsely labelled dataset it is possible to do more detailed body part tracking and have a model start to be able to track them better.

That’s from a couple hours training yolov8 on a dataset of ~450 images with about 2,300 annotations. My choice of classes is a bit crap, and my annotations weren’t necessarily consistent or balanced, but I could see that sort of tracking tieing into a OFS workflow or plugin.

That’s from a couple hours training yolov8 on a dataset of ~450 images with about 2,300 annotations. My choice of classes is a bit crap, and my annotations weren’t necessarily consistent or balanced, but I could see that sort of tracking tieing into a OFS workflow or plugin.

I have also experimented with such approach to generate funscripts a bit. I used the pretrained model NudeNet which, as far as I know, is also based on yolo8. Unfortunately, it was not possible for me to process the tracking results properly. Since the individual elements were not reliably recognized in every image and merging several trackers with missing features was/is not so easy, at least for me. My attempt can be found in this commit (newer code do no longer contains the function). We hope you find something that gives us better results.

2 Likes

Thanks, I’ve got some previous mentions of base yolo models as well as nudenet that someone shared previously in this thread and I didn’t find them stable or reliable enough, even on 2d video. They’d be great if I could use them as a feature extractor and I recently saw an article describing a way to use yolov8 that way but I haven’t gotten around to testing it.

I checked your link and also saw your newer commits and noted your yolo cock tracking onnx model and tried it out side by side with a yolo model from a longer training run and I don’t know what your experience has been with that, but I think if your not happy with the detections it’s making, then more data in the training dataset might be needed from the looks of it.



The same goes for the model I tuned, when watching the previous animation as well as that side by side, I can see many points where it’s not detecting objects but it should be, and that’s likely due to the small dataset I used. For transparency, there’s plenty of times in those videos that the model I trained, utterly fails! I have set opencv to listen for keypresses so when I see a frame that’s not being detected correcty, I can hit S and it can be dumped to a folder for later annotation. It’s not something I’m going to put a lot of time into, I’m still much more interested in approaches that do it all within the model.

I also saw your repo code using the cv2.calcOpticalFlowFarneback for tracking too, and I’d been meaning to take a look at what it would take to plug that into a model. I was surprised how easy it was to output a simple up or down vector and plot it into a detectable wave form and it’s given me a few ideas around feeding the hsv data into a custom model instead of images as it seemed to be visually informative, even at low resolutions. Although I can see that if I was programmatically trying to infer the right waveforms, the correct polarity is largely dependant on who is moving!

And I also saw how you’re doing quadratic interpolation in your code and it seems wayyy easier than the way I’ve been doing it so that was really useful as well! I have been manually writing functions to calculate the time between points and then doing sinusoidal calculations! I see interp1d is marked as legacy though, but make_interp_spline with uniform parametrizations looks like a good alternative I’ll try. A few weeks ago I played around making an autoencoder with lstm, with interpolated data choking down to 1/20th the size of the original data so having good, easy to generate interpolated data is really useful. I was thinking the resultant decoder model might be good to use as a replacement head on some of the other sequence models I’ve tinkered with, and potentially output smoother, more consistent predictions.

I also tried the lstm approach with my ‘manual’ interpolation to create an ‘axis expander’ model. To go from 1 channel and to create roll+pitch channels based off the longer term input (60s in, 60s out). Objectively, it wasn’t bad, but wasn’t necessarily better than just setting multifunplayer to random for those channels.

Back to your tracker model though, have you thought about doing a segmentation model instead of detection model? There’s some smart segmentation tools that make segmentation easier to annotate than you’d think for that sort of defined object. Potentially it’s easier and more reliable to calculate the visible area of the segmentation mask?

The existing optical flow and other approaches are quite good and consistently outperform anything model based I’ve tinkered with. I’m not sure if/where the pain ports are for users of it at the moment, but I could see some sort of detection model that helps a program make a decision about where to focus an optical flow’s attention (such as static cropping guidance), or to provide smoothing/postprocessing of the tracking output data.

I wouldn’t hold out any hope on anything useful from me though :smiley: . My area of interest on this is mostly focused on something fully image model based, which I don’t think is achievable with my limited knowledge, or my free time. Potentially something I’ve posted will save some time of someone smarter than me who comes along and sees it :wink:

2 Likes

herpaderpapotato/nsfw-identification-yolo10x at main (huggingface.co)

Link to the yolo10x detect model for reference. It can be used with ultralytics python libraries etc Quickstart - Ultralytics YOLO Docs

1 Like

herpaderpapotato/nsfw-identification-yolo10x at main (huggingface.co) 1

Link to the yolo10x detect model for reference. It can be used with ultralytics python libraries etc Quickstart - Ultralytics YOLO Docs 1

Oh, cool, thanks for sharing your model weight with us. :+1: I adjusted my code to be able to load your model.

I checked your link and also saw your newer commits and noted your yolo cock tracking onnx model and tried it out side by side with a yolo model from a longer training run and I don’t know what your experience has been with that, but I think if your not happy with the detections it’s making, then more data in the training dataset might be needed from the looks of it.

Yes the dataset was very small. In addition i have only trained on cowgirl position + i used the projected output of the vr for training and not a the video frame. The main problem is that creating a large enough dataset is just too time consuming and boring for me…

I will definitely follow your progress and look forward to your next results :grinning:

1 Like

I had a unexpectedly quiet saturday so labelled some more images and chucked them at yolov10n for training. It finished fairly quickly so I kicked off a m training as well. I added them both to huggingface.

I think I’ve still only got less than 10 images in the validation dataset but 600 base images in the training, and categories are imbalanced, and all the other dataset labelling sins that you can make. The one thing I did learn is that every instance of a class in an image should be labelled, or yolo begins to learn them as “background”. I did augment the dataset with cropped versions of the original images which brought the total amount of images trained on to ~1200.

I’d share the dataset publicly, but frames from copyright video seem like they’d attract a dmca request. If anyone is interested in it though, drop me a message.

yolov10n


yolov10m

That’s a vizualization of a prediction, and then a crop based on the detections, and then a prediction on the crop.

Even though the 10m is still only ~800 epochs through the training, the v10m does a bit better than the v10n. It’s detecting more of mouth on certain frames where as the v10n. I’ll see if it stays that way at the end of training.

Different sort of scene with the 10m again:

I’d still recommend other tools for newcomers, but I had a decent labelling workflow going with opencv and some python so labelling wasn’t too bad. Basically it was a case of:

  • load all videos into a list (glob *)
  • choose a random video
  • choose a random frame and display it
  • press a key to get some predictions from a previously trained yolo version and overlay them on the frame, or a different key to pick a new frame at random, or a different key to pick a new video at random.
  • if they’re all correct, press a key to write the frame and predictions to the dataset
  • if they’re not, then some opencv mousecallback magic to let me delete or add new ones and choose classes etc and then write/discard the frame etc

Once a model is performing reasonably, the process speeds up a lot because it will be doing most of the work with the annotations. It’s also possible to programatically find edge cases where the dataset/model is lacking by running predictions on random samplings of videos and working out if certain classes are disappearing and reappearing from the predictions, and then by stashing the problem frames somewhere for labelling later.

2 Likes

yolov10m seems to be the best balance of size/performance/training time from what I’ve seen. That is at herpaderpapotato/nsfw-identification-yolo10m · Hugging Face and gets a scores ~50 on the validation set compared to the previous best of ~40.

I wouldn’t generate wave forms or peaks like that if I was doing it for real (live, no forward+backward smoothing), and for the motion in this clip the wave form is inverted, but this is a quick and dirty example with mousecallbacks on the labels at the top where I can click them on and off and then reset the normalization to visualize the stability of the tracking over time.

If I wanted to actually work with that as a process, I’d probably want to run prediction over a whole video ahead of time and dump the tracking data to file. torchvision will do it with cuda straight into yolo on the same gpu with acceleration for h265 and h264 and runs at a much faster speed compared to what I get from opencv.
And then it’d be a case of working with that information in a gui to mix/combine/offset the waveforms into something usable. gui’s are hard though, and support of them is a nightmare. I just like playing with the tech :rofl:.

I started a new dataset with human pose keypoint labelling and started fine tuning yolo pose models. The results were good, but on the test scenes the keypoints don’t necessarily convey body movement enough.

Instead I added another 4 keypoints to my keypoint labels: base of pelvis, umbilicus, left of sternum and right of sternum. Fine tuning and adding keypoints throws out a lot more of the model’s previous knowledge than just a normal pose finetune so it’s a bit trickier.

Left is finetuned 21 keypoint visualization, right is base yolo pose model visualization with no fine tuning.
The visibility on the arms and legs is low in the scene, so some jitter is expected, but the fine tune still does much better on their positions. The scene used is intentionally kept out of the dataset.

Generating predictions faster than realtime speed is a goal at the moment but is a bit of a challenge, a lot of it related to python threading.

  • cuda opencv slows down converting GPUMat to tensors even when done without reprocessing on CPU
  • torchvision similarly slows down converting the tensor from HSV(YUV?) to BGR
  • ffmpeg hooks from memory seemed to decode via CPU but python accesses the images with CPU which slows predictions.
    There’s other options like preconverting, other libraries, multiprocessing or even using a different language like c++/c#.

In any case, it’s probably the best results I’ve had from any of these to date. Both based on the absolute position of the points, as well as the relative distance between certain points over time.

3 Likes

So you mean something like this, right?

We need lots of heroes to make all these useful.

Also, lots of great work in this post. Thanks for all the hard work and open source them.

1 Like

Not really, but it’s a very interesting project. It seems to me that resources will be needed to recognize the skeleton quite a lot, but maybe it’s worth it.