Gifs of pose estimation on vr video

I trained a few models in the above, but they’re not useful for various reasons.

I was looking at some image preprocessing methods the other day and I think the lineart model used as an image annotator with image generation might be a better model to use instead of the efficientnets I’ve been using.

output

From top to bottom, left to right:
original, canny, hed, lineart
lineart anime, midas, normal bae, oneformer
openpose (too close for it to work anything out), pidinet, zoe, blank

I’ve tried simplifying the image being fed to the model to lighten the workload of training and inference (although I still have dataset challenges), and lineart seems to be really consistent and fast at about 30ms per frame unbatched, and I want to try to utilize that as a feature extractor.

If you’re looking for something usable, I’d say check out Motion Tracking Funscript Generator v0.5.x - Software - EroScripts or How to use FunscriptToolBox MotionVectors Plugin in OpenFunscripter - howto - EroScripts, or pretty much any other posts under the software category. At some point I’d be curious to see how a custom model would go if it could utilize the motion vectors generated by the latter, but there wouldn’t be any pretrained feature extractors available and it’d be a slower initial training phase as well as require a fair bit of reprocessing on my current dataset.

1 Like

Hello!

Honestly love some of the work you’re doing here; I won’t pretend to be able to understand it :joy:

But it looks like training data is the blocker for you, i wonder if you’ve considered using VAM to synthesize these training data? ( perhaps writing a vam script that outputs renders + labels, you may be able to get accurate buffer data like depth maps; edges and all that from VAM too)

I’m not sure if it’d work though if the rendering for these scenes are not photorealistic :upside_down_face:

EDIT: I’m considering using VAM to synthesize views for 4D Gaussian so i thought this might be useful for you :slight_smile:

I would also be really interested to see if we could train these pose estimation models on VAM renders and produce really accurate inference results? :thinking:

obligatory wall of text because I’m not good at simple explanations
The pose stuff: movenet, openpose mmpose, mediapipe etc. These work pretty well on normal 2d videos for this sort of thing, and I’ve seen some topics of other users leveraging them as a part of generating funscripts.
This is a gif of openpose based on a 2d scene I had laying around.
output
The open pose model outputs a series of coordinates which correspond to bones/connections, and you could probably create a really good 1D starter script if you took the coordinates of the hip points and head points and did some processing on them. At the least it should provide a relatively decent representation of peaks and troughs which could be edited into something better with less effort than starting from scratch.

I think you’re right that something like VAM could be used to generate a labelled dataset of pose estimation for both vr and 2d videos to improve accuracy and it’s similar to the initial comments flowerstrample made initially. My understanding is that there’s no porn in the CMU Panoptic Studio dataset which is what openpose was trained on, or any other pose datasets so even for 2d there’s probably plenty of room for improvement. I don’t think it would take a large amount of retraining of one of those models to achieve that given that these things hold most of the knowledge in the base model and often just need fine tuning of the later layers. I’d guess it would need to be with a variety of scene types, and as realistic as possible input images to match the type of images it would be used against. The dataset of images generated could be pushed through something like stable diffusion with img2img generation with openpose control net to generate an even larger and more varied set of data from that. I don’t really know much about VAM though, and potentially loading enough unique scenes and poses (and exporting them in a VR view) could be more time intensive than manually labelling poses. VAM definitely could produce a large amount of similar images in a sequence much faster than manual labelling, but my guess is that variety in the dataset would be most beneficial to the model.

VAM or other 3d computer graphics could be valuable for generating a large dataset for a pretrained model which tracked 6 axis motion. A smaller dataset of manually annotated filmed scenes could be used to fine tune it and make it usable for normal videos, but I still wouldn’t expect any approach I’ve taken so far to be good and accurate enough for real time, script-less function. Instead I’d still predict the best I might manage is something that can generate a starter funscript to save time for human scripters, similar to the other posts I linked previously.

It could even be the case that a VAM or computer generated dataset could be used to train pose estimation using multiple frames to try achieve temporally consistent results, but that’s a lot to bite off. This would be similar to how movinet does it, but it only does an overall classification of action for the whole sequence, and the inner workings of things like that are beyond me. MoViNet for streaming action recognition  |  TensorFlow Hub

Stepping back from the pose models: I’ve looked at using yolov8 which offers bounding box detection, but it’s complex to get the feature extractor layers with the right classification info into a time sequence model. I did put together something that would extract every prediction from a frame and that might be useful for the LSTM style model that I’ve made in previous posts but haven’t pursued it. It’s not very temporally consistent, or very fast, but mostly it just didn’t feel like a lead I wanted to follow.

That’s why the lineart model caught my attention. Compared to line detection with canny, it was very smooth and temporally consistent. It also simplifies the image a lot which means there’s less data/noise to process, and it passed test of “if all I had was this output, could I write a semi accurate funscript”. The base model for it is a pytorch model though, and I’m more of a tensorflow user so it’d take some effort to convert, and it still outputs 512x512x1 bits of data which would still need layers to simplify down to a smaller size before I could use it in a model that processed a sequence of frames.

Including this gif because I thought it was cool. It was the product of choosing the highest value pixels from midas and lineart output, and then subtracting the lineart output values. I’d still try lineart alone because I think it’d be simpler for a model to learn.
output1

4 Likes

What if we don’t look at pose recognition? For example:

  1. Manually mark the time from and to and indicate which position (blowjob, blowjob with hands, wanking, from behind, horsewoman, etc.).
  2. I have quite a lot of 6-axis scripts, if you mark them up as I wrote above - to use for neural network is like a dataset for training.
  3. Let the trained neural network spin and move the cylinder at once?
    Blender_C__Users_CyberYou_Documents_eroscripts_RavenLane_POVR_Originals_Ass_for_Cash_Raven_Lane.blend2024-03-0117-35-13-ezgif.com-video-to-gif-converter

Why I thought so - there simply isn’t a dataset trained specifically for this purpose. For this purpose you need about 50 different short clips for each pose where there will be an exact match of skeletal bones with the picture.
Even for some special tasks are specially trained - the difference between boxing/yoga/running/playing ball for a dataset is significant.

1 Like

Great explanation! :slight_smile:

I was just about to suggest img2img processing on VAM renders to make it closer to photorealistic! :smiley:

Although i should say that with VAM you should be able to script up a rendering pipeline that incorporate automatic scene changes, outfit changes, and appearance preset changes so you’d get massive variety in the dataset

But still, i agree that using VAM renders to fine-tune models for multi-axis scripts does seem like a better use case compared to simple 2-dimensional scripts… for which there seem to be simpler and less intensive solutions!

ooh that’s an idea that could actually work with VAM!

So VAM could potentially output multi-axis funscripts ( not sure about it since i havent tried it, but i’m sure there are ways to extract that data ) but if we generate tons of these marked clips using VAM and their multi-axis funscripts and trained the neural network on that dataset; it might potentially work without requiring pose-estimation?

1 Like

You could try it. The penis is almost always in the same place and everything revolves around it :wink: At least it can work for videos with one girl and no complicated angles.

1 Like

I agree with your suggestion of letting the model manage it and with that approach I was able to get good outputs on a lot of the more obvious cowgirl and similar scenes which is what I put on github and huggingface.
With the mentions of getting better outputs from pose recognition for videos, I was mostly just providing feedback on possible options and blewClue215’s query and looping back around on some other thoughts.

If anyone else is looking into this space I’d say don’t take anything I’ve posted as correct because I’m a hobbyist, not a data scientist, I’m just sharing as I go. A lot of my understanding is based on examples like this.

Overall I look it as:

  1. There are model architectures that often used for extracting ‘features’ from images. resnets, vggnet, efficientnet, inception etc. They’re not necessarily only good for one thing, a detection model might use the same backbone architecture as a classification model.
  2. They can be trained from scratch, or there are pretrained weights that you can use.
  3. You can use the pretrained weights without the last layer as ‘backbones’ for your own models, which can reduce overall training time, but you’d still need to train your new layers, and for best results fine tune the original layers.

I haven’t looked back into the pose recognition models in a while, but they do have an appeal for me because their pretrained layers and architecture is proven suitable for tracking the position of a person in an image, which I would think could then be used by:

  1. stripping off the current head/output layers
  2. train a new output layer for the values we’d use
  3. and then fine tune all the layers with a low learning rate to further improve accuracy

but some of the pose models are a bit of a black box, and they tend to be more compute intensive(slower), and they can be a bit memory intensive, so they don’t play well when I try to use them as a backbone for a model that creates predictions on multiple frames to ensure that the predictions are temporally consistent.

Which is why I’ve been thinking more about what a good feature extractor would be.

I’m not an expert in neural networks either, so a budding amateur and haven’t done anything with them in a long time.

  1. It is necessary to have a network for object recognition, something like in SLR
  2. Depending on the present pose, this data is already transmitted for conversion into angles/vectors.

It turns out as follows.
Recognition of body parts → since we know what is used now (markup of preliminary video on types of sex) we can create a single-axis script with MTFG → adding angles and vector for multi-axis script.
I think I saw here on the forum someone already did body part recognition with further processing in MTFG or something similar. In the future we can try to recognize by neural network and type of sex, but for marking by time.

I don’t really want to get into neural networks now. But I can help you for example with import/export of multi-axis scripts and markup by sex types.

2 Likes

https://pury.fi/

1 Like

I’ll start small and see how I go :).

I’ve seen the original of that image in gif form. I’d embed it but it’s over 100mb so here’s an image for a frame of it instead.

It’s clearer in the gif, but it implies a detection model fine-tuned on a custom dataset, and then a workflow that would either rely on selection of the relevant detections by a user, or a separate automated process determining which category of detections to track. There’s probably a first stage though where detection is used to crop the image too.

The second link Falafel added is pretty cool because it’s almost exactly what they’ve done. I’ve seen a few different variations on that model and inference with python. The classes it uses is:
“FEMALE_GENITALIA_COVERED”,
“FACE_FEMALE”,
“BUTTOCKS_EXPOSED”,
“FEMALE_BREAST_EXPOSED”,
“FEMALE_GENITALIA_EXPOSED”,
“MALE_BREAST_EXPOSED”,
“ANUS_EXPOSED”,
“FEET_EXPOSED”,
“BELLY_COVERED”,
“FEET_COVERED”,
“ARMPITS_COVERED”,
“ARMPITS_EXPOSED”,
“FACE_MALE”,
“BELLY_EXPOSED”,
“MALE_GENITALIA_EXPOSED”,
“ANUS_COVERED”,
“FEMALE_BREAST_COVERED”,
“BUTTOCKS_COVERED”
The first link runs prediction at the server, so assume they’re harvesting any images you test :wink:

I’ll pick on that image for a bit though. This is what nudeNet in browser (notai.tech) thinks
image

This is what the base YOLOv8 web service does to the image
image

I’ve played around with YOLOv8 a bit before and did a quick prediction against that scene with the script I have
video
It’s a bit jittery but YOLOv8 would be similar to comments about openpose in that it was trained on COCO128 which is not a nudity heavy dataset, and the base categories the classification head is trained on is only about 15 different objects.

And if I lower the detection threshold even further for lulz
videoall

If someone wanted to copy the process hinted at in that first image then the starting point is labelled data, and it probably needs to be a bit more specific to this purpose than the classes used by nudenet, and the next step would be to fine tune something like YOLOv8 which is relatively well documented. From there it could be shoehorned into something like the motion tracking funscript generator extension for ofs instead of the opencv point tracking I assume it uses.

1 Like

I took the methodology from my previous models that I put in github/huggingface and this an example prediction generated by the ml after less than 24 hours of training on one (sometimes 2) gpu it does this


If for some reason that embed doesn’t work, Riding Nude Cowgirl Porn GIF by herpaderpapotato or motion.mp4 ~ pixeldrain.

Animating a 3d object manually with a 2d plane is hard, so I drew the axis with opencv on the image
tilt of the blue line is roll
tilt light blue line is twist
red line is position vertical, length of red line is surge
green line is sway, length of green line is pitch

I went through flowertrample’s free multiaxis scripts and grabbed the corresponding videos I have access to and ran through the following process.
Stage 1:

  • I put the videos and funscripts in a folder, matched the names and then put the timestamps of cowgirl scenes in a file named .txt. I only really did this for 3, and trained on 2. I should to more but the next steps take time.
  • Used python to extract each left half frame to file named like frame.jpg in the original resolution
  • For each frame, read the funscripts and interpolated the position for the frame and stored the axis values in frame.txt

Stage 2:

  • I wanted to speed up training, so instead of making a LSTM that used a pretrained model to generate features every epoch, I just made a model which would take 60 x the tensor shape of the features (which is an array of 12x12x1280) which saves training time. I used efficientnetv2-s-21k-ft1k to pregenerate all the images into features saved as numpy files.
  • I wrote a dataloader that reads all the numpy file names and works out which ones are sequential of 60 or more, and return however many sets of sequence data I request. It’s a really bad dataloader because it loads that many sequences into memory but I couldn’t be bothered writing a smarter data generator.
    At this point, across the 3 videos I prepped there’s 51830 jpgs, 51830 txt files and 51830 npy files. At this point there’s also approx 51,500 overlapping 60 frame sequences the dataloader will grab from.

You can visualize the data in the npy files and they look a bit like:

In an image classification model, that is what the final classification layer would ‘see’ and make decisions on, but I smash 60 together and tell it to do LSTM stuff with it before pushing it through a few dense layers.

Stage 3:

  • Because my dataloader is bad/dumb, I set the training to get 128 sequences at random, train for 100 epochs on them, and then loop that continuously.
  • I watched the loss numbers go down. There’s no right answer on this, the loss number went down much quicker than it did when I train basic funscript because that has a lot more variation than the other axis so it will learn to predict in a smaller range quickly which brings the loss down but doesn’t mean it’s accurate. e.g. part of the later training after adding more sequences

Stage 4:

  • I grabbed a video for one of flowertramples paid scripts and used it to test, because it’s not in my source dataset and could be used for comparison without fear of contamination
  • I loaded the timestamps of cowgirl in the video, pushed the frames through efficientnetv2-s-21k-ft1k, and then made the model do predictions of 60 frames at a time, which is the number of frames I configured the model to do.
  • I used opencv to draw on the frames to indicate the predictions.
  • also checked the up down axis output on a plot
    image

Caveats/notes

  • It jitters a little every 60 frames, because each 60 frames is a separate prediction. One day I’ll be bothered to add an input to the model that takes the position of the previous frame as a starting point. Doing overlapping predictions seems wasteful but is an option.
  • The model has never seen this scene before. In training it has only seen 2 other scenes so it’s not very diverse but it still matching that tempo reasonably.
  • It hasn’t been trained very long
  • the base model(efficientnetv2-s-21k-ft1k) hasn’t been fine-tuned at all
  • I don’t have explicit permission to train and share a model based on flowerstrample’s scripts. Although I can share code etc, the model weights can’t be shared and I’d end up deleting them to avoid accidentally using them elsewhere.
  • I used linear interpolation to create the frame matching pos values, and so the values I generated aren’t the best, but writing something sinusoidal was more effort/time
  • although there’s plenty of multiaxis anime and pmv, I think they’d probably contradict the model’s learnings and so I can’t use them
  • I realized I broke one of my rules and the dense layers after the LSTM each have less neurons than the final layer, and I try to avoid doing that so I probably should redo it with fresh weights.
  • although I did speed up training by pre-generating the pretrained model outputs, it meant I couldn’t use any data augmentation in training (although maybe I could add some noise to the features).
  • I could write some python to output funscripts based off the data (I’ve made some for my previous models, but not multi-axis), it’s too soon to bother because in all likelihood the model will freak out the minute someone transitions from cowgirl leaning back to cowgirl leaning forward with arms on chest.
  • I didn’t do any pre-cropping on the images it’s training on or predicted on, so it could be better if there’s less distractions which is what I saw on other models.
  • it’s only learnt from cowgirl scenes, and so is probably crap at everything else

As a proof of concept though, I don’t mind it. I mostly still want to find a way to simplify images prior to prediction to improve the eventual speed and accuracy, and/or to leave some breadcrumbs for others to improve upon.

2 Likes

Cool…
Of course I will send you the script in a private message, but the thing is that on this script I once again changed the math of writing (increased the angles - it’s Falafel’s fault with his video) + began to pay more attention to the anatomy of a woman (location of the vagina / rectum + oral cavity) Write to what videos you have access to, well, or I still have all the projects - I can give you at once flat converted videos 1024x1024 and multi-axis scripts to them.

The “conqueror” one? Irrc I limited the sway range during that video… The SR6 can sway a bit wider but it will be a little shaky.

1 Like

This one.
I thought these were the maximum angles…

Sorry I might have confused things, I’m not sure which script is referred to to send.

I alluded to it in a previous post, but I’ve put a previous model and code up, with the code at herpaderpapotato/silver-lamp: ml to assist funscript creation (github.com). I’m not really trying to drive traffic to it though so I’ve been avoiding posting the url. It uses weights I put on huggingface herpaderpapotato (herpader papotato) (huggingface.co) which was trained on datasets composed by me.

I’ll put the necessary code up I used to do all the above as well after I’ve had a chance to tidy it up. Whether the weights are shared will be up to flowerstrample because if it would be based on their intellectual property.
I’ve only trained spectacularly bad models though, so it’s not that I’m expecting it to come close to human funscripts. Sharing model weights would really just be to evidence that I’m not making all this up :slight_smile:

1 Like

I kept the training running with a sequence from third scene added to the dataset and it got way worse at predictions. I eventually took a look at the sequence and realized it was a much different and subtle style of cowgirl, it’s like a forward thrusting with no obvious vertical movement.
I’m going to limit this to just obvious vertical cowgirl movement for now, so I took it out of the dataset and continued training.

It kind of got better eventually, but I figured I wanted to fix the smaller middle layers and so started a fresh model from scratch.

The fresh model could predict the same sequence shown previously after about 3 hours of training, but then I realized I still had some middle layers thinner than the final layer, so I started again with another fresh model.

I shifted the code to a fresh repo and dumped it in herpaderpapotato/glowing-octo-giggle (github.com). The image of the plot shown is for the most recent training. There’s virtually no polish on any of it, but ¯\_(ツ)_/¯

1 Like

I’m still questing for a better feature extractor and model architecture and I ended up dropping the LSTM from the model and trying DepthwiseConv2d layers.

e.g. I predict against 60 frames and use efficientnet to extract the features, so each sequence is an array of (60, 12, 12, 1280).

e.g. 60 of something like this

image

I have an animated version of that somewhere to show how it changes with based on a scene, but basically the brighter blobby cells tended to stay brighter and blobby, and all the blobs in the cells tend to move around and change in their cells based on the input video.

With the LSTM and initial Dense layers, the model architecture was basically saying “take some relevant stuff from all that data, and each of the other frame’s data, and predict an output”.

But that’s hard for a model to learn because the data in each cell is most immediately relevant to the data in the same cell on the other frames, not all the other cells in all the other frames. Thinking about it, what would be better is that if there was a LSTM for each 12 x 12 feature.
image
Image blown up 10x

There’s 1280 of features like that per frame, and I want to do predictions against 60 frames at a time for smoothness. Initializing a model of 1280 LSTMs takes 10+minutes to build and 2 minutes to predict so it’s not practical to do it that way.

Instead I’m trying with a DepthwiseConv2d layer which is:
“Depthwise convolution is a type of convolution in which each input channel is convolved with a different kernel (called a depthwise kernel).”
Which is good, because treating each of the 1280 features as and transforming with weights relevant to that cell makes sense, but I’m still not 100% sure I’m using it right, or if there’s not a better way of doing it. I can think of ways of using the DepthwiseConv2d plus the LSTM which make more sense, but I’ll give my current training run more time to see if I think of anything else before wiping the slate clean and starting fresh.

This approach has had better results than the LSTM approach though, but I end up restarting training from scratch a fair bit when testing. Lots of trial and error though.

On a different approach, I did see that MMPose documents implementing your own model, and I haven’t wrapped my head around it but implementing your own custom model head on the MMPose base is something along the lines of what is previously described Implement New Models — MMPose 1.3.1 documentation.

1 Like

A little off-topic, but I’ll share a thought nonetheless, since I can’t do it myself yet.
Have you looked into creating a 3-dimensional mesh, then pulling rig onto it, based on 180 VR video? In that case all you need to do is get the interconnections between pelvic bone and dick for sex, head bone and dick for blowjobs, and hand bone with dick for wanking?
We have already calibrated video for VR, which can be imported into software (Blender/Maya, etc.) as mesh and then try to insert pose recognition there. For VR video for a neural network with your approach we lose depth data (recognizing poses for 2D pictures).

My understanding is that something like a midas or normalbae generated depth map is much like a point cloud, although is missing any obscured surfaces. e.g. a 512h x 512w x depth

I’ve never tried rigging them though, I have done some visualizations with them but found what works well for one scene, doesn’t work very well for another. I had an gif above that I had lineart output superimposed on midas as an example of what they ‘see’ over time.

output1
e.g. aside from the flickering(depth changing) on the ceiling, this one was fairly consistent over time.

midas uses a trained resnet backbone for extraction of features in its encoder and then the decoder is largely doing upscaling
I’ve seen a relatively recent depth estimator was released called Marigold and it vae encodes the image, then pushes it through a unet (basically another encoder/decoder model) and it does quite well, although from what I read previously is that it’s not particularly temporally consistent. e.g. The State of the Art of Depth Estimation from Single Images | by PatricioGonzalezVivo | Medium

Essentially the encoder stage distils information out of the images into feature maps and then the decoders work backwards to create depth estimation images that are a lot like point clouds. I look at that and my take-away is that all the information needed is in the feature map, and instead of needing to add the overhead of decoding that image and then doing it over time and encoding to something usable, it should just be working with the feature maps.

The significant difference is that the feature extractors in a midas model are trained/finetuned on their task, whereas the efficientnet backbone I used was trained to extract features that make it an accurate categorization model. It might be okay for the task I gave it, visualizing the feature map outputs in video shows they do hold positional data relevant to the image, but it could always be better. For any model I’ve done above, fine tuning the backbone after initial training helps with that.

And then when it comes back to skeleton/pose detection or human rigging I revert to looking at that last link I posted about a custom head for mmpose. Although I was originally thinking of it all being something like this Human Pose Classification with MoveNet and TensorFlow Lite, but that’s literally just using the 17 keypoints output form the movenet and from my original post in this thread, they’re very hit and miss in their pretrained state.

I’m not sure what the calibrated video for VR which imports into software as mesh is, but that reminds me, if anyone had lens distortion maps for commonly used VR lenses it’d be good because the tests I did with OpenCV: Depth Map from Stereo Images really didn’t seem to like the wider fov and lens distortion and I’ve seen articles like Correcting lens distortion using FFMpeg | Daniel Playfair Cal’s Blog which offer more accurate ways of correcting it.

1 Like