Gifs of pose estimation on vr video

A smaller update. 10 hours of further training later, training time on the current dataset is starting to hit diminishing returns, which is expected. Rather than keeping that going for longer, I’m planning on enabling training on the base model and seeing how far that goes.

The current state of the model on an ‘easy’ scene looks like this.


I think the close up crop(poi) learns movement the easiest, but by training on too many different levels of crop it might be making the full frame predictions dumber.

in the literal sense, it looks like this, Brunette Doggystyle POV Porn GIF by herpaderpapotato

Rather than stashing more code on the mega, I put some demo code up for testing on github and a copy of the trained models on huggingface. I’m not trying to drive traffic there, just sharing the work so instead of putting a link here, if you search for my username on github it’ll be the repo silver-lamp(edit to fix name). Standard disclaimer it needs some experience with python etc to work with. Again, this isn’t a “go here and try this”, it’s more of a sharing of information for those looking at similar.

edit: to avoid spamming, small, small update: these are the direct outputs generated using the onnx model with wsl2 from the github. The point of interest (poi) script is the least bad. It’s still failing terribly on some scenes, but progress compared to previous.
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.poi.20231125-012514.funscript (4.1 MB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.crop.20231125-012514.funscript (4.1 MB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.20231125-012514.funscript (4.1 MB)
disclaimer, anyone may open them and review in OFS, but I don’t recommend playing them back on hardware. They’re also large because it’s predicting every frame.

4 Likes

Love this, learning a lot here. Thanks, and please keep these posts rolling.

The 3 funscripts from my previous make some nice wave forms in OFS (on types of scenes it has learnt), but that’s about it.

Eventually a model might be able to do well enough with obvious stuff that the base output is usable as a base for scripting from. If accuracy increases and tooling adapts to doing things like enabling them to be more easily ranged stretched then post processing might not be necessary. But for now, we can brute force find peaks using python. e.g. top is just a “set peaks at 100, troughs at 0”

I’ve added the python notebook used to transform a full frame prediction script, to just a peak to peak prediction script in github and the resulting funscript files are only ~400 - 500kb.

milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.poi.20231125-012514_postprocessed.funscript (552.0 KB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.20231125-012514_postprocessed.funscript (242.1 KB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.crop.20231125-012514_postprocessed.funscript (381.2 KB)

They’re still very bad at some positions, and constant 0 - 100 is also bad, but this is the tip (peak?) of the iceberg for postprocessing. e.g.

  • Instead of the POS value being a flat 100 or 0, it could be ((actual POS+100)/2). But just divide by 2 for the trough. This would maintain a higher range while tying the values to the prediction.
  • Or, intermediate frames could be added with their value relative to the peak.
  • Or wave finding functions could be used to generate smoothed intermediate values etc.
  • Or, the values of the intermediate frames could be preserved and just normalized between the peaks.
  • And anomaly detection could find missed strokes and allow manual, or single button press to fix/accept a recommendation.
  • And for scenes that it’s very bad at, could be added to the dataset for further tuning of the model and reducing the need for post processing.

I added a onnx model to huggingface and a prediction script to github (with its own requirements.txt). Onnx can be used for full CUDA acceleration in Windows if the right CUDA libraries are installed, however I only tested it with WSL. Onnx seemed faster than tensorflow and took less time to perform the prediction than it did to gather the frames for each prediction, but also needed about ~8GB of vram so I may look at quantization on it in the future.

2 Likes

I was inspired by falafel’s video although with a different goal of making the worst funscripts known to man with AI. Here’s my version, how I make funscripts that will simultaneously give you blue balls and rip your junk off.

This example does not use any close up (poi) crops which are normally much more accurate but a bit of a crutch, and the latest model is improving with full frame predictions.

I’ve added some observations in github around the models hyperparameters and why I was having so little success with anything larger than the small efficientnetv2 model, my thinking is because the bigger models feature extraction layers are subtler and the number of dense neurons after the efficientnetv2S were previously only just enough to barely function as intended. In newer models I’ve beefed up the dense layer in the middle of the model and that became the thicc variant. I took that even further and made a doublethicc variant. The onnx version of doublethicc variant was used for the above video and significantly improved the output of full frame predictions. At the moment it’s held back by variety in the training data so next focus is probably there.

The funscripts generated in the video are in gobble ~ pixeldrain if anyone wants to review them. They’re relatively big funscripts so I thought they’d be better off on filehosting. Everything else used in the video is on github or huggingface.

1 Like

Have you trained the model?

I trained a few models in the above, but they’re not useful for various reasons.

I was looking at some image preprocessing methods the other day and I think the lineart model used as an image annotator with image generation might be a better model to use instead of the efficientnets I’ve been using.

output

From top to bottom, left to right:
original, canny, hed, lineart
lineart anime, midas, normal bae, oneformer
openpose (too close for it to work anything out), pidinet, zoe, blank

I’ve tried simplifying the image being fed to the model to lighten the workload of training and inference (although I still have dataset challenges), and lineart seems to be really consistent and fast at about 30ms per frame unbatched, and I want to try to utilize that as a feature extractor.

If you’re looking for something usable, I’d say check out Motion Tracking Funscript Generator v0.5.x - Software - EroScripts or How to use FunscriptToolBox MotionVectors Plugin in OpenFunscripter - howto - EroScripts, or pretty much any other posts under the software category. At some point I’d be curious to see how a custom model would go if it could utilize the motion vectors generated by the latter, but there wouldn’t be any pretrained feature extractors available and it’d be a slower initial training phase as well as require a fair bit of reprocessing on my current dataset.

1 Like

Hello!

Honestly love some of the work you’re doing here; I won’t pretend to be able to understand it :joy:

But it looks like training data is the blocker for you, i wonder if you’ve considered using VAM to synthesize these training data? ( perhaps writing a vam script that outputs renders + labels, you may be able to get accurate buffer data like depth maps; edges and all that from VAM too)

I’m not sure if it’d work though if the rendering for these scenes are not photorealistic :upside_down_face:

EDIT: I’m considering using VAM to synthesize views for 4D Gaussian so i thought this might be useful for you :slight_smile:

I would also be really interested to see if we could train these pose estimation models on VAM renders and produce really accurate inference results? :thinking:

obligatory wall of text because I’m not good at simple explanations
The pose stuff: movenet, openpose mmpose, mediapipe etc. These work pretty well on normal 2d videos for this sort of thing, and I’ve seen some topics of other users leveraging them as a part of generating funscripts.
This is a gif of openpose based on a 2d scene I had laying around.
output
The open pose model outputs a series of coordinates which correspond to bones/connections, and you could probably create a really good 1D starter script if you took the coordinates of the hip points and head points and did some processing on them. At the least it should provide a relatively decent representation of peaks and troughs which could be edited into something better with less effort than starting from scratch.

I think you’re right that something like VAM could be used to generate a labelled dataset of pose estimation for both vr and 2d videos to improve accuracy and it’s similar to the initial comments flowerstrample made initially. My understanding is that there’s no porn in the CMU Panoptic Studio dataset which is what openpose was trained on, or any other pose datasets so even for 2d there’s probably plenty of room for improvement. I don’t think it would take a large amount of retraining of one of those models to achieve that given that these things hold most of the knowledge in the base model and often just need fine tuning of the later layers. I’d guess it would need to be with a variety of scene types, and as realistic as possible input images to match the type of images it would be used against. The dataset of images generated could be pushed through something like stable diffusion with img2img generation with openpose control net to generate an even larger and more varied set of data from that. I don’t really know much about VAM though, and potentially loading enough unique scenes and poses (and exporting them in a VR view) could be more time intensive than manually labelling poses. VAM definitely could produce a large amount of similar images in a sequence much faster than manual labelling, but my guess is that variety in the dataset would be most beneficial to the model.

VAM or other 3d computer graphics could be valuable for generating a large dataset for a pretrained model which tracked 6 axis motion. A smaller dataset of manually annotated filmed scenes could be used to fine tune it and make it usable for normal videos, but I still wouldn’t expect any approach I’ve taken so far to be good and accurate enough for real time, script-less function. Instead I’d still predict the best I might manage is something that can generate a starter funscript to save time for human scripters, similar to the other posts I linked previously.

It could even be the case that a VAM or computer generated dataset could be used to train pose estimation using multiple frames to try achieve temporally consistent results, but that’s a lot to bite off. This would be similar to how movinet does it, but it only does an overall classification of action for the whole sequence, and the inner workings of things like that are beyond me. MoViNet for streaming action recognition  |  TensorFlow Hub

Stepping back from the pose models: I’ve looked at using yolov8 which offers bounding box detection, but it’s complex to get the feature extractor layers with the right classification info into a time sequence model. I did put together something that would extract every prediction from a frame and that might be useful for the LSTM style model that I’ve made in previous posts but haven’t pursued it. It’s not very temporally consistent, or very fast, but mostly it just didn’t feel like a lead I wanted to follow.

That’s why the lineart model caught my attention. Compared to line detection with canny, it was very smooth and temporally consistent. It also simplifies the image a lot which means there’s less data/noise to process, and it passed test of “if all I had was this output, could I write a semi accurate funscript”. The base model for it is a pytorch model though, and I’m more of a tensorflow user so it’d take some effort to convert, and it still outputs 512x512x1 bits of data which would still need layers to simplify down to a smaller size before I could use it in a model that processed a sequence of frames.

Including this gif because I thought it was cool. It was the product of choosing the highest value pixels from midas and lineart output, and then subtracting the lineart output values. I’d still try lineart alone because I think it’d be simpler for a model to learn.
output1

4 Likes

What if we don’t look at pose recognition? For example:

  1. Manually mark the time from and to and indicate which position (blowjob, blowjob with hands, wanking, from behind, horsewoman, etc.).
  2. I have quite a lot of 6-axis scripts, if you mark them up as I wrote above - to use for neural network is like a dataset for training.
  3. Let the trained neural network spin and move the cylinder at once?
    Blender_C__Users_CyberYou_Documents_eroscripts_RavenLane_POVR_Originals_Ass_for_Cash_Raven_Lane.blend2024-03-0117-35-13-ezgif.com-video-to-gif-converter

Why I thought so - there simply isn’t a dataset trained specifically for this purpose. For this purpose you need about 50 different short clips for each pose where there will be an exact match of skeletal bones with the picture.
Even for some special tasks are specially trained - the difference between boxing/yoga/running/playing ball for a dataset is significant.

1 Like

Great explanation! :slight_smile:

I was just about to suggest img2img processing on VAM renders to make it closer to photorealistic! :smiley:

Although i should say that with VAM you should be able to script up a rendering pipeline that incorporate automatic scene changes, outfit changes, and appearance preset changes so you’d get massive variety in the dataset

But still, i agree that using VAM renders to fine-tune models for multi-axis scripts does seem like a better use case compared to simple 2-dimensional scripts… for which there seem to be simpler and less intensive solutions!

ooh that’s an idea that could actually work with VAM!

So VAM could potentially output multi-axis funscripts ( not sure about it since i havent tried it, but i’m sure there are ways to extract that data ) but if we generate tons of these marked clips using VAM and their multi-axis funscripts and trained the neural network on that dataset; it might potentially work without requiring pose-estimation?

1 Like

You could try it. The penis is almost always in the same place and everything revolves around it :wink: At least it can work for videos with one girl and no complicated angles.

1 Like

I agree with your suggestion of letting the model manage it and with that approach I was able to get good outputs on a lot of the more obvious cowgirl and similar scenes which is what I put on github and huggingface.
With the mentions of getting better outputs from pose recognition for videos, I was mostly just providing feedback on possible options and blewClue215’s query and looping back around on some other thoughts.

If anyone else is looking into this space I’d say don’t take anything I’ve posted as correct because I’m a hobbyist, not a data scientist, I’m just sharing as I go. A lot of my understanding is based on examples like this.

Overall I look it as:

  1. There are model architectures that often used for extracting ‘features’ from images. resnets, vggnet, efficientnet, inception etc. They’re not necessarily only good for one thing, a detection model might use the same backbone architecture as a classification model.
  2. They can be trained from scratch, or there are pretrained weights that you can use.
  3. You can use the pretrained weights without the last layer as ‘backbones’ for your own models, which can reduce overall training time, but you’d still need to train your new layers, and for best results fine tune the original layers.

I haven’t looked back into the pose recognition models in a while, but they do have an appeal for me because their pretrained layers and architecture is proven suitable for tracking the position of a person in an image, which I would think could then be used by:

  1. stripping off the current head/output layers
  2. train a new output layer for the values we’d use
  3. and then fine tune all the layers with a low learning rate to further improve accuracy

but some of the pose models are a bit of a black box, and they tend to be more compute intensive(slower), and they can be a bit memory intensive, so they don’t play well when I try to use them as a backbone for a model that creates predictions on multiple frames to ensure that the predictions are temporally consistent.

Which is why I’ve been thinking more about what a good feature extractor would be.

I’m not an expert in neural networks either, so a budding amateur and haven’t done anything with them in a long time.

  1. It is necessary to have a network for object recognition, something like in SLR
  2. Depending on the present pose, this data is already transmitted for conversion into angles/vectors.

It turns out as follows.
Recognition of body parts → since we know what is used now (markup of preliminary video on types of sex) we can create a single-axis script with MTFG → adding angles and vector for multi-axis script.
I think I saw here on the forum someone already did body part recognition with further processing in MTFG or something similar. In the future we can try to recognize by neural network and type of sex, but for marking by time.

I don’t really want to get into neural networks now. But I can help you for example with import/export of multi-axis scripts and markup by sex types.

2 Likes

https://pury.fi/

1 Like

I’ll start small and see how I go :).

I’ve seen the original of that image in gif form. I’d embed it but it’s over 100mb so here’s an image for a frame of it instead.

It’s clearer in the gif, but it implies a detection model fine-tuned on a custom dataset, and then a workflow that would either rely on selection of the relevant detections by a user, or a separate automated process determining which category of detections to track. There’s probably a first stage though where detection is used to crop the image too.

The second link Falafel added is pretty cool because it’s almost exactly what they’ve done. I’ve seen a few different variations on that model and inference with python. The classes it uses is:
“FEMALE_GENITALIA_COVERED”,
“FACE_FEMALE”,
“BUTTOCKS_EXPOSED”,
“FEMALE_BREAST_EXPOSED”,
“FEMALE_GENITALIA_EXPOSED”,
“MALE_BREAST_EXPOSED”,
“ANUS_EXPOSED”,
“FEET_EXPOSED”,
“BELLY_COVERED”,
“FEET_COVERED”,
“ARMPITS_COVERED”,
“ARMPITS_EXPOSED”,
“FACE_MALE”,
“BELLY_EXPOSED”,
“MALE_GENITALIA_EXPOSED”,
“ANUS_COVERED”,
“FEMALE_BREAST_COVERED”,
“BUTTOCKS_COVERED”
The first link runs prediction at the server, so assume they’re harvesting any images you test :wink:

I’ll pick on that image for a bit though. This is what nudeNet in browser (notai.tech) thinks
image

This is what the base YOLOv8 web service does to the image
image

I’ve played around with YOLOv8 a bit before and did a quick prediction against that scene with the script I have
video
It’s a bit jittery but YOLOv8 would be similar to comments about openpose in that it was trained on COCO128 which is not a nudity heavy dataset, and the base categories the classification head is trained on is only about 15 different objects.

And if I lower the detection threshold even further for lulz
videoall

If someone wanted to copy the process hinted at in that first image then the starting point is labelled data, and it probably needs to be a bit more specific to this purpose than the classes used by nudenet, and the next step would be to fine tune something like YOLOv8 which is relatively well documented. From there it could be shoehorned into something like the motion tracking funscript generator extension for ofs instead of the opencv point tracking I assume it uses.

1 Like

I took the methodology from my previous models that I put in github/huggingface and this an example prediction generated by the ml after less than 24 hours of training on one (sometimes 2) gpu it does this


If for some reason that embed doesn’t work, Riding Nude Cowgirl Porn GIF by herpaderpapotato or motion.mp4 ~ pixeldrain.

Animating a 3d object manually with a 2d plane is hard, so I drew the axis with opencv on the image
tilt of the blue line is roll
tilt light blue line is twist
red line is position vertical, length of red line is surge
green line is sway, length of green line is pitch

I went through flowertrample’s free multiaxis scripts and grabbed the corresponding videos I have access to and ran through the following process.
Stage 1:

  • I put the videos and funscripts in a folder, matched the names and then put the timestamps of cowgirl scenes in a file named .txt. I only really did this for 3, and trained on 2. I should to more but the next steps take time.
  • Used python to extract each left half frame to file named like frame.jpg in the original resolution
  • For each frame, read the funscripts and interpolated the position for the frame and stored the axis values in frame.txt

Stage 2:

  • I wanted to speed up training, so instead of making a LSTM that used a pretrained model to generate features every epoch, I just made a model which would take 60 x the tensor shape of the features (which is an array of 12x12x1280) which saves training time. I used efficientnetv2-s-21k-ft1k to pregenerate all the images into features saved as numpy files.
  • I wrote a dataloader that reads all the numpy file names and works out which ones are sequential of 60 or more, and return however many sets of sequence data I request. It’s a really bad dataloader because it loads that many sequences into memory but I couldn’t be bothered writing a smarter data generator.
    At this point, across the 3 videos I prepped there’s 51830 jpgs, 51830 txt files and 51830 npy files. At this point there’s also approx 51,500 overlapping 60 frame sequences the dataloader will grab from.

You can visualize the data in the npy files and they look a bit like:

In an image classification model, that is what the final classification layer would ‘see’ and make decisions on, but I smash 60 together and tell it to do LSTM stuff with it before pushing it through a few dense layers.

Stage 3:

  • Because my dataloader is bad/dumb, I set the training to get 128 sequences at random, train for 100 epochs on them, and then loop that continuously.
  • I watched the loss numbers go down. There’s no right answer on this, the loss number went down much quicker than it did when I train basic funscript because that has a lot more variation than the other axis so it will learn to predict in a smaller range quickly which brings the loss down but doesn’t mean it’s accurate. e.g. part of the later training after adding more sequences

Stage 4:

  • I grabbed a video for one of flowertramples paid scripts and used it to test, because it’s not in my source dataset and could be used for comparison without fear of contamination
  • I loaded the timestamps of cowgirl in the video, pushed the frames through efficientnetv2-s-21k-ft1k, and then made the model do predictions of 60 frames at a time, which is the number of frames I configured the model to do.
  • I used opencv to draw on the frames to indicate the predictions.
  • also checked the up down axis output on a plot
    image

Caveats/notes

  • It jitters a little every 60 frames, because each 60 frames is a separate prediction. One day I’ll be bothered to add an input to the model that takes the position of the previous frame as a starting point. Doing overlapping predictions seems wasteful but is an option.
  • The model has never seen this scene before. In training it has only seen 2 other scenes so it’s not very diverse but it still matching that tempo reasonably.
  • It hasn’t been trained very long
  • the base model(efficientnetv2-s-21k-ft1k) hasn’t been fine-tuned at all
  • I don’t have explicit permission to train and share a model based on flowerstrample’s scripts. Although I can share code etc, the model weights can’t be shared and I’d end up deleting them to avoid accidentally using them elsewhere.
  • I used linear interpolation to create the frame matching pos values, and so the values I generated aren’t the best, but writing something sinusoidal was more effort/time
  • although there’s plenty of multiaxis anime and pmv, I think they’d probably contradict the model’s learnings and so I can’t use them
  • I realized I broke one of my rules and the dense layers after the LSTM each have less neurons than the final layer, and I try to avoid doing that so I probably should redo it with fresh weights.
  • although I did speed up training by pre-generating the pretrained model outputs, it meant I couldn’t use any data augmentation in training (although maybe I could add some noise to the features).
  • I could write some python to output funscripts based off the data (I’ve made some for my previous models, but not multi-axis), it’s too soon to bother because in all likelihood the model will freak out the minute someone transitions from cowgirl leaning back to cowgirl leaning forward with arms on chest.
  • I didn’t do any pre-cropping on the images it’s training on or predicted on, so it could be better if there’s less distractions which is what I saw on other models.
  • it’s only learnt from cowgirl scenes, and so is probably crap at everything else

As a proof of concept though, I don’t mind it. I mostly still want to find a way to simplify images prior to prediction to improve the eventual speed and accuracy, and/or to leave some breadcrumbs for others to improve upon.

3 Likes

Cool…
Of course I will send you the script in a private message, but the thing is that on this script I once again changed the math of writing (increased the angles - it’s Falafel’s fault with his video) + began to pay more attention to the anatomy of a woman (location of the vagina / rectum + oral cavity) Write to what videos you have access to, well, or I still have all the projects - I can give you at once flat converted videos 1024x1024 and multi-axis scripts to them.

The “conqueror” one? Irrc I limited the sway range during that video… The SR6 can sway a bit wider but it will be a little shaky.

1 Like

This one.
I thought these were the maximum angles…