Gifs of pose estimation on vr video

herpaderpapotato · September 2, 2023, 4:59am

This won’t be anything that probably hasn’t been discussed or thought about by plenty of people already, but I wanted to see some samples of how these behaved on VR scenes and couldn’t find any.

This community seems pretty sharing, so I figured I’d share what I generated in case anyone else had the same curiosity. It’s all from the same video, maybe someone guess the video.
Full mp4 outputs and keypoint outputs in 945.41 MB folder on MEGA

movenet
I started with movenet, it was the easiest to work with and had good speed but felt the least accurate.
noir_scaled_2560x1280_cropped_1280x1280_movenet_nofilter

mediapipe
mediapipe was nice, and not resource intensive, but the default pipeline is a bit slow when processing sequentially and they seem to be in the middle of a bit of a code change so it’s not straightforward with what it can and can’t do at the moment. I wanted to use their native landmark smoothing, so I didn’t parallelize it or anything.
noir_scaled_2560x1280_cropped_1280x1280_mediapipe

mmpose
mmpose The timing is off in the gif, although the full video processed and npy stored with frameid so I do have replayable data. It was the slowest at ~ 30frames every 5 seconds, and used GPU. They also have a 3d pose estimation model, but I didn’t see much more difference or accuracy from initial testing, so didn’t run the full inference on it.
noir_scaled_2560x1280_cropped_1280x1280_mmpose2d

Similar to the file names, I fed the original video into ffmpeg and cropped it, and then again and scaled it. The video I picked because it was a shorter video and smaller size I had. I didin’t use any ffmpeg flattening filters, they always seemed like they cropped too much and left some distortion that could confuse the models. In lieu of a perfect solution, something like movenet fine-tuned on VR poses might yield better results. I wanted to start each model on an even playing field, so I didn’t try to speed things up by preprocessing the videos to the model’s native sizes.

This is just something I’ve looked at out of curiosity, I don’t have any expertise in this space or do anything more than dabble with python.

As far as applicability and how it can be used, it could be a source to help at the start of composing, but I think the tools and extensions that exist far surpass the output of these. If I wanted to use them, probably the lowest hanging fruit would be to convert the keypoint Y data into funscript and use them as individual sources to copy+paste out of. Most/all keypoint data includes a confidence value as well.
e.g.
The average position of left and right hip could give approx rhythm indications (left_hip_y + right_hip_y) / 2
The y position of the hand or head at certain points in the video could inform motion
The overall joint y position, where confidence is greater than % could inform motion.
and so on.

doublevr · September 2, 2023, 3:41pm

We played with it a bit at the very beginning of our AI scripting. Estimated human poses seemed very noisy and inaccurate. It might be a good idea to use pre-trained backbones from pose estimators as starting weights for training

flowerstrample · September 2, 2023, 5:24pm

I’ve also experimented with recognizing poses from videos, imho it’s pointless.

All neural networks that are available are trained on 2D sports videos, where a person is full or almost full height.
Because of the pespective of VR porn, even normalizing the 2D picture for the eyes, the proportions are distorted and neural networks do not work.

I was thinking of the following algorithm for myself, but it is on pause for now.

Take 3D video, create a point cloud from it.
On its basis we create 3D models based on voxels.
Train the model to recognize poses from voxels.
For 3D scripts (like I do in Blender), you just need to get the base and head of the penis and the dual quaternion (or vector + quaternion) of objects interacting with it.

Your interest is commendable - you tried and succeeded, maybe you will think of where to apply the results of your research.

herpaderpapotato · September 11, 2023, 11:35am

Thanks, that makes sense. It’s a bit of a rabbit hole of what sort of model to use. I’d assume it has to be a prediction over time. And then there’s the challenge of correctly labelled data.

I distilled a scenario down into basic shapes and ran some training against randomly generated scenario data and untrained movinet style of models. Using about 1000 different labelled datasets generated, each with 10 to 30 sliding windows of 10 frames to pick from I tried a classifier of either “movingup”, “movingdown”, “peak”, or “trough” and trained with the default sparse categorizer.

e.g. these sort of animations
thegrid

The categorical results look a lot like:
custommodel.012-0.492269
whoops forgot to change the color mode, also the animated are just circles, I didn’t intentionally make any animated body parts specific sizes.

With a similar but also randomly generated dataset labelled with integer values of how much the “direction indicator” was showing, I bolted a linear head onto a movinet style model and used mse to train. It had a harder time, but was starting to get a hang of it.
e.g. position result
custommodel.295-0.000004-2

I can see how the right model, or pipeline of models could definitely ease the workflow of first classifying the data, and later generating tracking predictions.

e.g. I extracted random frames on random videos to create a library of images. Movenet (pose) then was able to estimate the position of the person in the scene, and if the overall confidence was high enough, it could go into a folder saying “person on back” because the coordinates of the left side, were less than the right side coordinates. Or if the position of the knees apart was 1.5x the position of the hips apart then it could go in a folder “person on back with knees apart”.

If it’s stored as “frame #xxxxx video #yyyyy”, then you’d then use that data to build your training dataset because you can quickly review if it is the position predicted, and the workflow could provide/animate based on the previous 9 (interval?) frames to given you context of the class “movingup”, “movingdown”, “peak”, or “trough”. A bit of python and opencv and you can classify a frame every few seconds.
And/Or you click on where the “action” is happening and the workflow stores the coordinates of that frame which can then be later used for an “actionfinder” model, which you put in the pipeline and auto crop to ensures that the main model has less data to process because more pixels = slower and harder to train.
And/Or you classify the pose in the frame, because you eventually find that you need specific models for different positions because the model doesn’t generalize well, which becomes a part of your workflow pipeline.

For my testing, I think that a small step up from the simplistic 2d images I’ve fed the movinet, might be if I stick to a specific studio, and specific position, and even a specific cast, to see how well the models ‘take’.

For now I’ll stick to the basic little animations, and see if there’s any other model architectures that make sense but are still temporally aware like LSTMs, and GRUs etc. I don’t know that much about them; this is just a problem that caught my interest.

Movinet style i used for those:
Video classification with a 3D convolutional neural network | TensorFlow Core

herpaderpapotato · October 3, 2023, 2:07pm

Nobody should actually try the script on a device, but I put 75% of SLR_SexBabesVR_Put Out My Fire_1920p_37738_LR_180.funscript onto the mega I linked into the ml/funscriptprediction folder.

I used ML and did some training of a model. The result is a model that it tries to predict pos values in funscripts based on the current, and past 4 frames. Its outputs aren’t great, but it proves the concept which was a nice outcome.

The tensorflow models I trained are on the mega in the funscriptprediction folder, but they’re just proof of concept. I trained 2 models;

The pointofinterestmodel is an efficientnet_v2_imagenet21k_ft1k_xl with a dense layer slapped on the tail and then trained to find the root of the male’s member’s x and coords. That model takes 512x512 images for it’s predictions and I found that it would predict images that had been resized down to 128x128, and then back up to 512x512 acceptably so I assume I can probably train a much lighter version of that model. The prediction from this model is used to grab a 20%, 30% of 40% crop of the image which is resized to 128x128 and sequences are fed to the second model. Maybe all I really need to do is crop to the bottom half (and middle) of the image, but I wanted to give the second model some really ‘close’ context using a relatively small input size.

The ‘tuned’ model is 5 x separate parallel efficientnet_v2_imagenet21k_ft1k_b0, with a concat, a dropout, a dense, and then another dense layer strapped on. I used parallel and separate instances of the efficientnet model with the prediction that each interval point might need to learn different things.
This model is trained on datasets similar to the tuned_data_sample in the mega folder. I had about 750 different prediction frames, each with 10 different frame intervals for the preceding frames as well as 3 x different versions with different crop percentages. When training first I kept the efficientnet layers frozen and just trained the dense layers, and then lowered the rate of learning and unfroze the layers and trained it further.

This is the batch loss over the course of ~4 days, and a lot of the initial spikes involved points where I introduced new training data. The steeper gradients are when I realized this model used much less memory than the other one and so I could use bigger batch sizes.

Looking at the funscript, there are some bits where the model isn’t just bad, it’s really bad. There’s probably more training that could help, and more training data, and possibly doing something with LTSM or GRU etc at the concat layer. Possibly another improvement could be “on the fly” fine-tuning of a few key points of the current video to help it get context or something. Or even autoencoders.
Also, the dataset is relatively limited because creating accurate and good datasets is hard.
Also, I could never tell what to put in for handjobs and blowjobs, so I rarely used them in the dataset so it’s pretty bad at them.

Generating the funscsript I could get about 20fps batch prediction speed out of it. It can probably improve further with TFLite conversion, lighter models, and maybe opencv GPU compile.
I picked the video above because I know it’s not going to have been in my training dataset.

Although I’ve shared the models in the mega folder, as a hobbyist I don’t think it’s something anyone can really pick up test on their own without a pretty heavy background in python and ml, and the models are still pretty bad as shown by the funscript, so I don’t see why anyone would want to use them, but they’re there for the curious.

herpaderpapotato · October 10, 2023, 2:45pm

Not great, but getting better.

Some of it I look at and think, “with some smoothing, or normalization, this isn’t terrible”.

And then it just utterly fails on some sections of the video.

I’d say that’s because I don’t have enough similar data (blow job frames) in the training dataset, which is only 1,600 labelled frames across 129 videos.
All the labelling was done by me, and I’ve steered away from using funscript data as training material because they don’t necessarily match the video frame and because it is bad form to train ML on other people’s work, even if the result is freely available.

I’ve seen mention that the point isn’t to perfectly follow the motion and depth when scripting, but I’ve got no chance of training a soul into a ML model so I tried to keep the labels as accurate as possible with the thinking that, at best, it might help take some of the tedium out of scripting so that the people with the talent can answer some of the deeper questions in life like; “If it’s a handjob scene, and then hand rests at the base, what the hell should the script do if the hand stays on the base but she starts licking the tip???”

I ran prediction for that video on cpu on a second PC. Python doesn’t thread good, so I just ran lots of them and join the result.

Prediction against that video took about an hour.
gif of the prediction smashing cpu Blonde NSFW POV Porn GIF by herpaderpapotato.

This is a different model again from before, it’s 9 frames for inference but it’s the 5th frame that’s labelled. I didn’t use any zooming preprocessing model for training/prediction on this one, but might add that back in because a frame of video resized to 224x224 is almost impossible for a human to script well, let alone ML.

As before, the model and script is in the mega, this time it’s nested in the nineframe folder. The model is not something most people can pick up and do anything with without a python and ML background, and the funscripts in that folder are bad and will likely injure anyone that looks at them suggestively.

herpaderpapotato · November 5, 2023, 3:49pm

I saw the linking from the trends discussion thread and I hung up on phrasing about community effort :D. I tend to think of the above as more an awareness thing to help someone else get to the real solution, or to create something that someone else finds useful. I’m not really racing to the finish line on this stuff and the topic title doesn’t really indicate what I’ve followed up or is click baity, but I read that thread and think what if I was racing to the finish line and what would that look like.

Lately I’ve been stuffing around trying to build a 60 frame prediction model, which takes 60 sequential frames and outputs a prediction for each and it should be quicker than the previous 9frame1prediction model and should learn how a normal sequence should look like, instead of being jittery per-frame predictions. That’s slow going, both to train, and to create labelled data.

Over in the mega, back on the 30th I jotted down some thoughts on what it might look like if I was “racing to the finish line”(initial readme.txt is now in the pose\ml\funscriptprediction\tempo_aimodel\archive folder).
TLDR: It’d be a categorical prediction. I put even more of an explanation in a readme in the pose\ml\funscriptprediction\tempo_aimodel folder but categorization can provide the basis of a tempo generator. The cheeky trick here, is that although I only feed one frame to the model per prediction, that frame mixes color channels from previous frames.

e.g. this predicts 4 categories, ascending, descending, peak or trough, but I assigned them values 75, 25, 100 and 0 for the purposes of visualizing them in OFS. What it shows is that the categories could be accurate enough as a starting point for a script, or something that can generate a matching sinusoidal pattern.

In the above it fared fairly well on different positions, but similar to the previous posts/models, there’s a lot of places where it just predicts dumb stuff because it hasn’t trained on enough of that, but I did find this model type very easy to generate fine tuning data for (covered in the mega).

Also covered in the mega is that this is a proof of concept. It’s trained on community provided or commercial funscripts that I do not have permission to train on from an ethical standpoint, and so it’s not a model I included in the mega unlike previous. I might go back and generate my own training data and then retrain from scratch, but the base model trained on 26,000 images that were labelled based on funscripts so it’s not something I can generate quickly.

The above screenshot “visualization” funscript file(s) and explanations are in that mega folder.

Same story as previous, I have included python scripts in the mega that were used to generate the datasets, create the model, train, finetune and generate predictions, but it’s not something that most people can pick up and do anything with without a python and ML background.

Anyone is able to take anything I’ve placed in the mega and do whatever they want with it. Make a copy, rename the files, and call it your own for all I care, github copilot wrote most of it anyway :). If someone thinks they can take any funscript files in there and do something with them, go for it, no citation etc needed. The only exception is where existing rights holders have licensing around their usage (e.g. the efficientnet models are under CC BY 3.0 Deed so you can’t copy that part of the models and call them yours unfortunately).

herpaderpapotato · November 22, 2023, 10:41pm

I’ll try to do some smaller update posts on a few different area including dataset, model, training and progress.

Looking at the categorization was a good chance to step back and rethink the sixty frame approach. I ended up going back and looking at a 9 frames in, 9 out approach and it did well, but there was too much jumping between each 9 frame prediction set. Overlapping predictions, adding smoothing etc helped, but still felt like a cludge, although some smoothing/overlap is probably inevitable. Either way, I’m back on looking at 60 frames in, 60 frames out and the current output looks like this:

prediction (1)
Tits Brunette Cowgirl Porn GIF by herpaderpapotato - edit to add because the gif seemed busted on this page.

I’m currently training with a dataset of 12,000 sequential labels frames from 39 videos. I’m generating the labels using python and opencv.

I’ll navigate to a point in the video I want to label, I’ll make an initial “100%” label by clicking on the base/0% point, and the on the tip/100% point. This records the maximum distance and each subsequent frame will be a percent of this.
I’ll then start skipping through frames and clicking to indicate the tip or contact point. If the base point moves, I update the base point. I don’t label every frame in sequence. I do try to label each point where velocity or acceleration changes.
Once I’m at least 60 frames from the starting point, I hit q to stop and then post process that data.
Post processing:
turns it into a small funscript file.
extracts the frames for each timestamp
extracts the frames for each in between timestamp
interpolates values for each frame. If I’ve labeled enough points in a stroke, this is close enough.
places all the frames in a folder structure, and the labels in a similar folder structure in individual txt files.
goes back over the data and grabs another frame of a tighter crop: 25% off the left, 25% off the right, and 50% off the top.
goes back over the data again and I manually select an area of the frame for a tight crop.

This works for me because I don’t have to try to estimate the exact percent/score on each frame, and just base it on pixel distances. I’m still occasionally adding to the dataset.

Each set of frames is used for training will use all 3 versions, full frame, mid crop and tight crop.
I then sanity check the results with python by playing the frames back with a prediction overlaid. e.g. mid crop
585e2005ef5c25ff7c3ceddfb759228f_crop

And then that will all get fed into training with a custom tf.keras.utils.Sequence class.

The class reads the folder structure
finds all the possible sequential frames based on numbering, and randomly augments them with: noise, blur, crop, blocking, brightness, b&w, reversing sequence.
And then shuffles them and feeds each sequence to the model for training.

This is about 24 hours of training on a model. It needs more time in the oven for the loss to flatten out.

More on the model separately, but that’s another efficientnet and currently only involves training on ~50mb of parameters I added, with the total model size at 125mb.

And the unprocessed prediction output for the first gif ends up looking like this

I’ve used this specific example to still show that while it can learn good, some movements are either confusing, or not in the training data and it goes dumb.

herpaderpapotato · November 23, 2023, 4:13pm

A smaller update. 10 hours of further training later, training time on the current dataset is starting to hit diminishing returns, which is expected. Rather than keeping that going for longer, I’m planning on enabling training on the base model and seeing how far that goes.

The current state of the model on an ‘easy’ scene looks like this.

I think the close up crop(poi) learns movement the easiest, but by training on too many different levels of crop it might be making the full frame predictions dumber.

in the literal sense, it looks like this, Brunette Doggystyle POV Porn GIF by herpaderpapotato

Rather than stashing more code on the mega, I put some demo code up for testing on github and a copy of the trained models on huggingface. I’m not trying to drive traffic there, just sharing the work so instead of putting a link here, if you search for my username on github it’ll be the repo silver-lamp(edit to fix name). Standard disclaimer it needs some experience with python etc to work with. Again, this isn’t a “go here and try this”, it’s more of a sharing of information for those looking at similar.

edit: to avoid spamming, small, small update: these are the direct outputs generated using the onnx model with wsl2 from the github. The point of interest (poi) script is the least bad. It’s still failing terribly on some scenes, but progress compared to previous.
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.poi.20231125-012514.funscript (4.1 MB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.crop.20231125-012514.funscript (4.1 MB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.20231125-012514.funscript (4.1 MB)
disclaimer, anyone may open them and review in OFS, but I don’t recommend playing them back on hardware. They’re also large because it’s predicting every frame.

MakeItFun · November 24, 2023, 3:46pm

Love this, learning a lot here. Thanks, and please keep these posts rolling.

herpaderpapotato · November 25, 2023, 10:59am

The 3 funscripts from my previous make some nice wave forms in OFS (on types of scenes it has learnt), but that’s about it.

Eventually a model might be able to do well enough with obvious stuff that the base output is usable as a base for scripting from. If accuracy increases and tooling adapts to doing things like enabling them to be more easily ranged stretched then post processing might not be necessary. But for now, we can brute force find peaks using python. e.g. top is just a “set peaks at 100, troughs at 0”

I’ve added the python notebook used to transform a full frame prediction script, to just a peak to peak prediction script in github and the resulting funscript files are only ~400 - 500kb.

milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.poi.20231125-012514_postprocessed.funscript (552.0 KB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.20231125-012514_postprocessed.funscript (242.1 KB)
milfvr-gobble-squabble-180_180x180_3dh_LR.mp4.crop.20231125-012514_postprocessed.funscript (381.2 KB)

They’re still very bad at some positions, and constant 0 - 100 is also bad, but this is the tip (peak?) of the iceberg for postprocessing. e.g.

Instead of the POS value being a flat 100 or 0, it could be ((actual POS+100)/2). But just divide by 2 for the trough. This would maintain a higher range while tying the values to the prediction.
Or, intermediate frames could be added with their value relative to the peak.
Or wave finding functions could be used to generate smoothed intermediate values etc.
Or, the values of the intermediate frames could be preserved and just normalized between the peaks.
And anomaly detection could find missed strokes and allow manual, or single button press to fix/accept a recommendation.
And for scenes that it’s very bad at, could be added to the dataset for further tuning of the model and reducing the need for post processing.

I added a onnx model to huggingface and a prediction script to github (with its own requirements.txt). Onnx can be used for full CUDA acceleration in Windows if the right CUDA libraries are installed, however I only tested it with WSL. Onnx seemed faster than tensorflow and took less time to perform the prediction than it did to gather the frames for each prediction, but also needed about ~8GB of vram so I may look at quantization on it in the future.

herpaderpapotato · November 30, 2023, 4:47pm

I was inspired by falafel’s video although with a different goal of making the worst funscripts known to man with AI. Here’s my version, how I make funscripts that will simultaneously give you blue balls and rip your junk off.

This example does not use any close up (poi) crops which are normally much more accurate but a bit of a crutch, and the latest model is improving with full frame predictions.

I’ve added some observations in github around the models hyperparameters and why I was having so little success with anything larger than the small efficientnetv2 model, my thinking is because the bigger models feature extraction layers are subtler and the number of dense neurons after the efficientnetv2S were previously only just enough to barely function as intended. In newer models I’ve beefed up the dense layer in the middle of the model and that became the thicc variant. I took that even further and made a doublethicc variant. The onnx version of doublethicc variant was used for the above video and significantly improved the output of full frame predictions. At the moment it’s held back by variety in the training data so next focus is probably there.

The funscripts generated in the video are in gobble ~ pixeldrain if anyone wants to review them. They’re relatively big funscripts so I thought they’d be better off on filehosting. Everything else used in the video is on github or huggingface.

weidong · March 5, 2024, 2:33am

Have you trained the model?

herpaderpapotato · March 5, 2024, 2:39pm

I trained a few models in the above, but they’re not useful for various reasons.

I was looking at some image preprocessing methods the other day and I think the lineart model used as an image annotator with image generation might be a better model to use instead of the efficientnets I’ve been using.

output

From top to bottom, left to right:
original, canny, hed, lineart
lineart anime, midas, normal bae, oneformer
openpose (too close for it to work anything out), pidinet, zoe, blank

I’ve tried simplifying the image being fed to the model to lighten the workload of training and inference (although I still have dataset challenges), and lineart seems to be really consistent and fast at about 30ms per frame unbatched, and I want to try to utilize that as a feature extractor.

If you’re looking for something usable, I’d say check out Motion Tracking Funscript Generator v0.5.x - Software - EroScripts or How to use FunscriptToolBox MotionVectors Plugin in OpenFunscripter - howto - EroScripts, or pretty much any other posts under the software category. At some point I’d be curious to see how a custom model would go if it could utilize the motion vectors generated by the latter, but there wouldn’t be any pretrained feature extractors available and it’d be a slower initial training phase as well as require a fair bit of reprocessing on my current dataset.

blewClue215 · March 5, 2024, 11:35pm

Hello!

Honestly love some of the work you’re doing here; I won’t pretend to be able to understand it

But it looks like training data is the blocker for you, i wonder if you’ve considered using VAM to synthesize these training data? ( perhaps writing a vam script that outputs renders + labels, you may be able to get accurate buffer data like depth maps; edges and all that from VAM too)

I’m not sure if it’d work though if the rendering for these scenes are not photorealistic

EDIT: I’m considering using VAM to synthesize views for 4D Gaussian so i thought this might be useful for you

I would also be really interested to see if we could train these pose estimation models on VAM renders and produce really accurate inference results?

herpaderpapotato · March 6, 2024, 5:55am

obligatory wall of text because I’m not good at simple explanations
The pose stuff: movenet, openpose mmpose, mediapipe etc. These work pretty well on normal 2d videos for this sort of thing, and I’ve seen some topics of other users leveraging them as a part of generating funscripts.
This is a gif of openpose based on a 2d scene I had laying around.
output
The open pose model outputs a series of coordinates which correspond to bones/connections, and you could probably create a really good 1D starter script if you took the coordinates of the hip points and head points and did some processing on them. At the least it should provide a relatively decent representation of peaks and troughs which could be edited into something better with less effort than starting from scratch.

I think you’re right that something like VAM could be used to generate a labelled dataset of pose estimation for both vr and 2d videos to improve accuracy and it’s similar to the initial comments flowerstrample made initially. My understanding is that there’s no porn in the CMU Panoptic Studio dataset which is what openpose was trained on, or any other pose datasets so even for 2d there’s probably plenty of room for improvement. I don’t think it would take a large amount of retraining of one of those models to achieve that given that these things hold most of the knowledge in the base model and often just need fine tuning of the later layers. I’d guess it would need to be with a variety of scene types, and as realistic as possible input images to match the type of images it would be used against. The dataset of images generated could be pushed through something like stable diffusion with img2img generation with openpose control net to generate an even larger and more varied set of data from that. I don’t really know much about VAM though, and potentially loading enough unique scenes and poses (and exporting them in a VR view) could be more time intensive than manually labelling poses. VAM definitely could produce a large amount of similar images in a sequence much faster than manual labelling, but my guess is that variety in the dataset would be most beneficial to the model.

VAM or other 3d computer graphics could be valuable for generating a large dataset for a pretrained model which tracked 6 axis motion. A smaller dataset of manually annotated filmed scenes could be used to fine tune it and make it usable for normal videos, but I still wouldn’t expect any approach I’ve taken so far to be good and accurate enough for real time, script-less function. Instead I’d still predict the best I might manage is something that can generate a starter funscript to save time for human scripters, similar to the other posts I linked previously.

It could even be the case that a VAM or computer generated dataset could be used to train pose estimation using multiple frames to try achieve temporally consistent results, but that’s a lot to bite off. This would be similar to how movinet does it, but it only does an overall classification of action for the whole sequence, and the inner workings of things like that are beyond me. MoViNet for streaming action recognition | TensorFlow Hub

Stepping back from the pose models: I’ve looked at using yolov8 which offers bounding box detection, but it’s complex to get the feature extractor layers with the right classification info into a time sequence model. I did put together something that would extract every prediction from a frame and that might be useful for the LSTM style model that I’ve made in previous posts but haven’t pursued it. It’s not very temporally consistent, or very fast, but mostly it just didn’t feel like a lead I wanted to follow.

That’s why the lineart model caught my attention. Compared to line detection with canny, it was very smooth and temporally consistent. It also simplifies the image a lot which means there’s less data/noise to process, and it passed test of “if all I had was this output, could I write a semi accurate funscript”. The base model for it is a pytorch model though, and I’m more of a tensorflow user so it’d take some effort to convert, and it still outputs 512x512x1 bits of data which would still need layers to simplify down to a smaller size before I could use it in a model that processed a sequence of frames.

Including this gif because I thought it was cool. It was the product of choosing the highest value pixels from midas and lineart output, and then subtracting the lineart output values. I’d still try lineart alone because I think it’d be simpler for a model to learn.
output1

flowerstrample · March 6, 2024, 9:38am

What if we don’t look at pose recognition? For example:

Manually mark the time from and to and indicate which position (blowjob, blowjob with hands, wanking, from behind, horsewoman, etc.).
I have quite a lot of 6-axis scripts, if you mark them up as I wrote above - to use for neural network is like a dataset for training.
Let the trained neural network spin and move the cylinder at once?

Why I thought so - there simply isn’t a dataset trained specifically for this purpose. For this purpose you need about 50 different short clips for each pose where there will be an exact match of skeletal bones with the picture.
Even for some special tasks are specially trained - the difference between boxing/yoga/running/playing ball for a dataset is significant.

blewClue215 · March 6, 2024, 9:06pm

Great explanation!

I was just about to suggest img2img processing on VAM renders to make it closer to photorealistic!

Although i should say that with VAM you should be able to script up a rendering pipeline that incorporate automatic scene changes, outfit changes, and appearance preset changes so you’d get massive variety in the dataset

But still, i agree that using VAM renders to fine-tune models for multi-axis scripts does seem like a better use case compared to simple 2-dimensional scripts… for which there seem to be simpler and less intensive solutions!

blewClue215 · March 6, 2024, 9:12pm

ooh that’s an idea that could actually work with VAM!

So VAM could potentially output multi-axis funscripts ( not sure about it since i havent tried it, but i’m sure there are ways to extract that data ) but if we generate tons of these marked clips using VAM and their multi-axis funscripts and trained the neural network on that dataset; it might potentially work without requiring pose-estimation?

flowerstrample · March 6, 2024, 9:20pm

You could try it. The penis is almost always in the same place and everything revolves around it At least it can work for videos with one girl and no complicated angles.