obligatory wall of text because I’m not good at simple explanations
The pose stuff: movenet, openpose mmpose, mediapipe etc. These work pretty well on normal 2d videos for this sort of thing, and I’ve seen some topics of other users leveraging them as a part of generating funscripts.
This is a gif of openpose based on a 2d scene I had laying around.
The open pose model outputs a series of coordinates which correspond to bones/connections, and you could probably create a really good 1D starter script if you took the coordinates of the hip points and head points and did some processing on them. At the least it should provide a relatively decent representation of peaks and troughs which could be edited into something better with less effort than starting from scratch.
I think you’re right that something like VAM could be used to generate a labelled dataset of pose estimation for both vr and 2d videos to improve accuracy and it’s similar to the initial comments flowerstrample made initially. My understanding is that there’s no porn in the CMU Panoptic Studio dataset which is what openpose was trained on, or any other pose datasets so even for 2d there’s probably plenty of room for improvement. I don’t think it would take a large amount of retraining of one of those models to achieve that given that these things hold most of the knowledge in the base model and often just need fine tuning of the later layers. I’d guess it would need to be with a variety of scene types, and as realistic as possible input images to match the type of images it would be used against. The dataset of images generated could be pushed through something like stable diffusion with img2img generation with openpose control net to generate an even larger and more varied set of data from that. I don’t really know much about VAM though, and potentially loading enough unique scenes and poses (and exporting them in a VR view) could be more time intensive than manually labelling poses. VAM definitely could produce a large amount of similar images in a sequence much faster than manual labelling, but my guess is that variety in the dataset would be most beneficial to the model.
VAM or other 3d computer graphics could be valuable for generating a large dataset for a pretrained model which tracked 6 axis motion. A smaller dataset of manually annotated filmed scenes could be used to fine tune it and make it usable for normal videos, but I still wouldn’t expect any approach I’ve taken so far to be good and accurate enough for real time, script-less function. Instead I’d still predict the best I might manage is something that can generate a starter funscript to save time for human scripters, similar to the other posts I linked previously.
It could even be the case that a VAM or computer generated dataset could be used to train pose estimation using multiple frames to try achieve temporally consistent results, but that’s a lot to bite off. This would be similar to how movinet does it, but it only does an overall classification of action for the whole sequence, and the inner workings of things like that are beyond me. MoViNet for streaming action recognition | TensorFlow Hub
Stepping back from the pose models: I’ve looked at using yolov8 which offers bounding box detection, but it’s complex to get the feature extractor layers with the right classification info into a time sequence model. I did put together something that would extract every prediction from a frame and that might be useful for the LSTM style model that I’ve made in previous posts but haven’t pursued it. It’s not very temporally consistent, or very fast, but mostly it just didn’t feel like a lead I wanted to follow.
That’s why the lineart model caught my attention. Compared to line detection with canny, it was very smooth and temporally consistent. It also simplifies the image a lot which means there’s less data/noise to process, and it passed test of “if all I had was this output, could I write a semi accurate funscript”. The base model for it is a pytorch model though, and I’m more of a tensorflow user so it’d take some effort to convert, and it still outputs 512x512x1 bits of data which would still need layers to simplify down to a smaller size before I could use it in a model that processed a sequence of frames.
Including this gif because I thought it was cool. It was the product of choosing the highest value pixels from midas and lineart output, and then subtracting the lineart output values. I’d still try lineart alone because I think it’d be simpler for a model to learn.