Gifs of pose estimation on vr video

This won’t be anything that probably hasn’t been discussed or thought about by plenty of people already, but I wanted to see some samples of how these behaved on VR scenes and couldn’t find any.

This community seems pretty sharing, so I figured I’d share what I generated in case anyone else had the same curiosity. It’s all from the same video, maybe someone guess the video.
Full mp4 outputs and keypoint outputs in 945.41 MB folder on MEGA

I started with movenet, it was the easiest to work with and had good speed but felt the least accurate.

mediapipe was nice, and not resource intensive, but the default pipeline is a bit slow when processing sequentially and they seem to be in the middle of a bit of a code change so it’s not straightforward with what it can and can’t do at the moment. I wanted to use their native landmark smoothing, so I didn’t parallelize it or anything.

mmpose The timing is off in the gif, although the full video processed and npy stored with frameid so I do have replayable data. It was the slowest at ~ 30frames every 5 seconds, and used GPU. They also have a 3d pose estimation model, but I didn’t see much more difference or accuracy from initial testing, so didn’t run the full inference on it.

Similar to the file names, I fed the original video into ffmpeg and cropped it, and then again and scaled it. The video I picked because it was a shorter video and smaller size I had. I didin’t use any ffmpeg flattening filters, they always seemed like they cropped too much and left some distortion that could confuse the models. In lieu of a perfect solution, something like movenet fine-tuned on VR poses might yield better results. I wanted to start each model on an even playing field, so I didn’t try to speed things up by preprocessing the videos to the model’s native sizes.

This is just something I’ve looked at out of curiosity, I don’t have any expertise in this space or do anything more than dabble with python.

As far as applicability and how it can be used, it could be a source to help at the start of composing, but I think the tools and extensions that exist far surpass the output of these. If I wanted to use them, probably the lowest hanging fruit would be to convert the keypoint Y data into funscript and use them as individual sources to copy+paste out of. Most/all keypoint data includes a confidence value as well.
The average position of left and right hip could give approx rhythm indications (left_hip_y + right_hip_y) / 2
The y position of the hand or head at certain points in the video could inform motion
The overall joint y position, where confidence is greater than % could inform motion.
and so on.


We played with it a bit at the very beginning of our AI scripting. Estimated human poses seemed very noisy and inaccurate. It might be a good idea to use pre-trained backbones from pose estimators as starting weights for training

I’ve also experimented with recognizing poses from videos, imho it’s pointless.

  1. All neural networks that are available are trained on 2D sports videos, where a person is full or almost full height.
  2. Because of the pespective of VR porn, even normalizing the 2D picture for the eyes, the proportions are distorted and neural networks do not work.

I was thinking of the following algorithm for myself, but it is on pause for now.

  1. Take 3D video, create a point cloud from it.
  2. On its basis we create 3D models based on voxels.
  3. Train the model to recognize poses from voxels.
  4. For 3D scripts (like I do in Blender), you just need to get the base and head of the penis and the dual quaternion (or vector + quaternion) of objects interacting with it.

Your interest is commendable - you tried and succeeded, maybe you will think of where to apply the results of your research.

Thanks, that makes sense. It’s a bit of a rabbit hole of what sort of model to use. I’d assume it has to be a prediction over time. And then there’s the challenge of correctly labelled data.

I distilled a scenario down into basic shapes and ran some training against randomly generated scenario data and untrained movinet style of models. Using about 1000 different labelled datasets generated, each with 10 to 30 sliding windows of 10 frames to pick from I tried a classifier of either “movingup”, “movingdown”, “peak”, or “trough” and trained with the default sparse categorizer.

e.g. these sort of animations

The categorical results look a lot like:
whoops forgot to change the color mode, also the animated are just circles, I didn’t intentionally make any animated body parts specific sizes.

With a similar but also randomly generated dataset labelled with integer values of how much the “direction indicator” was showing, I bolted a linear head onto a movinet style model and used mse to train. It had a harder time, but was starting to get a hang of it.
e.g. position result

I can see how the right model, or pipeline of models could definitely ease the workflow of first classifying the data, and later generating tracking predictions.

e.g. I extracted random frames on random videos to create a library of images. Movenet (pose) then was able to estimate the position of the person in the scene, and if the overall confidence was high enough, it could go into a folder saying “person on back” because the coordinates of the left side, were less than the right side coordinates. Or if the position of the knees apart was 1.5x the position of the hips apart then it could go in a folder “person on back with knees apart”.

  • If it’s stored as “frame #xxxxx video #yyyyy”, then you’d then use that data to build your training dataset because you can quickly review if it is the position predicted, and the workflow could provide/animate based on the previous 9 (interval?) frames to given you context of the class “movingup”, “movingdown”, “peak”, or “trough”. A bit of python and opencv and you can classify a frame every few seconds.
  • And/Or you click on where the “action” is happening and the workflow stores the coordinates of that frame which can then be later used for an “actionfinder” model, which you put in the pipeline and auto crop to ensures that the main model has less data to process because more pixels = slower and harder to train.
  • And/Or you classify the pose in the frame, because you eventually find that you need specific models for different positions because the model doesn’t generalize well, which becomes a part of your workflow pipeline.

For my testing, I think that a small step up from the simplistic 2d images I’ve fed the movinet, might be if I stick to a specific studio, and specific position, and even a specific cast, to see how well the models ‘take’.

For now I’ll stick to the basic little animations, and see if there’s any other model architectures that make sense but are still temporally aware like LSTMs, and GRUs etc. I don’t know that much about them; this is just a problem that caught my interest.

Movinet style i used for those:
Video classification with a 3D convolutional neural network | TensorFlow Core

1 Like