This won’t be anything that probably hasn’t been discussed or thought about by plenty of people already, but I wanted to see some samples of how these behaved on VR scenes and couldn’t find any.
This community seems pretty sharing, so I figured I’d share what I generated in case anyone else had the same curiosity. It’s all from the same video, maybe someone guess the video.
Full mp4 outputs and keypoint outputs in 945.41 MB folder on MEGA
movenet
I started with movenet, it was the easiest to work with and had good speed but felt the least accurate.
mediapipe
mediapipe was nice, and not resource intensive, but the default pipeline is a bit slow when processing sequentially and they seem to be in the middle of a bit of a code change so it’s not straightforward with what it can and can’t do at the moment. I wanted to use their native landmark smoothing, so I didn’t parallelize it or anything.
mmpose
mmpose The timing is off in the gif, although the full video processed and npy stored with frameid so I do have replayable data. It was the slowest at ~ 30frames every 5 seconds, and used GPU. They also have a 3d pose estimation model, but I didn’t see much more difference or accuracy from initial testing, so didn’t run the full inference on it.
Similar to the file names, I fed the original video into ffmpeg and cropped it, and then again and scaled it. The video I picked because it was a shorter video and smaller size I had. I didin’t use any ffmpeg flattening filters, they always seemed like they cropped too much and left some distortion that could confuse the models. In lieu of a perfect solution, something like movenet fine-tuned on VR poses might yield better results. I wanted to start each model on an even playing field, so I didn’t try to speed things up by preprocessing the videos to the model’s native sizes.
This is just something I’ve looked at out of curiosity, I don’t have any expertise in this space or do anything more than dabble with python.
As far as applicability and how it can be used, it could be a source to help at the start of composing, but I think the tools and extensions that exist far surpass the output of these. If I wanted to use them, probably the lowest hanging fruit would be to convert the keypoint Y data into funscript and use them as individual sources to copy+paste out of. Most/all keypoint data includes a confidence value as well.
e.g.
The average position of left and right hip could give approx rhythm indications (left_hip_y + right_hip_y) / 2
The y position of the hand or head at certain points in the video could inform motion
The overall joint y position, where confidence is greater than % could inform motion.
and so on.