Gifs of pose estimation on vr video

I started a new dataset with human pose keypoint labelling and started fine tuning yolo pose models. The results were good, but on the test scenes the keypoints don’t necessarily convey body movement enough.

Instead I added another 4 keypoints to my keypoint labels: base of pelvis, umbilicus, left of sternum and right of sternum. Fine tuning and adding keypoints throws out a lot more of the model’s previous knowledge than just a normal pose finetune so it’s a bit trickier.

Left is finetuned 21 keypoint visualization, right is base yolo pose model visualization with no fine tuning.
The visibility on the arms and legs is low in the scene, so some jitter is expected, but the fine tune still does much better on their positions. The scene used is intentionally kept out of the dataset.

Generating predictions faster than realtime speed is a goal at the moment but is a bit of a challenge, a lot of it related to python threading.

  • cuda opencv slows down converting GPUMat to tensors even when done without reprocessing on CPU
  • torchvision similarly slows down converting the tensor from HSV(YUV?) to BGR
  • ffmpeg hooks from memory seemed to decode via CPU but python accesses the images with CPU which slows predictions.
    There’s other options like preconverting, other libraries, multiprocessing or even using a different language like c++/c#.

In any case, it’s probably the best results I’ve had from any of these to date. Both based on the absolute position of the points, as well as the relative distance between certain points over time.

3 Likes

So you mean something like this, right?

We need lots of heroes to make all these useful.

Also, lots of great work in this post. Thanks for all the hard work and open source them.

1 Like

Not really, but it’s a very interesting project. It seems to me that resources will be needed to recognize the skeleton quite a lot, but maybe it’s worth it.