Gifs of pose estimation on vr video

herpaderpapotato · September 22, 2024, 4:57am

I started a new dataset with human pose keypoint labelling and started fine tuning yolo pose models. The results were good, but on the test scenes the keypoints don’t necessarily convey body movement enough.

Instead I added another 4 keypoints to my keypoint labels: base of pelvis, umbilicus, left of sternum and right of sternum. Fine tuning and adding keypoints throws out a lot more of the model’s previous knowledge than just a normal pose finetune so it’s a bit trickier.

Left is finetuned 21 keypoint visualization, right is base yolo pose model visualization with no fine tuning.
The visibility on the arms and legs is low in the scene, so some jitter is expected, but the fine tune still does much better on their positions. The scene used is intentionally kept out of the dataset.

Generating predictions faster than realtime speed is a goal at the moment but is a bit of a challenge, a lot of it related to python threading.

cuda opencv slows down converting GPUMat to tensors even when done without reprocessing on CPU
torchvision similarly slows down converting the tensor from HSV(YUV?) to BGR
ffmpeg hooks from memory seemed to decode via CPU but python accesses the images with CPU which slows predictions.
There’s other options like preconverting, other libraries, multiprocessing or even using a different language like c++/c#.

In any case, it’s probably the best results I’ve had from any of these to date. Both based on the absolute position of the points, as well as the relative distance between certain points over time.

lock0u · November 19, 2024, 5:31pm

So you mean something like this, right?

We need lots of heroes to make all these useful.

Also, lots of great work in this post. Thanks for all the hard work and open source them.

flowerstrample · November 19, 2024, 8:48pm

Not really, but it’s a very interesting project. It seems to me that resources will be needed to recognize the skeleton quite a lot, but maybe it’s worth it.

herpaderpapotato · December 18, 2024, 2:34pm

bit of an info dump.

Using FFMPEG I convert a 3d SBS video to 640x640 in about 30 minutes and get a left or right eye video. I’m using cuda, and also it’s quicker if I start from a smaller resolution source, but I prefer to start from larger. yolo11’s native resolution is 640, you can go bigger but most of the time 640 is decent. Reencoding has been simpler than some of the more niche python+ffmpeg, or opencv+gpu, or torchvision decoding things Ive tried, which ultimately ends up taking the same time because the format switching need for yolo to process them, even when it stays on the gpu I haven’t gotten much efficiency. long.

Once the video is at a good size, I can run every frame through a yolo11m-pose previously finetuned. Using batches of 120 frames it completes in about 90% of the video runtime. I run the predictions without tracking, minimum confidence 10% and an iou of 99%. I trim/pad the data for each frame out to 8 detections.

It’s then 60 predictions per second, for 60 seconds by 60 minutes. Each frame has 8 detections, some are blank. Each detection has a bounding box consisting of 2 xy coordinates, and 21 keypoints which have x, y coords and a confidence score.
That’s up to 60x60x60x8x67 points of data for an hour long video. That’s 115,776,000 data points.

If I just choose the most confident predictions from each frame, I can animate them into full video length animated gifs similar to the last one I posted. Everything goes out the window with more than one performer in a scene though.

With one performer I have multiple detections because of the low cfg and high iou I use, which was intentional so I could try things like finding the most similar pose in each frame based on total differences of all frames throughout the video, as well as other things like merging certain points or working the difference in points.

Ultimately it was all much the same outcome, and the maths hurt my head so instead I stuck to the most confident prediction. 21 lines correspond to 21 keypoints.

And then smoothed that with savgol filter

And then some more maths that the internet told me to extract peaks and troughs. The X axis looks trash but that’s expected.

I didn’t love the approach, but it can be a case of

put video in folder
press button.
come back an hour later to an ugly graph and also a funscript file of that graph.
Use a tool that doesn’t exist (OFS is great, but of course it doesn’t behave when I load in 21 funscripts) to scrub through the video and mark which keypoints to use for given time frames, which eventually assembles to make a draft funscript that could help someone make an actual good funscript.

And I feel like where I’d get with that isn’t much better that what other folks have already done in simpler, less complicated ways.

So instead I’d rather take my 115,776,000 data points and run them through a model to get a draft funscript. I also figured I’d do it with pytorch instead of tensorflow which I was used to which was an extra bit of learning as well as that it can be hard to even know if you’ve designed a model’s layers correctly, or stuffed up your code until the model actually learns something. I started with a model that can generate 120 predictions for a 120 frame sequence at once, because that seems to result in smoother outputs. The input data was yolo pose data I pulled from a from a few different videos that I had scripts for. I interpolated the funscripts with quadratic interpolation to create a data point for each frame.

I wasn’t training a model with the intent distributing the end result as it was done with funscripts I don’t have permissions to train on. But if it had proved the concept sufficiently I’d start training from scratch with permissable data.

This is an example of a prediction on unseen data.
Green is actual funscript.
Orange is the value the model predicted.
Blue is the Orange line normalized.

It was about a month ago when I took a break from it it. Work got busy. If I’m charitable, I’d say it was learning sequences from the pose data, there is some harmonic correlation visible between the green and orange line. But also I went back through the data and I was training on some sequences that were quite hard as a human to work out what they should be which probably wasn’t helping.

All that when I still don’t love this approach either :D. I’d prefer try working with the yolo model’s later layers as a feature extractor instead of just the pose data, but the layers are quite large and I’d have to compromise on sequence length for predictions or work out a good way to reduce their size first.

Either way, this is still a problem I come back to and bang my head against every so often.

herpaderpapotato · December 29, 2024, 8:33am

Another visualization of a prediction on 180frames:

The lines to the corner at 0 are missing detections.

Interpolating and then smoothing the data looks like:

But one of the things that’s bugging me is missing info about the body of the male. I haven’t expanded the original dataset or trained other yolo models for a couple months, but to add male keypoints I’ve redone my labelling workflow and started updating the dataset with an extra class.

This is what the labelling process looks like on the existing dataset.

That’s sped up 8x from realtime. When I press “i”, it inserts keypoints using a model trained on the ~60 images I’d labelled so far. It’s quicker than fully doing it manually but still needs alot of correcting.

At that rate it takes about 80 seconds to label each image with the male. There’s nothing special to the labelling process, just some python and writing txt files.
It should speed up as I get more images and train on bigger datasets and the inserted “i” data is more accurate.
I’m also using less than optimal key bindings spread all over the keyboard.
And also correcting some of the old bounding boxes on the performer labels.
If I can get the labelling down to ~40seconds per image it’ll only take another 6 hours to label the remaining 500 images.

herpaderpapotato · December 29, 2024, 6:09pm

Off track for a moment (not specifically pose related

) LightGlue on github can be persuaded to work a lot like the opencv point tracking or encoder vector methods by masking out areas of interest in a reference frame. I don’t have them up and running to compare, but for what it’s worth I captured the above example.

edit: back on the topic of pose and trying not to post spam, this is viz of outputs from partially trained yolo11n skimming through video which was not in training set. Dataset is still at ~450 images.

k00gar · January 2, 2025, 8:32am

As already mentioned in another topic, that’s awesome work, congratulations!

I know the type of headache trying to tackle down this type of topic can induce, and how much energy a simple “thank you for what your doing” can bring, so : Thank you for what you’re doing!

Two quick remarks from my side:

I have been experimenting with YOLO detection in another thread, where @Zalunda and @jambavant amongst others came up with very valuable suggestions about “undistorting” the VR image before processing it, you retrieve a lot of details in the lower part of the picture for instance, where the action often is (and where the pose model sometimes struggle…)
I tried to add a center-hip point from YOLO-pose model to the process, but was disappointed by the yolo11n-pose unstable results, and by how computationally extensive was the yolo11x-pose model (however with WAY better results and stability).

As an AI / YOLO newb, I am still convinced a nice approach would be the combination of both fine tuned image detection and fine tuned pose detection. Happy to share and join forces if you ever feel like it

Anyway, please keep up the good work, thanks again

herpaderpapotato · January 3, 2025, 3:49pm

Pose - Ultralytics YOLO Docs

with yolo, I find m to be the sweet spot. It’s only ~3.5x the inference time of n, but it can make up ~75% of the gap between n and x. Whereas x is ~9.5x the inference time of n.

Using torch to process the video with gpu acceleration, and passing directly to yolo as a tensor in 120 frame batches, for me it’s ~ 25 minutes to perform predictions for a 50 minute 4096p video, compared to ~37minutes with the m. That’s without preprocessing the video to a smaller size.

I looked at undistorting the video and the work others had done/documented, and when it comes to pose, my thinking is that the model will learn what it’s taught from the data, they just haven’t been taught on this content.
I do occasionally work with crops but when it comes to distortion, I’d take it the other way. Grab a normal coco keypoint dataset, programmatically build a new dataset that’s zoomed in on subjects so they’re close/positioned similar to your use case and then distort them to be even more like your use case, and then train a model on those for a pose model that’s much closer to functional on your use case.

I see the pose models as significantly more powerful than the flat out detection models.

My anecdotal understanding is that when you label pose you mark the points that are in frame and visible as visibility 2. The model learns the features around the point, and the relation of the point to other keypoints.
When a keypoint is in the frame and not visible, you mark the point with visibility 1. The model learns where to expect the keypoint to be, given the other keypoints, the bounding box of the object and other factors.
And when a keypoint isn’t in the frame it’s marked with a 0 and the model learns that for that keypoint everything else in the frame is “background” but that information still informs the positions of the other keypoints.

All that means that if I’m tracking a wrist, the keypoint prediction for it is weighted by everything else about that object detection, and not just a set of blurry pixels approximately in the middle of the screen etc like a straight object detection.
It also means that if the model can’t see something, it’s probably got a pretty good idea of where that thing is.

With the previous female-only pose model I did the dataset was relatively small of only ~600 images(when it comes to ML datasets, that’s tiny), I’d find that it’d sometimes have problems on a video with performer/background/poses that the model hadn’t seen before. If I went through and found ~20 of the worst frame predictions (automatically based on sequences where predictions went from okay to bad), and then labelled those and then further trained a model. After that it would be much better on that video. I never really hit a limit where this stopped applying, it’s just that for my goal the pose tracking didn’t provide much more value to scripters than what they already have, or using something like LightGlue etc.

It’s probably buried somewhere above, but if I’m working towards anything it’s for something to assist scripters so that they can spend less time stepping through frames for timings on peaks and troughs, and instead being able to focus more on other aspects of the scripts such as depth of peaks or when to add some jitter to a motion etc.

Possibly I’m also trying to invoke Cunningham’s Law .

Everything that I can share, I’ve generally tried to share though.

my crap python code goes onto github. At least where it’s more than just one off scratchings/testing in python notebooks which lead no where or is a copy of someone elses existing code.
yolo models go onto huggingface
datasets I’ve labelled are available at request. I’m trying to avoid widely sharing copyright images so it’s not something I’ve dropped public links to.
and any info/testing/thoughts/approaches etc I generally put in this thread.

You’re free to check out any of that, I’m pretty sure I’ve got github/huggingface links in here somewhere. If not then it’s just under my name on those. Or if there’s something specific you need, post it and I’ll see what I can do.

At the moment I’ve got another multiclass yolo11m-pose model training. I’ve reduced the “beholder” keypoints down to the 7 most applicable, and changed some other config so it’ll be interesting to see how that comes out when it’s done training. Work has been quiet at this time of year which has meant I’ve had enough time to revisit the pose stuff over the past couple of weeks.

k00gar · January 3, 2025, 6:45pm

@herpaderpapotato thank you so much for your detailed answer, once again, amazing work…

I will definitely look into it, I think I already played a lil bit with one of your yolo v10 trained model on suggestion of @mchyxnaaiorxfwrivv (if I am not mistaken)

I just made my github repo public via this message, in case you want to have a look ; your batch processing approach might be super useful to speed up the pipeline!

In the meantime, you just made me discover what was the Cunningham’s Law lol, thank you!

herpaderpapotato · January 19, 2025, 9:57am

If you’ve tested with the fine tuned pose models, it’s possible they gave you terrible results. I’ve mentioned similar, but I’ve seen it a fair bit when I’ve tested on scenes or poses that weren’t common in the dataset. I tried to have a fairly wide dataset by downloading videos at random or in bulk and then by grabbing frames for labelling randomly, but the dataset tended to be biased towards the scenes I had more of, or poses that I found easy to label. Anecdotally it did feel like the best case scenario using pose models was more stable and reliable in it’s predictions than detection based models but it was so much harder to label in bulk that detection could outperform it based on being easier to build a larger and more diverse dataset.

Revisiting model sizing with yolo10n vs yolo10x etc; I use the x and larger for labelling prospective data where accuracy is more important. Better pre-labelling means there’s less actual labelling effort. So there’s definitely value to keeping vesions of bigger models around.

I had mixed results with the multiclass model. The data format was correct for the male, but it wasn’t conducive to training. I ended up cutting down the male keypoints to about 5, and then bringing them all to the start of the keypoint array so that it ended up looking like:

kp1, kp2, kp3, kp4, kp5, padx16

I saw you had a look at segmentation. Have you tried going the other way with it?
Segment people out of the image, fill in the blanks with image generation, and then stick the image in your yolo dataset without any labels. Yolo will treat unlabelled images as background and it can help it learn what features should be ignored.

That also applies to any unlabelled classes in images. If you have a class for mouth but only label the one mouth relevant to your scripting, then yolo will bias the unlabelled mouth towards being background as well as affecting the loss calculation for that class. If there’s only a small percentage of images where that is the case then you probably won’t have any issues, but at a larger scale it can cause intermittent or jumpy detections.

Awesome stuff though, good work on putting things up on github and everything else so far.

k00gar · January 19, 2025, 9:57pm

Really appreciate, and thank you so much for answering!

Will look into what you suggest regarding the potential use of masks.