Gifs of pose estimation on vr video

I kept the training running with a sequence from third scene added to the dataset and it got way worse at predictions. I eventually took a look at the sequence and realized it was a much different and subtle style of cowgirl, it’s like a forward thrusting with no obvious vertical movement.
I’m going to limit this to just obvious vertical cowgirl movement for now, so I took it out of the dataset and continued training.

It kind of got better eventually, but I figured I wanted to fix the smaller middle layers and so started a fresh model from scratch.

The fresh model could predict the same sequence shown previously after about 3 hours of training, but then I realized I still had some middle layers thinner than the final layer, so I started again with another fresh model.

I shifted the code to a fresh repo and dumped it in herpaderpapotato/glowing-octo-giggle (github.com). The image of the plot shown is for the most recent training. There’s virtually no polish on any of it, but ¯\_(ツ)_/¯

1 Like

I’m still questing for a better feature extractor and model architecture and I ended up dropping the LSTM from the model and trying DepthwiseConv2d layers.

e.g. I predict against 60 frames and use efficientnet to extract the features, so each sequence is an array of (60, 12, 12, 1280).

e.g. 60 of something like this

image

I have an animated version of that somewhere to show how it changes with based on a scene, but basically the brighter blobby cells tended to stay brighter and blobby, and all the blobs in the cells tend to move around and change in their cells based on the input video.

With the LSTM and initial Dense layers, the model architecture was basically saying “take some relevant stuff from all that data, and each of the other frame’s data, and predict an output”.

But that’s hard for a model to learn because the data in each cell is most immediately relevant to the data in the same cell on the other frames, not all the other cells in all the other frames. Thinking about it, what would be better is that if there was a LSTM for each 12 x 12 feature.
image
Image blown up 10x

There’s 1280 of features like that per frame, and I want to do predictions against 60 frames at a time for smoothness. Initializing a model of 1280 LSTMs takes 10+minutes to build and 2 minutes to predict so it’s not practical to do it that way.

Instead I’m trying with a DepthwiseConv2d layer which is:
“Depthwise convolution is a type of convolution in which each input channel is convolved with a different kernel (called a depthwise kernel).”
Which is good, because treating each of the 1280 features as and transforming with weights relevant to that cell makes sense, but I’m still not 100% sure I’m using it right, or if there’s not a better way of doing it. I can think of ways of using the DepthwiseConv2d plus the LSTM which make more sense, but I’ll give my current training run more time to see if I think of anything else before wiping the slate clean and starting fresh.

This approach has had better results than the LSTM approach though, but I end up restarting training from scratch a fair bit when testing. Lots of trial and error though.

On a different approach, I did see that MMPose documents implementing your own model, and I haven’t wrapped my head around it but implementing your own custom model head on the MMPose base is something along the lines of what is previously described Implement New Models — MMPose 1.3.1 documentation.

1 Like

A little off-topic, but I’ll share a thought nonetheless, since I can’t do it myself yet.
Have you looked into creating a 3-dimensional mesh, then pulling rig onto it, based on 180 VR video? In that case all you need to do is get the interconnections between pelvic bone and dick for sex, head bone and dick for blowjobs, and hand bone with dick for wanking?
We have already calibrated video for VR, which can be imported into software (Blender/Maya, etc.) as mesh and then try to insert pose recognition there. For VR video for a neural network with your approach we lose depth data (recognizing poses for 2D pictures).

My understanding is that something like a midas or normalbae generated depth map is much like a point cloud, although is missing any obscured surfaces. e.g. a 512h x 512w x depth

I’ve never tried rigging them though, I have done some visualizations with them but found what works well for one scene, doesn’t work very well for another. I had an gif above that I had lineart output superimposed on midas as an example of what they ‘see’ over time.

output1
e.g. aside from the flickering(depth changing) on the ceiling, this one was fairly consistent over time.

midas uses a trained resnet backbone for extraction of features in its encoder and then the decoder is largely doing upscaling
I’ve seen a relatively recent depth estimator was released called Marigold and it vae encodes the image, then pushes it through a unet (basically another encoder/decoder model) and it does quite well, although from what I read previously is that it’s not particularly temporally consistent. e.g. The State of the Art of Depth Estimation from Single Images | by PatricioGonzalezVivo | Medium

Essentially the encoder stage distils information out of the images into feature maps and then the decoders work backwards to create depth estimation images that are a lot like point clouds. I look at that and my take-away is that all the information needed is in the feature map, and instead of needing to add the overhead of decoding that image and then doing it over time and encoding to something usable, it should just be working with the feature maps.

The significant difference is that the feature extractors in a midas model are trained/finetuned on their task, whereas the efficientnet backbone I used was trained to extract features that make it an accurate categorization model. It might be okay for the task I gave it, visualizing the feature map outputs in video shows they do hold positional data relevant to the image, but it could always be better. For any model I’ve done above, fine tuning the backbone after initial training helps with that.

And then when it comes back to skeleton/pose detection or human rigging I revert to looking at that last link I posted about a custom head for mmpose. Although I was originally thinking of it all being something like this Human Pose Classification with MoveNet and TensorFlow Lite, but that’s literally just using the 17 keypoints output form the movenet and from my original post in this thread, they’re very hit and miss in their pretrained state.

I’m not sure what the calibrated video for VR which imports into software as mesh is, but that reminds me, if anyone had lens distortion maps for commonly used VR lenses it’d be good because the tests I did with OpenCV: Depth Map from Stereo Images really didn’t seem to like the wider fov and lens distortion and I’ve seen articles like Correcting lens distortion using FFMpeg | Daniel Playfair Cal’s Blog which offer more accurate ways of correcting it.

1 Like

I come back to this every month or so and try a new or different idea, but nothing worth mentioning in the realm of custom models.

I did revisit yolov8. There’s a guide to creating datasets and tuning yolov8 at notebooks/notebooks/train-yolov8-object-detection-on-custom-dataset.ipynb at main · roboflow/notebooks (github.com).

It’s possible to follow the instructions and label your own dataset with whatever you like. There’s other tools for dataset labelling than roboflow, but their example was fairly straightforward.

With a relatively small and sparsely labelled dataset it is possible to do more detailed body part tracking and have a model start to be able to track them better.

That’s from a couple hours training yolov8 on a dataset of ~450 images with about 2,300 annotations. My choice of classes is a bit crap, and my annotations weren’t necessarily consistent or balanced, but I could see that sort of tracking tieing into a OFS workflow or plugin.

That’s from a couple hours training yolov8 on a dataset of ~450 images with about 2,300 annotations. My choice of classes is a bit crap, and my annotations weren’t necessarily consistent or balanced, but I could see that sort of tracking tieing into a OFS workflow or plugin.

I have also experimented with such approach to generate funscripts a bit. I used the pretrained model NudeNet which, as far as I know, is also based on yolo8. Unfortunately, it was not possible for me to process the tracking results properly. Since the individual elements were not reliably recognized in every image and merging several trackers with missing features was/is not so easy, at least for me. My attempt can be found in this commit (newer code do no longer contains the function). We hope you find something that gives us better results.

2 Likes

Thanks, I’ve got some previous mentions of base yolo models as well as nudenet that someone shared previously in this thread and I didn’t find them stable or reliable enough, even on 2d video. They’d be great if I could use them as a feature extractor and I recently saw an article describing a way to use yolov8 that way but I haven’t gotten around to testing it.

I checked your link and also saw your newer commits and noted your yolo cock tracking onnx model and tried it out side by side with a yolo model from a longer training run and I don’t know what your experience has been with that, but I think if your not happy with the detections it’s making, then more data in the training dataset might be needed from the looks of it.



The same goes for the model I tuned, when watching the previous animation as well as that side by side, I can see many points where it’s not detecting objects but it should be, and that’s likely due to the small dataset I used. For transparency, there’s plenty of times in those videos that the model I trained, utterly fails! I have set opencv to listen for keypresses so when I see a frame that’s not being detected correcty, I can hit S and it can be dumped to a folder for later annotation. It’s not something I’m going to put a lot of time into, I’m still much more interested in approaches that do it all within the model.

I also saw your repo code using the cv2.calcOpticalFlowFarneback for tracking too, and I’d been meaning to take a look at what it would take to plug that into a model. I was surprised how easy it was to output a simple up or down vector and plot it into a detectable wave form and it’s given me a few ideas around feeding the hsv data into a custom model instead of images as it seemed to be visually informative, even at low resolutions. Although I can see that if I was programmatically trying to infer the right waveforms, the correct polarity is largely dependant on who is moving!

And I also saw how you’re doing quadratic interpolation in your code and it seems wayyy easier than the way I’ve been doing it so that was really useful as well! I have been manually writing functions to calculate the time between points and then doing sinusoidal calculations! I see interp1d is marked as legacy though, but make_interp_spline with uniform parametrizations looks like a good alternative I’ll try. A few weeks ago I played around making an autoencoder with lstm, with interpolated data choking down to 1/20th the size of the original data so having good, easy to generate interpolated data is really useful. I was thinking the resultant decoder model might be good to use as a replacement head on some of the other sequence models I’ve tinkered with, and potentially output smoother, more consistent predictions.

I also tried the lstm approach with my ‘manual’ interpolation to create an ‘axis expander’ model. To go from 1 channel and to create roll+pitch channels based off the longer term input (60s in, 60s out). Objectively, it wasn’t bad, but wasn’t necessarily better than just setting multifunplayer to random for those channels.

Back to your tracker model though, have you thought about doing a segmentation model instead of detection model? There’s some smart segmentation tools that make segmentation easier to annotate than you’d think for that sort of defined object. Potentially it’s easier and more reliable to calculate the visible area of the segmentation mask?

The existing optical flow and other approaches are quite good and consistently outperform anything model based I’ve tinkered with. I’m not sure if/where the pain ports are for users of it at the moment, but I could see some sort of detection model that helps a program make a decision about where to focus an optical flow’s attention (such as static cropping guidance), or to provide smoothing/postprocessing of the tracking output data.

I wouldn’t hold out any hope on anything useful from me though :smiley: . My area of interest on this is mostly focused on something fully image model based, which I don’t think is achievable with my limited knowledge, or my free time. Potentially something I’ve posted will save some time of someone smarter than me who comes along and sees it :wink:

2 Likes

herpaderpapotato/nsfw-identification-yolo10x at main (huggingface.co)

Link to the yolo10x detect model for reference. It can be used with ultralytics python libraries etc Quickstart - Ultralytics YOLO Docs

1 Like

herpaderpapotato/nsfw-identification-yolo10x at main (huggingface.co) 1

Link to the yolo10x detect model for reference. It can be used with ultralytics python libraries etc Quickstart - Ultralytics YOLO Docs 1

Oh, cool, thanks for sharing your model weight with us. :+1: I adjusted my code to be able to load your model.

I checked your link and also saw your newer commits and noted your yolo cock tracking onnx model and tried it out side by side with a yolo model from a longer training run and I don’t know what your experience has been with that, but I think if your not happy with the detections it’s making, then more data in the training dataset might be needed from the looks of it.

Yes the dataset was very small. In addition i have only trained on cowgirl position + i used the projected output of the vr for training and not a the video frame. The main problem is that creating a large enough dataset is just too time consuming and boring for me…

I will definitely follow your progress and look forward to your next results :grinning:

1 Like

I had a unexpectedly quiet saturday so labelled some more images and chucked them at yolov10n for training. It finished fairly quickly so I kicked off a m training as well. I added them both to huggingface.

I think I’ve still only got less than 10 images in the validation dataset but 600 base images in the training, and categories are imbalanced, and all the other dataset labelling sins that you can make. The one thing I did learn is that every instance of a class in an image should be labelled, or yolo begins to learn them as “background”. I did augment the dataset with cropped versions of the original images which brought the total amount of images trained on to ~1200.

I’d share the dataset publicly, but frames from copyright video seem like they’d attract a dmca request. If anyone is interested in it though, drop me a message.

yolov10n


yolov10m

That’s a vizualization of a prediction, and then a crop based on the detections, and then a prediction on the crop.

Even though the 10m is still only ~800 epochs through the training, the v10m does a bit better than the v10n. It’s detecting more of mouth on certain frames where as the v10n. I’ll see if it stays that way at the end of training.

Different sort of scene with the 10m again:

I’d still recommend other tools for newcomers, but I had a decent labelling workflow going with opencv and some python so labelling wasn’t too bad. Basically it was a case of:

  • load all videos into a list (glob *)
  • choose a random video
  • choose a random frame and display it
  • press a key to get some predictions from a previously trained yolo version and overlay them on the frame, or a different key to pick a new frame at random, or a different key to pick a new video at random.
  • if they’re all correct, press a key to write the frame and predictions to the dataset
  • if they’re not, then some opencv mousecallback magic to let me delete or add new ones and choose classes etc and then write/discard the frame etc

Once a model is performing reasonably, the process speeds up a lot because it will be doing most of the work with the annotations. It’s also possible to programatically find edge cases where the dataset/model is lacking by running predictions on random samplings of videos and working out if certain classes are disappearing and reappearing from the predictions, and then by stashing the problem frames somewhere for labelling later.

2 Likes

yolov10m seems to be the best balance of size/performance/training time from what I’ve seen. That is at herpaderpapotato/nsfw-identification-yolo10m · Hugging Face and gets a scores ~50 on the validation set compared to the previous best of ~40.

I wouldn’t generate wave forms or peaks like that if I was doing it for real (live, no forward+backward smoothing), and for the motion in this clip the wave form is inverted, but this is a quick and dirty example with mousecallbacks on the labels at the top where I can click them on and off and then reset the normalization to visualize the stability of the tracking over time.

If I wanted to actually work with that as a process, I’d probably want to run prediction over a whole video ahead of time and dump the tracking data to file. torchvision will do it with cuda straight into yolo on the same gpu with acceleration for h265 and h264 and runs at a much faster speed compared to what I get from opencv.
And then it’d be a case of working with that information in a gui to mix/combine/offset the waveforms into something usable. gui’s are hard though, and support of them is a nightmare. I just like playing with the tech :rofl:.

I started a new dataset with human pose keypoint labelling and started fine tuning yolo pose models. The results were good, but on the test scenes the keypoints don’t necessarily convey body movement enough.

Instead I added another 4 keypoints to my keypoint labels: base of pelvis, umbilicus, left of sternum and right of sternum. Fine tuning and adding keypoints throws out a lot more of the model’s previous knowledge than just a normal pose finetune so it’s a bit trickier.

Left is finetuned 21 keypoint visualization, right is base yolo pose model visualization with no fine tuning.
The visibility on the arms and legs is low in the scene, so some jitter is expected, but the fine tune still does much better on their positions. The scene used is intentionally kept out of the dataset.

Generating predictions faster than realtime speed is a goal at the moment but is a bit of a challenge, a lot of it related to python threading.

  • cuda opencv slows down converting GPUMat to tensors even when done without reprocessing on CPU
  • torchvision similarly slows down converting the tensor from HSV(YUV?) to BGR
  • ffmpeg hooks from memory seemed to decode via CPU but python accesses the images with CPU which slows predictions.
    There’s other options like preconverting, other libraries, multiprocessing or even using a different language like c++/c#.

In any case, it’s probably the best results I’ve had from any of these to date. Both based on the absolute position of the points, as well as the relative distance between certain points over time.

3 Likes

So you mean something like this, right?

We need lots of heroes to make all these useful.

Also, lots of great work in this post. Thanks for all the hard work and open source them.

1 Like

Not really, but it’s a very interesting project. It seems to me that resources will be needed to recognize the skeleton quite a lot, but maybe it’s worth it.

bit of an info dump.

Using FFMPEG I convert a 3d SBS video to 640x640 in about 30 minutes and get a left or right eye video. I’m using cuda, and also it’s quicker if I start from a smaller resolution source, but I prefer to start from larger. yolo11’s native resolution is 640, you can go bigger but most of the time 640 is decent. Reencoding has been simpler than some of the more niche python+ffmpeg, or opencv+gpu, or torchvision decoding things Ive tried, which ultimately ends up taking the same time because the format switching need for yolo to process them, even when it stays on the gpu I haven’t gotten much efficiency. long.

Once the video is at a good size, I can run every frame through a yolo11m-pose previously finetuned. Using batches of 120 frames it completes in about 90% of the video runtime. I run the predictions without tracking, minimum confidence 10% and an iou of 99%. I trim/pad the data for each frame out to 8 detections.

It’s then 60 predictions per second, for 60 seconds by 60 minutes. Each frame has 8 detections, some are blank. Each detection has a bounding box consisting of 2 xy coordinates, and 21 keypoints which have x, y coords and a confidence score.
That’s up to 60x60x60x8x67 points of data for an hour long video. That’s 115,776,000 data points.

If I just choose the most confident predictions from each frame, I can animate them into full video length animated gifs similar to the last one I posted. Everything goes out the window with more than one performer in a scene though.

With one performer I have multiple detections because of the low cfg and high iou I use, which was intentional so I could try things like finding the most similar pose in each frame based on total differences of all frames throughout the video, as well as other things like merging certain points or working the difference in points.

Ultimately it was all much the same outcome, and the maths hurt my head so instead I stuck to the most confident prediction. 21 lines correspond to 21 keypoints.

And then smoothed that with savgol filter

And then some more maths that the internet told me to extract peaks and troughs. The X axis looks trash but that’s expected.

I didn’t love the approach, but it can be a case of

  1. put video in folder
  2. press button.
  3. come back an hour later to an ugly graph and also a funscript file of that graph.
  4. Use a tool that doesn’t exist (OFS is great, but of course it doesn’t behave when I load in 21 funscripts) to scrub through the video and mark which keypoints to use for given time frames, which eventually assembles to make a draft funscript that could help someone make an actual good funscript.

And I feel like where I’d get with that isn’t much better that what other folks have already done in simpler, less complicated ways.

So instead I’d rather take my 115,776,000 data points and run them through a model to get a draft funscript. I also figured I’d do it with pytorch instead of tensorflow which I was used to which was an extra bit of learning as well as that it can be hard to even know if you’ve designed a model’s layers correctly, or stuffed up your code until the model actually learns something. I started with a model that can generate 120 predictions for a 120 frame sequence at once, because that seems to result in smoother outputs. The input data was yolo pose data I pulled from a from a few different videos that I had scripts for. I interpolated the funscripts with quadratic interpolation to create a data point for each frame.

I wasn’t training a model with the intent distributing the end result as it was done with funscripts I don’t have permissions to train on. But if it had proved the concept sufficiently I’d start training from scratch with permissable data.

This is an example of a prediction on unseen data.
Green is actual funscript.
Orange is the value the model predicted.
Blue is the Orange line normalized.

image

It was about a month ago when I took a break from it it. Work got busy. If I’m charitable, I’d say it was learning sequences from the pose data, there is some harmonic correlation visible between the green and orange line. But also I went back through the data and I was training on some sequences that were quite hard as a human to work out what they should be which probably wasn’t helping.

All that when I still don’t love this approach either :D. I’d prefer try working with the yolo model’s later layers as a feature extractor instead of just the pose data, but the layers are quite large and I’d have to compromise on sequence length for predictions or work out a good way to reduce their size first.

Either way, this is still a problem I come back to and bang my head against every so often.

Another visualization of a prediction on 180frames:
image

The lines to the corner at 0 are missing detections.

Interpolating and then smoothing the data looks like:
image

But one of the things that’s bugging me is missing info about the body of the male. I haven’t expanded the original dataset or trained other yolo models for a couple months, but to add male keypoints I’ve redone my labelling workflow and started updating the dataset with an extra class.

This is what the labelling process looks like on the existing dataset.


That’s sped up 8x from realtime. When I press “i”, it inserts keypoints using a model trained on the ~60 images I’d labelled so far. It’s quicker than fully doing it manually but still needs alot of correcting.

At that rate it takes about 80 seconds to label each image with the male. There’s nothing special to the labelling process, just some python and writing txt files.
It should speed up as I get more images and train on bigger datasets and the inserted “i” data is more accurate.
I’m also using less than optimal key bindings spread all over the keyboard.
And also correcting some of the old bounding boxes on the performer labels.
If I can get the labelling down to ~40seconds per image it’ll only take another 6 hours to label the remaining 500 images. :grimacing:

3 Likes


Off track for a moment (not specifically pose related :scream:) LightGlue on github can be persuaded to work a lot like the opencv point tracking or encoder vector methods by masking out areas of interest in a reference frame. I don’t have them up and running to compare, but for what it’s worth I captured the above example.

edit: back on the topic of pose and trying not to post spam, this is viz of outputs from partially trained yolo11n skimming through video which was not in training set. Dataset is still at ~450 images.

2 Likes

As already mentioned in another topic, that’s awesome work, congratulations!

I know the type of headache trying to tackle down this type of topic can induce, and how much energy a simple “thank you for what your doing” can bring, so : Thank you for what you’re doing!

Two quick remarks from my side:

  • I have been experimenting with YOLO detection in another thread, where @Zalunda and @jambavant amongst others came up with very valuable suggestions about “undistorting” the VR image before processing it, you retrieve a lot of details in the lower part of the picture for instance, where the action often is (and where the pose model sometimes struggle…)
  • I tried to add a center-hip point from YOLO-pose model to the process, but was disappointed by the yolo11n-pose unstable results, and by how computationally extensive was the yolo11x-pose model (however with WAY better results and stability).

As an AI / YOLO newb, I am still convinced a nice approach would be the combination of both fine tuned image detection and fine tuned pose detection. Happy to share and join forces if you ever feel like it :slight_smile:

Anyway, please keep up the good work, thanks again :+1:

1 Like

Pose - Ultralytics YOLO Docs

with yolo, I find m to be the sweet spot. It’s only ~3.5x the inference time of n, but it can make up ~75% of the gap between n and x. Whereas x is ~9.5x the inference time of n.

Using torch to process the video with gpu acceleration, and passing directly to yolo as a tensor in 120 frame batches, for me it’s ~ 25 minutes to perform predictions for a 50 minute 4096p video, compared to ~37minutes with the m. That’s without preprocessing the video to a smaller size.

I looked at undistorting the video and the work others had done/documented, and when it comes to pose, my thinking is that the model will learn what it’s taught from the data, they just haven’t been taught on this content.
I do occasionally work with crops but when it comes to distortion, I’d take it the other way. Grab a normal coco keypoint dataset, programmatically build a new dataset that’s zoomed in on subjects so they’re close/positioned similar to your use case and then distort them to be even more like your use case, and then train a model on those for a pose model that’s much closer to functional on your use case.

I see the pose models as significantly more powerful than the flat out detection models.

  • My anecdotal understanding is that when you label pose you mark the points that are in frame and visible as visibility 2. The model learns the features around the point, and the relation of the point to other keypoints.
  • When a keypoint is in the frame and not visible, you mark the point with visibility 1. The model learns where to expect the keypoint to be, given the other keypoints, the bounding box of the object and other factors.
  • And when a keypoint isn’t in the frame it’s marked with a 0 and the model learns that for that keypoint everything else in the frame is “background” but that information still informs the positions of the other keypoints.

All that means that if I’m tracking a wrist, the keypoint prediction for it is weighted by everything else about that object detection, and not just a set of blurry pixels approximately in the middle of the screen etc like a straight object detection.
It also means that if the model can’t see something, it’s probably got a pretty good idea of where that thing is.

With the previous female-only pose model I did the dataset was relatively small of only ~600 images(when it comes to ML datasets, that’s tiny), I’d find that it’d sometimes have problems on a video with performer/background/poses that the model hadn’t seen before. If I went through and found ~20 of the worst frame predictions (automatically based on sequences where predictions went from okay to bad), and then labelled those and then further trained a model. After that it would be much better on that video. I never really hit a limit where this stopped applying, it’s just that for my goal the pose tracking didn’t provide much more value to scripters than what they already have, or using something like LightGlue etc.

It’s probably buried somewhere above, but if I’m working towards anything it’s for something to assist scripters so that they can spend less time stepping through frames for timings on peaks and troughs, and instead being able to focus more on other aspects of the scripts such as depth of peaks or when to add some jitter to a motion etc.

Possibly I’m also trying to invoke Cunningham’s Law :rofl:.

Everything that I can share, I’ve generally tried to share though.

  • my crap python code goes onto github. At least where it’s more than just one off scratchings/testing in python notebooks which lead no where or is a copy of someone elses existing code.
  • yolo models go onto huggingface
  • datasets I’ve labelled are available at request. I’m trying to avoid widely sharing copyright images so it’s not something I’ve dropped public links to.
  • and any info/testing/thoughts/approaches etc I generally put in this thread.

You’re free to check out any of that, I’m pretty sure I’ve got github/huggingface links in here somewhere. If not then it’s just under my name on those. Or if there’s something specific you need, post it and I’ll see what I can do.

At the moment I’ve got another multiclass yolo11m-pose model training. I’ve reduced the “beholder” keypoints down to the 7 most applicable, and changed some other config so it’ll be interesting to see how that comes out when it’s done training. Work has been quiet at this time of year which has meant I’ve had enough time to revisit the pose stuff over the past couple of weeks.

@herpaderpapotato thank you so much for your detailed answer, once again, amazing work…

I will definitely look into it, I think I already played a lil bit with one of your yolo v10 trained model on suggestion of @mchyxnaaiorxfwrivv (if I am not mistaken)

I just made my github repo public via this message, in case you want to have a look :slight_smile:; your batch processing approach might be super useful to speed up the pipeline!

In the meantime, you just made me discover what was the Cunningham’s Law lol, thank you!