Gifs of pose estimation on vr video

herpaderpapotato · March 8, 2024, 1:31pm

I took the methodology from my previous models that I put in github/huggingface and this an example prediction generated by the ml after less than 24 hours of training on one (sometimes 2) gpu it does this

If for some reason that embed doesn’t work, https://www.redgifs.com/watch/imaginaryvillainousjunco or motion.mp4 ~ pixeldrain.

Animating a 3d object manually with a 2d plane is hard, so I drew the axis with opencv on the image
tilt of the blue line is roll
tilt light blue line is twist
red line is position vertical, length of red line is surge
green line is sway, length of green line is pitch

I went through flowertrample’s free multiaxis scripts and grabbed the corresponding videos I have access to and ran through the following process.
Stage 1:

I put the videos and funscripts in a folder, matched the names and then put the timestamps of cowgirl scenes in a file named .txt. I only really did this for 3, and trained on 2. I should to more but the next steps take time.
Used python to extract each left half frame to file named like frame.jpg in the original resolution
For each frame, read the funscripts and interpolated the position for the frame and stored the axis values in frame.txt

Stage 2:

I wanted to speed up training, so instead of making a LSTM that used a pretrained model to generate features every epoch, I just made a model which would take 60 x the tensor shape of the features (which is an array of 12x12x1280) which saves training time. I used efficientnetv2-s-21k-ft1k to pregenerate all the images into features saved as numpy files.
I wrote a dataloader that reads all the numpy file names and works out which ones are sequential of 60 or more, and return however many sets of sequence data I request. It’s a really bad dataloader because it loads that many sequences into memory but I couldn’t be bothered writing a smarter data generator.
At this point, across the 3 videos I prepped there’s 51830 jpgs, 51830 txt files and 51830 npy files. At this point there’s also approx 51,500 overlapping 60 frame sequences the dataloader will grab from.

You can visualize the data in the npy files and they look a bit like:

In an image classification model, that is what the final classification layer would ‘see’ and make decisions on, but I smash 60 together and tell it to do LSTM stuff with it before pushing it through a few dense layers.

Stage 3:

Because my dataloader is bad/dumb, I set the training to get 128 sequences at random, train for 100 epochs on them, and then loop that continuously.
I watched the loss numbers go down. There’s no right answer on this, the loss number went down much quicker than it did when I train basic funscript because that has a lot more variation than the other axis so it will learn to predict in a smaller range quickly which brings the loss down but doesn’t mean it’s accurate. e.g. part of the later training after adding more sequences

Stage 4:

I grabbed a video for one of flowertramples paid scripts and used it to test, because it’s not in my source dataset and could be used for comparison without fear of contamination
I loaded the timestamps of cowgirl in the video, pushed the frames through efficientnetv2-s-21k-ft1k, and then made the model do predictions of 60 frames at a time, which is the number of frames I configured the model to do.
I used opencv to draw on the frames to indicate the predictions.
also checked the up down axis output on a plot

Caveats/notes

It jitters a little every 60 frames, because each 60 frames is a separate prediction. One day I’ll be bothered to add an input to the model that takes the position of the previous frame as a starting point. Doing overlapping predictions seems wasteful but is an option.
The model has never seen this scene before. In training it has only seen 2 other scenes so it’s not very diverse but it still matching that tempo reasonably.
It hasn’t been trained very long
the base model(efficientnetv2-s-21k-ft1k) hasn’t been fine-tuned at all
I don’t have explicit permission to train and share a model based on flowerstrample’s scripts. Although I can share code etc, the model weights can’t be shared and I’d end up deleting them to avoid accidentally using them elsewhere.
I used linear interpolation to create the frame matching pos values, and so the values I generated aren’t the best, but writing something sinusoidal was more effort/time
although there’s plenty of multiaxis anime and pmv, I think they’d probably contradict the model’s learnings and so I can’t use them
I realized I broke one of my rules and the dense layers after the LSTM each have less neurons than the final layer, and I try to avoid doing that so I probably should redo it with fresh weights.
although I did speed up training by pre-generating the pretrained model outputs, it meant I couldn’t use any data augmentation in training (although maybe I could add some noise to the features).
I could write some python to output funscripts based off the data (I’ve made some for my previous models, but not multi-axis), it’s too soon to bother because in all likelihood the model will freak out the minute someone transitions from cowgirl leaning back to cowgirl leaning forward with arms on chest.
I didn’t do any pre-cropping on the images it’s training on or predicted on, so it could be better if there’s less distractions which is what I saw on other models.
it’s only learnt from cowgirl scenes, and so is probably crap at everything else

As a proof of concept though, I don’t mind it. I mostly still want to find a way to simplify images prior to prediction to improve the eventual speed and accuracy, and/or to leave some breadcrumbs for others to improve upon.