I agree with your suggestion of letting the model manage it and with that approach I was able to get good outputs on a lot of the more obvious cowgirl and similar scenes which is what I put on github and huggingface.
With the mentions of getting better outputs from pose recognition for videos, I was mostly just providing feedback on possible options and blewClue215’s query and looping back around on some other thoughts.
If anyone else is looking into this space I’d say don’t take anything I’ve posted as correct because I’m a hobbyist, not a data scientist, I’m just sharing as I go. A lot of my understanding is based on examples like this.
Overall I look it as:
- There are model architectures that often used for extracting ‘features’ from images. resnets, vggnet, efficientnet, inception etc. They’re not necessarily only good for one thing, a detection model might use the same backbone architecture as a classification model.
- They can be trained from scratch, or there are pretrained weights that you can use.
- You can use the pretrained weights without the last layer as ‘backbones’ for your own models, which can reduce overall training time, but you’d still need to train your new layers, and for best results fine tune the original layers.
I haven’t looked back into the pose recognition models in a while, but they do have an appeal for me because their pretrained layers and architecture is proven suitable for tracking the position of a person in an image, which I would think could then be used by:
- stripping off the current head/output layers
- train a new output layer for the values we’d use
- and then fine tune all the layers with a low learning rate to further improve accuracy
but some of the pose models are a bit of a black box, and they tend to be more compute intensive(slower), and they can be a bit memory intensive, so they don’t play well when I try to use them as a backbone for a model that creates predictions on multiple frames to ensure that the predictions are temporally consistent.
Which is why I’ve been thinking more about what a good feature extractor would be.