yolov10m seems to be the best balance of size/performance/training time from what I’ve seen. That is at herpaderpapotato/nsfw-identification-yolo10m · Hugging Face and gets a scores ~50 on the validation set compared to the previous best of ~40.
I wouldn’t generate wave forms or peaks like that if I was doing it for real (live, no forward+backward smoothing), and for the motion in this clip the wave form is inverted, but this is a quick and dirty example with mousecallbacks on the labels at the top where I can click them on and off and then reset the normalization to visualize the stability of the tracking over time.
If I wanted to actually work with that as a process, I’d probably want to run prediction over a whole video ahead of time and dump the tracking data to file. torchvision will do it with cuda straight into yolo on the same gpu with acceleration for h265 and h264 and runs at a much faster speed compared to what I get from opencv.
And then it’d be a case of working with that information in a gui to mix/combine/offset the waveforms into something usable. gui’s are hard though, and support of them is a nightmare. I just like playing with the tech .