SLR: Can we do 3D object Segmentation for Passthrough?

@Realcumber @doublevr Hi, hope you are doing well. The passthrough is an amazing feature… could you please hear a small suggestion I had.

[1] The current algorithm adjusts global chroma settings. It removes girls dark hair and other stuff also…

Right now, chroma is needed so that only girl remains in video, while remaining background gets mixed into the real background of the user. BUT, this chroma setting is different for every user, depending on the type of background they have.

Right now what we are doing, is a simple GLOBAL thresholding llike older Otsus Algorithm . Deep learning far outperforms such simple techniques in Computer VIsion.

What i Suggest: Remove Chroma Altogether.
This will remove the hassle of separate production costs, i.e. green screen everytime you shoot. ALSO, you could run this on potentially every video on SLR.
[1] Take 3d VR video SLR already shoots.
[2] Pass through a state-of-the-art 3d segmentation model. Those are free

[3] Get the girl’s body separated out from the background [segmentation mask] as well as the Depth mask.
[4] Do this for each frame. You have a segmentation mask for whole video. It CONTAINS 0 where background is there and 1 where girl is present.

On SLR APP Backend.
[1] Whenever someone clicks passthrough feature, the video combines with the segmentation mask in the above background:

This should be fast since this is a simple addition. So no latency.

Hope this helps,
thank you for listening to me,
p.s. I know it’s not my place to speak, you guys are already doing amazing work. Just thought to put my thoughts out there…



Great post @bashturtle - passed this on to SLR team - looks like they may explore this concept further :crossed_fingers:

1 Like

Wow, that’s a nice follow up.

Looking into it right now

Also I would be really happy to see you joining passthrough discussions at

I would expect some great fine tuning coming

1 Like

Right now we are working on segmentation using deep learning. There were some problems with alphachanneling at unity side and some concerns about alphachannel in backend since the work for reencoding tons of videos for each resolution is immense. Some work is done using language-driven segmentation and it seems like it may be usable under supervision with better quality than background removals (like CapCut) which fails on some adult scenes since not trained on naked people

if you process the max resolution once, cant you downsample it to fit other lower resolutions… that way only need to encode once per video?

I was thinking of preprocessing down across the board to 960x270 and ignoring 200 degree videos for now. But I’m just doing this for fun. And as such, I’m going to play with using an attention model to process sequences of frames to predict missing data. But also, again, I’m just doing this for fun. No results promised. Image segmentation plus control point data would likely simplify this as well:

Just to further elaborate on discussion from our channel

So to further explain, ever noticed the real-time background blurring / background substituting algorithm some of us use during the group meetings? It uses a segmentation model called MobileNetV3 / U-Net followed by some unspecified/secret sauce mask border refinement algorithm. Alex can train a similar model for green screening the actresses but we don’t have the bandwidth to hand-label the thousands of frames it would take to create such a model. What Alex is looking into is seeing if he can use his language-driven segmentation approach with some border refining to create the dataset needed to train MobileNetV3.
Just want to emphasize, this is in a exploratory phase

The border refinement is always the hardest part, isn’t it?

CV Dev
Yep, that’s the secret to making it look good. (edited)

Unity Dev
I even dropped “color spill” (which is also border related) part from chroma key algorithm to avoid even greater FPS drop…

CV dev
Well it’s unknown where this will go if it proves promising. Could it be possible to run on headsets or future headsets or just be used offline to speed up chroma-keying / alpha channeling videos? No idea.

CV Dev 2
If it will work on headset (far from that for now) probably with low resolution, may be it will worth to blur transparency channel on edges to mask poor resolution?

Would be great to see how it worked for you

@doublevr , you dont need efficient models like modelnet. also, segmenting the girls doesnt require grreen screen annotation.

just use a standard transformer for instance segmentation… i would recommend maskformer2 from facebook.

the whole idea of segmentation is that you dont need any border refinement. the segmentation masks you get after attention layers in a transformer already take care of this.



Checking it out :+1:

jiust give me a reference to one video on slr you are facing most problem with

1 Like

A sequence of frames would provide information on whether the borders flicker or not over time, I agree he should point you to a video.

Hi! I’m CV researcher at SLR. We’ ve tried different approaches, including pre-trained transformers (not remmembered if we tried segformer though, but tried classic ViT) and almost all of them failed on non - standard poses. Here is an example of video almost all pretrained models fail - on last part of video (doggy position). The fail is either flickering, either completely wrong segmentation or wrong class labelling . That is caused mainly by training on COCO or similar datasets, which do not contain the doggy-style positioned naked persons. The language-driven segmentation shows the great results since much flexible and vast training data, but since it is based on pure VIT, the resolution is limited. For now good results are achieved by lang-driven segmentation + resnet-like refiner. A CLIP-based prompt encoder allows to control the segmentation results by adjusting the prompts and corresponding thresholds. For now we are thinking about placing such tool with corresponding gui to opensource (not decided yet).
Also we work on knowledge distillation from such heavyweght, but effective model. So for in-domain dataset generated by it is used to finetune some really tiny model, like mobilenet based unet, hoping it can run on a headset directly close to realtime (do not think that will be a problem if mask will update with a bit lower fps than video itself).
Thanks for proposing help in research! Will be glad if you find pretrained model running well on every video, feel free to contact us

1 Like

Hi Alex,
Wow, that’s technically exhaustive. Let me break this down, based on what i believe.
Vit is pretty basic last time i remember. It was ICLR paper, which proved that transformers transfer from NLP to image and simple tokenization works. That’s not the best architecture for segmentation segmentation
This problem is not of per-frame image segmentation as we believe
The reason is that once you have detected an actor, you need to track him in further frames. That will give you the so-called temporal consistency in labels in future frames.
There is an entire field called Video INSTANCE SEGMENTATION. USE THAT.
I recommend : In Defence of Online Models for Video Instance Segmentation [Search VNext on Github and run that]. Thats the SOTA on YoutubeVIS.
The reason porn videos are difficult is bcoz -
a) bodies of male/female touch, so instance separation is difficult
b) only some parts of the female are visible sometimes.
The model i told [IDOL] works on those cases.

Also, This is an online model. i.e. you only see one frame at a time, so the decoding is not expensive. You dont need to process videos like a smaller clip/tube. All it contains is a simple 2d DETR based Object Detector, and a memory bank for temporal association.

The ICLR paper is good, but here is why it wont work:
The girl you shared fails in doggy position, since n/w thinks the actor is a ‘dog’ and not a ‘person’. Therefore , you would need a ‘dog’ label as input to segment it. But its a person. Therefore, track it as a person over time.
I recommend a Video Instance Segmentation model for this.

My advice would be to never put a model on the Client end, since those models are pretty light and bad at segmentation. Since, you are trying to improve the performance of that smaller model, i think that is why you mentioned Hinton’s knowledge distillation. My suggestion is go away from that:

Instead, Generate segmentation masks on server. It is an array of 0/1’s. The issue is that it is pretty high resolution [around 6k] and you shouldn’t stream both MASK streams/video streams from SLR servers since that would need double the bandwidth. I suggest this:

[1] Take your mask.
[2] Run standard Coco’s Run Length Encoding on IT. [RLE]. This will compress mask of whole 1 hour video into like 1-2 Mb based on my experience.
[3] DECODE on Client.
Check COCO Python API.
[4] Do Alpha-matting on Client.
Also, check this :

Networks can segment humans. ‘Nude’ humans are just humans without clothes. I have done a lot of gait analysis work for ‘the big guys’, and nude people are easily detected.

So, i guess it should work well out of the box. Otherwise all of the govts security research would have failed. A robbber could just get rid of clothers, never get detected and rob you :slight_smile: Of course, if you think that it’s worth collecting the data that’s good, but lack of it should not stop the future upgrades in my opinion.

I dont know if i have what it takes to stick to a single problem for months on time. All i know is some small things which should work, and suggest them to you. So, i suggest Keeping it Short and Simple. :–)
Thanks so much for giving this post due consideration,

1 Like

Hi, I’m another CV researcher at SLR.

The biggest underlying 900 lbs gorilla of a problem is datasets and covering all cases. So yes you can grab a pretrained model for segmenting humans and immediately start segmenting frames BUT the Googles, Metas, and Universities do not have sexual acts in their publicly available or internal training sets. What happens is the models work great for frames that resembles zoom chats but start falling apart once bodies start mashing each other. A lot of effort right now is trying to come up with training sets to cover those corner cases using a combination of human labelers, creating synthetic CG scenes, and semi-automatic segmentation.

As far as which particular variant of segmentation to use, as my academic advisor always said, “Algorithms come and go, the money’s in the data.”


As far as mask updates, we’re pretty aware of it but there’s limitations with any of our choices:

  1. Running on the device: Would solve the cost/time of preprocessing the ungodly huge content library plus that’s what cool kids like Google Meet and Zoom do in real-time. Nope, too slow. Not enough GPU left over after rendering VR180 scenes.
  2. Precomputed Alpha Channel: The devs tell us video codecs don’t handle alpha channels well.
  3. Chroma Green Background: Works to a point, bleeding colors, user unfriendly knobs and dials.
  4. Secondary Encoding of Masks: RLE or LZW encoding of the masks per frame. We’ve brought this up with the devs, but there’s the management of these files/big file, syncing to frames, and different resolutions of videos. This might be the solution, but we’re focused on “just getting segmentation to work correctly” stage.

[2] Precomputed Alpha Channel:: Maybe used .TS stream instead of .MP4. That way you wont have to decode several frames around the keyframe you wish to modify…

Thanks a lot for comments! Will check your proposals.
I totally agree that approaching to task as video, not segmenting the frames alone can provide much better results. But experimentally the lang-based models + refinement shows the best results for now. The image with dog instead of girl was given by coco-trained transformer, not language based model.
In case of language-based ones if the dog prompt is just not included, the probability of “girl” still remain high enough to classify region correctly, so class mislabelling problem vanishes.
Probably it makes sense just remove most of classes into the segmentation head on ither models - will play with it.
The further refinement by resnet surprisingly outputs temporally stable results. As for humain - there is no big difference if it big or small for several architectures but staying or bending without face and usually seen bodyparts visible is one. There is no public dataset contain the doggy-style staying persons, so it is not looking like human at all for it. Moreover, the pose change in videos are usually done by cut scenes, so advances of continuity in video-based model are limited (in general agree that video-based model will be superior) . And as other CV engeneer told, we are looking for the chances to make work acceptable without video labelling. If it fails, we will label videos and in that case much more possibilities will be open, including finetuning of video instance segmentation models. Thanks a lot for rle mentioning,

1 Like

Hi alex, thanks so much for the reply,
[1] Hmmm, it is surprising language based segmentation gives you better results inspite of solving multi-modal grounding task. This teaches me a new thing, thank you so much for that. Of course, you know far more since you are on the practical side of the things.
[2] Nice, it seems to me that trying things without video labelling should be good. If you guys ever need to discuss/chat over such things, i will be down :slight_smile:
[3] regarding sudden pose change, that should not be a problem. As soon as sudden cut is made in video [say girl moves from front-cowgirl to doggy], a different DETR’s slot will start tracking the girl in doggy. The part about temporal association takes care of that implicitly. I also see this as a re-id problem.

1 Like

@bashturtle @serendipitis here we go. Or is there anything you are looking for?

For now current model there is no troubles. The problems that were occurring in case of pre trained models were most visible on last part of that 20s video . Also posted that here SLR: Can we do 3D object Segmentation for Passthrough? - #4 by doublevr . The specific frame is hard to find - different models fail on different frames in that short video

1 Like