This demonstrates that neural networks are benefited from breaking down a complex objective. We provide some visual comparison in Fig. These two pieces of training are made on the MODNet architecture. Finally, MODNet has better generalization ability thanks to our SOC strategy. In this section, we elaborate the architecture of MODNet and the constraints used to optimize it. For example, background matting [BM] replaces the trimap by a separate background image. Fortunately for us, this new technique can process human matting from a single input image, without the need for a green screen or a trimap in real-time at up to 63 frames per second! Support for building environments with Docker. Using two powerful models if you would like to achieve somewhat accurate results. Zhang \etal[LFM] applied a fusion network to combine the predicted foreground and background. The training process is robust to these hyper-parameters. 9, when a moving object suddenly appears in the background, the result of BM will be affected, but MODNet is robust to such disturbances. Toldo \etal[udamss] presented a consistency-based domain adaptation strategy for semantic segmentation. This GrabCut algorithm basically estimates the color distribution of the foreground item and the background using a gaussian mixture model. For example, (1) whether the whole human body is included; (2) whether the image background is blurred; and (3) whether the person holds additional objects. - A repository for storing models that have been inter-converted between various frameworks. We start by reducing the size of the segmented object to leave a bit of space for the unknown region by eroding it, removing some pixels at the contour of the object iteratively. Although the SPS pre-training is optional to MODNet, it plays a vital role in other trimap-free methods. Unfortunately, this technique needs two inputs: an image, and its trimap. This network architecture is way faster because it first finds the semantic estimation itself, using a basic decoder inside the low-resolution branch, making it much faster to process. [Research] Photography Portrait Matting (PPM) Benchmark is Released. We first pick the portrait foregrounds from AMD. dont have to squint at a PDF. To facilitate real-time interaction, we adopt the MobileNetV2 [net_mobilenetv2] architecture, an ingenious model developed for mobile devices, as our S. When analysing the feature maps in S(I), we notice that some channels have more accurate semantics than others. Second, MODNet achieves state-of-the-art results, benefitted from (1) objective decomposition and concurrent optimization; and (2) specific supervisions for each of the sub-objectives. In computer vision, we can divide these mechanisms into spatial-based or channel-based according to their operating dimension. Lutz \etal[AlphaGAN] demonstrated the effectiveness of generative adversarial networks [GAN] in matting. First, MODNet is much faster. Second, professional photography is often carried out under controlled conditions, like special lighting that is usually different from those observed in our Third, MODNet can be easily optimized end-to-end since it is a single well-designed model instead of a complex pipeline. Supports inverse quantization of INT8 quantization model. We compare MODNet with FDMPA [FDMPA], LFM [LFM], SHM [SHM], BSHM [BSHM], and HAtt [HAtt]. Table3 shows the quantitative results on the aforementioned benchmark. 4(b)(c)(d), the samples in PHM-100 have more natural backgrounds and richer postures. However, this scheme will identify all objects in front of the human, i.e., objects closer to the camera, as the foreground, leading to an erroneous trimap for matte prediction in some scenarios. We set s==1 and d=10. We supervise sp by a thumbnail of the ground truth matte g. MODNet versus BM under Fixed Camera Position. The inference time of MODNet is 15.8ms (63fps), which is twice the fps of previous fastest FDMPA (31fps). NVIDIA GPU (dGPU) support. We follow the original papers to reproduce the methods that have no publicly available codes. We denote the outputs of D as D(I,S(I)), which implies the dependency between sub-objectives high-level human semantics S(I) is a priori for detail prediction. For example, the foreground probability of a certain pixel belonging to the background may be wrong in the predicted alpha matte p but is correct in the predicted coarse semantic mask sp. Moreover, MODNet suffers less from the domain shift problem in practice due to the proposed SOC and OFD. Intel iHD GPU (iGPU) support. [DIM] suggested using background replacement as a data augmentation to enlarge the training set, and it has become a typical setting in image matting. (by PeterL1n). Since the fine boundaries are preserved in ~dp output by M, we append an extra constraint to maintain the details in M as: We generalize MODNet to the target domain by optimizing Lcons and Ldd simultaneously. Specifically, the pixel values in a depth map indicate the distance from the 3D locations to the camera, and the locations closer to the camera have smaller pixel values. Nonetheless, feeding RGB images into a single neural network still yields unsatisfactory alpha mattes. Which uses the information of the precedent frame and the following frame to fix the unknown pixels hesitating between foreground and background. Applying Ls and Ld to constrain human semantics and boundary details brings considerable improvement. If you like my work and want to support me, Id greatly appreciate it if you follow me on my social media channels: [1] Ke, Z. et al., Is a Green Screen Really Necessary for Real-Time Human Matting? We believe that our method is challenging the necessity of using a green screen for real-time human matting. The background replacement [DIM] is applied to extend our training set. Currently, trimap-free methods always focus on a specific type of foreground objects, such as humans. After that, we add this third section, which is the unknown region, by dilating the object, adding pixels around the contour. 1 (b)) to improve the performance of MODNet in the new domain. Consistency is one of the most important assumptions behind many semi-/self-supervised [semi_un_survey] and domain adaptation [udda_survey] algorithms. In contrast, our MODNet imposes consistency among various sub-objectives within a model. MODNet is a light-weight matting objective decomposition network (MODNet), which can process portrait matting from a single input image in realtime. MODNet is trained end-to-end through the sum of Ls, Ld, and L, as: where s, d, and are hyper-parameters balancing the three losses. Second, the high-level representation S(I) is helpful for subsequent branches and joint optimization. Intuitively, semantic estimation outputs a coarse foreground mask while detail prediction produces fine foreground boundaries, and semantic-detail fusion aims to blend the features from the first two sub-objectives. ": https://arxiv.org/pdf/2011.11961.pdf, Implement GrabCut yourself: https://github.com/louisfb01/iterative-grabcut, MODNet GitHub code: https://github.com/ZHKKKe/MODNet, Deep Image Matting - Adobe Research: https://sites.google.com/view/deepimagematting, CNNs explanation video: https://youtu.be/YUyec4eCEiY. It removes the fine structures (such as hair) that are not essential to human semantics. The GitHub repo (linked in comments) has been edited with code and commercial solution for anyone interested! 10 provides more visual comparisons of MODNet and the existing trimap-free methods on PHM-100. Then, we produce a segmentation where the pixels equivalent to the person are set to 1, and the rest of the image is set to 0. Therefore, we append a SE-Block [net_senet] after S to reweight the channels of S(I). For human matting without the green screen111Also known as the blue screen technology., existing works either require auxiliary inputs that are costly to obtain or use multiple models that are computationally expensive. (2020), https://github.com/ZHKKKe/MODNet[3] Xu, N. et al., Deep Image MattingAdobe Research (2017), https://sites.google.com/view/deepimagematting[4] GrabCut algorithm by OpenCV, https://docs.opencv.org/3.4/d8/d83/tutorial_py_grabcut.html. Visual Comparisons of Trimap-free Methods on PHM-100. Looking like this. The supervised way takes an input, and learns to remove the background based on a corresponding ground-truth, just like usual networks. Applying image processing algorithms independently to each video frame often leads to temporal inconsistency in the outputs. For unlabeled images from a new domain, the three sub-objectives in MODNet may have inconsistent outputs. However, its implementation is a more complicated approach compared to MODNet. With a batch size of 16, the initial learning rate is 0.01 and is multiplied by 0.1 after every 10 epochs. Human matting is an extremely interesting task where the goal is to find any human in a picture and remove the background from it. Modern deep learning and the power of our GPUs made it possible to create much more powerful applications that are yet not perfect. It takes one RGB image as input and uses a single model to process human matting in real time with better performance. In fact, the pixels with md=1 are the ones in the unknown area of the trimap. Sengupta \etal[BM] proposed to capture a less expensive background image as a pseudo green screen to alleviate this issue. - Real-Time High-Resolution Background Matting, keras-onnx We also compare MODNet against the background matting (BM) proposed by [BM]. - Core ML tools contain supporting tools for Core ML model conversion, editing, and validation. Real-world data can be divided into multiple domains according to different device types or diverse imaging methods. With the tremendous progress of deep learning, many methods based on convolutional neural networks (CNN) have been proposed, and they improve matting results significantly. One possible future work is to address video matting under motion blurs through additional sub-objectives, e.g., optical flow estimation. However, these methods consist of multiple models and constrain the consistency among their predictions. It basically takes what the first network learned, and understands the consistency between the object in each frame to correctly remove the background. However, the trimap is costly for humans to annotate, or suffer from low precision if captured via a depth camera. Or, have a go at fixing it yourself the renderer is open source! AI for automatic image background removal..!!!!!! As exhibited in Fig. Scale your research, not boilerplate. Fig. These drawbacks make all aforementioned matting methods not suitable for real-time applications, such as preview in a camera. Moreover, we introduce two techniques, SOC and OFD, to generalize MODNet to new data domains and smooth the matting results on videos. Although dp may contain inaccurate values for the pixels with md=0, it has high precision for the pixels with md=1. The best example here is Deep Image Matting, made by Adobe Research in 2017. In contrast, MODNet avoids such a problem by decoupling from the trimap input. 4.1). 3): In practice, we set =0.1 to measure the similarity of pixel values. Is a Green Screen Really Necessary for Real-Time Human Matting? For each foreground, we generate 5 samples by random cropping and 10 samples by compositing the backgrounds from the OpenImage dataset [openimage]. Liu \etal[BSHM] concatenated three networks to utilize coarse labeled data in matting. Compared with them, our MODNet is light-weight in terms of both input and pipeline complexity. Other works designed their pipelines that contained multiple models. If you find a rendering bug, file an issue on GitHub. Supported frameworks are TensorFlow, PyTorch, ONNX, OpenVINO, TFJS, TFTRT, TensorFlowLite (Float32/16/INT8), EdgeTPU, CoreML. One possible future work is to address video matting under motion blurs through additional sub-objectives, e.g., optical flow estimation. More Visual Comparisons of Trimap-free Methods on PHM-100. MODNet is shown to have good performances on the carefully designed PHM-100 benchmark and a variety of real-world data. Specifically, MODNet has a low-resolution branch (supervised by the thumbnail of the ground truth matte) to estimate human semantics. Popular CNN architectures [net_resnet, net_mobilenet, net_densenet, net_vggnet, net_insnet] generally contain an encoder, i.e., a low-resolution branch, to reduce the resolution of the input. By assuming that the images captured by the same kind of device (such as smartphones) belong to the same domain, we capture several video clips as the unlabeled data for self-supervised SOC domain adaptation. The downsampling and the use of fewer convolutional layers in the high-resolution branch is done to reduce the computational time. The decomposed sub-objectives are correlated and help strengthen each other, we can optimize MODNet end-to-end. Consistency Constraint. Besides, limited by insufficient amount of labeled training data, trimap-free methods often suffer from domain shift [DomainShift] in practice, \ie, the models cannot well generalize to real-world data, which has also been discussed in [BM]. [D] AI Background Removal: a quick comparison between RVM & BGMv2, Robust High-Res Video Matting with Temporal Guidance(Code and Pretrained Models) (r/MachineLearning), [P] Robust High-Res Video Matting with Temporal Guidance(Code and Pretrained Models), [R] Robust High-Resolution Video Matting with Temporal Guidance, ByteDance (Developer of TikTok) Unveils The Most Advanced, Real-Time, HD, Human Video Matting Method (Paper, Codes, Demo Included), Robust High-Res Video Matting with Temporal Guidance(Code and Pretrained Models), RobustVideoMatting vs pytorch-deep-image-matting, RobustVideoMatting vs BackgroundMattingV2, RobustVideoMatting vs Autonomous-Ai-drone-scripts. Fig. (2020). Attention [attention_survey] for deep neural networks has been widely explored and proved to boost the performance notably. it outperforms trimap-based DIM, which reveals the superiority of our network architecture. Next, we use basic computer vision transformations to create a trimap from this segmentation. To guarantee sample diversity, we define several classifying rules to balance the sample types in PHM-100. In summary, we present a novel network architecture, named MODNet, for trimap-free human matting in real time. - Generate saved_model, tfjs, tf-trt, EdgeTPU, CoreML, quantized tflite, ONNX, OpenVINO, Myriad Inference Engine blob and .pb from .tflite. MODNet is easy to be trained in an end-to-end style. We prove this standpoint by the matting results on Adobe Matting Dataset222Refer to Appendix B for the results of portrait images (with synthetic backgrounds) from Adobe Matting Dataset.. In addition, OFD further removes flickers on the boundaries. 3, M has three outputs for an unlabeled image ~I, as: We force the semantics in ~p to be consistent with ~sp and the details in ~p to be consistent with ~dp by: where ~md indicates the transition region in ~p, and G has the same meaning as the one in Eq. Table1 shows the results on PHM-100, MODNet surpasses other trimap-free methods in both MSE and MAD. It measures the absolute difference between the input image I and the composited image obtained from p, the ground truth foreground, and the ground truth background. BM relies on a static background image, which implicitly assumes that all pixels whose value changes in the input image sequence belong to the foreground. Although our results are not able to surpass those of the trimap-based methods on the human matting benchmarks with trimaps, our experiments show that MODNet is more stable in practical applications due to the removal of the trimap input. For example, Ke \etal[GCT] designed a consistency-based framework that could be used for semi-supervised matting. In contrast, we present a light-weight matting objective decomposition network (MODNet), which can process human matting from a single input image in real time. Xu \etal[DIM] proposed an auto-encoder architecture to predict alpha matte from a RGB image and a trimap. (, (c) In the application of video matting, one-frame delay (. A Trimap-Free Portrait Matting Solution in Real Time [AAAI 2022] (by ZHKKKe), Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML! In Fig. To obtain better results, some matting models [GCA, IndexMatter] combined spatial-based attentions that are time-consuming. It is designed for real-time applications, running at 63 frames per second (fps) on an Nvidia GTX 1080Ti GPU with an input size of 512512. You can just imagine the time it would need to process a whole video. At the end of MODNet, a fusion branch (supervised by the whole ground truth matte) is added to predict the final alpha matte. We regard it as a flickering pixel if it satisfies the following conditions C (illustrated in Fig. They reconstruct sound using cameras and a laser beam on any vibrating surface, allowing them to isolate music instruments, focus on a specific speaker, remove ambient noises, and many more amazing applications. Therefore, existing trimap-free models always tend to overfit the training set and perform poorly on real-world data. There are two insights behind MODNet. MODNet achieves remarkable results in daily photos and videos. C indicates that if the values of it1 and it+1 are close, and it is very different from the values of both it1 and it+1, a flicker appears in it. Results of SOC and OFD on a Real-World Video. Now, do you really need a green screen for real-time human matting? Others [SHM, BSHM, DAPM] apply multiple models to firstly generate a pseudo trimap or semantic mask, which is then served as the priori for alpha matte prediction.