Image and video editing are two of the most popular applications for computer users. With the advent of machine learning (ML) and deep learning (DL), image and video editing has been progressively studied through several neural network architectures. Until recently, most DL models for image and video editing were supervised and, more specifically, the training data needed to contain pairs of input and output data used to learn the details of the required transformation. Recently, end-to-end learning frameworks have been proposed, which require only a single image as input to learn the mapping to the desired edited output.
Video matting is a specific task related to video editing. The term “matting” dates back to the 19th century when glass plates of matte paint were set in front of the camera during filming to create the illusion of an environment not present in the film location. Nowadays, the composition of many digital images follows similar processes. A composite formula is used to shade the foreground and background intensity of each image, expressed as a linear combination of the two components.
Although really powerful, this process has some limitations. This requires a fuzzy factorization of the image into foreground and background layers, which are then treated independently. In some situations, such as video matting, therefore a sequence of temporally- and spatially-dependent frames, decomposing layers becomes a complex task.
The goals of this paper are to increase the clarity and decomposition accuracy of this process. The authors propose factor matting, a variant of the matting problem that factors video into more independent components for downstream editing operations. To address this problem, they then introduce FactorMatte, an easy-to-use framework that combines classical matting priors with conditional ones based on expected distortions in a scene. The classic Bayes formulation, for example, referring to the estimation of the maximum a posteriori probability, has been extended to remove the limiting assumption on the independence of the foreground and background. Most approaches also assume that background layers remain constant over time, which is severely limiting for many video scenes.
To overcome these limitations, FactorMatte relies on two modules: a decomposition network that factors the input video into one or more layers for each component and a set of patch-based discriminators that represent conditional priors on each component. The architectural pipeline is depicted below.
The input to the decomposition network is composed of a video and a rough segmentation mask for the object of interest frame by frame (left, yellow box). With this information, the network produces layers of color and alpha (middle, green and blue boxes) based on the reconstruction loss. The foreground layer contains the foreground component (right, green
box), while the environment layer and residual layer together model the background component (right, blue box). The environment layer represents the static-like aspects of the background, while the residual layer captures more random changes in the background component due to interactions with foreground objects (pillow distortion in the figure). For each of these layers, a discriminator is trained to learn the corresponding marginal priors.
Mating results for some selected samples are presented in the figure below.
Although FactorMatte is not perfect, the results produced are clearly more accurate than the basic approach (OmniMatte). In all given samples, the background and foreground layers present a clean separation between each other, which cannot be claimed for comparable solutions. Furthermore, separate studies have been conducted to verify the effectiveness of the proposed solution.
This was the summary FactorMatteA novel framework to address Video mating the problem If you are interested, you can find more information in the links below.
Check it out paper, code, and project All credit for this research goes to the researchers on this project. Also, don’t forget to join Our Reddit page and Controversy ChannelWhere we share the latest AI research news, cool AI projects, and more.
Daniel Lorenzi received his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padua, Italy. He is a Ph.D. candidate at the Information Technology Institute (ITEC) in Alpen-Adria-Universität (AAU) Klagenfurt. He is currently working at the Christian Doppler Laboratory Athena and his research interests include adaptive video streaming, immersive media, machine learning, and QoS/QoE evaluation.