Abstract

This paper addresses the video rescaling task, which arises from the needs of adapting the video spatial resolution to suit individual viewing devices. We aim to jointly optimize video downscaling and upscaling as a combined task. Most recent studies focus on image-based solutions, which do not consider temporal information. We present two joint optimization approaches based on invertible neural networks with coupling layers. Our Long Short-Term Memory Video Rescaling Network (LSTM-VRN) leverages temporal information in the low-resolution video to form an explicit prediction of the missing high-frequency information for upscaling. Our Multi-input Multi-output Video Rescaling Network (MIMO-VRN) proposes a new strategy for downscaling and upscaling a group of video frames simultaneously. Not only do they outperform the image-based invertible model in terms of quantitative and qualitative results, but also show much improved upscaling quality than the video rescaling methods without joint optimization. To our best knowledge, this work is the first attempt at the joint optimization of video downscaling and upscaling.

Introduction Video

CVPR 2021 Paper

            @InProceedings{Huang_2021_CVPR,
                author    = {Huang, Yan-Cheng and Chen, Yi-Hsin and Lu, Cheng-You and Wang, Hui-Po and Peng, Wen-Hsiao and Huang, Ching-Chun},
                title     = {Video Rescaling Networks With Joint Optimization Strategies for Downscaling and Upscaling},
                booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
                month     = {June},
                year      = {2021},
                pages     = {3527-3536}
            }
            

Method


Overview of the proposed LSTM-VRN and MIMO-VRN for video rescaling. Both schemes involve an invertible network with coupling layers for video downscaling and upscaling. In part (a), LSTM-VRN downscales every video frame $x_t$ independently and forms a prediction $\hat{z}_t$ of the high-frequency component $z_t$ from the LR video frames $\{\hat{y}_i\}_{i=t-L}^{t+L}$ by a bi-directional LSTM that operates in a sliding window manner. In part (b), MIMO-VRN downscales a group of HR video frames $\{x_i\}_{i=t}^{t+g}$ into the LR video frames $\{\hat{y}_i\}_{i=t}^{t+g}$ simultaneously. The upscaling is also done on a group-by-group basis, with the high-frequency components $\{z_i\}_{i=t}^{t+g}$ estimated from the $\{\hat{y}_i\}_{i=t}^{t+g}$ by a predictive module.

Experimental Results (Vid4)


Video Results


HR Reconstruction on Vid4


Calendar

City

Walk

Foliage


HR Reconstruction on Vimeo-90K-T



LR Reconstruction on Vid4



We still can see the unpleasing artifacts in some of the downscaled LR results especially those produced by SISO downscaling method such as IRN and LSTM-VRN. However, MIMO-VRN can reduce much such artifacts since it utilizes temporal information while downscaling. The other way of reducing such artifacts is to increase the lambda factor in loss function which makes LR video much close to bicubic-downscaled one with a slight decrease in HR reconstrution quality. If the LR quality really bothers you, we suggest training the model with higher lambda factor.