Beyond slow-motion: Enhancing live replays with AI-powered special effects

Open/Close

Beyond slow-motion: Enhancing live replays with AI-powered special effects

13 September 2023
EVS.com
minutes to read

Artificial Intelligence (AI) is having a profound impact on numerous industries, and the world of live production is no exception.

As broadcasters strive to optimize workflows, reduce costs, and captivate audiences, AI-driven technologies have emerged as powerful tools to achieve these objectives and more. By automating time-consuming tasks and enabling new creative capabilities, AI-powered solutions bring immense value to live productions and have found multiple applications throughout different stages of the production chain.

"EVS has been at the forefront of embracing AI's potential, and one area that has particularly intrigued us is the enhancement of live replays. Recognizing the significance of replays in enriching storytelling to captivate audiences and create emotion, we set out to explore the integration of AI to elevate this aspect of live production."

Olivier Barnich
Head of Innovation & Architecture, EVS

The challenge of replay footage

Replay technology has evolved significantly over the years. The industry has progressed from SDI to IP, from standard definition to high definition, and ultimately to ultra-high definition. Yet, the way camera images are processed in order to generate slow motion has remained mostly unchanged.

The process involves playing back recorded images at a reduced speed, maintaining the native frame rate of the production through frame duplication. Images are played back without any alterations, as all the necessary image processing occurs prior to their recording.

Unfortunately, this approach means that during the event, creative opportunities offered to production teams are limited by compromises made when selecting cameras and lenses during the production planning phase, which ultimately affects the quality and impact of the replays. During this phase, the teams determine the camera plan which dictates the available live and slow-motion replay streams accessible to the director during the event. The camera plan includes decisions about the placement of Super Slow Motion (SSM) cameras for smooth slow-motion replays, the usage of wide-angle lenses, and the selection of narrower angle lenses. It also defines which cameras should have a shallow depth of field to capture emotionally engaging sequences.

However, each decision regarding cameras and lens choice involves inherent compromises. For instance, using SSM cameras with shorter exposure times and wider apertures results in a shallower depth of field, making it challenging to maintain focus on fast-moving objects. Determining the ideal amount of motion blur depends on the content being captured and the intended playback speed. Choosing between a wide-angle view for comprehensive game understanding or a narrower angle to highlight specific actions becomes a critical decision point. Additionally, while a shallower depth of field creates emotionally compelling sequences, it may not be suitable for capturing intense gameplay moments.

These compromises made in advance during the planning phase leave little room for real-time adjustments. However, with the emergence of generative AI, this is set to change.

Technology overview

Generative AI for real-time special effects in live replays

The advent of generative AI presents an opportunity to transform live replays and elevate storytelling. Today, there are solutions that enable real-time adjustments and introduce live special effects, ensuring that replays capture the full excitement and emotional impact of the game, without compromising visual quality or creative choices.

This technology introduces a new layer of freedom to productions, offering access to novel creative possibilities during live events. Powered by deep neural networks, modern image processing techniques can virtually extend the imaging system's capabilities in terms of frame rate, sharpness, resolution, and depth of field during playout.

Simulating higher frame rate cameras using temporal frame interpolation

The usual way to delight viewers with smooth slow-motion replays is to use high frame rate cameras, also known as Super Slow Motion (SSM) cameras. These cameras provide a higher number of frames per second, enabling a replay server to create slow-motion video without resorting to frame duplication. However, the practicality and budget constraints make it unrealistic to have high frame rate cameras at every desired camera position. One concrete example is in American Football. Contemplating the installation of four high frame rate cameras in each pylon becomes a complex task, especially when considering the potential risk of an unintentional collision.

AI makes it possible to deliver smooth slow-motion replays even from standard frame-rate cameras. Temporal frame interpolation is a powerful technique that holds the key to transforming standard video streams into high frame-rate videos, perfect for delivering smooth slow-motion replays. The computational process involves generating intermediate frames between two frames captured by a camera. Often referred to as "hallucinated frames," these can be incorporated into the input camera stream, as illustrated in Figure 1, to generate a higher frame rate camera stream that is ideal for producing smoother slow-motion replays.

Figure 1 - Multiplying the frame rate of a soccer video by a factor of three using temporal frame interpolation. The technique involves inserting two hallucinated images between each pair of images of the video stream provided by the camera.

Deep learning methods such as optical-flow and end-to-end methods have proven highly successful in achieving frame interpolation, as they are self-supervised, requiring only high frame rate data to train a deep neural network.

Optical flow-based methods [1, 2] offer significant advantages in their ability to generate images at an arbitrary time stamp between the two original frames, which in turn allows to generate arbitrary frame rates. However, they heavily rely on the estimated optical flow quality, which remains an active research area [3, 4].

On the other hand, end-to-end methods [5, 6, 7, 8] aim to solve the problem globally, without relying on any prior motion estimation. Motion is either explicitly or implicitly estimated to reconstruct frames.

While these methods are not able to generate intermediate frames at arbitrary timestamps, they are not dependent on the quality of a pretrained optical flow estimator, nor on hand-crafted priors to generate the warped frames: the interpolation function can be entirely learned from data. In the context of live replays, being able to learn the interpolation function directly from data is advantageous as the complete function benefits from training on diversified video sequences. This ensures that the method will deliver satisfactory results in all situations.

#2

Increasing image sharpness with deblurring

While temporal frame interpolation allows for smooth replays from any camera, there remains a noticeable disparity in image sharpness compared to the results generated from SSM cameras. This discrepancy can be attributed to the shorter exposure time of SSM cameras, effectively minimizing motion blur in the captured images.

One possible solution is to lower the exposure time for all cameras, but this presents a challenge as motion blur can be desirable in some cases, as it enhances the perceived fluidity of the video footage by mimicking the persistence of vision of the human brain and eye. Yet, in other situations, motion blur can be detrimental. Deciding on an exposure time that is optimal in every scenario is an impossible task.

Fortunately, with the aid of generative AI powered by deep learning, it has become possible to eliminate unwanted motion blur from video content through a process known as “deblurring”. Interestingly, this task can be self-supervised: a training dataset can be built by averaging adjacent frames of high frame rate videos [9].

Two primary approaches exist: image-based and video-based methods, the former being much faster while the latter leverages temporal information. Among video deblurring methods, there are convolutional [10], recurrent [11] or transformer deep neural networks [12]. Convolutional methods typically cannot model long range dependencies while recurrent methods are harder to parallelize and still struggle to model long range dependencies. Transformers try to achieve the best of both worlds although they require huge amounts of computational resources.

Figure 2 showcases an example of a result obtained using a recent video-based method [12], demonstrating its promising performance for deblurring.

Figure 2 - Increasing image sharpness by deblurring images with a deep neural network. On the left-hand side: images provided by the camera. On the right-hand side: results of the application of the deblurring algorithm.

Creating a shallow depth of field using virtual lenses

The utilization of a shallow depth of field in cinematography, also referred to as the “bokeh” effect, has the power to establish a deeper connection between the viewer and the subject, offering significant added value, especially in capturing emotionally charged moments.

Achieving this effect traditionally requires the use of specialized lenses with wide apertures. However, modern computer vision techniques now enable the replication of this effect through the application of video filters based on deep neural networks. There are three types of methods that can provide this special effect: bottom-up, end-to-end, and hybrid methods.

Bottom-up methods [13] aim to simulate the physics of the lens by employing a neural network to estimate the depth of each pixel in the image. While these methods can struggle at object boundaries, they usually demonstrate more striking bokeh effects compared to end-to-end methods [14] which directly attempt to imitate the effect from real or synthetic bokeh. Finally, hybrid methods [15] aim to combine the strengths of both approaches, with encouraging results.

Figure 3 - Bottom-up approach to simulate a shallow depth of field lens. From the input image on the left-hand side, a first deep neural network is used to estimate the depth (distance from camera) of each pixel location. A second deep neural network is used to segment the area to put in focus. The two pieces of information are combined by a bokeh kernel that simulates the physics of a wide aperture lens to produce an image with a shallow depth of field.

A typical bottom-up approach follows the pipeline depicted in Figure 3:

First, the depth of each pixel is estimated using a deep learning method [16, 17].
Then, the point of focus in the image is determined using heuristics based on object detection and tracking.
Finally, the depth information and the point of focus are used to simulate the physics of a wide aperture lens to generate images with a shallow depth-of-field, delighting the viewer with an artistic out of focus blur.

Figure 4 shows two examples of simulation of a wide aperture lens on soccer images.

Figure 4 - Simulation of a wide aperture lens to produce images with a shallow depth of field on soccer content. On the left-hand side: original images. On the right-hand side: results.

Intelligent digital zoom with super resolution algorithms

As mentioned previously, before production begins, a camera plan is established to determine the positioning of cameras and camera operators for capturing the event. The director then assigns specific objectives for each camera operator, such as focusing on particular players or capturing specific types of shots. During the event, the director maintains frequent communication with camera operators and updates their instructions to ensure that no crucial moments are missed. Nevertheless, when an unexpected event occurs, the team’s reaction may not always be quick enough to properly frame the area of interest.

Fortunately, recent advances in computer vision and image processing can be advantageously combined to accurately frame any event, with the sole requirement that the event is visible within a wide angle shot.

By combining saliency detection [18], object detection [19] and tracking [20], it is indeed possible to semi-automatically define a virtual camera trajectory that selects an area of interest in the wide-angle footage. The area of interest is then extracted from the wide-angle view and brought back to the native resolution of the production using a super resolution algorithm.

Super resolution, the process of generating high-resolution images from low-resolution inputs, has long posed a challenge in the field of computer vision. However, recent advancements in deep learning techniques have revolutionized the state-of-the-art in video super resolution, offering a wide range of solutions to enhance image quality in broadcast productions.

Approaches, such as convolutional neural networks (CNNs) [21], transformer-based methods, generative adversarial networks (GANs) [22], and diffusion-based models [23], are commonly employed to learn the mapping from low-resolution to high-resolution images.

Despite being an ill-conditioned problem, as multiple high-resolution images could correspond to a single low-resolution image, these approaches have demonstrated considerable success in producing remarkable results.

By allowing to create high-quality close-up video streams from wide-angle views, the combination of super resolution with saliency detection, object detection and tracking presents an exciting opportunity to empower operators to tell even more engaging stories with minimal additional effort or equipment costs.

Figure 5 illustrates this application, exemplifying its potential impact.

Figure 5 - Leveraging object (person) detection and tracking to cut out an area of interest out of a wide-angle view (left-hand side) and bringing it back to the native resolution of the production using a super resolution algorithm (right-hand side).

Conclusion

The integration of AI in replays represents a significant leap forward for the broadcast industry. By harnessing the capabilities of AI, broadcasters can overcome technical limitations of traditional broadcasting and enhance the quality of their productions for viewers.

As demonstrated in this paper, creative opportunities in live productions are no longer constrained by decisions made prior to the event regarding camera types and configurations. Parameters such as video framerate, exposure time, aperture, and focal length can now be adjusted in real-time during the event. Having these effects as always available live video streams will offer operators a newfound flexibility that supports their artistic vision.

We believe that the next step will involve granting even greater freedom when it comes to camerapositions and orientations. Recent advances in 3D reconstruction algorithms [24], make it conceivable to move a virtual video camera to any position to achieve smooth camera transitions or to show never-seen-before viewpoints, such as a player’s first-person perspective.

As technology continues to progress, the potential for AI-powered replays are boundless, promising a future where the art of storytelling in live productions reaches unprecedented heights.

EVS Innovation Lab

Pioneering live broadcast solutions

At the heart of EVS' dedication to delivering cutting-edge solutions for live broadcast applications lies its Innovation Lab, a team comprised of dedicated engineers and technology experts. For nearly a decade, this Innovation Lab has served as the breeding ground for the inception, research, and testing of innovative technology concepts, with a strong focus on artificial intelligence and machine learning as well as areas such as IP, software-defined servers, cloud and SaaS technologies.