Scaling WebRTC with Mediasoup V3

Scalability is a decisive factor when it comes to video conferencing or broadcasting. It is responsible for keeping an excellent audio and video quality for the connected clients as the number of video call participants rises.

As more businesses attempt to introduce real-time voice and video communication in apps and browsers with a high number of clients, they quickly encounter issues with server-side processing and bandwidth.

That makes such solutions challenging to maintain and scale, and prohibitively expensive.

And besides being the technology behind many of the leading click-to-call applications including Facebook Messenger and Google Hangout, WebRTC based communication has been offering advantages that are solving these issues of scalability, cost and infrastructure.

WebRTC and Scalability in a nutshell

While making a WebRTC based video conferencing application, you need to know how many people it’s going to host in the conference- one to one, one to many, many to many.

Most WebRTC applications support much more than the original one-to-one peer-to-peer WebRTC use case, and use at least one media server to implement them.

And all open-source WebRTC software (i.e. or video-conferencing (i.e., Facetime, Skype) began using Peer-to-Peer initially. Because P2P is apparently simple to use and low-cost.

Here you only need to set a TURN/STUN server and a signaling server. You find libraries for all the most famous frameworks and runtimes.Use the backend you are most comfortable with.

But what happens when the number of participants increases?

Let’s understand through a case for what happens when scalability hits.

If it’s a one to many configuration, every user would upload (n-1) videos, and receive the same quantity if it’s many to many configurations. If for a five people conference, the bandwidth of the video is 500 kbits/sec, every user should upload 2 mbit/sec and download the same.

It seems that it’s not much but many internet providers slow down the upload speed, and users get insane costs in due course.

More clearly:

  • They experience an overload of processing capacity, i.e. CPU, at the endpoint client by decoding and coding simultaneously all streams.
  • Participants overwhelm the bandwidth by sending video and audio streams to all participants.

As a result, the quality of the video-conference is degraded (sound cut-offs, frozen video, etc.).

This is where media servers come in picture

For such cases a media server is required. A WebRTC media server is a multimedia middleware where media traffic passes through when moving from source to destination. Media servers process incoming media streams and offer different outcomes, such as Group communications (acting as a SFU or MCU). Let’s talk about SFU (Selective Forwarding Unit).

To use a SFU topology, there are many open source servers in this area:

  • MediaSoup (SFU/Node/C++)
  • Jitsi (SFU/JAVA) (comes with client)
  • Janus gateway (SFU/C)

With SFU like Mediasoup, clients can select which streams to send. After that, they receive one or many streams from the other participants. Every client can choose to receive one or more low bit-rate(s), and one high bit-rate (highest quality) stream.

Thus, SFU responds to clients’ demands, resolving the P2P scalability issues as well.

Although, the media server comes with many challenges, like scalability. Since in most of these projects, you would require to manually make your media server horizontally scalable and it’s not an easy task.

How Mediasoup V3 addresses scalability

Mediasoup is a Node.js library that exposes a JavaScript API to manage workers, routers, transports, producers and consumers. It offers a mediasoup-client which provides a proper and unified API for all the browsers helping in building multi-party video conferencing and real-time streaming apps.

Let’s look into how Mediasoup V3 makes many to many broadcasting scenarios possible.

Horizontal Scaling:

Mediasoup V3 makes broadcasting scenarios possible by allowing a room or a router to use N media worker subprocesses. Which initially was running into a single Node.js process and launching N (typically the same number as CPU cores) C++ subprocesses: mediasoup workers that handle the media layer.

A room in mediasoup now extends into N separate hosts running different instances of mediasoup.

In addition, by using the new pipe transports in v3, two mediasoup routers running in the same or different hosts can be interconnected at media level. Thus, increasing the broadcasting capabilities by enabling usage of multiple CPU cores even in different machines.

However, each room in mediasoup initially uses a single worker, which means a single CPU with a single thread. It supports some hundreds of participants well, but it does not scale well for broadcasting scenarios (a few media producers and many many consumers).

Such SFUs not only are less CPU intensive on the server, but also allow for advanced bandwidth adaptation with Scalable Video Coding (SVC) and multiple encoding (simulcast) codecs. The latter allows even better resilience against network quality problems like packet loss.

Server side BWE:

The bandwidth estimation (BWE) module helps in deciding how much video traffic you can send without congesting the network to avoid bad video quality.

Mediasoup v3 implements sender side bandwidth estimation that evaluates the available bandwidth, and then apply automated congestion control algorithms to modulate the bitrate without changing the spatial or temporal resolution.

Thus can reduce the bandwidth usage without impacting the subjective quality. This adaptation is made possible at the speed of the network for a couple of milliseconds.

Whereas, when broadcasting a video stream to many viewers (hundreds or thousands of consumers) it’s important to be aware of how video RTP transmission typically works:

A viewer may connect or reconnect, or may change its preferred spatial layer, or may just lose too many packets. Any of those circumstances would indicate a video key frame request through a RTCP FIR or PLI that reaches the broadcaster endpoint.

An endpoint or a ‘re-encoder’ on the server-side is required there. It consumes the streams of the broadcaster endpoint, re-encodes those streams and reproduces them into a set of mediasoup routers with hundreds or thousands of consumers in total. In addition, it’s not limited by available bandwidth as such a ‘re-encoder’ runs typically in the backend network.

Mediasoup comes with libmediasoupclient which can be used as a re-encoder.

The Takeaway

If you plan to have multiple participants in your WebRTC video conferencing or broadcasting solution, then you will probably end up using a Selective Forwarding Unit (SFU) or MCU (Multipoint Conferencing Unit).

As capacity planning for SFU’s and making your media server horizontally scalable manually can be difficult – there are estimates to be made for where they should be placed, which webRTC library to be used, what kind of signaling and media servers you need, and how much bandwidth they will consume.