In this post, I will explain the different components that make up a successful implementation of real-time communication in an WebRTC enabled system.
What is WebRTC?
WebRTC is a suite of components based from a couple of innovations from companies that Google bought back in 2010. WebRTC enables a developer to set up real-time media and data channels between two browsers (or mobiles if you compile it for that). It contains a couple of key components and they are all shipped in Chrome, Firefox, and Opera, and a version of it exists in Microsoft’s new browser Edge (oRTC).
Setting up Data Streams and Hardware
WebRTC helps setting up both the physical environment (such as cameras, speakers, and microphones) and other hardware (such as echo cancelation and background cancelation hardware – those you will find mostly in mobiles), plus helping to figure out the network together with STUN.
Audio Codecs and Video Codecs
One of the main benefits using WebRTC over other software when dealing with real-time audio and video is the open source/royalty free codecs that Google is kind enough to ship.
- G711, used in regular phone networks
- iLBC, old narrowband coded, also used in phone networks
- Opus, a high quality variable and (support for adaptive) codec that is the newest in codec used in WebRTC
There are more shipped, but these are the main ones and most widely used ones.
- VP8 and soon VP9, this is Google’s variation of a royalty free H.264/H.265 codec
- H.264 (added in 2015 as an agreement for ORTC)
Audio codecs do a lot of the work for you, taking care of packet loss, encoding and decoding of audio, error correction, noise cancelation, echo cancelation, volume leveling, and more. The fact that it contains codecs also makes it hugely popular on mobile devices and desktops.
Directory of Who is Available to Call (or Peer Discovery)
In order to call someone, you need to know the address, and unlike regular phone numbers, the addressing on the Internet is mostly dynamic IP addresses. To solve this, you have to keep record of where everyone is. This can be done in a number of ways using XMPP, SIP, custom protocols, etc., but it all boils down to that anyone ready to receive a call checks in with a server one way or another, and lets the server know how to contact that peer (implied for further delivery of Offer/Invite/SDP etc.).
Think of it as a totally dynamic white pages. This is usually done on timed intervals to keep firewalls or similar open for the signaling server to notify the client if someone wants to communicate with them. So, this is the first piece you need to build on top of that.
Next you probably want to keep track of all devices for a particular user and notify them on all devices if there is a call. Using Sinch, we take care of this part for you.
After your signalling server has located a device and sends an offer, you need a STUN server. The STUN server will facilitate to determine your external IP address as well as if the two (or more) devices can talk to each other directly. Sinch will take care of this for you too.
Media Relay Server (Turn Server)
If a peer-to-peer session is not possible (our own data suggests this accounts for around 25% of sessions), you will need a TURN server. The TURN server will basically shift the bits for you through open holes in the firewall between the two clients. Why does this happen? The most common is asymmetric firewalls and the possibility to punch holes on different ports in firewalls.
Why Don’t I Set This Up Myself?
Well, you could. This might be a little overkill, and one more competency in your operations team will be required. Your TURN and STUN servers will probably be heavily under utilized and expensive. And here is where scalable economics come in. Since Sinch are doing over a billion minutes per year, our pricing for data transfer are cheaper than most companies can get.
You probably want to have a distributed network as well. If you for instance have your TURN server in the U.S. and calls are going on between clients in Europe, you will add latency just because all traffic needs to cross the ocean. A good rule of thumb is that around 250ms is noticeable in a conversation (more quality of service info here). So, without adding any network latency on the client and processing time to encode the data, you are basically guaranteed to have too much latency between clients.
Is it only about Backend?
It’s not only about backend. At Sinch we have vast experience of real-time communications, and we are customizing and configuring WebRTC to work the best on all devices and across different networks conditions. A couple of examples are implementation of adaptive Opus, which will adjust the recording quality based on quality metrics from our traffic. We also know what codecs to use in specific circumstances, and which to select to minimize transcoding and latency worldwide.
Other examples include dynamic configuration of Android devices. One wonderful thing with mobiles is that at least the high end ones have dedicated hardware support for noise and echo cancellation. Unfortunately, not all handsets have that, which means that when you set up WebRTC, you will need to know if you want to use the hardware or not. At Sinch, we test and optimize for a lot of different handsets (and Android versions).
So there are just a few reasons why you should consider a provider like Sinch. If you have any more questions, please feel free to contact us here.