Build a WebRTC app from 0 to 1

Zhang Yuhang, a front-end engineer of wemedical Cloud service team, is an unartistic virgo programmer.

preface

The sudden outbreak of COVID-19 in early 2020 almost cut off the offline medical channels. Against this background, wedoctor, as a leader in the digital health industry, has quickly solved the urgent needs of a large number of people in need of medical treatment through online consultation and other forms. As an important part of the online consultation on the Web end of Wedoctor, the video consultation between doctors and patients is precisely the application of WebRTC technology described in the following section.

What is the WebRTC

WebRTC(Web Real-time Communication) is a GIPS engine that Google acquired from VoIP software developer Global IP Solutions for $68.2 million in 2010. Renamed “WebRTC”, it opened source in 2011 to build a platform for real-time audio, video and data communication between Internet browsers.

So what can WebRTC do? In addition to the above mentioned online consultation/remote outpatient service/remote consultation in the medical field, there are also more popular interactive live streaming solutions of e-commerce and educational industry solutions. In addition, with the rapid construction of 5G, WebRTC also provides good technical support for cloud games.

WebRTC framework

Below is the overall architecture of WebRTC from the WebRTC website

It is not difficult to see from the figure that the entire WebRTC architecture design can be roughly divided into the following three parts:

Purple provides apis for Web front-end development
The solid blue lines provide the apis used by the major browser vendors
The blue dotted line contains three parts: audio engine, video engine, and Network Transport. All can be customized

WebRTC point-to-point communication principle

What are the difficulties and problems that need to be solved to implement real-time audio and video communication between two clients (possibly different Web browsers or mobile apps) in two different network environments (with microphones, camera devices)?

How do we know about each other and how do we find each other?
How to communicate with each other’s audio and video codec capabilities?
How to transmit audio and video data? How can you let the other party see you?

For problem 1: Although WebRTC supports end-to-end communication, this does not mean that WebRTC no longer needs a server. During point-to-point communication, both parties need to exchange some metadata, such as media information and network data. This process is often called signaling. The corresponding server is a Signaling server. It is also commonly referred to as a room server because it can not only exchange media and network information with each other, but also manage room information, such as informing each other who has joined the room,who has left the room, and telling third parties whether the room is full or not. To avoid redundancy and maximize compatibility with existing technologies, the WebRTC standard does not specify signaling methods and protocols. The next practical section of this article uses Koa and Socket. IO technologies to implement a signaling server.

For question 2: The first thing to know is that different browsers have different codec capabilities for audio and video. For example, peer-A supports H264 and VP8 encoding formats, while peer-B supports H264 and VP9 encoding formats. In order to ensure that both sides can codec correctly, the simplest way is to take the intersection of both supported formats -H264. In WebRTC, a special Protocol called Session Description Protocol(SDP) can be used to describe such information. Therefore, both parties involved in the audio and video communication must exchange SDP information in order to understand the media formats supported by the other party. The process of exchanging SDP is often called media negotiation.

For question 3: it is essentially a process of network negotiation: both parties involved in real-time audio and video communication need to understand each other’s network conditions, so that it is possible to find a communication link. Ideally, each browser computer would have its own private public IP address, so that point-to-point connections could be made directly. But in fact, due to the insufficient consideration of Network security and IPV4 Address, our computers are in a local area Network (LAN), whether large or small, and need NAT(Network Address Translation). In WebRTC we use the ICE mechanism to establish network connections. So what is ICE?

ICE (Interactive Connecctivity Establishment) is not a protocol, but a framework integrating STUN and TURN protocols. The STUN(Sesssion Traversal Utilities for NAT) allows the client behind NAT (or multiple Nats) to find its own public IP address and port, also known as “piercing”. However, if the NAT type is symmetric, the hole cannot be drilled successfully. This is where TURN comes in. TURN(Traversal USing Replays around NAT) is an extension of STUN/RFC5389 that adds Replay. In short, it is designed to solve the intraversable problem of symmetric NAT. When STUN fails to allocate a public IP address, it can request the public IP address as the relay address through the TURN server.

There are three types of ICE candidates in WebRTC. They are:

Host candidate
Reflex candidate
Relay candidate

Host candidate, which represents the IP address and port on the local area network. It is the highest priority of the three candidates, meaning that at the bottom of WebRTC, the local area network (LAN) will first be tried to establish a connection.

The reflection candidate represents the external IP address and port of the host within the NAT. It has a lower priority than the host candidate. That is, when the WebRTC fails to connect locally, it tries to connect by reflecting the IP address and port obtained by the candidate.

The relay candidate represents the IP address and port of the relay server through which media data is relayed. When the communication between the two WebRTC clients cannot pass through P2P NAT, in order to ensure the normal communication, the service quality can only be guaranteed through the server transfer.

As can be seen from the figure above, in the non-local LAN, WebRTC obtains its external NETWORK IP and port through STUN server, and then exchanges network information with the remote WebRTC through signaling server. The two parties can then attempt to establish a P2P connection. If NAT traversal fails, the Relay server (TURN) Relay is passed.

It is worth mentioning that in WebRTC, network information is usually described as candidate, and STUN Server and Replay server in the above figure can also be the same server. The practice section at the end of the article uses coturn, an open source project that integrates STUN(holing) and TURN(relay) functions.

To sum up, we can use the following figure to illustrate the basic principle of WebRTC point-to-point communication:

In short, it obtains the media information SDP and network information candidate of each end through the API provided by WebRTC, and exchanges through the signaling server, and then establishes the connection channel of both ends to complete the real-time video and voice call.

WebRTC has several important apis

Audio and video capture API

MediaDevices.getUserMedia()

const constraints = {
        video: true.audio: true
    
};
// In non-secure mode (non-HTTPS /localhost) navigator. MediaDevices returns undefined
try {
        const stream = await navigator.mediaDevices.getUserMedia(constraints);
        document.querySelector('video').srcObject = stream;
    }   catch (error) {
        console.error(error);
    }
Copy the code

Gets a list of audio and video device input and output

MediaDevices.enumerateDevices()

try {
        const devices = await navigator.mediaDevices.enumerateDevices();
        this.videoinputs = devices.filter(device= > device.kind === 'videoinput');
        this.audiooutputs = devices.filter(device= > device.kind === 'audiooutput');
        this.audioinputs = devices.filter(device= > device.kind === 'audioinput');
      } catch (error) {
        console.error(error);
      }
Copy the code

RTCPeerConnection

RTCPeerConnection, as an API for creating point-to-point connections, is the key to achieving real-time audio and video communication. (Refer to MDN documentation)

In the practice section of this article, the following methods are mainly applied to RTCPeerConnection:

Media negotiation method

createOffer
createAnswer
setLocalDesccription
setRemoteDesccription

Important events

onicecandidate
onaddstream

As described in the previous chapter, the most important part of P2P communication is the exchange of media information

It is not difficult to find from the figure above that the whole media negotiation process can be simplified into three steps corresponding to the four media negotiation methods mentioned above:

The caller Amy creates the Offer and sends the Offer message (which contains the SDP information of the caller Amy) to the receiver Bob through the signaling server. At the same time, the setLocalDesccription call will contain the local SDP information Save the Offer
After receiving the Offer information from the peer, Bob calls the setRemoteDesccription method to save the Offer containing the information from the peer SDP. CreateAnswer (createAnswer) and send the Answer message (which contains the SDP information of Bob on the receiver) to Amy on the caller through the signaling server
After receiving the Answer information from the peer end, Amy calls the setRemoteDesccription method to save the Answer information containing the information from the peer end

After the above three steps, the media negotiation part of the P2P communication process is completed. In fact, setLocalDesccription is invoked on both the calling end and the receiving end, and at the same time, the network information (candidate) of each end is collected. Then, each end collects their candidates through the listening event OnicecandiDate and sends them to the peer end through the signaling server, thus getting through the NETWORK channel of P2P communication, and getting the video stream of the other side through the listening onAddStream event, thus completing the whole video call process.

WebRTC practice

Setup of coturn server

Note: You do not need to set up a Coturn server for local area network (LAN) testing. If you need Internet access, you need to purchase a cloud host and bind a domain name that supports HTTPS access before setting up a Coturn server. The following is a WebRTC test website built by the author: webrtc-Demo

The setup of Coturn server is mainly to solve the problem of NAT impassability, and its installation is relatively simple:

1. git clone https://github.com/coturn/coturn.git
2. cd coturn/
3. ./configure --prefix=/usr/local6. Openssl req -x509 -newkey rsa:2048 -keyout /etc/turn_server_pkey.pem -out /etc/turn_server_cert.pem -days 99999 -nodesCopy the code

Coturn service configuration

vim /usr/local/ coturn/etc/turnserver. Conf listening - port = 3478 external - IP = XXX. XXX / / your host public IP user = XXX: XXX / / account: Password = realm=xxx.com // your domain nameCopy the code

Start the Coturn service


1. cd /usr/local/coturn/bin/ 2. ./turnserver -c .. // Note: TCP and UDP ports 3478 must be enabled on the cloud hostCopy the code

Practice code

Before coding, the following flow chart can be obtained by combining the basic principles of WebRTC point-to-point communication in the above section:

It is not difficult to see from the figure that, assuming that PeerA is the initiator and PeerB is the receiver, a Signal server is necessary to manage room information and forward network information and media information in order to realize the real-time peer-to-peer audio and video communication of WebRTC. In this paper, the signaling server is built by using KOA and socket. IO:

/ / the server end server. Js
const Koa = require('koa');
const socket = require('socket.io');
const http = require('http');
const app = new Koa();
const httpServer = http.createServer(app.callback()).listen(3000.() = >{});
socket(httpServer).on('connection'.(sock) = >{
    / /...
});

/ / the client end socket. Js
import io from 'socket.io-client';
const socket = io.connect(window.location.origin);
export default socket;
Copy the code

After the signaling server is set up, the following steps are taken according to the flowchart:

PeerA and PeerB connect to the signaling server respectively, and the signaling server records room information

/ / the server end server. Js
socket(httpServer).on('connection'.(sock) = >{
    // The user leaves the room
    sock.on('userLeave'.() = >{
        // ...
    });
    // Check whether the room can be joined
    sock.on('checkRoom'.() = >{
        // ...
    });
    / /...
});
/ / client side Room. Vue
import socket from '.. /utils/socket.js';

// The server tells the user whether to join the room
socket.on('checkRoomSuccess'.() = >{
        // ...
});
// The server informs the user that the room has been joined successfully
socket.on('joinRoomSuccess'.() = >{
        // ...
});
//....

Copy the code

A sends A video invitation to B. After B agrees the video request, both parties create A local RTCPeerConnection and add local video streams. The sender creates an offer to set local SDP information description. The signaling server sends its SDP information to the peer end

socket.on('answerVideo'.async (user) => {
        VIDEO_VIEW.showInvideoModal();
        // Create local video stream information
        const localStream = await this.createLocalVideoStream();
        this.localStream = localStream;
        document.querySelector('#echat-local').srcObject = this.localStream;
        this.peer = new RTCPeerConnection();
        this.initPeerListen();
        this.peer.addStream(this.localStream);
        if (user.sockId === this.sockId) {
          / / receiver
        } else {
          // The sender creates the offer
          const offer = await this.peer.createOffer(this.offerOption);
          await this.peer.setLocalDescription(offer);
          socket.emit('receiveOffer', { user: this.user, offer }); }});Copy the code

When you call setLocalDescription, you also start to collect your own network information (candidates), and if you are not on the local network or the network “hole” is not successful, you will also try to send a request to the Stun/Turn server to collect the “relay candidates”. So when creating RTCPeerConnection we also need to listen for events of the ICE network candidate:

initPeerListen () {
      // Collect your own network information and send it to the peer
      this.peer.onicecandidate = (event) = > {
        if (event.candidate) { socket.emit('addIceCandidate', { candidate: event.candidate, user: this.user }); }};/ /...
    }
Copy the code

When the receiver B obtains the offer information containing SDP from the peer sender A through the signaling server, it will call setRemoteDescription to store the peer SDP information. Create and set the local SDP information, and send the answer containing the local SDP information through the signaling server

socket.on('receiveOffer'.async (offer) => {
        await this.peer.setRemoteDescription(offer);
        const answer = await this.peer.createAnswer();
        await this.peer.setLocalDescription(answer);
        socket.emit('receiveAnsewer', { answer, user: this.user });
 });
Copy the code

When initiator A receives the answer message from receiver B through the signaling server, initiator A also calls setRemoteDescription. In this way, the two parties complete the exchange of SDP information

socket.on('receiveAnsewer'.(answer) = > {
        this.peer.setRemoteDescription(answer);
      });
Copy the code

When the SDP information exchange between the two parties is completed and the monitoring icecandidate is collected and the network candidates are exchanged through the signaling server, the video streams of each other will be obtained.

socket.on('addIceCandidate'.async (candidate) => {
        await this.peer.addIceCandidate(candidate);
});
this.peer.onaddstream = (event) = > {
        // Get each other's video stream
        document.querySelector('#remote-video').srcObject = event.stream;
};
Copy the code

conclusion

The code can be downloaded through the Learn-WebrTC. It is worth worth to mention that VIDEO_VIEW in the code is a JS SDK focusing on the video UI layer. It includes Modal in initiating video, Modal in receiving video, and Modal in video, which is extracted from JS SDK used by Web video asking clinic on micro doctor online. This article only briefly introduces the basic principles of WebRTC P2P communication. In fact, the SDK used in the production environment not only supports point-to-point communication, but also supports multi-person video calls, screen sharing and other functions which are implemented based on WebRTC.

Refer to the article

WebRTC in the real world: STUN, TURN and signaling
WebRTC signaling control was set up with STUN/TURN server