This article is shared by Zhang Yuhang, front-end engineer of WebRTC technology team. The original title “Build a WebRTC application from 0 to 1” has been revised and changed.

1, the introduction

At the beginning of last year, due to the sudden outbreak of COVID-19, offline medical channels were almost cut off. In this context, online consultation mode quickly solved the urgent needs of a large number of patients in need of medical treatment. As an important part of online consultation, the video consultation between doctor and patient is realized by the application of real-time audio and video technology.

As we all know, the threshold of real-time audio and video chat technology is very high, and it is very difficult for general companies to make up for the technical shortcomings in this area from zero in a short time, but the open source audio and video project WebRTC provides such a shortcut (including the author’s product, also based on WebRTC technology to achieve).

Based on the practical experience of WebRTC technology in the online consultation product developed by the author, this paper describes how to develop a real-time audio and video chat function from zero based on WebRTC. This article will start from the basic knowledge of WebRTC, technical principles, based on open source technology for you to demonstrate how to build a WebRTC real-time audio and video chat function.

2, the source of this article

gitee.com/instant\_me…

cdwebrtc-serveryarnnpm startcdwebrtc-staticyarnnpm start
Copy the code

3. Knowledge preparation

3.1 Theoretical basis of audio and video

Before understanding WebRTC technology, if you do not understand the basic theory of audio and video technology, it is recommended to learn from the following introductory articles first.

  1. “Zero Basics: An Introduction to the Most Popular Video Coding Techniques ever” (* Required reading)
  2. Introduction to Real-time Audio and Video Technology for Xiaobai
  3. Basics of Real-time Audio and Video Technology
  4. “Real-time Audio and Video Visual Essentials: Quick Mastery of 11 Basic Concepts related to Video Technology” (* Required reading)
  5. Iqiyi Technology Sharing: Light and humorous, explaining the past, present and future of video codec technology

3.2 What is WebRTC

▲ The picture is quoted from “The Great WebRTC: Improving the Ecology, or upgrading real-time audio and video technology”

WebRTC (Web Real-Time Communication) is the GIPS engine of Global IP Solutions, a VoIP software developer, which Google acquired for 68.2 million DOLLARS in 2010. It was renamed “WebRTC” and opened in 2011 with the goal of building a platform for real-time audio, video and data communication between Internet browsers. For more information about WebRTC, see “Great WebRTC: Improving the ecosystem, or Revolutionizing real-time audio and video technology.”

So what can WebRTC do?

In addition to real-time audio and video calls in traditional IM social software such as wechat, Dingding and QQ that we all use every day, the author’s products involve online consultation/remote outpatient service/remote consultation in the medical field, as well as popular interactive live broadcast and online education. In addition, with the rapid construction of 5G, WebRTC also provides a good technical support for cloud games.

3.3 WebRTC learning resources

WebRTC official resources:

  1. WebRTC Open Source Project official website
  2. WebRTC Open Source Project Source Hosting Address
  3. WebRTC Standard API Online Documentation

Other WebRTC learning resources:

  1. Open Source Real-time Audio and video Technology WebRTC under Windows concise compilation tutorial
  2. Introduction to the Overall Architecture of WebRTC Real-time Audio and Video Technology
  3. Conscience Sharing: WebRTC Zero-Base Developer Tutorial (Chinese)

4. Composition of WebRTC technology

Overall technical composition diagram from WebRTC official website:

The whole WebRTC can be roughly divided into the following three parts:

  • 1) Purple provides apis for Web front-end development;
  • 2) The solid blue line provides apis used by major browser vendors;
  • 3) The blue dotted line part contains three parts: audio engine, video engine and network Transport, all of which can be customized.

Due to limited space, this section does not discuss in depth. If you are interested, you can read the overall architecture of WebRTC Real-time Audio and video technology.

5. P2P communication principle of WebRTC

5.1 Technical difficulties of P2P communication

P2P communication is point-to-point communication.

What are the difficulties and problems to be solved in realizing real-time audio and video communication between two clients (possibly different Web browsers or mobile apps) in different network environments (with microphones and cameras)?

To sum up, there are mainly the following three problems:

  • 1) How do you know each other exists — how do you discover each other?
  • 2) How to communicate with each other’s audio and video codec ability?
  • 3) How to transmit audio and video data, and how to let the other party see themselves?

We’ll take a look at each of these questions one by one.

5.2 How do you know each other exists (i.e., how do you spot each other)?

For question 1: WebRTC supports end-to-end communication, but that doesn’t mean WebRTC doesn’t need a server anymore.

In P2P communication, the two parties need to exchange some metadata, such as media information and network data, etc. This process is usually called “signaling”.

The corresponding server is a signaling server, also known as a “room server” because it can exchange media and network information and manage room information as well.

Such as:

  • 1) Inform each other that who has joined the room;
  • 2) Who leaves the room
  • 3) Tell the third party whether the room is full and whether they can join the room.

To avoid redundancy and to maximize compatibility with existing technologies, the WebRTC standard does not specify signaling methods and protocols. The later practical sections of this article will implement a signaling server using Koa and socket. IO technologies.

5.3 How to communicate with each other about audio and video codec ability?

For question 2: The first thing we need to know is that different browsers have different codec capabilities for audio and video.

For example, peer-A supports H264 and VP8, while peer-B supports H264 and VP9. In order to ensure that both parties can codec correctly, the simplest way is to take the intersection of the formats supported by them -H264.

In WebRTC: there is a special Protocol called Session Description Protocol(SDP) that can be used to describe this kind of information.

Therefore, both parties involved in audio and video communication must exchange SDP information if they want to know the media formats supported by each other. The process of exchanging SDP is usually called media negotiation.

5.4 How to transmit audio and video data, and how to let the other party see them?

As for question 3, it is essentially a process of network negotiation, that is, both parties involved in real-time audio and video communication need to understand each other’s network situation, so that a link for mutual communication can be found.

The ideal network situation would be for each browser’s computer to have its own private public IP address, which would allow direct point-to-point connections.

But the reality is: because of security and the lack of IPV4 addresses, our computers are all on a local area Network (LAN), requiring NAT (Network Address Translation). In WebRTC we use the ICE mechanism to establish network connections.

So what is ICE?

ICE (Interactive Connecctivity Establishment) is not a protocol, but a framework that integrates STUN and TURN.

STUN (Sesssion Traversal Utilities for NAT) allows a client to find its own public IP address and port after a NAT (or multiple NAT).

However: if the NAT type is symmetric, the hole will not succeed. This is where TURN (Traversal USing Replays around NAT) comes in handy. TURN (Traversal USing Replays around NAT) is an extension of STUN/RFC5389 that adds Replay functionality.

Simply put: The purpose is to solve the problem that symmetric NAT cannot traverse. When STUN fails to allocate public IP addresses, TURN server can request public IP addresses as relay addresses.

There are three types of ICE candidates in WebRTC, which are:

  • 1) Host candidate: Indicates the IP address and port on the local LAN. It is the highest priority of the three candidates, that is, at the bottom of WebRTC, it first tries to establish a connection within a local LAN;
  • 2) Reflection candidate: Indicates obtaining the external IP address and port of the host in the NAT. Its priority is lower than that of the host candidate. In other words, when WebRTC attempts to connect to the local, it attempts to connect by reflecting the IP address and port obtained by the candidate.
  • 3) Relay candidate: indicates the IP address and port of the relay server, that is, media data is transferred through the server. When the communication between the WebRTC client and the P2P NAT cannot be traversed, in order to ensure the normal communication between the two parties, the server can only be used to ensure the quality of service.

As can be seen from the figure above, WebRTC in non-local LAN obtains its external IP and port through STUN Server, and then exchanges network information with remote WebRTC through signaling server. After that, both parties can try to establish P2P connection. If THE NAT traversal fails, the Relay server (TURN) relays the NAT.

It is worth mentioning that network information in WebRTC is usually described as candidate, and the STUN server and Replay server in the figure above can also be the same server. At the end of the paper, the practice chapter is to adopt the open source project Coturn, which integrates STUN(holing) and TURN(relay) functions.

To sum up, we can use the following figure to illustrate the basic principle of WebRTC point-to-point communication.

In short: it is through the API provided by WebRTC to obtain the media information SDP and network information candidate, and through the signaling server exchange, and then establish the connection channel at both ends to complete the real-time video voice call.

PS: P2P related knowledge, can be in-depth study of the article:

  1. P2P technical details (A) : NAT details – detailed principle, P2P introduction
  2. Peer-to-peer (P2P) NAT traversal (hole) solution (Basic Principles)
  3. Peer-to-peer NAT traversal (hole) solution (advanced analysis)
  4. P2P technology details (4) : P2P technology STUN, TURN, ICE details
  5. Easy to Understand: A Quick understanding of NAT Penetration in P2P Technology

6. Several important apis for WebRTC

6.1 Audio and video collection API

Audio and video collection API, namely MediaDevices getUserMedia ().

Sample code:

const constraints = {video: true,audio: true}; // In non-secure mode (non-HTTPS /localhost) navigator.mediaDevices returns undefinedtry{const stream = await navigator.mediaDevices.getUserMedia(constraints); document.querySelector('video').srcObject = stream; } catch(error) {console.error(error); }Copy the code

6.2 Obtaining the INPUT and output list of audio and video devices

Input/output list API access to audio and video equipment, namely MediaDevices. EnumerateDevices ().

Sample code:

try{const devices = await navigator.mediaDevices.enumerateDevices(); this.videoinputs = devices.filter(device => device.kind === 'videoinput'); this.audiooutputs = devices.filter(device => device.kind === 'audiooutput'); this.audioinputs = devices.filter(device => device.kind === 'audioinput'); } catch(error) {console.error(error); }Copy the code

6.3 RTCPeerConnection

RTCPeerConnection, as an API for creating point-to-point connections, is the key for us to achieve real-time audio and video communication.

In the practice chapter of this paper, the following methods are mainly used.

Media negotiation methods:

  1. createOffer
  2. createAnswer
  3. localDesccription
  4. remoteDesccription

Important Events:

  1. onicecandidate
  2. onaddstream

From the description in the last chapter, we can know that the most important link in P2P communication is the exchange of media information.

Principles of media negotiation:

It is easy to see from the above figure that the whole media negotiation process can be simplified into three steps corresponding to the above four media negotiation methods.

Concrete is:

  • 1) Amy creates Offer(createOffer) and sends the Offer message (SDP information of Amy) to Bob through the signaling server. At the same time, setLocalDesccription is called to obtain the local SDP information Save the Offer;
  • 2) After receiving the Offer information from the peer end, Bob calls the setRemoteDesccription method to save the Offer containing the SDP information from the peer end. CreateAnswer (createAnswer) and send the Answer message (SDP information of Bob at receiver) to Amy at the calling end through signaling server.
  • 3) After receiving the Answer information from the peer end, Amy calls the setRemoteDesccription method to save the Answer containing the SDP information from the peer end.

After the above three steps, the media negotiation part of P2P communication process is completed.

In fact: Calling setLocalDesccription on the caller and receiver also started to collect network information (candidate) on each end. Then each end collected the candidate through the listening event onicecandidate and sent it to the peer end through the signaling server. Then through the NETWORK channel of P2P communication, and by listening onaddStream event to get the video stream of the other party to complete the whole video call process.

7. Hands-on coding practice

** Note: ** Please download the complete source code for this section from the attachment of the “2. Source Code for this article” section.

7.1 Establishment of coturn server

Note: No need to set up [url=%5Burl=github.com/coturn/%5Dc… If you need to access the Internet, you need to purchase a cloud host and bind the domain name that supports HTTPS access before setting up the Coturn server. The following is the process of setting up the coturn server by yourself, if you are interested, please refer to your own practice.

The coturn server was built to solve the problem of NAT traversal.

Its installation is also relatively simple:

1. git clone [url=https://github.com/coturn/coturn.git]https://github.com/coturn/coturn.git[/url]2. cdcoturn/3. ./configure--prefix=/usr/local/coturn4. make-j 45. Makeinstall // generate key6. openssl req -x509 -newkey rsa: 2048-keyout /etc/turn_server_pkey.pem -out /etc/turn_server_cert.pem -days 99999 -nodesCopy the code

7.2 Coturn Service Configuration

My configuration is as follows:

Vim/usr/local/coturn/etc/turnserver conflistening - port = 3478 external - IP = XXX. XXX / / your host public IPuser = : XXX XXX / / account: Password =xxx.com // Your domain nameCopy the code

7.3 Starting the Coturn Service

My startup process:

1. cd/usr/local/coturn/bin/2. ./turnserver-c .. /etc/turnserver.conf// Note: Enable both TCP and UDP ports 3478 on the cloud hostCopy the code

7.4 Practice Code

Before writing the code, the following flow chart can be obtained based on the basic principle of WebRTC point-to-point communication in the above section.

As can be seen from the figure, assuming that PeerA is the initiator and PeerB is the receiver, in order to realize real-time audio and video communication of WebRTC point-to-point, Signal server is necessary to manage room information and forward network information and media information.

In this paper, the signaling server is built by koA and socket. IO:

// server.jsconst Koa = require(' Koa '); const socket = require('socket.io'); const http = require('http'); const app = newKoa(); const httpServer = http.createServer(app.callback()).listen(3000, ()=>{}); socket(httpServer).on('connection', (sock)=>{// .... }); // client socket.jsimport IO from 'socket. IO -client'; const socket = io.connect(window.location.origin); export defaultsocket;Copy the code

After setting up a signaling server, perform the following steps based on the flowchart.

Step 1: PeerA and PeerB are connected to the signaling server respectively, and the signaling server records room information:

/ / server. The server end jssocket (httpServer.) on (' connection '(sock) = > {sock. / / the user left the room on (' userLeave' () = > {/ /... }); On ('checkRoom',()=>{//... }); / /... }); // Client room. vueimport socket from '.. /utils/socket.js'; Socket. on('checkRoomSuccess',()=>{//... }); / / the server to inform user successfully join room socket. On (' joinRoomSuccess '() = > {/ /... }); / /...Copy the code

Step 2: End A initiates A video invitation to end B as the initiator. After receiving the approval of the video request from end B, both parties will create A local RTCPeerConnection and add A local video stream. The sender will create offer to set the local SDP information description. In addition, the signaling server sends its SDP information to the peer end

socket.on('answerVideo', async (user) => {VIDEO_VIEW.showInvideoModal(); / / to create local video information const localStream = await this. CreateLocalVideoStream (); this.localStream = localStream; document.querySelector('#echat-local').srcObject = this.localStream; this.peer = newRTCPeerConnection(); this.initPeerListen(); this.peer.addStream(this.localStream); If (user.sockid === this.sockid) {// receiver} else{// sender creates offerConst offer = await this.peer.createOffer(this.offerOption); await this.peer.setLocalDescription(offer); socket.emit('receiveOffer', { user: this.user, offer }); }});Copy the code

Step 3: When setLocalDescription is called, it also collects network information about its own candidate. If the “hole” in a non-LAN or network fails, it also attempts to collect relay candidates from the Stun/Turn server. So in creating RTCPeerConnection we also need to listen for ICE network candidate events:

Init PeerListen () {this.peer. Onicecandidate = (event) => {if(event.candidate) { socket.emit('addIceCandidate', { candidate: event.candidate, user: this.user }); }}; / /... }Copy the code

Step 4: When receiving end B receives the offer information containing SDP from sender A through the signaling server, it will call setRemoteDescription to store the SDP information of the peer end. Create and set the local SDP information, and send the answer containing the local SDP information through signaling server:

socket.on('receiveOffer', async (offer) => {await this.peer.setRemoteDescription(offer); const answer = await this.peer.createAnswer(); await this.peer.setLocalDescription(answer); socket.emit('receiveAnsewer', { answer, user: this.user }); });Copy the code

Step 5: When the initiator A receives the answer message from the receiver B through the signaling server, it will also call setRemoteDescription, so that the two parties complete the exchange of SDP information:

socket.on('receiveAnsewer', (answer) => {this.peer.setRemoteDescription(answer); });Copy the code

Step 6: After the SDP information exchange between the two parties is completed and the monitoring ICecandidate collects network candidates and exchanges them through the signaling server, the video streams of each other will be obtained:

socket.on('addIceCandidate', async (candidate) => {await this.peer.addIceCandidate(candidate); }); This.peer. Onaddstream = (event) => {document.querySelector('#remote-video').srcobject = event.stream; };Copy the code

7.5 Operating Effect

         

8. Summary of this paper

Through the six steps in the previous section, a complete P2P video call function based on WebRTC can be completed (the code is available through: the complete source code involved in this section, please download from the attachment of the “2, source code of this article” section).

It is worth mentioning that the VIDEO_VIEW in the code is the JS SDK focusing on the video UI layer, including initiating video Modal, receiving video Modal and video Modal, which is extracted from the JS SDK used by the author’s online Web video consultation product.

This article only briefly introduces the basic principles of WebRTC P2P communication and simple code practice. In fact, the SDK used in our production environment not only supports point-to-point communication, but also supports multi-party video calls, screen sharing and other functions, which are realized based on WebRTC.

9. Reference materials

[1] WebRTC Standard API online documentation

[2] WebRTC in the real world: STUN, TURN and signaling

[3] WebRTC signaling control and STUN/TURN server construction

[4] The great WebRTC: The ecosystem is becoming more and more perfect, or the real-time audio and video technology will be revolutionized

[5] WebRTC is an open source real-time audio and video technology

[6] WebRTC Real-time audio and video technology architecture introduction

[7] Share your Conscience: WebRTC Zero-Base Developer Tutorial

[8] Peer-to-peer NAT Traversal (NAT traversal)

[9] A peer-to-peer (P2P) technology with STUN, TURN, ICE

[10] NAT penetration in P2P technology

(This article is simultaneously published at: www.52im.net/thread-3680…)