CREATING A LIVE STREAMING PROTOTYPE

0
about 7 months ago
207 views
Professional
English
#livestreaming #engineering #plaudere
Less info
More info
It is always a challenge to try to understand a technology and create a prototype to prove how it can be used, for instance, in a website. All frameworks and current state-of-the-art development tools were once prototypes, experiments, and proofs of concept; this is why I decided to create a prototype, even though this technology is widely available and multiple services already exist, in order to demonstrate how to transfer live audio and video. I kept in mind concepts such as adaptive transfer based on network restrictions, as well as encoding media from the sender and transferring and decoding media at the receiver.
With these concepts in mind, let's use minimal frameworks to make a streaming website work. First of all, it is important to have a server framework; in this case, using Node.js and Express.js to power the website. The client side uses HTML, CSS, and JavaScript. Also, let's use the Web Audio API, as it has audio decoding properties; the Media Recorder API, as it can take chunks of media from the sender; and WebSockets, as messages with media will be sent and received using the server as a distributor.
Sender logic
My first concern when gathering these frameworks and the idea was to create media to transfer. To do so, there is different media available from the sender. There can be videos or audio stored on the device, there can be sender live devices such as a camera and microphone, and there is a combination of those in case of more complex media creation. In general, this type of media can be accessed through MediaRecorder to be captured in chunks on a regular basis. For instance, an idea could be accessing the sender's media and creating media chunks with audio at 1-second intervals, so this info can be sent to the receiver.
However, in the case of video, there is probably a need for more complex tools to create video chunks; so, to simplify the proof of concept, I decided to capture some snapshots of the sender's media if video is available. This way, we send to the receiver not only audio chunks but also frames of video that can be played together once they arrive. As the audio chunks will be decoded and played on the receiver side, the audio might have a 'tick' if there are silent gaps between playing the previous audio chunk and the next one. To avoid hearing gaps, the sender is recording audio chunks with some overlaps of no more than 250 milliseconds to perform fades while the next audio chunk starts recording. This way, the listener will hear no gaps when playing.
Use of worker
There can be devices with more computing power, such as a computer or laptop, and less powerful devices, such as smartphones or tablets. Therefore, a Worker would be a good idea to use background computing when available, allowing these lower-powered devices to also run the website with media capabilities. Operations such as managing temporary arrays of frames and audio to be sent, formatting and sending media content from the sender to the server, and calculations of chunk timing, the number of chunks temporarily stored and sent, and the time of creation of chunks, among others, can be handled there. Only operations that modify the HTML code cannot be run from the Worker; all other operations should run from it to make the website more efficient.
Server logic for reception and distribution to media consumers
Client-side Workers from the sender send data to the server, which not only receives the audio chunks and frames but must also store the metadata related to the live media, as well as the media data itself for later consumption. In this sense, the computing power of the server, the number of instances, and the temporary storage (RAM) are the key constraints for the number of senders and receivers that can coexist at once, as well as the window of availability of live media. Preferably, the audio chunks, frames, and metadata need to be stored in a database—in this case, MongoDB—to make them available for later consumption. However, in this model, we avoid having each receiver query the database to get the audio chunks and frames, as this can be an inefficient approach. For this reason, if one or more receivers require live media, a single instance is activated that queries all available audio chunks and frames; the receivers will then only get the chunks and frames corresponding to their connected channel and the relevant timing. Finally, audio chunks and frames older than a certain threshold are automatically deleted from both the database and the server cache to avoid overloading the server.
Receiver logic
Once the receiver receives media content in the form of audio chunks and frames, the idea is that a few seconds are guaranteed for playback (buffering); then, the playback is programmed. Although there are different methods of decoding and playing media through the receiver device, the lightest method must be chosen to avoid consuming too many resources. In this case, the array buffers from audio chunks and frames are converted into strings to be sent through the server and database; they reach the receiver as strings, which need to be converted back into array buffers. These are stored temporarily until a few seconds are guaranteed for playback, and then they are transferred for playing. The audio in array buffer form is decoded using the Web Audio API, so the audio buffers are fully available for playback planning. Once the timing is chosen for the first available audio chunk, the subsequent ones are played after the duration of the previous chunk, which in this website is 1 second. Additionally, the image frames are programmed with the audio, as they are related to an audio chunk based on the metadata. Then, the image frame is played using a canvas that updates the images according to the audio, and the audio is played using the Web Audio API and managed by the receiver.
Network issues and solutions
Problems can arise when running live media content from the receiver. If the receiver has communication issues with the server, or if the audio chunks and frames are received past a threshold that exceeds the minimum age allowed for playback, the system declares the reception as 'lower quality'. It then sends a message to the server stating that the receiver needs to receive a smaller amount of information. It starts by requesting fewer image frames per second and, ultimately, may stop asking for image frames altogether to only receive audio chunks. This allows the reception to restart and progressively increase the number of images per second until it is aligned with the content created by the sender. However, if this is not enough—and despite less data being transferred, there is no improvement in the timing from creation to reception—the receiver logic resets the buffering and waits until a few seconds of playback are guaranteed to keep playing.
From the sender's point of view, if the server is not receiving audio chunks and image frames within a timing threshold, it may mean the server is not receiving the data correctly. In this case, the sender starts by sending fewer image frames and, finally, no image frames at all, to always prioritise audio chunks; this, in turn, affects the content the receiver receives."
Innovation possibilities
The custom approach to sending live streaming content, including audio and video, poses a number of possibilities to enrich the experience for both the sender and receiver.
For instance, a combination of video or audio files with devices such as cameras and microphones can be combined and sent through the server as a single stream. A consideration would be that there might be a delay when combining audio or video from files with those from devices. For this reason, it is important to create a selector so a delay can be defined to correct any latency when combining both; this can be done using Web Audio API nodes with delays that can be adjusted to synchronise their timing.
On the other hand, another possibility applied to this website service is the ability to combine streams from two senders. The first sender can send a stream of audio and video to a second streamer who receives the stream and mixes it with a second stream; this can then be sent to the receiver as a single stream, taking into account the potential delay that can occur when combining two different streams. This way, two streamers can be perceived as one, creating the illusion that they are in the same place producing a joint stream, even though they may be geographically distant.
Other enriching elements for the streaming experience are definitely a chat system to allow real-time communication between streamers and receivers, as well as feedback from receivers to senders. Finally, video and audio elements in the streaming can be located either on the streamer's local device or via a URL available on the internet with the correct permissions. That content can then be extracted and distributed using the website server.
Concerns and future improvements
The current model seems to work fine. The network issues adaptation and decoding formula works well; image frames are aligned with audio chunks and, in case of issues, the number of frames is reduced so the network is automatically adapted depending on network conditions. However, it seems challenging to try another approach, such as decreasing the quality of frames instead of reducing the number of image frames, to decrease the amount of data to be handled. This needs to be analysed further to combine both approaches and make streaming more robust. On the other hand, as audio chunks and image frames need to be stored within a MongoDB database before distributing to receivers, there is a delay between sending and reception. This delay avoids any instant communication with this technology, and it can only be one streamer and one or more receivers.
Regarding the Web Audio API, it cannot start playing audio content as soon as it is available; instead, the website waits for a receiver gesture, in this case by clicking the play button. This way, there is no violation of the policies for playing using Web Audio API. Another concern is definitely copyright content; currently, any type of content can be played if technically compatible, so future receiver flags and detection services to detect copyright material need to be a priority to defend the rights of content creators. Finally, as the solution relies on client JavaScript for audio decoding, everything should ideally be managed through the server to avoid stressing and overloading the devices. This type of service is more suitable for native applications, which would also solve issues with inactivity periods or Bluetooth delays. Even though this is an in-house solution, other third-party services can be researched to enhance the experience for both senders and receivers.
Joe Esteves