CREATING A LIVE STREAMING PROTOTYPE
English

0
Mostrar más
Background:
Understanding a technology and validating potential use cases through prototypes is one of the greatest challenges in software development, as it requires a balance between theoretical vision and technical viability. It is essential to remember that all the frameworks and tools that currently define the state of the art in software development were, in their origin, experiments and proofs of concept designed to solve specific problems. This process of experimentation is what allows an abstract idea to materialise into a robust architecture, establishing the foundations upon which the technical solutions we use daily are built.
In my case, I have always been drawn to live streaming applications, particularly the challenge of transmitting audio and video from sender to receiver under real-time conditions. For instance, for a call to be viable, latency, defined as the time it takes for an analogue signal such as image or sound to be digitised, encoded, transferred across network infrastructure, and finally decoded at the destination, must be under control. Even with considerable geographical distance, programs like Microsoft Teams, Google Meet or WhatsApp deliver latency low enough for people to feel they are interacting in the same time. In reality, these applications typically operate around 500 milliseconds, a figure that is adequate for a conversation, but insufficient if one attempts to sing a song or play an instrument in unison, as that offset prevents the synchronisation required for joint artistic activities.
For this reason, I decided to develop my own prototype, which gave rise to the development of Plaudere as the original idea, but it started as an exercise to understand in depth how live audio and video transfer works between senders and receivers, exploring its limitations and experimenting with solutions. Although this type of software is already available in various services, in-house development allows to analyse critical concepts such as adaptive transfer according to network constraints. The design of this application integrates media encoding at the sender end and decoding logic at the receiver, ensuring that the data flow remains resilient in the face of bandwidth fluctuations and processing times.
To achieve optimal performance with minimal dependency on external libraries, the development relies on a simplified yet powerful stack that allows total control over the data flow. On the server side, we use Node.js and Express.js to manage the base architecture and packet routing, while the client is built exclusively with pure web technologies such as HTML, CSS, and JavaScript. The pillars of this multimedia implementation are the Web Audio API for its advanced decoding and buffer management capabilities, the Media Recorder API for the precise capture of media fragments or chunks at the sender, and the use of WebSockets to ensure instantaneous bidirectional distribution of multimedia messages through the server.
Sender logic:
The first critical challenge after defining the stack was the capture and generation of media for transfer. The sender has the capacity to integrate various sources, ranging from locally stored video and audio files to live devices such as cameras and microphones, even allowing for a combination of both for more complex multimedia creations. To manage this data input, we use the MediaRecorder interface, which allows the capture of the stream and its segmentation into audio chunks at a regular interval of one second.
However, handling video requires significantly more complex tools to generate continuous stream fragments, so to simplify this proof of concept, snapshots of the video source are captured in synchrony with the audio. In this way, the sender transmits both the audio chunks and the corresponding image frames to the receiver so they can be reconstructed and played in unison. A fundamental engineering detail in this process is the management of audio to avoid the tick phenomenon or audible gaps between fragments, a common issue when millisecond gaps exist between the decoding of one chunk and the next. To ensure a continuous listening experience, the sender generates recordings of 1000 miliseconds with a 330 millisecond overlap, which allows for smooth transitions or fades on the receiver side as the playback of the next fragment begins.
Use of worker:
Given the disparity in computing power between devices, where smartphones and tablets face greater constraints compared to the performance of PCs and laptops, the use of Web Workers is essential to guarantee the viability of the web application. All intensive operations are delegated to the background processing thread, allowing even lower-powered devices to run the multimedia capabilities of the site without compromising the fluidity of the user interface. In this isolated environment, the application manages the administration of temporary arrays of frames and audio, the formatting of data for transmission to the server, and the precise calculation of metadata, such as timing markers and creation timestamps for each chunk. Operating under the premise that only functions directly modifying the DOM must remain on the main thread, the use of Workers allows synchronization logic and stream transmission to be performed efficiently, avoiding navigation freezes and maximizing the overall performance of the application.
Server logic for reception and distribution to media consumers:
The server acts as the core for reception and distribution, processing data sent from the sender side Web Worker. Its function goes beyond reception, as it must fully manage both synchronisation metadata and the multimedia content itself for later consumption. In this architecture, the computing power of the server, the number of active instances, and the available RAM become the critical factors that dictate the limit for user concurrency and the availability window of live media. For persistent storage, MongoDB was chosen, where audio chunks, image frames, and their associated metadata are hosted.
However, to maximise efficiency and avoid bottlenecks, an optimised query model was implemented that prevents each receiver from making independent requests to the database. Instead, the application activates a single instance responsible for querying available fragments to distribute them en masse to the corresponding channels according to the timing required by each stream. This centralisation of the query drastically reduces the load on the database and improves overall response times. Finally, to ensure long term application health, an automatic cleanup process was implemented to delete data exceeding a predefined age threshold, thus avoiding server cache overflow and the exhaustion of physical storage.
Receiver logic:
Once the receiver receives media content in the form of audio chunks and frames, the application guarantees a safety margin of a few seconds for buffering before starting the scheduled playback. For this prototype, we chose a lightweight transfer method to minimise resource consumption: the array buffers from audio chunks and frames are converted into strings for transmission through the server and database, and then converted back into their original binary format upon reaching the receiver. This data is stored temporarily until the necessary playback window is secured to prevent interruptions. The audio is decoded using the Web Audio API, allowing for extremely precise scheduling of the output timeline. Once the timing for the first available audio chunk is established, subsequent blocks are chained mathematically based on their nominal one second duration. Simultaneously, the video frames are synchronised using the associated metadata, employing a canvas that updates the visuals in coordination with the audio stream managed by the receiver.
Network issues and solutions:
Problems can inevitably arise when running live media content from the receiver side. If the receiver experiences communication issues with the server, or if the audio chunks and frames are received past a threshold that exceeds the minimum age allowed for playback, the application declares the reception as lower quality. It then triggers a protocol to reduce the data load, starting by requesting fewer image frames per second and, ultimately, stopping the video feed altogether to prioritise audio chunks. This mechanism allows the reception to stabilise and progressively increase the frame rate until it is aligned with the original content created by the sender. However, if these measures are insufficient and the timing from creation to reception does not improve despite the reduced data transfer, the receiver logic resets the buffering process and waits until a few seconds of playback are once again guaranteed.
From the sender's perspective, the application also implements proactive resilience. If the server does not receive audio chunks and image frames within a specific timing threshold, it may indicate that the upstream data is not being transmitted correctly. In such cases, the sender begins to transmit fewer image frames or stops sending them entirely to always prioritise the audio stream, which in turn dictates the quality of the content the receiver gets. This dual approach ensures that even under adverse network conditions, the audio component, which is the backbone of the communication, remains the application's absolute priority.
Innovation possibilities:
The custom approach adopted for this prototype opens a range of advanced functions that could transform the experience for both users and creators. One of the most promising possibilities is complex stream mixing, allowing the combination of local multimedia files with live device signals such as cameras and microphones to be sent as a single unified stream. One of the most promising functions is complex stream mixing. In this scenario, latency management is critical, which is why manual selectors have been implemented to allow for delay definitions through specific Web Audio API nodes, aligning sources with different response times with millimetric precision. Furthermore, this architecture facilitates collaborative streaming, where a first sender can transmit their signal to a second person who integrates both into a joint production. This model creates the powerful illusion of physical presence within the same creative space, enabling artistic collaborations between geographically distant individuals. To further enrich interaction, the prototype features the integration of a real-time chat, even allowing multimedia content to be fetched from external URLs with the appropriate permissions to be distributed through the server.
Concerns and future improvements:
Although the current model successfully validates the viability of the technology, technical challenges persist that define my roadmap for future improvements. While network adaptation through frame reduction works correctly, a priority area for improvement is evaluating the reduction of individual image quality instead of merely decreasing the FPS rate, which would provide greater robustness against unstable connections. On the other hand, the intermediate data step through MongoDB introduces an inevitable delay that prevents strictly instantaneous communication, limiting the current model to a one to many relationship with a slight lag. Likewise, the application natively complies with modern browser playback policies by managing the requirement for explicit user interaction to activate the audio context.
On the horizon of Plaudere's development, the protection of intellectual property stands as an absolute priority, requiring the future implementation of copyright detection systems to safeguard creators' rights over transmitted material. From a hardware performance standpoint, the goal is to migrate much of the decoding processing from the client to the server to alleviate the computational stress on end user devices. Although this solution has been developed entirely in house, I recognise that exploring third party services and the possible transition towards native applications would allow for the resolution of persistent issues such as application inactivity periods or the latencies inherent to Bluetooth protocols. This prototype is not the end of the road, but rather the proof that a resilient architecture, conscious of its limitations, can reclaim the magic of joint creation in the digital age.