We've worked on 2 big changes this last month:
We've placed Nginx in front of our Node ingestion servers
Nginx will allow us to scale the ingestion capacity in every region across several servers. It also provides better monitoring of resources, errors and live connections, better logging, throttling and easier management of SSL/TLS certificates.
Another big benefit is the significant reduction in duplicate recordings in our database (and inherently eliminating recordings that failed to copy because they were one of the duplicates). The reduction happened thanks to some of Nginx' internal features, which can ensure a device will always connect to the same backend server or process if it gets disconnected and tries to reconnect (commonly known as sticky sessions).
New recovery mechanism produces no more duplicates
Desktop HTML5 recorder: if you're in the middle of a recording (or you're waiting for the stream to finish uploading over a slower connection) and your connection dies completely (after 30 re-connection attempts), a recovery mechanism kicks in, saves and processes the data that reached our servers.
This mechanism was the source for most of the duplicate recordings we saw. We've taken a deep look at it and decided to separate it from the ingestion logic. It stopped producing duplicates.
On top of that, the feature now functions independent of and across media server restarts, we can also use it internally on demand and it retries the recovery if it doesn't go through for some reason.
This new recovery mechanism is being rolled out these days.
Correct time to recovery (but also longer)
We've also learned that our 30 re-connection attempts in the HTML5 desktop recording client can take up to 750 seconds ( 30 attempts * (5 seconds max delay + 20 seconds max timeout) = 12.5 minutes), so we're now recovering recordings after 14 to 15 minutes since the user was last connected.
We're updating all our documentation and public info to reflect the new times and mechanics.