At the beginning of May, we had to process a significant volume of small recordings on our
us1 region in a small amount of time.
This event took over the entire engineering month.
We are glad the Platform managed to process all the recordings correctly, without intervention from our side. The way the Platform behaved is a significant improvement from the days of the early 2020 Pandemic when many recordings that were not immediately processed were unfortunately deleted. However, a subset of recordings from this recent wave was processed with a significant delay. Understanding and reducing this delay has been an ongoing objective for us for the remainder of the month. The large volume of recordings also gave us the opportunity to investigate how our processing infrastructure behaved in its complex entirety under load and what can be done to speed it up in such scenarios.
Some of the fixes/improvements were obvious and immediate:
- We removed a legacy implementation of queue priorities
- We've made small improvements to our internal retry mechanism
- We've aligned the data sent by different actors to the processing workers
These were implemented right away.
Others, however, need a significant time investment in research & implementation:
- We want to dynamically scale up and down the number of recordings that we process at the same time while keeping resource usage (CPU, RAM, network, etc.) under a comfortable usage limit.
- We want to implement per-account processing queues so that one client can't take over all available processing workers.
With this event, we also broke our internal records for:
- most recordings processed in a region per hour: 17247 on
us1on the 5th of May 2022
- most recordings processed on a region in any 24h interval: 219383 on
Amongst all this work, the Pipe Platform also almost processed an 18GB video. A happy surprise as we only officially support & test recordings files up to 5GB in size. It failed because our push to AWS S3 feature is technically limited to 5GB.
LE: the Qualtrics tutorial was also seriously brought up to date.