Simon Eskildsen on scaling Shopify, building turbopuffer, and the future of databases
Summary
Shopify's infrastructure team solved million-person flash sales by building homegrown connection pooling and load testing tools before cloud proxies existed—revealing that random 100x traffic spikes from celebrity product drops required different preparation strategies than predictable peak shopping days.
Key Takeaways
- Uncover hidden system killers through kernel tracing: a PHP cron job running lsof hourly on MySQL instances caused 30-second stalls affecting the entire database cluster—requiring deep inspection across dependency chains.
- Connection pooling was critical bottleneck at scale: Shopify managed 30,000-40,000 concurrent MySQL connections from Ruby/Python processes (10-100 QPS each), spending enormous CPU time epolling connections before modern open-source proxies existed.
- Flash sales require different preparation than predictable peaks: random 1,000 to 100,000 RPS spikes from celebrity Instagram posts (Kylie Jenner, Kanye West) created unpredictable SEVs, driving massive inventory lock contention on single MySQL rows.
- Load testing with simple Ruby scripts on multiple servers was the primary preparation tool: teams mimicked user inventory contention patterns and reservation logic to identify bottlenecks before live traffic hit.
- Inventory row lock contention becomes the critical constraint at extreme scale: hundreds of thousands of users fighting for 10,000 SKUs simultaneously requires rethinking transaction isolation and reservation patterns.
Related topics
Transcript Excerpt
So you scaled a lot from 2010 to- -Yeah -... 2020. -Yeah. -What were the great SEVs in Shopify history? One of the funniest ones was we had this problem where about every hour, the primary or the writer of the MySQL clusters would stall for about 30 seconds, and we couldn't figure out why. We were debugging this endlessly, could not figure out what was going on, and someone figured out when this was going on, there was an lsof running on these machines. I was like, "Okay, why is this lsof running?" And someone was tracing the kernel, figuring out, "Okay, this is causing a soft lockup in the kernel. Where is this lsof coming from?" It turns out that some of the Percona utilities, which are some of the Perl scripts you use to manage MySQL, drew in PHP as a dependency, and PHP as a dependency…