Donald Res| Thu Jan 29 2015 CET| Cleeng Nuts & Bolts
Unfortunately, our industry had to learn the complexity of ‘scalability’ the hard way. Take for example this or this organisation. I can proudly say, that Cleeng is an exception. We’ve mastered the art of proper growth! Scalability is our ever evolving “key feature”, continuously nurtured by the best people in our organisation.
What’s the main problem other companies seem to overlook? They don’t know what their (next) bottleneck is. And yet all they need to do is to notice (and imitate) a real user’s behaviour. Simple, right? Monitoring the real traffic patterns and analysing those help to understand what is causing troubles.
However from experience we know such iterations are difficult and take time. I see more and more web agencies, that get involved into pay-per-view, but lack the crucial insight into hard data. This results in a wonky DIY setup…and to obvious PPV disasters.
Based on our experience in PPV events and having a pretty busy API (on an average day the API is hit about 10 million times), we ‘luckily’ deal with enough data and events each week to scale our platform the right way, free from mistakes.
Already in the first year after launching Cleeng, we got to learn that scalability (and availability) is crucial to our business. We took large servers and optimized the application to achieve what we thought was the ‘high scalability’.
Though, after our first ‘serious‘ PPV event (~10k tickets) that took place three years ago, we realized that any really big event (100k+ tickets & more) would need far more capacity. This is mainly due to the nature of PPV events.
They show huge spikes in visitors especially in the ten last minutes before the broadcast starts. But it’s not only the number of people who show up to purchase the ticket and us being ready to handle this. At the same time we have to deal with dependencies on external systems (payments, emails, etc).
How did we tackle that challenge? We introduced Cleeng LiveHD – a dedicated module, capable of handling smoothly huge traffic peaks. On the flipside – in order to scale up well – we had to strip out many cool functionalities we were developing at that time for our main platform (coupons, social discounts, local payments, instant reporting, etc).
LiveHD was founded on DynamoDB, had a dedicated payment gateway and emails were specifically configured to avoid being flagged as spam due to a sudden high increase. It was a very neat application that only needed 4 database queries for one visitor to sign-up, pay and have the access controlled. That said, the essential elements to organize an online event where in place, but we had to make a tough choice between guaranteed scalability and fully fledged functionalities.
From a perspective I can say we made a right choice. LiveHD has been applied by many large pay-per-view event organizers such as: boxing fights and FIFA World Cup. The extra benefit: we got confident we’re able to handle biggest PPV events in the world plus, we gained a tremendous amount of experience on how to scale.
With our past experiences, and efforts throughout the years we went through most of the common scalability stages. We use a CDN (both cloudflare and cloudfront) to offload origin, we have load-balancers in place to direct traffic to our different application servers – which scale up when needed. We also reworked our applications in order to support multiple databases, and we tuned most of our database queries. A lot is written on the web around those topics, so I won’t repeat that here. But we still have some unique observations to share:
Bringing this statement to the physical world: Consider a high-way…and you want to bring as many cars from A to B in a given time. You have two options: 1. creating new lanes… or 2. getting fast cars (and let them drive fast). Obviously both help significantly in total throughput. The lane equals extra servers, the fast cars equal fast loading of pages. Our ambition is to load events- and payment pages within 2 seconds around the world. And although the speed of loading time for your audience, is not the same as the amount of buyers & viewers we handle in parallel – it heavily influences the amount of users a platform can handle in a given time.
Like with any project, it’s always important to take into account the end-user’s perspective. What are they doing ? What do most people click on when they want to watch an event? What is the most common path to get there? Any optimisation project should start with mapping out the main path. With code and queries – things can always be made faster. Optimizing 30% from a query that is ‘experienced’ by only 5% of your users, has a total different overall impact than optimizing 10% on a query that is encountered by 90% of your users. No-brainer, but I have seen so many professionals forgetting about focusing on the right priorities. Choose smartly – your time is limited.
At Cleeng we cooperate with companies that are specialized in performing load tests. When they hit us with thousands of concurrent users, we measure the time it takes for each individual to complete the main path (see point 2): to load the event page, register, pay and watch the event. When we scale up concurrent users, at a certain point we reach a bottleneck – and this becomes visible because the time it takes per user increases. These graphs are our dashboard for ‘the next phase’ – as we can break it down into all the separate steps that happen. By looking at these timings we know which function to optimise next.
Something that companies like Facebook excel in: they hardly need to query the database to deliver personalized information into your browser – instead, they smartly put those details into dedicated caching layers (like memcached or elastic ache). We adapted the same approach at Cleeng – resulting in only a few database writes to ensure no data gets lost, no matter what happens. Key takeout: we know only need to optimise those few database queries.
When running load tests – we scope the full user’s path. Many applications nowadays rely on external services. This can be a Facebook API, an email gateway or any other external API. Often these APIs have limitations in how often you can use them, but also these external services often cause an additional delay in the user path. When you run a test, ideally you also test these dependencies – as our experience shows that many bottlenecks are actually coming from those external services.
In line with my previous point – many applications have dependency on external apps… and therefore their owners – your partners. You need to ensure they can handle your traffic well, and do not restrict you suddenly. That is why Cleeng works with the most modern and scalable email (mandrill) and payment gateways (adyen) available. These companies also support booking.com, groupon and other high traffic sites. We run load test together with our partners to ensure the full user path is covered.
This lesson is more technical, but by applying this smartly we gained a lot of scale. We applied priorities to different tasks in our code. Some tasks don’t need to be executed immediately, while some others are critical to be done instantly. For example, an email confirmation of the payment can be delayed a little, while a “forget password” email should be delivered as fast as possible. By smartly organising such tasks we have greatly optimized for scale.
I am happy to announce that we’ve recently finished another key milestone in our journey to unlimited scalability. No trade-off any longer between scalability and full fledge functions. Cleeng is offering them both – out of the box – to any client. Since this year our full infrastructure is cloud-based, scalable and we are confident we can deal with events that sell 100.000s of tickets. But we won’t stop here. Our ambitions go further.
Let us know if these tips are of any help to you, share your thought on scalability, or reach out via below form to trigger any discussion about this exciting topic.