This is the English version of our statement to the server down – you can find the German version here.
Russia could not have wished for a better World Cup opening match… but it could hardly have been worse for us! From 4:48 pm (German time) to 7:32 pm, our service was almost completely unavailable, despite the dedication of massive server resources and efforts on our part. However, at around 8:54 pm, our developers discovered that the problems were not caused by serving too many users – it was caused by a targeted cyber attack on Tippstube! So, in this post, we’ll explain (1) what exactly happened, (2) what this means for you, and (3) what implications this will have for us and the remainder of the tournament.
By the way: It’d be awesome, if you could share this article with your Tippstube-friends, so that we’re all on the same page!
Cyber attack on Tippstube: The “too long, didn’t read” version
Just to be clear up front: Nobody had access to our systems or even your data. We fell victim to a DDOS attack: In such an event, enormous amounts of data are being sent through our network to our servers, which try to process these “packages”. As a result, our network is flooded and traffic gets stuck – we basically have a traffic-jam caused by „manipulated packages“.
So, here’s what exactly happened!
We want to be 100% transparent with you, and for the sake of simplicity, we will walk you through the events of the opening match chronologically. Please note that the timestamps refer to the German time (CEST).
- 3:34 pm: User #100,000 signs up. Wow 🙂
- 3:46 pm: Our developers preventively start monitoring all of our services actively, in order to cope with the tens of thousands of users, who are using the app at the same time.
- 3:48 pm: The response-times of our systems are stable and within the low millisecond range. The CPU usage of our primary database is at decent 60 percent. We’re good!
- 4:31 pm: Due to the heavy load of the app-usage, the response-times of the database are slightly increased, but it’s still working smoothly. The database-utilization is high (> 90%), but again, no signs of “system overload”
- 4:48 pm: The first manipulated packets hits the secondary load-balancer. Unfortunately, we did not realize this at this time and were only able to identify this in retrospect.
- 4:49 pm: Within seconds, database-utilization drops rapidly to 37%, which is surprising, but apparently still fine. In hindsight, this is an indicator for our network being clogged and very few requests reaching our database.
- 4:50 pm: Requests fail to be answered: The accessibility of Tippstube is extremely limited and we start receiving lots of complaints and negative reviews (rightly so). In the meanwhile, „manipulated packages“ are pouring in continuously and reach the load balancers.
- 5:00 pm: Official kick-off time – and Tippstube is down. We’re fully alerted and sweating!
- 5:01 pm: The match started. Outch.
- 5:08 pm: Assuming that the database would not be able to cope with the heavy load and that our database-usage-metric is not correct, we scale the master database to an entity four times(!) the size of our current one.
- 5:09 pm: Requests are now being processed by the database replica – a clone of the database, so that we can configure the new, larger database.
- 5:13 pm: The new master database is up and running, but we’re still only able to process the requests we’re receiving – and apparently, we’re missing a ton of requests. Russia scores the 1st goal of the 2018 World Cup.
- 5:19 pm: We’re trying to manually re-route traffic from the workers (which are the systems on our server that process your request) to the new, larger master database (our assumption at the time: we’re experiencing network-issues with our cloud service provider).
- 5:26 pm: Despite our efforts, almost no requests reach the master database – although it is completely underutilized and almost starving for requests.
- 5:26 pm: We launch an additional server to process requests. Nevertheless, requests somehow cannot be processed and do not even make it to the database.
- 5:36 pm: Yet another server is launched in order to eat up the requests. But again, no effect: We’re puzzled.
- 5:46 pm: Almost no requests can be processed anymore, although the load of the workers (i.e. the server) is surprisingly low
- 5:48 pm: Halftime for Russia and Saudi Arabia, but our app is still dead. We’re doomed.
- 6:25 pm: We consult an AWS cloud expert – but upon quickly skimming through the issues, it’s a mystery to him as well. We start investigating.
- 6:44 pm: Two is better than one, right?! We’re consulting another cloud expert and get him involved in the analysis.
- 6:55 pm: Game over – at least in Moscow: 5:0; what an opener! We’re still unable to receive requests.
- 7:01 pm: We are re-configuring the database connection-configuration, still assuming that the database is root-cause of our problems.
- 7:31 pm: We diving into the detailed-analysis with our AWS cloud expert #1.
- 7:32 pm: The last “manipulated package” is received by the secondary load balancers.
- 7:33 pm: Within seconds, database-utilization jumps up to 55%. We’re again receiving requests smoothly, and the database is processing them as expected.
- 7:34 pm: All of a sudden, every request is processed “as usual” within a few milliseconds. Our developers start questioning their existence – what’s going on there? As if a switch was turned on…
- 7:40 pm: We’re starting to analyse log files, metrics and various other protocols.
- 8:10 pm: The Tippstube team is still working non-stop to answer more than 1,000 emails and 100 disappointed ratings. The last email will have been send at 01:45 clock CEST.
- 8:54 pm: We discover unnatural patterns in the workers‘ network traffic using AWS CloudWatch metrics. Workers have experienced excessive network traffic during downtime – they received very few requests with extremely high data volumes.
- 9:12 pm: Conspicuous log entries are discovered on the secondary load balancers. They reveal that „manipulated“ packets have been sent from 4:48 pm to 7:32 pm. The connection of the requesting HTTP client was intentionally terminated. It’s basically like calling someone, speaking extremely slowly and then hanging up on him, like in the middle of a sentence – and then, rinse and repeat.
- 9:48 pm: We consult a Software Security Professional from Columbia, Maryland who confirms our suspicion: We fell victim to a targeted cyber-attack. He explains, that we’ll have to expect further, possibly even harsher attacks in the coming days!
- 11:22 pm: The entire Tippstube-team gets together for a video-chat to discuss the findings. Lena reports of countless, disappointed users – and we’re searching for possible solutions.
- 01:24 am: The findings of the attack are collected in the draft of this post-mortem protocol.
- 01:40 am: Vodka-resources of our Tippstube-techie are coming to an end.
- 01:46 am: The last email for the night leaves Tippstube. #InboxZero, end for today.
What does that mean for me?
„Are my predictions affected by this?“ – No. The database is untouched, secure and encrypted. Attacks of this kind serve the only purpose of causing chaos and to shut down our servers. Your predictions and data are not affected.
However, we do have to expect further attacks – which implies for you: To be safe, submit your predictions earlier. Avoid doing that shortly before or during a game, as these are “prime-times” for the attackers!
What does this mean for Tippstube and the remainder of the World Cup?
I’m sure that some questions popped up in your head whilst reading our “protocol”, such as…
Why didn’t you simply log the IP addresses? For data protection reasons, we refrained from logging the IP addresses (not even anonymized).
Why did you notice the attack so late? To be frank, we did not expect anyone to target a a free app like ours in such a professional style. Unlike our counterparts, we do not pursue commercial goals with this app and created Tippstube for the simple sake of enhancing your (and our) World Cup experience.
Who is responsible for the attacks? If only we knew… You are welcome to send an email to lena (at) tippstube (dot) de, if you have any ideas or guesses.
However, the most important question is: What we can do about that? And that’s a tough one: Defending against DDOS-attacks is – depending on the nature and professionalism of the attack – everything but easy! We will now take some precautions to prevent these scenarios, but you just can’t be sure whether these will be a solution to the problem. In addition, one of our friends, an IT Security Expert by training, agreed to support and assist us – and even took days off his job to help us in his spare time(!). But again, he also explicitly stated that there’s not so much we can do preventively, and even reactive measures are really difficult.
Preventing professional attacks is almost impossible, unless you get one of the Enterprise-level IT-services, e.g. provided by Amazon. They come, however, with a hefty price tag: 36,000 euros. With our current infrastructure-setup, we’re already at several thousand euros per month, which does not pay off even with advertising revenue in mind. However, even if others play against us with unfair means – a.k.a. in the Ramos-style, who tackled the whole team of Liverpool – we stick to our standards, and – just as the proud players and fans of Liverpool – stand up and do our very best!
Lena ist die gute Fee im Tippstube Team und kümmert sich um die Website, neue Artikel und die Unterstützung unserer Nutzer. Sie ist leidenschaftlicher Fußball-Fan und freut sich, mit euch zu tippen!