ForumsNewsJanuary 5 downtime explanation


January 5 downtime explanation
Author Message
Jake

Toodledo Founder
Posted: Jan 05, 2015
Score: 0 Reference
Today between 9:15am and 11:45am PST we experienced about 2.5 hours of intermittent downtime for all of Toodledo. This is definitely NOT how I wanted to spend our first week of 2015, which is traditionally Toodledo's most heavily used week of the year. We deeply regret any inconvenience that this may have caused to anyone.

Throughout the event we posted status updates on Twitter and Facebook. Here is a complete timeline of events:

At 9:00 we noticed elevated error rates when connecting to one of our external resources. We immediately started investigating. Soon after, Toodledo became intermittently unavailable across our entire site.

It initially appeared that because we were unable to connect to the external resource, it was causing connections to our website to be slower than normal and build up. Once they built up to a certain amount our servers ran out of connections/memory and stopped functionling properly. However, upon further investigation, we were able to connect to this resource from other locations, so it wasn't a problem with the resource, it was a problem with our server's network connection.

Upon realizing this, we immediately got on the phone with Rackspace (our datacenter provider) to help us diagnose the issue. It took about 30 minutes, but the smart people there eventually determined the source of the problem. It was a DDOS attack against another customer. Because of the way datacenters are setup, multiple customers share the same upstream network equipment. The attack on some other website was causing the networking equipment to become flooded and overloaded. The result was that anyone downstream of that point was experiencing trouble, so it affected several websites in addition to Toodledo.

At this point, it was out of our control. We had to wait for Rackspace to mitigate the attack and resolve the problem. At around 11:45am it was fixed and our website came back online.

We spent a few minutes running tests and checking everything and we quickly noticed that our API was still offline, which unfortunately meant that 3rd party apps were unable to sync with Toodledo. We investigated this issue and ultimately had to call Rackspace again who helped us resolve this secondary problem. It seems that as part of their mitigation attempt Rackspace noticed a lot of traffic going to our API, so they blocked that traffic as a safeguard. Once it was understood that Toodledo was very popular :) and all this traffic was legitimate traffic, they restored the connections and our API started working again. So, the API came back online about an hour after the main site came back.

At this time it does not appear that Toodledo was to blame, or was the victim of the attack. We were collateral damage. No user data was compromised.

What have we learned?

We have learned that events outside of our control can affect Toodledo. It's difficult to plan for something like this. Even if we replicated our entire infrastructure in a different datacenter, it would take time to switch everything over if another event like this happened. These types of events are very rare. The last time we were affected by something like this was 5 or 6 years ago.

One thing that we are working towards across all of Toodledo is the ability to work offline. So, even if our servers went offline, you would still be able to access the website and make changes. Those changes would sync up once the site came back online. We have plans to do that in 2015 and we will likely accelerate these plans. If you access Toodledo through an app that syncs with Toodledo then you would already have benefited from this offline functionality.

Another thing we plan to do is isolate our services more so that it would be more likely that parts of Toodledo could stay online when other parts have trouble.

We are also going to keep thinking about this and trying to think of more ways to make Toodledo more robust in the future.

Those of you with Platinum accounts will be getting subscription extensions as part of the uptime guarantee that we have for that subscription level.

Again, we are very sorry for any inconvenience that this might have caused to anyone. We understand that Toodledo is a key productivity tool for lots of people and that you depend on us to be available all the time. We have had a good record in the last few years and we will redouble our efforts to improve in the future.

-Jake


This message was edited Jan 05, 2015.
Salgud

Posted: Jan 05, 2015
Score: 0 Reference
Thanks for letting us know, Jake.
Mr. J.

Posted: Jan 05, 2015
Score: 0 Reference
It happens. Thank you for being up front about it and for the status tweets.
allendr59

Posted: Jan 05, 2015
Score: 0 Reference
Thanks for the update Jake! One request though: could you add the status updates to Google+ as well? I don't use either Twitter or Facebook, so wasn't aware of the efforts being made.

Thanks!
dpbaril_1333763958

Posted: Jan 05, 2015
Score: 1 Reference
Adding my voice to those who have praised the handling of this problem. Proves the old customer service addage that it isn't the problem, it's how you respond to the problem that matters.
Ann M

Posted: Jan 05, 2015
Score: 1 Reference
And your response is why I continue to be a happy customer. Thankfully your tweets kept us updated so your transparency and quick response is much appreciated. Ann
Eddie S

Posted: Jan 05, 2015
Score: 0 Reference
I appreciate the candid update, a few options your IT team could consider:

- Go to a private cloud. This would protect you from being collateral damage. You could still be directly targeted of course, but at least you don't have to worry that your neighbor in the cloud (which may be running on the same physical host) is being attacked.
- Employ an active/active high availability cluster solution based on load distribution technology such as HAProxy. The failover time is almost instantaneous and there are potential load balancing benefits.

Cheers!
rclark

Posted: Jan 05, 2015
Score: 1 Reference
I'd also like to throw my thanks into the thread for the transparency and real-time communications with your user base.

I have to say that I'm a bit surprised that Rackspace didn't already know about the problem... If this attack was large enough to affect multiple customers by congesting upstream network gear, I would have expected better monitoring and alerting in place to detect the traffic spike and over-utilization of the pipe or the equipment, well before the time you'd be calling them after doing your own troubleshooting. You'd be within your rights as their customer to ask about their monitoring and detection for that sort of thing in the future; if one of their other nearby customers is getting attacked, it's likely that they'll be attacked again, and that puts you at risk.
dlehoven

Posted: Jan 06, 2015
Score: 0 Reference
I remember only one other time when Toodledo was off-line - a perfect storm involving a lightning strike, I believe? - and I remember the updates as you worked tirelessly to get it up and running again. It was very impressive. That's why, when the site seemed to disappear this morning, I was alarmed, but not terrified - I knew you'd get it fixed, and I trusted that I'd be seeing my data again soon. It hadn't occurred to me to look for information on Facebook during the outage, but now I know - plus now I'll get updates about all the new features without having to remember to look for them in the forum. Thanks for the clear explanation of what happened, and thanks for the thoughtful and consistent development of a great product.
susan_1396340637

Posted: Jan 06, 2015
Score: 0 Reference
Great work. Rough start to the year but I'm really impressed with how you've handled this. Thank you!
junk_1356549382

Posted: Jan 07, 2015
Score: 0 Reference
By the time I noticed the issue, it was fixed in 20 minutes - I had my data on my phone, so it wasn't a huge show stopper - and yes, the service has been very, very good (I can think of many more times in the last 6 years when email, yahoo, amazon, others have been down) and you guys did a great job with Twitter and Facebook.

Setting up a status page failover is also one way, to keep users informed, if the downtime is going to be longer - but as you point out, you are at the mercy of Rackspace, etc. and they don't always tell you.

Certainly, they can provide additional alerting to you - in light of this event.

Keep up the good work!


This message was edited Jan 07, 2015.
Christoph Dollis

Posted: Jan 18, 2015
Score: 0 Reference
Jake,

Excellent explanation, and I'm glad that offline capability is in the works. I'd used Toodledo for a few years and then spent a year or two off of it trying out other things (none of which were quite as good, but some of which offered offline capability) before coming back to Toodledo.

With Habits, with the improving UI, and with offline capabilities, and especially Toodledo's outstanding sorting capabilities, it is the solution for me. Keep up the good work, building on top of your earlier good work.
You cannot reply yet

U Back to topic home

R Post a reply

To participate in these forums, you must be signed in.