Our Response to Tuesday’s AWS Outage

Share

Earlier this year I described our customer commitment for 2021 in an article on our Study Hall resources page. I mentioned then what I still firmly believe: you as our customers are the health of our organization…and the reason why we are here. You have placed in us a trust that we take very seriously. 

I also shared in that post our commitment to improving your technology-driven experience. So it pains me when situations arise like we had this week on December 7th when our hosting partner AWS had services go down. The outage meant that so many of you were unable to access solutions from our Instructure Learning Platform such as Canvas or MasteryConnect. It turned into a rough day for all of us.

 Yes, this was a large event that impacted websites in nearly every industry, but when it comes to education, we know that technologies like Canvas now sit at the heart of the educational workflows and you’ve come to rely on them. We are discussing internally how we address these types of widespread outages going forward and evaluating the measures needed to more effectively respond when it happens. 

Based on conversations we’ve had with so many of you over the past couple of days, there are a few things I want to share about what happened and the steps we’re taking to address going forward:

What happened? An outage by AWS’s web hosting service caused disruption in service across most Instructure solutions for our customers across North and South America. Some customers only experienced slowness, but others received error messages. 

How long did it last? The impact lasted the course of the school day for impacted customers and we weren’t back online until late afternoon or early evening in the US. The timing was particularly tough for schools though because so many of you are nearing the end of semesters/terms and educators and students had tests and quizzes disrupted. 

What was Instructure doing during this downtime? When situations like this occur, we implement a priority 1 incident response protocol bringing teams across the company together to address, including our engineering teams, our customer success operations team, our support team, our executive team, and our communications team. We also immediately update our status page at status.instructure.com. On Tuesday, we followed a similar approach and spent the day in dedicated response rooms working with AWS to understand what was happening and how our customers were impacted, communicating with those customers, and figuring out how to address as quickly as possible. 

If this was a priority 1 incident, why couldn’t the recovery time be quicker with a multi-region fail-over to working servers?  While we have data center redundancy within regions, we don't have multi-region fail-over capabilities to handle the failures we were experiencing. In other words, we can survive failures within local data center zones, but not failures across an entire region which is what we were experiencing.

Why didn’t we see more status page updates on status.instructure.com? We regularly update status.instructure.com to best inform you of the latest details during technical incidents such as this. We worked hard to update you through the morning while we were gathering details from our AWS executive partners. That said, we did have a gap of time with no new updates where we missed updating the page as frequently as we were earlier in the morning because we were so focused on finding solutions. We will make sure that does not happen again, even in situations where we have no new updates.

Is it normal to still be experiencing slowness in our systems? All systems are back to operating at normal levels so if you are experiencing slowness please contact our support team to help you.

If you’re continuing to experience any challenges associated with Tuesday’s events, let’s talk. Please reach out to your CSM so that we can get things right. Let’s keep the dialogue open about this. We want to hear from you and want to keep your trust. Thank you for the work you do every day.

I’m here and committed to your success. If you want to reach out to me directly, you can at melissa@instructure.com.

Keep Learning,
Melissa

 

 

Discover More Topics:

Stay in the know

Don't miss a thing – subscribe to our monthly recap and receive the latest insights directly to your inbox.