A few months ago, Lance posted about our continuing plans to improve the robustness and availability of our architecture.
Step number one in that process was hiring. I (Drew Blas) have been very pleased to join Chargify and have definitely hit the ground running. Together, we’ve identified a very aggressive roadmap to turn our service into the most bulletproof site you use.
We’ve already implemented a slew of small internal changes that will help us in many regards. These represent the ‘low-hanging fruit’ that are the easy and quick actions we could take before beginning our implementation of the number one priority in our life. We’ve added a lot of additional monitoring, logging, testing, and analytics to help ensure that as we execute major changes we don’t cause any disruption in ongoing operations. We also hope to bring some of these internal insights to you directly in the form of improved automated status checks & uptime history that will help you to see how well we’re doing.
Our next big hurdle is to actually begin using a second data center. We’ve talked to a variety of datacenters and providers with a wide range of skillsets and specialities. We’ve tested, demoed and evaluated their offerings from many different perspectives, including performance, support, & security.
Ultimately, the approach that resonates most with us is the simple philosophy that “manual operations are most prone to cause problems or to fail to respond quickly in an emergency”. Instead, a properly built & tested architecture that can automatically monitor, scale, and heal itself is the ultimate paradigm that ensures our site can keep running no matter what. This approach demands a highly dynamic and flexible environment with significant automated infrastructure behind it.
Of course, the best known provider of such an environment is Amazon Web Services. AWS is also a PCI Level 1 Service Provider, meaning they have already completed an audit of the internal & physical security controls that are required in order for us to be confident that our partnership will allow us to continue maintaining our PCI Level 1 compliance.
We have not been exploring these options in a bubble. We’ve taken input and advice from several experts in the field who have first-hand experience with many different providers. One in particular that I’d like to point out is Tom Mornini, Co-founder and CTO of Engine Yard. Several years ago, Engine Yard switched from internally managed hardware in a colo datacenter to hosting their customers on AWS, where they now run many thousands of systems. In doing so, they saw the reliability and robustness of their customers’ sites skyrocket. These types of anecdotes have been repeated time and again by many others.
AWS is not without issues. They have had several high profile outages (indeed, when you host as much of the internet as AWS, any issue is going to be high-profile). You can be sure that we have studied each one in detail, as well as how others have successfully and unsuccessfully planned for these failures. At the top of that list is that we’ll be running simultaneously in as many different Availability Zones and Regions as possible. I, personally, have run several sizable operations on AWS and have found it can be extremely robust when executed properly. Also, a review of their documentation, technical papers, and architecture descriptions give us cause to be confident that a multi-region failure is no more likely than with any other pair of datacenters we might choose. We’ll also be using every technology available (both inside and outside AWS) to make Chargify as redundant and highly-available as possible.
Our decision on a new provider is not yet final: we want to get your feedback. We want to learn how you feel about this process and what we can do in order to ensure you that your data is secure and that the availability and performance of Chargify will only improve. Please contact us and let us know what think! It’s important to be clear that our choice for AWS is NOT for ease-of-use, nor is it for price. Our exploration has focused on security & reliability and in these two areas, AWS is well ahead of its competitors.
– Drew Blas – “Keeper of the Uptime”