Back in early December, 2011, our system architect, Michael Klett, blogged about capacity limits we were reaching due to growth in the number of merchants running on Chargify and the number of customers they all support.
RECAP OF LATE 2011 & EARLY JANUARY 2012
* Over the second half of 2011, we occasionally noticed slower-than-acceptable system responsiveness, but it was just occasional, and most merchants were happy. There are many new features that merchants really want, so we stayed focused on features. By the Fall, that was a mistake!
* We let ourselves lose sight of how fast our merchants were growing, and sometime around November, resource utilization hockey-sticked. System performance got pretty bad, I’m sorry to say.
* Our tech team shifted focus entirely to system architecture and hardware improvements. We worked with our managed data center staff and outside consultants to move as quickly as possible to solve capacity problems without taking the system down. One merchant told me that he figured it was like, “Working on an engine while the car is moving.”
* Over the course of December and early January, we did things like:
– Renewed our PCI audit.
– Stopped processing test subscriptions older than 6 months.
– Replaced our JSON processor with a newer, faster version.
– Upgraded our current database server.
– Added another utility server for processing background jobs.
– Our data center staff upgraded firewalls to increase security.
– Replaced delayed_job with Resque, which uses Redis instead of MySQL.
* We did crack a few eggs while doing these things. For instance, when we first implemented Resque, there were a few problems that resulted in some charges being processed, but not logged in your merchant activity stream, and in some cases, webhooks not being generated. Or in another case, the firewall upgrades caused some SSL connections to time out. Both things were stressful and we apologize for that.
THINGS TODAY AND FOR THE NEXT FEW WEEKS
Things are definitely running better now, and we no longer hear from merchants regarding slow app & API performance.
However, we want even more headroom beyond current needs, so here’s what’s still coming between now and mid-Feburary:
* A larger database infrastructure. It will be more fault-tolerant and have a lot more capacity for growth. This is the single largest thing left, and may require some middle-of-the-night downtime. We’ll let you know as we get closer.
* We’re increasing the number of front-end servers and utility servers again, just for good measure.
I’m sorry we didn’t jump on this sooner. We just didn’t see the problem growing as quickly as it was, and we were too focused on new feature development. We won’t make that mistake again.
If you have any questions, feel free to contact me via email or cell phone:
Lance Walley, co-founder/CEO