Friday, September 29, 2006

Scalability - How to keep data flowing as registration volume increases

I spent two drizzly days in Seattle last week visiting some of our best clients, including Weyerhaeuser and Safeco. With our phenomenal growth over the past two years (we've processed over 2.7 million registrations in 2006 - already twice as much as last year), we continually have to upgrade our servers and network equipment in order to handle the increased load. So what can we do to make sure that registration forms respond quickly when 10,000 people try to log onto the system at the same time?

The first thing to realize is that event registration is a very demanding web-based process. A single page of an online registration form will require 10-100 times as much computer processing power as a single page of a content-driven web site such as Yahoo! This is because registration systems involve a great deal of data storage and retrieval, while news sites simply deliver static web pages to viewers.

So to keep the registrations flowing, we have to eliminate bottle-necks at every step of the process:
  1. Registrant requests a page from their browser
  2. Certain's servers receive the request and pass it to one of several web servers
  3. The web servers handle the request and retrieve and store data into the database
  4. The database server processes individual queries to read or write information
  5. Backup systems make copies of the data to disc and magnetic tape
  6. Security systems monitor the process to prevent unauthorized intrusion

All of this processing must occur within about a half-second, or the registrant will "feel" like the system is sluggish.

So to go from a million registrations per year to two million, to four million, to ten million, we continue to make large investments in our network operations.

Network

The first requirement is to make sure that data can get in and out of our servers fast enough to handle all incoming page requests. Internet "bandwidth" is a measurement of how much data can be transmitted in a given period of time. Internet Service Providers (ISP) measure bandwidth by both average usage and peak usage. For example, in September 2006 we increased our bandwidth from "1/10" to "2/100". This means that we pay for a monthly average usage of 2 Mbps (Megabits per second), but we are allowed to use up to 100 Mbps during peak periods.

For comparision, an average web page has about 100 Kb (kilobytes) of text and images. That means that the web server had to deliver 100,000 bytes, or about 800,000 bits since there are 8 bits in a byte. (Note that these calculations aren't exact, due to various small descrepancies between measuring systems of disc, computer, and network engineers.) With a million bits in a megabit, you would need 0.8 Mbps to receive the page in a second. High definition digital cable requires about 20 Mbps to deliver HD movies to your monitor.

Our current daily average bandwidth usage is just over 1 mbps, and our peak usage is around 15 mbps, so we have plenty of room for future growth and spike loads.

Software Foundation

Another way to improve performance is to upgrade the underlying software that the web and database servers use to deliver the page requests. We currently are upgrading our database servers from 32-bit Microsoft SQL Server 2000 to 64-bit Microsoft SQL Server 2005. This allows the database to process requests for information much faster. In addition, SQL Server 2005 allows for simple configuration of clusters of multiple servers. Instead of using one database server for processing and another as a backup in case of failure, we can use the processing power of multiple servers, which improve performance during normal operation and provide for redundancy during periods of failure.

Data Storage

With a data intensive application such as online registration, the overall performance can be limited by how fast the servers can read and write the data to disk. We are implementing high-speed Storage Area Networks (SAN), which improve speeds by orders of magnitude for more rapid read/write actions to discs and for database transaction log backups.

Application Architecture

Once the network, hardware, and software foundations are ready for high performance, it's up to the application to deliver.

  • First, we are modifying Register123 to “cache” information, such as event and form configuration, that does not frequently change. Data stored in the cache is located in the ultra-fast memory of web servers, so when a page request comes in, the information can be delivered immediately instead of requiring a round-trip request to the (relatively) slower database servers. The specifics of caching are complex, but basically if the information has changed, then the web servers will automtically update the information in the cache from that in the database (this is called "refreshing the cache"). Caching will reduce database utilization by up to 35% per page without any sacrifice to the Users' experience.
  • Second, we are going to implement a queue system for incoming requests. During the vast majority of time, registrations will be handled immediately when our servers receive the request. But in rare periods of exceptionally high demand (when database utilization exceeds 75%), incoming requests will be placed in a queue and handled on a “first-come, first-served” basis. While in the queue, registrants will see a “System Processing” message, much like the ones on online travel sites that so many Internet users have seen before. Their total wait time will be a few seconds (depending on system load), which is vastly superior to not being able to access the system at all.

Performance improvement is a never-ending battle with increasing registration volume. It is one that Certain Software is committed to win.

No comments: