I met Monsur in April of 2005. I'd flown to New York on a Thursday night for an interview with Xanga. He and Jon met me at the hotel and took me to Times Square Brewery for a late dinner. After forty-eight hours of coding, eating, drinking, and socializing with the team I received a job offer and that was the beginning of my association with Monsur (as well as the rest of the great folks at Xanga).

Over the course of the next year I got to work closely with Monsur on various projects, be there when he got married, see Ted Leo and the Pharmacists, go on recruiting trips, attend PDC, and move over a hundred racked servers in the back of a U-Haul (see the pics). I've no doubt that life would have continued in much the same vein except that I met an amazing girl in California and decided to move across the country and marry her, which led to the close of the Xanga chapter in my life.

I've been blessed thus far in my life that I've never left a job because I didn't like the job - IBM was a great place to work, I have very fond memories (and a groomsman) from my time at Xanga, and Latham & Watkins was very good to me. In addition I can honestly say that I'd welcome to opportunity to work again with many of my past co-workers, which is why I'm really excited that Monsur accepted a job at Google. As one of the first programmers (perhaps the first) at Xanga I never expected Monsur to leave, so when I heard he was interviewing with Google I was shocked. However, I can attest to the fact that sometimes you have to leave something good behind because a new opportunity has arisen, and I'm glad that this new opportunity for Monsur is here at Google.

Congrats Monsur and welcome to Google! It's unlikely that we'll get called upon to move racks of servers here :)

I nominated Jon for Sysadmin of the Year 2006:
Not many sysadmins are responsible for the hundreds of servers and terabytes of data that power a web-site with millions of users, but Jon is. In addition to that he's also responsible for all the computers and test servers in the office, which means he's in charge of keeping us developers happy (and everyone else, too, but they're easier to please).

To keep him on his toes, as it were, most of the equipment Jon is responsible for lives at a co-lo in Jersey, which, as any Manhattanite will tell you, is somewhere West of us, though no one really knows where and no one really cares to find out, because there's no good reason to go to Jersey - hell, there aren't even any bad reasons to go to Jersey. What this really means, though, is that at any hour of the day or night Jon's smartphone might start beeping, and if it can't be fixed remotely, he has to hop in a ZipCar and get his tired ass over to Jersey. Thankfully, with EVDO and a Lenovo tablet (which he always carries), he can usually fix things wherever he happens to be - which just means that he has a four pound tether tying him to his work no matter where he goes (and I've got pictures of him standing outside a Starbucks at one in the morning trying to fix something after we left a movie theater).

Anyway, despite everything Jon goes through, he's still one of the chillest guys I've ever met and could kick my tail at CS:S, BF2, or Burnout Revenge any day of the week. I think his effort and responsibility deserves to be rewarded, not many 26 year olds have the responsibility that he has and handle it as well.
Sorry, it's a little PG-13, but it was funnier that way.

When we started work on the new profile manager, specifically the parts that involve managing your friends and invites, we wanted to use AJAX to provide a fast and effective user interface. At the start I spent a lot of time beating my head against a wall trying to figure out how to structure the JS code so it wasn't one big hack. While a lot of the JS code was for display purposes, some of it was also business logic, which now straddled both the client and the server.

We decided early on to use Microsoft's Atlas framework and one of the benefits it provides is the ability to use namespaces and create classes in Javascript. From a programming perspective this at least allows you to organize your code (though it doesn't do anything to insure you have a good architecture). To achieve a simple yet solid architecture I wanted to create an MVC based system. This system would have simple objects, retrieved from the server via Atlas ASMX proxies (another big benefit of Atlas); business objects that wrapped these simple objects and provided some simple functionality; views that were HTML representations of these business objects; a single page class that managed all the resources for the current page; and various util classes like sorters (for sorting objects), pagers (for paging through arrays of objects), and controls (a JS/XHTML/CSS combo that provides UI functionality).

The image below is a simplified view of this architecture:

As you can see, one object may have multiple views associated with it, rendered in different contexts of the page.

Here's an example of how this looks in practice:

The items highlighted in red are views of business objects - specifically a Friend object and a Group object. The Group object has two views - one on the drop-down menu (which is a control, highlighted in green) and one on the Edit view of the Friend object. The Friend object also has two views - an Edit view (the layer that is hovering, which is actually a view built on a control, though it's not highlighted in green) and the Default view, which is the card-like representation of a Friend.

The separation between the view and the model is very clean, however, the controller and the view aren't separated as cleanly as I would prefer. This is because the view frequently provides actions for the user to perform. For example, if I were to click the checkbox next to 'strangers' and then click the 'Save' button the Friend object is updated and then a web-service call is made by the view to the Server to update the Friend object in the DB. The view needs to know whether or not the call succeeded (and it's asynchronous) and then act accordingly.

One of the benefits of this system is that views are only rendered for objects that are currently visible. Doing this greatly reduces the complexity of the DOM. For instance, Dan has over 1300 friends, and though the page takes two or three seconds to load (over broadband) once the data is loaded navigating through it is very fast. Also, because all the data is loaded on the client we can provide very fast sorting and filtering (without a roundtrip). The burden of this sorting and filtering isn't born by our servers, either, in fact none of the data is sorted by the DB, all it has to do is SELECT WITH(NOLOCK) WHERE - which is very quick.

As mentioned in previous posts, the profile.xanga.com site is hosted at a different co-lo than the main www.xanga.com site (I'm going to refer to the co-los by the subdomain that lives at each one from here on out). A relatively small amount of data is shared between the two sites and the read/update ratio for that data is pretty high, so we were able to optimize for that scenario.

One set of data that takes this scenario to an extreme is the Metros data (stored at the www co-lo). This data is like a tree with great breadth and very shallow depth. It's persisted as a table in a DB and the most frequent operation is a look-up on a leaf node to map an ID to a name. Updates happen very infrequently and consist of a leaf node being updated or created; the tree structure itself doesn't change.

Because we need to look up information from this table on nearly every page request to the profile site we decided to cache all the data in memory on each of the web-servers and periodically refresh it. When IIS started it would make a call to the metros DB at the other co-lo and load all the data into memory. This provided very fast lookups and reduced the load on the metros DB at the expense of some memory on each of the webservers (and memory is cheap).

This worked great, with one small problem - whenever we redeployed code to the webservers they would all restart and try to repopulate the local caches. This swamped the DB server and resulted in only some of the webservers getting the data, the rest timed out and threw errors.

To get around this I ended up writing a little app that makes a request to the DB to get all the data and then writes it to an XML file. The XML file is then robocopied to each of the webservers. The app is set up as a scheduled task on one of the servers and runs every couple hours.

Although this solution is really simple it has a bunch of benefits. First, compared to the other approach, only a fraction of the data gets sent between the two co-los (we're requesting it once instead of N times). Second, it's also much less DB intensive (for the same reason). Third, it provides a level of redundancy - if the DB happens to be down IIS will just read the old copy of the XML file and at least we'll have some of the data. And finally, I'm pretty sure that reading a local XML file is faster than requesting all the data from a remote SQL Server, so the cache gets populated faster.

I'm sure it's nothing new or novel, but I found it interesting.

Sometimes resolving a performance problem is a lot like unravelling a ball of string. You pick an end and follow it back as far as you can. If that doesn't work you grab the other end and start unravelling from there.

At this point it seems like there were actually two problems that struck at the same time. First, there was some non-production photo code sitting on a test server that was deployed by accident to all the photo servers. When this was deployed it effectively broke photo uploads (and we didn't find out about it until later because of the misconfigured error logging - ouch!). When this was fixed photo uploading started working again but it was still slow.

The second problem was with the main switch at the co-lo. We use Windows Load Balancing for routing traffic between servers in a cluster and apparently this results in a lot of chatter, which overwhelmed the main switch. To alleviate this we switched to round-robin routing, which reduced the load on the switch considerably. However, we had to disable the AJAXy progress bar on the upload page because it relies on server affinity, which it no longer had.

Last week Dan ordered a new switch and today it arrived. After Jon programs and installs it everything should be back to normal. Here's a picture of Dave posing with the switch:

Disclaimer: I know very little about networking, if any of the information in this post is erroneous please correct me!

One of the problems with uploading files via a web-page is that you have no idea how the upload is progressing until it finishes. However, the capabilities provided by AJAX make it possible to check on the progress of an upload by making periodic calls to the server to find out how much of the file has been transferred. You can even go a step further and make your AJAX component/control smart enough to restart the upload if no bytes have been transferred after a few seconds. It's a great way to improve the user experience, and we use a third-party component at Xanga that does just that.

However, when there is an error in the server-side application that is supposed to receive the uploaded image, this type of smart behavior by the client can actually come back to bite you - which is what happened to us. Because the upload app was failing with an error the image transfer was never being started. The AJAX component would see that the upload failed and it would try again, about a second-and-a-half later. This cycle repeated endlessly until the user tired of waiting and closed their browser window/tab. With several thousand users trying to upload images this effectively DDOSed everything at that co-lo. The routers were at 100% CPU utilization trying to keep up with all the requests coming in, which slowed down service to everything - profile.xanga.com included, simply because it sat behind the same routers.

The problem was exacerbated by the fact that the tool we use to aggregate and analyze errors from all the servers was misconfigured on the upload servers, so we had no idea that the application was broken. I'm not sure who figured out that the upload app was broken (Bob?), or how, but after that was known the rest of the story fell into place quickly enough. Needless to say, the problem has been fixed and the error aggregation tool has been properly configured.

After spending a couple hours looking at Perfmon counters on several different webservers I was pretty sure we weren't having an application problem. However, I still wasn't positive. Later in the day Bob suggested I just try a simple ping and see what that looked like. Sure enough, pings to anything at the profile.xanga.com co-lo were an order of magnitude slower than pings to the xanga.com co-lo. Doing a traceroute revealed that the last hop was the slowest, which indicates that the problem is most likely with the network at the profile.xanga.com co-lo.

Apparently John recommended checking ping times and traceroutes two nights ago, but I missed that somehow. Moral of the story: start with the simplest tools available and work your way up. As a programmer I'm guilty of starting with the code or appserver as soon as something goes wrong.

A little over a week ago we officially launched the beta of the 'Upgraded Profiles' feature for Xanga. The Upgraded Profiles allow you to enter a lot more information about yourself, as well as receive various types of feedback from your friends and other users. My favorite is the 'memories' page, which I hope will promote the generation of substantive/meaningful content for our users. We also provide a chatboard (not as serious as leaving a memory), nudges (to get someone's attention), and the ability to connect with your friends. All in all, I think it's a good addition to the Xanga I know and love.

Where the rest of Xanga has evolved over time to handle the growth in our user base, we were able to design the profiles project from scratch. Because we have millions of users and serve so many pages a big consideration was scalability. Toward this end we distributed our databases among several servers, based on our estimates of which tables would be hit most frequently and grow the largest. One of the implications, though, was that we couldn't rely on SQL joins. Instead we had to do two DB calls (to separate DBs) and then do an 'in-memory join' on the web-server. Doing the join in SQL would definitely be faster (at the expense of some extra DB CPU cycles). As I was thinking about this I realized that designing for scalability is like finding a global maxima on a curve, which is a lot different than finding a local maxima. In most cases the SQL-join version of the code will be faster than the scalable approach, until you reach the point at which the DB throughput is maxed out. At this point, of course, the scalable version will be faster simply because it still works (the DB hasn't rolled over and died). You could make the SQL join version work a little longer by buying a bigger server, and perhaps that's a valid solution, but then you've got downtime while you migrate data, and that's no good.

Anyway, with all the work we did to make the new profiles fast it was very disconcerting when things started to slow down this past weekend. I spent some time yesterday logging into the DB servers and checking memory usage, cpu usage, etc, and none of them were even breaking a sweat. Before we launched Dan spent some time looking at the queries and making sure we had the right indices/pks defined, so that's not likely to be the problem. Next I logged in to a couple web-servers, which were also barely exerting themselves. There are a couple things that make me think this isn't a code issue. First, profiles are hosted at a different co-lo than regular Xanga, and both the projects we have running at this co-lo seem to be responding very slowly to requests (profiles and photos are both slow). Secondly, even our error page, which does no data processing and is a very small page, also takes a long time to load. The fact that two unrelated code-bases are slowing down (and they're on separate servers) makes me think it must be something they have in commong - they're both on ASP.Net 2.0 and they're both at the same co-lo. And it's not likely that it's ASP.Net 2.0. But, before I rule out an app level problem I need to spend some more time looking at the web-servers.

If I was using WebSphere it would be easy for me to confirm that this isn't an application problem, I'd just log in to the admin console, pull up TPV, and then look at the thread pools, the JSP/Servlet response times, JVM stats, and connection pools. In about six page views I would have a good idea if there was an app-server level bottleneck or not. Unfortunately, I have no idea what the analogous tools are for IIS. Perfmon seems to have some helpful counters, but I haven't found anything that will tell me average response times on a page-by-page basis. Today I'll be spending some time trying to find that out.