When we started work on the new profile manager, specifically the parts that involve managing your friends and invites, we wanted to use AJAX to provide a fast and effective user interface. At the start I spent a lot of time beating my head against a wall trying to figure out how to structure the JS code so it wasn't one big hack. While a lot of the JS code was for display purposes, some of it was also business logic, which now straddled both the client and the server.

We decided early on to use Microsoft's Atlas framework and one of the benefits it provides is the ability to use namespaces and create classes in Javascript. From a programming perspective this at least allows you to organize your code (though it doesn't do anything to insure you have a good architecture). To achieve a simple yet solid architecture I wanted to create an MVC based system. This system would have simple objects, retrieved from the server via Atlas ASMX proxies (another big benefit of Atlas); business objects that wrapped these simple objects and provided some simple functionality; views that were HTML representations of these business objects; a single page class that managed all the resources for the current page; and various util classes like sorters (for sorting objects), pagers (for paging through arrays of objects), and controls (a JS/XHTML/CSS combo that provides UI functionality).

The image below is a simplified view of this architecture:

As you can see, one object may have multiple views associated with it, rendered in different contexts of the page.

Here's an example of how this looks in practice:

The items highlighted in red are views of business objects - specifically a Friend object and a Group object. The Group object has two views - one on the drop-down menu (which is a control, highlighted in green) and one on the Edit view of the Friend object. The Friend object also has two views - an Edit view (the layer that is hovering, which is actually a view built on a control, though it's not highlighted in green) and the Default view, which is the card-like representation of a Friend.

The separation between the view and the model is very clean, however, the controller and the view aren't separated as cleanly as I would prefer. This is because the view frequently provides actions for the user to perform. For example, if I were to click the checkbox next to 'strangers' and then click the 'Save' button the Friend object is updated and then a web-service call is made by the view to the Server to update the Friend object in the DB. The view needs to know whether or not the call succeeded (and it's asynchronous) and then act accordingly.

One of the benefits of this system is that views are only rendered for objects that are currently visible. Doing this greatly reduces the complexity of the DOM. For instance, Dan has over 1300 friends, and though the page takes two or three seconds to load (over broadband) once the data is loaded navigating through it is very fast. Also, because all the data is loaded on the client we can provide very fast sorting and filtering (without a roundtrip). The burden of this sorting and filtering isn't born by our servers, either, in fact none of the data is sorted by the DB, all it has to do is SELECT WITH(NOLOCK) WHERE - which is very quick.

Sometimes resolving a performance problem is a lot like unravelling a ball of string. You pick an end and follow it back as far as you can. If that doesn't work you grab the other end and start unravelling from there.

At this point it seems like there were actually two problems that struck at the same time. First, there was some non-production photo code sitting on a test server that was deployed by accident to all the photo servers. When this was deployed it effectively broke photo uploads (and we didn't find out about it until later because of the misconfigured error logging - ouch!). When this was fixed photo uploading started working again but it was still slow.

The second problem was with the main switch at the co-lo. We use Windows Load Balancing for routing traffic between servers in a cluster and apparently this results in a lot of chatter, which overwhelmed the main switch. To alleviate this we switched to round-robin routing, which reduced the load on the switch considerably. However, we had to disable the AJAXy progress bar on the upload page because it relies on server affinity, which it no longer had.

Last week Dan ordered a new switch and today it arrived. After Jon programs and installs it everything should be back to normal. Here's a picture of Dave posing with the switch:

Disclaimer: I know very little about networking, if any of the information in this post is erroneous please correct me!

One of the problems with uploading files via a web-page is that you have no idea how the upload is progressing until it finishes. However, the capabilities provided by AJAX make it possible to check on the progress of an upload by making periodic calls to the server to find out how much of the file has been transferred. You can even go a step further and make your AJAX component/control smart enough to restart the upload if no bytes have been transferred after a few seconds. It's a great way to improve the user experience, and we use a third-party component at Xanga that does just that.

However, when there is an error in the server-side application that is supposed to receive the uploaded image, this type of smart behavior by the client can actually come back to bite you - which is what happened to us. Because the upload app was failing with an error the image transfer was never being started. The AJAX component would see that the upload failed and it would try again, about a second-and-a-half later. This cycle repeated endlessly until the user tired of waiting and closed their browser window/tab. With several thousand users trying to upload images this effectively DDOSed everything at that co-lo. The routers were at 100% CPU utilization trying to keep up with all the requests coming in, which slowed down service to everything - profile.xanga.com included, simply because it sat behind the same routers.

The problem was exacerbated by the fact that the tool we use to aggregate and analyze errors from all the servers was misconfigured on the upload servers, so we had no idea that the application was broken. I'm not sure who figured out that the upload app was broken (Bob?), or how, but after that was known the rest of the story fell into place quickly enough. Needless to say, the problem has been fixed and the error aggregation tool has been properly configured.

After spending a couple hours looking at Perfmon counters on several different webservers I was pretty sure we weren't having an application problem. However, I still wasn't positive. Later in the day Bob suggested I just try a simple ping and see what that looked like. Sure enough, pings to anything at the profile.xanga.com co-lo were an order of magnitude slower than pings to the xanga.com co-lo. Doing a traceroute revealed that the last hop was the slowest, which indicates that the problem is most likely with the network at the profile.xanga.com co-lo.

Apparently John recommended checking ping times and traceroutes two nights ago, but I missed that somehow. Moral of the story: start with the simplest tools available and work your way up. As a programmer I'm guilty of starting with the code or appserver as soon as something goes wrong.

A little over a week ago we officially launched the beta of the 'Upgraded Profiles' feature for Xanga. The Upgraded Profiles allow you to enter a lot more information about yourself, as well as receive various types of feedback from your friends and other users. My favorite is the 'memories' page, which I hope will promote the generation of substantive/meaningful content for our users. We also provide a chatboard (not as serious as leaving a memory), nudges (to get someone's attention), and the ability to connect with your friends. All in all, I think it's a good addition to the Xanga I know and love.

Where the rest of Xanga has evolved over time to handle the growth in our user base, we were able to design the profiles project from scratch. Because we have millions of users and serve so many pages a big consideration was scalability. Toward this end we distributed our databases among several servers, based on our estimates of which tables would be hit most frequently and grow the largest. One of the implications, though, was that we couldn't rely on SQL joins. Instead we had to do two DB calls (to separate DBs) and then do an 'in-memory join' on the web-server. Doing the join in SQL would definitely be faster (at the expense of some extra DB CPU cycles). As I was thinking about this I realized that designing for scalability is like finding a global maxima on a curve, which is a lot different than finding a local maxima. In most cases the SQL-join version of the code will be faster than the scalable approach, until you reach the point at which the DB throughput is maxed out. At this point, of course, the scalable version will be faster simply because it still works (the DB hasn't rolled over and died). You could make the SQL join version work a little longer by buying a bigger server, and perhaps that's a valid solution, but then you've got downtime while you migrate data, and that's no good.

Anyway, with all the work we did to make the new profiles fast it was very disconcerting when things started to slow down this past weekend. I spent some time yesterday logging into the DB servers and checking memory usage, cpu usage, etc, and none of them were even breaking a sweat. Before we launched Dan spent some time looking at the queries and making sure we had the right indices/pks defined, so that's not likely to be the problem. Next I logged in to a couple web-servers, which were also barely exerting themselves. There are a couple things that make me think this isn't a code issue. First, profiles are hosted at a different co-lo than regular Xanga, and both the projects we have running at this co-lo seem to be responding very slowly to requests (profiles and photos are both slow). Secondly, even our error page, which does no data processing and is a very small page, also takes a long time to load. The fact that two unrelated code-bases are slowing down (and they're on separate servers) makes me think it must be something they have in commong - they're both on ASP.Net 2.0 and they're both at the same co-lo. And it's not likely that it's ASP.Net 2.0. But, before I rule out an app level problem I need to spend some more time looking at the web-servers.

If I was using WebSphere it would be easy for me to confirm that this isn't an application problem, I'd just log in to the admin console, pull up TPV, and then look at the thread pools, the JSP/Servlet response times, JVM stats, and connection pools. In about six page views I would have a good idea if there was an app-server level bottleneck or not. Unfortunately, I have no idea what the analogous tools are for IIS. Perfmon seems to have some helpful counters, but I haven't found anything that will tell me average response times on a page-by-page basis. Today I'll be spending some time trying to find that out.