- by Joel
- 03/28/2006
- Performance, Xanga
- 0 comments
A little over a week ago we officially launched the beta of the 'Upgraded Profiles' feature for Xanga. The Upgraded Profiles allow you to enter a lot more information about yourself, as well as receive various types of feedback from your friends and other users. My favorite is the 'memories' page, which I hope will promote the generation of substantive/meaningful content for our users. We also provide a chatboard (not as serious as leaving a memory), nudges (to get someone's attention), and the ability to connect with your friends. All in all, I think it's a good addition to the Xanga I know and love.
Where the rest of Xanga has evolved over time to handle the growth in our user base, we were able to design the profiles project from scratch. Because we have millions of users and serve so many pages a big consideration was scalability. Toward this end we distributed our databases among several servers, based on our estimates of which tables would be hit most frequently and grow the largest. One of the implications, though, was that we couldn't rely on SQL joins. Instead we had to do two DB calls (to separate DBs) and then do an 'in-memory join' on the web-server. Doing the join in SQL would definitely be faster (at the expense of some extra DB CPU cycles). As I was thinking about this I realized that designing for scalability is like finding a global maxima on a curve, which is a lot different than finding a local maxima. In most cases the SQL-join version of the code will be faster than the scalable approach, until you reach the point at which the DB throughput is maxed out. At this point, of course, the scalable version will be faster simply because it still works (the DB hasn't rolled over and died). You could make the SQL join version work a little longer by buying a bigger server, and perhaps that's a valid solution, but then you've got downtime while you migrate data, and that's no good.
Anyway, with all the work we did to make the new profiles fast it was very disconcerting when things started to slow down this past weekend. I spent some time yesterday logging into the DB servers and checking memory usage, cpu usage, etc, and none of them were even breaking a sweat. Before we launched Dan spent some time looking at the queries and making sure we had the right indices/pks defined, so that's not likely to be the problem. Next I logged in to a couple web-servers, which were also barely exerting themselves. There are a couple things that make me think this isn't a code issue. First, profiles are hosted at a different co-lo than regular Xanga, and both the projects we have running at this co-lo seem to be responding very slowly to requests (profiles and photos are both slow). Secondly, even our error page, which does no data processing and is a very small page, also takes a long time to load. The fact that two unrelated code-bases are slowing down (and they're on separate servers) makes me think it must be something they have in commong - they're both on ASP.Net 2.0 and they're both at the same co-lo. And it's not likely that it's ASP.Net 2.0. But, before I rule out an app level problem I need to spend some more time looking at the web-servers.
If I was using WebSphere it would be easy for me to confirm that this isn't an application problem, I'd just log in to the admin console, pull up TPV, and then look at the thread pools, the JSP/Servlet response times, JVM stats, and connection pools. In about six page views I would have a good idea if there was an app-server level bottleneck or not. Unfortunately, I have no idea what the analogous tools are for IIS. Perfmon seems to have some helpful counters, but I haven't found anything that will tell me average response times on a page-by-page basis. Today I'll be spending some time trying to find that out.
Leave a Reply