Sometimes resolving a performance problem is a lot like unravelling a ball of string. You pick an end and follow it back as far as you can. If that doesn't work you grab the other end and start unravelling from there.

At this point it seems like there were actually two problems that struck at the same time. First, there was some non-production photo code sitting on a test server that was deployed by accident to all the photo servers. When this was deployed it effectively broke photo uploads (and we didn't find out about it until later because of the misconfigured error logging - ouch!). When this was fixed photo uploading started working again but it was still slow.

The second problem was with the main switch at the co-lo. We use Windows Load Balancing for routing traffic between servers in a cluster and apparently this results in a lot of chatter, which overwhelmed the main switch. To alleviate this we switched to round-robin routing, which reduced the load on the switch considerably. However, we had to disable the AJAXy progress bar on the upload page because it relies on server affinity, which it no longer had.

Last week Dan ordered a new switch and today it arrived. After Jon programs and installs it everything should be back to normal. Here's a picture of Dave posing with the switch:

Disclaimer: I know very little about networking, if any of the information in this post is erroneous please correct me!

Sorry, comments are closed for this article.