July 4th is one of my favorite holiday’s… hitting the beach… Barbecuing… cold beers… and fireworks! BUT working in IT brings with the possibility of having you’re holiday plans interrupted by server/network outages. I can’t remember a 4th of July where I didn’t get a call that something is up with one of my servers and 2010 would be no different.

It started at 6:30AM… The main website for our Canadian office was unreachable. SO I booted my laptop and checked the site. Hmmm… It came up fine for me. Perhaps the server’s admin got to it before me. Called corporate and fixed myself a cup of coffee. 45 minutes later the phone rang again. “The site is down again!” I walked back to the computer, coffee in hand and indeed my browser timed out. Hmmm… OK something’s not right. I VPN’ed into the box and pointed the servers browser to the website and the site loaded BUT not as fast as I would have expected. The load on the server looked a bit high to me but this wasn’t my box and didn’t know what the normal numbers for the box were! I started to pour through the APACHE server error logs looking of answers. Nothing there!  Back to the browser… The page loaded fine, the speed having returned. I turned to my wife’s computer (making sure I took the local network out of the equation) and the point her browser to the site. This time I got a strange error message in the browser window…

“Can’t connect to the database too many connections open.”

Hmmm. Strange? Let me refresh my browser… the site pops back up. OK… let’s jump on the box and have a look at what’s going on… CPU utilization looks normal… MySQL looks OK… Refresh the browser… the site is still up. OK let’s have a look at the MySQL logs… Still nothing. So I called the developer to confirm that no moves were made into Production on Friday. I rebooted the box and everything seemed to return to normal. 2 hours later the phone rings once more… The site is down again Bill. Man this isn’t even my server… This is really going to be bad if I have to reboot the server every 2 hours this weekend. Opened a browser window and got the database connection error message again. OK let’s take a look that the system logs… WOW that’s funny the kernel is error’ing out and throttling back the the network stack. OK… Let’s see what netstat turns up… ouch! There were hundreds of connections in a FIN-WAIT or a SYN_RECEIVED state? What’s going to? Did some one patch the OS on this box? Nope… Let’s check the throughput of this box… 75,000 requests per second… Now one could dream but I’d think this was pretty rare occasion for this domain! OK… Let’s see if I could get at the firewall logs… Sure enough there were thousands of connections open. WOW we were in the middle of a DDoS (Distributed Denial of Service) attack. I couldn’t believe it.

The point of this is that it doesn’t use fancy network tools to figure out what’s going wrong with a  machine. I didn’t use a network sniffer. The box was not one of mine so I didn’t know the state of the server.  I used what was on the machine and started by eliminating variables. But the really big lesson learned is, it doesn’t matter how small you think your site is, it could always be the target of something like a DDoS.