A Day in the Life of a Network Administrator THE CALM AFTER THE STORM… How did a day of crisis prompt my company to rethink the way we managed our network? Pretty easily. We knew we needed to change, we’d already had our fair share of problems, but it took a real crisis to spur us into action. With our new network management solution in place, we are shocked at what’s been happening on our network for who knows how long! Now we know exactly what’s working and what’s not – which is a great position to be in. As a network administrator or manager, I‘m sure you’ve lived through a few of them yourself. If you haven’t already learned from them, perhaps our story will drive you to action too.
HOW IT ALL STARTED It started with an unforeseen network outage on a week day, right in the middle of our peak sales season. Sure, our network had gone down before even for several hours, but never at such a critical time. We basically lost access to our customers on one of our busiest sales days of the year. We do nearly 25% of our annual business during our peak season, so any kind of network availability issue costs the company big dollars. In hindsight, we know our crisis that day was complex. Multiple things were going wrong but we had no clear way of seeing if and how they were related. We could only guess at the reasons and
underlying causes, because we had no unified way of troubleshooting or visualizing the network dependencies. It turned out to be a very educational day for us- we learned from the crisis and changed the way we manage the network. As a result, we’re much better equipped to minimize downtime and outages and in some cases, even prevent them.
FROM NETWORK SLOWDOWN TO A CRISIS The first hint of a problem showed up early that morning. One of our company sales reps called to say that her Webex session was slow. These slowdowns were usually temporary and resulted from occasional spikes in traffic. We were all set to add a new T1 line (next month), but that wouldn’t help us today. In the next ten minutes we got two more calls – one from a sales rep saying that his (VoIP) phone conversations had become hard to follow and the other call was from our Webmaster, who had noticed that our order entry application web pages were taking a long time to load. When it rains it pours. Next, we got a call from our telesales manager telling us that all sales reps were experiencing noticeable problems on the phone. With this news, we knew it was not an isolated user problem at all – it was now a network wide issue. We had to act fast.
ABOUT ME My name is Mark Brown and I’m a Network Administrator. I have a degree in Information Technology and have been in my job for almost four years.
MY COMPANY I work for a medical device and technology reseller. My boss (Director of IT) and I are responsible for supporting 80 people at our office. We do nearly half of our business online and the rest via our telesales team. For a relatively small company, we have a pretty sophisticated infrastructure and key business apps which need to be available 24x7.
TECHNOLOGY ENVIRONMENT Our web site and app servers are located in a datacenter upstate but our email servers, file servers, VoIP servers and our demo machines are in house. Our sales team use Webex regularly, and we migrated to a VoIP system about two years ago. Altogether, we have approximately 20 servers, 90 workstations and phones and 40 network devices.
BEFORE AND AFTER For the last six months, we’ve been using a network and systems management solution called WhatsUp Gold. It basically runs our network infrastructure, so I can focus on what I need to get done. I used to be forever behind schedule, even coming in on weekends. Now, all that has changed and it’s a great feeling personally and professionally to be ahead of what’s going on rather than being behind it.
TROUBLESHOOTING THE OLD FASHIONED WAY We checked the stats on the VoIP systems management portal – and sure enough latency was high and call quality was down. It looked liked network congestion. We’d used a free tool called Wireshark to troubleshoot network traffic issues before, so we set it up to monitor the current problem. Yet when we looked at the results from Wireshark, the traffic seemed within range. We began to think the problem could be our external link. Next step – call the service provider. We spent 30 minutes on the phone troubleshooting our gateway router and external link. Our service provider told us both tested fine, so it wasn’t a link (or internet connectivity) problem after all. During the call they gave us some interesting news; they told us they were seeing occasional bursts of traffic on our external link. Perhaps it was a congestion problem after all. We checked the traffic again via Wireshark and there it was! We had missed it before, because the traffic was fluctuating wildly. We were most certainly congested and the traffic was coming from inside our network. There was an extraordinary amount of HTTP and RTP header packets and it looked like a lot of unnecessary traffic. Everyone knew how critical the peak season was to our revenues and they were especially careful not to load the network. So, what would explain this burst of traffic?
DIGGING DEEPER We checked the stats on the VoIP server. Sure enough the retransmission of the failed packets was overloading network I/O and CPU utilization was high. In fact, some of the calls were being routed to our backup VoIP server which shared the same system as our order entry app server. We checked the back-up server and it was overloaded too, now we knew why the order entry app was so slow – faced with a double whammy of network congestion and server performance issues. The crisis was now full blown. We needed to fix the situation fast. We had eliminated any issue from the service provider
and from our own end-users. The network was congested off and on and the bursts were being seen on the external link. Clearly, this meant that one or more of our internal devices were communicating with an external site. And then it dawned on us – it was a virus. That would explain the traffic and the connection to the external site. Now, it was not just an issue of performance – but could possibly be a security breach as well.
INOCULATING THE NETWORK Eliminating the virus by shutting down the external link was not an option. The network was too critical to the business in peak season. The next few hours were a mad rush to find the infected machines and quarantine them. We didn’t have a topology diagram, and we didn’t know what was on each subnet, which made it harder. In the end, we found four infected workstations and one server. After we shut the last one off, traffic returned to normal. We rebooted the primary VoIP server too and after nearly six hectic hours, we were back to business as usual.
PUTTING A LONG TERM SOLUTION IN PLACE This was a wake-up call for my boss, the CEO, and for me. We had discussed purchasing a network management solution before, but we were never able to carve out the budget for it. Now we knew that if we had a solution in place – we could have saved valuable hours trying to find and solve the problem. The right performance monitoring would have alerted us to network traffic congestion and persistent high utilization on the servers. We would have known the network topology and been able to visualize the affected subnets and nodes. And active monitoring for instances of failure, like the high number of dropped packets, would have alerted us to impending faults far before it impacted our end users and our business. It didn’t take us long to make the decision to put WhatsUp Gold in charge of our network management. Thankfully, life has been quieter ever since.