Thursday, October 25, 2007

Techno zombies!

It's been non-stop action for me recently. When we left off, gentle reader I was just coming back from my Chicago trip. I had managed to trade my first three days of oncall to go to training, but now it was time to return to the office and a mountain of work not least of which involved tending to a pager. Thursday was a blur of non-stop go go go from the time I got settled in to the time I went to bed. There is a kind of rhythm to IT calamity; an organic element to technological failure. I'll put it like so avoiding technical jargon: If one thing breaks chances are good at least five people will tell you to fix it. Also, due to the dependency each thing has on another thing when one thing breaks it generally causes a chain reaction of breakage. This creates a notification|work multiplier. The good news is troubleshooting problems is much like slaying zombies: kill the head zombie and they all die. So if you weed out and correct the problem the rest of the world can start moving again. The only issue is that the more things that held dependencies with the core thing that broke means more people will notice the breakage which means more people tell you to fix it and the more people telling you to fix the problem means the more "paperwork" after the fact. It's a nasty cycle. I'm grateful that I actually had a chance to get settled in and check my mail over my coffee and a pop tart prior to the flood gate opening.

I say paperwork, but there's no such thing as paper based communication for me anymore. All notification and updates are done electronically through a web based ticketing system. Things like web based trouble ticketing systems are touted as helpful tools that management magnamimously offers to their technical staffs as a generous aid in the never-ending struggle of zombie slaying commonly known as troubleshooting. In practice this all looks like a different animal. Something breaks: the oncall engineer receives electronic notification which needs to be acknowledged. The oncall engineer then starts furiously typing an e-mail in a desperate attempt at playing beat the panic. Before the second sentence can be typed out three people run by the oncall engineer’s cubicle explaining something has happened. These people are in turn acknowledged. Then the oncall engineer receives at least one instant message requesting to join a troubleshooting chat room to give updates on the issue. It’s around this time that the oncall engineer’s boss is now in play asking what has happened and when will it be fixed. A page goes off requesting the oncall engineer join a telephone bridge call to give an update.

Remember how troubleshooting is like slaying zombies? Well, in a way when a thing breaks it almost instantly starts eating the brains of the people who support things that have those dependencies which effectively increases the zombie horde. Eventually, people start asking when the service will be restored so their particular thing will start working again. At this point the OE manages to ignore everything else and start troubleshooting the issue. A good boss will morph into an offensive linesman and start blocking the newly created zombies, and eventually functionality is restored and all questions are answered.

We have a weekend event called "Fall Release". I have no clue what it is we're releasing, I think it's software, but I know that it takes a huge amount of personnel and other resources to accomplish. In any case, this past weekend there was a lot of movement among the developers and the server teams which resulted in a lot of work in the form of tying loose ends on the network side. My side.

It’s Saturday afternoon when my weekend becomes cataclysmic. I perform a change for someone and it won’t work. Between the two of us we fiddle with all kinds of knobs and nothing seems to make the light turn green. For an hour and a half we work, and then out of frustration and desperation and sheer idiocy I try a command that isn’t even supposed to help the situation. One minute later I lose connectivity to the machine that received the command. It’s at this time that the guy I’m working with loses connectivity to his server. I try to re-establish connectivity. Nothing doing. I try to establish connectivity to the hot standby device (an understudy that can leap out on stage the moment the lead collapses). I can’t get to it. It won't even respond to the most basic call - the simple ping. This is a crucial thing. This is a thing with many many dependencies.

To cut to the chase: I removed my company’s web presence from the Internet for over an hour.

I was working from home so I flew to the office receiving an electronic onslaught of notification all the way. When I got in to the data center a manager (not mine) was there waiting and then standing behind me offering assistance. My manager did hit the scene pretty quickly, though and we worked the issue along with another engineer. Service was restored and now it was time to face the music. I had to sit there and say I broke it.

My boss took it really well. I imagine if our roles were reversed I would be angry in a way that would hearken to the animal kingdom. I’d be baboon-screaming-poo-flinging-jumping-up-and-down-on-my-desk mad. He didn’t even raise his voice. Perhaps he could see how sick I felt about the whole thing. Perhaps, he was terrified of making me cry, a fear that would at one time offend me very deeply, but now I completely endorse if it means I don’t get yelled at. Sincerely though, I appreciate not being cut to shreds.

Regardless, I had a terrible Saturday. Remembering it puts my stomach in knots. I tried to surround myself with friendly faces at a gathering after that terrible outage, but work and technology just wouldn’t cooperate, and I trudged home armed with exceptional chili. My brief time out was restorative. The worst and best part was I had to go home and start working on that problem where I shit the bed to begin with. It felt terrible getting back on that horse, but it was well worth it. I eventually found the problem and fixed it, and nothing blew up. That felt pretty good.

I ended up working much of Saturday night and most of Sunday, and the world kept turning. I did find an hour to go for a lovely motorcycle ride around Eagle Creek. There's a narrow road that feels like a country lane lined with trees, and on Sunday the sun shone down through the golden, orange, fiery red leaves as they lazily drifted over me onto the street like warm, paper snow. It was so warm. The last warm day I believe, and it felt so good. It was centering in a way that I desperately needed.

The next installment: From wind-stealing lows to dizzying highs. My first Monday Night Football game.

2 comments:

nickabouttown said...

Wait? You troubleshoot? I've made an IT career out of just restarting things. Wow...who knew :-)

Btw...HR is going to be out on Saturday and is going to meet up with the Bag Lady bus at some point. Get with him if you want to laugh at Blanche and Romeo in drag

Scrawler said...

I'm so bummed, I have to miss it. Work is pretty demanding right now. I'm pulling an all-nighter in the data center.