Systems Administration
Feb. 8th, 2006 03:23 pmAn email from a colleague-to-be got me thinking about systems administration. Administration of large and complex systems, to be precise. And that learning how to do that is quite hard, since it's difficult to get access to large and complex systems. However much I detest Telenor, I sure learned a lot while working with major-ISP-class systems there. A lot of it is fairly obscure technical details, of course. But some things were more along the lines of principles and ways of thinking. And some of those might be possible to put down into fairly simple sentences.
If I think of more I'll add it later.
- First, always plan for change. It doesn't matter how simple it is, or how short-lived it's supposed to be, 999 times out of a thousand you are going to have to change it. If you take half an hour to think about and put in stuff that makes it easier to modify the system, you can save yourself an enormous amount of work later on.
- Second, it will break. It doesn't matter how well you plan and build it, at some point it's going to get fucked up. A CPU suddenly dies. Someone trips over a cable. A coworker restores an HP-UX system backup to your Solaris machine. Builders on the floor above drill too far into the floor and fill your router with coolant and concrete dust. So while you should of course make an effort to make your systems as reliable as possible, it's even more important to plan for what to do when the Wintel hits the fan. Since it usually does so at 3AM on a Sunday, the difference between rushing out to the machine room and trying to fix things in a panic and just saying "Switch over to the hot standby and I'll look at it on Monday" is not to be sneered at.
- You can have many stops, but you can't have long stops. Almost certainly, the vast majority of your customers will be running Windows. Which means that your system doesn't really have to be any more reliable than Windows. Since your users are used to their machines crapping out on them every now and then, you can do that too. What you can't do is be down for very long. Ideally, your system should never be down for longer than it takes Windows to reboot, since that's likely to be how long it is before the customer tries again. If it works then, he's not going to blame you.
- Some things are more important than other things. You need to know what your customers care about. Some systems can be down for a few hours without problems, others can't be down for even a minute.
- Have a plan B. Preferably, have plans C, D, E and F as well. See the two top items up there? Sometimes they hit you at the same time. You're changing something, and halfway through for whatever reason it becomes impossible to go through with the change. At that point you do not want to be in the position that your only way is forward, because then you're screwed. At the very least, you want to be able to back out and restore things as they were before.
- It takes longer. You know the old saying that to get a not too bad estimate on how long something's going to take, you estimate how long you think it should take, multiply by three and convert to the next larger unit? That saying is pessimistic. Also, if you say it's going to take an hour to get the system back up and it takes two, people will be pissed off. If you said that it's going to take four and it took two, they'll be ecstatic. So don't be afraid to pad your time estimates shamelessly.
- Sometimes, the right thing to do is to get a scapegoat. Your company has this project. It's taken a year, and it's cost a small fortune -- and it's about to totally and utterly fail. Somebody is going to have to take the blame for the failure, for such is human nature. If the failure isn't really the fault of any of the people involved (maybe the market crashed, or the idea was crap to begin with but the previous CEO ordered it to be done, or something), this may lead to the loss of valuable staff and/or a lot of bad feelings between coworkers. Neither of which is good. In this case, hiring a consultant to do the impossible ("We want you to do 3000 man-years of work in a week") and then blaming him for the failure can leave the company better off than it would've been otherwise.
If I think of more I'll add it later.
(no subject)
Date: 2006-02-09 12:26 pm (UTC)6b: it will impact more customers you thought it will. When they designed the system it was just used for sending status emails once a day, but unknown to you it now runs the central SAP instance that controls the refinery.