North American Network Operators Group|
Date Prev | Date Next | Date Index | Thread Index | Author Index | Historical
Re: Why do we use facilities with EPO's?
On Wed, 25 Jul 2007 12:43:17 PDT, Roy said: > > Funny story about that and the EPO we have here... > > ... > Story #1 > Story #2 Story #3 So about 4 -5 years ago, we were in the middle of a major renovation of our server room. Moving machines all over the place, trying to clear about 6K contiguous square feet of floor space to drop a top-5 supercomputer in. Upgrading the power, bringing in another 1.5Mw feed, cooling to get the resulting BTUs *out*, etc. And we decide it's time to put in a new 600kw diesel backup generator to replace the old one that was way too small, for all the non-supercomputer systems in the room. So we take a multi-hour outage one Saturday for a full powerdown so we can wire all the new UPS gear in. And one of our scarier moments is rebooting the Sun E10K, because it was a bit long in the tooth, and had 400 disk drives, and hadn't been powered off in so long we weren't sure if it *would* power up again without field engineering assistance. And it *had* to come back up, because it had all the Oracle databases that had all our business records, HR, student records, everything. There's a few tense moments - we lose about a dozen drives, but fortunately they're all in RAID sets and no more than one drive per set died. We also notice that we dodged a bullet - the main boot drive was supposed to be mirrored, but due to a config error, wasn't. Tuesday, that boot drive is moved, it's now mirrored on 2 drives. Friday, some construction guys come in to move the main entrance door into the room - it has to move about 20 feet to the right so you can go *around* the supercomputer, rather than walk straight into it. And as per plan, one of them starts moving the kind of odd light switch junction box next to the door, to its new location next to the new door. Unfortunately, as *not* per plan, he fails to double-check with our Facilities team that it's been disarmed first... 5 seconds later, it's very quiet and foggy in the room, as the Halon has dumped and the interlock with the EPO has killed the power. Several hours later, we finally get to start powering up the Sun E10K. The good news: We only lost 2 drives out of 400 this time, rather than a dozen. The bad news: Guess which 2 failed.....