SCN: The difficulties of Feb 24-26.

J. Johnson jj at scn.org
Tue Feb 27 00:40:05 PST 2001


(The following account of the difficulties SCN encountered on Feb. 24-26
may be desiminated generally.)

On the evening of Feb 24th our 'scn4' machine, which hosts the mail
server, the web server, and the SCN Web pages, went into a perpetual
reboot because of a memory failure.  

On Sunday afternoon we attempted to restore scn4 onto the chassis (box) 
used for scn3 (which had been pulled for development of SCN2).  However,
diagnostics showed "memory" problems on that box, plus there was a network
problem, so every thing was put back onto the original chassis.
Unfortunately, in the crowded conditions of our equipment rack a power
cable was disturbed, and we lost our network connection to the library and
to the Internet.  This was not discovered until that evening, so we were
unable to correct that until Monday morning.

On Monday morning there was also a failure of a key networking service on
our main 'scn' machine (probably because of repeated failure to connect to
scn4).  This effectively killed incoming dialup connections, though at
that point that was less of a problem than merciful assistance.

On Monday extensive testing of scn4's memmory commenced.  The Sun memory
diagnostic is not reliable, so the system was shutdown and rebooted in
various memory configurations until clear results were obtained.  This was
done with various combinations of memory and also memory slots, as the
latter are also susceptible to failure.  This is very tedious, but in the
end both memory and the motherboard for scn4 were replaced, including one
memory DIMM that was supplied as replacement.  And scn3 will be getting
its motherboard replaced Tuesday. 

At one point there was also a problem with a disk drive, again because of
the cramped conditions, and about two hours were required to restore key
system files and e-mail.  About two-dozen messages were lost.

The system was restored at 6:30 PM.  There is a very large backlog of mail
which will take all night to deliver.

This was a very difficult episode, which was exacerbated by when it
happened.  Hopefully the replacement hardware will give us good service,
and some experience was gained which will aid us if the hardware fails.
Operations is also looking at ways of making the system more robust.

=== JJ =============================================================

* * * * * * * * * * * * * *  From the Listowner  * * * * * * * * * * * *
.	To unsubscribe from this list, send a message to:
majordomo at scn.org		In the body of the message, type:
unsubscribe scn
==== Messages posted on this list are also available on the web at: ====
* * * * * * *     http://www.scn.org/volunteers/scn-l/     * * * * * * *



More information about the scn mailing list