From: Paul Sladen (vserver_at_paul.sladen.org)
Date: Thu 07 Nov 2002 - 13:13:03 GMT
On Thu, 7 Nov 2002, Nuno Silva wrote:
> My advice is: "upgrade" to 2.4.18 :)
> My current record with 2.4.19 with moderate load but high I/O is 12 days
Both my production vserver boxes suffer from the `hangs' Cathy has
described.
PIV 1.6Ghz, 1GB [no highmem], ext3, LVM, IDE Soft-"RAID"
2.4.19-pre7-ctx10-ide20020510 is my ``more reliable'' so far...
75 days, 09:25:41 | Linux 2.4.19-pre7-ctx1 Sun Jul 14
0 days, 09:21:45 | Linux 2.4.19-pre7-ctx1 Sat Sep 28
38 days, 22:20:20 | Linux 2.4.19-pre7-ctx1 Sun Sep 29
On one occasion I found ``out of file handles'' in the terminal-server
scroll--now I monitor that. As Cathy points out monitoring/logwriting (and
presumably the processes trying to do it) completely stop when it ends up
in this state.
PIII 700Mhz, 192MB [no highmem ;-)], ext2, md+SCSI, IDE
2.4.18ctx-10 is my ``less reliable'' box, vis:
9 days, 22:42:04 | Linux 2.4.18ctx-10 Tue May 14
56 days, 22:51:47 | Linux 2.4.18ctx-10 Fri May 24
32 days, 21:26:16 | Linux 2.4.18ctx-10 Sat Jul 20
13 days, 13:29:46 | Linux 2.4.18ctx-10 Thu Aug 22
10 days, 05:49:08 | Linux 2.4.18ctx-10 Tue Sep 17
3 days, 06:01:25 | Linux 2.4.18ctx-10 Thu Sep 5
5 days, 20:31:13 | Linux 2.4.18ctx-10 Sun Sep 8
4 days, 05:47:06 | Linux 2.4.18ctx-10 Mon Sep 30
22 days, 17:39:13 | Linux 2.4.18ctx-10 Sat Oct 5
9 days, 21:37:55 | Linux 2.4.18ctx-10 Mon Oct 28
This last one was a genuine Oops that spewed (not rebootable with [break] on
the serial console; most [all?] of the rest (sadly too many to count...)
have been hangs where it returns ICMP request and half-opens TCP connections
and can be rebooted with sysreq from the serial console.
The softdog (kernel/userspace watchdog) cannot be persuaded to reboot the
machines when they end up in this state; although the kernel--not having
received an update from userspace--should reboot! And that is with
*everything* turned on (completely paranoid state) in the watchdog program.
-Paul
-- Nottingham, GB