From: Sam Vilain (sam_at_vilain.net)
Date: Mon 10 Feb 2003 - 19:07:14 GMT
On Thu, 06 Feb 2003 23:03, you wrote:
> On Wed, Feb 05, 2003 at 07:55:19PM +0100, Herbert Poetzl wrote:
> > On Wed, Feb 05, 2003 at 04:46:21PM +0000, Paul Sladen wrote:
> > > On Mon, 3 Feb 2003, John Goerzen wrote:
> >
> > Justin M Kuntz reported a kernel oops in
> > sched.c 570 on a 2.4.20 ctx16 with reiserfs
> > on january 01 2003, so this seems to be
> > the same race ...
>
> Hmm, i'm also using reiserfs on the server which crashed, it might be
> related.
>
> John, are you using reiserfs ?
Hmm, funny you should suspect reiserfs so quickly. You have good reason.
As I've recently become painfully aware, reiserfs can easily break under
not so unusual circumstances. Though I used to swear by it, I have in 2
years or so of using it had five unexplained data corruption incidents
running so-called `stable' versions since early 2.4 days, which is five
more than all other UNIX filesystems I've used combined. 3 of these have
been following a system crash, when reiserfs's journalling failed. One of
these resulted in a complete loss of the filesystem structure, due to the
inadequacy of the `reiserfsck' tool.
In addition to data corruption, it's not all that hard to create a
directory structure that even root cannot read; I've just managed to
create one, and all I was doing was duplicating ~25% of the directory
structure using an analogue of `cp -al'. Reiserfs really cracks under
pressure, and that's the last thing you want a filesystem to do!
With these problems under high load, it's hard to think of a truly useful
application for reiserfs. It really is still experimental as hell; the
version in 2.4.20 seems particularly bad. Best to stick with ext3/ext2
(with the directory hashing patch if you need it). Or try your luck with
xfs/jfs if you really need the speed.
Check out this e-mail seen on the reiserfs list:
---- [... talking about a crash ...] And now I can reliably reproduce it. It has nothing to do with MD, linear, raid, SMP, or unclean shutdowns.I can reproduce this bug on a plain IDE disk partition in about three hours on Linux 2.4.20 (compiled for SMP but running on UP, full .config and system details available on request). My test system has about 4 gigs under /etc, /usr, and /var, /dev/hdc2 is 25GB, and there is 1G of swap.
BEGIN cut-and-paste-into-a-root-shell
# Create an empty filesystem:
mkreiserfs -f -f /dev/hdc2 mount /dev/hdc2 /test cd /test
# Script used to control the load average. Note that as written the loops # below will keep spawning new processes, so we need some way to throttle # them. Change the '-lt 10' to another number to change the number # of processes.
cat <<'LC' > loadcheck && chmod 755 loadcheck #!/bin/sh read av1 av5 av15 rest < /proc/loadavg echo -n "Load Average: $av1 ... " av1=${av1%.*} if [ $av1 -lt 10 ]; then echo OK exit 0 else echo "Whoa, Nellie!" exit 1 fi LC
# Create directories used by test mkdir foo bar
# Start up some rsyncs. I use /etc, /usr, and /var because there's a # good mixture of files with some hardlinks between them, and on a normal # Linux system some of them change from time to time.
while sleep 1m; do ./loadcheck || continue; for x in usr etc var; do rsync -avxHS --delete /$x/. foo/$x/. & done; done &
# Start up some cp -al's and rm -rf's. Note there are two concurrent # sets of 'cp's and two concurrent sets of 'rm's, and each of those # has different instances of 'cp' and 'rm' running at different times. for x in 1 2; do while sleep 1m; do ./loadcheck || continue; cp -al foo bar/`date +%s` & done & while sleep 1m; do ./loadcheck || continue; for x in bar/*; do rm -rf $x; sleep 1m; done & done & done &
END cut-and-paste-into-a-root-shell
rm and occasionally cp will frequently complain about "No such file or directory". This is normal. After about 3 hours, the following non-normal messages appear:
readlink lib/R/library/base/help/contrasts: Permission denied readlink lib/R/library/base/html/hsv.html: Permission denied rm: cannot remove `bar/1042550428/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/appletalk/ltpc.o': Permission denied rm: cannot remove `bar/1042550428/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/aironet4500_proc.c': Permission denied cp: cannot stat `foo/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/e1000/.e1000_ethtool.o.flags': Permission denied cp: cannot stat `foo/usr/src/kernel-source-2.4.20-zb-586-smp/drivers/net/.eepro.o.flags': Permission denied
This needs a 'reiserfsck --fix-fixable' to fix.
It looks to me like there may be some sort of locking bug triggered by concurrent link/unlink/rename calls, but I'm not even a filesystem expert, much less a reiserfs expert. ;-)
-- Sam Vilain, sam_at_vilain.net
To be sure of hitting the target, shoot first, and call whatever you hit the target. ASHLEIGH BRILLIANT