From: Jörn Engel (joern_at_wohnheim.fh-wedel.de)
Date: Fri 06 Aug 2004 - 13:23:20 BST
On Fri, 6 August 2004 12:53:43 +1200, Sam Vilain wrote:
>
> The chances of bits on your hard drive platter randomly losing their
> magnetism or capacitors in your RAM losing charge and changing are
> probably higher than two different files having an SHA1 collision :-).
I used to have the same opinion. Then I read this:
http://www.usenix.org/events/hotos03/tech/full_papers/henson/henson_html/hash.html
> Hashing only the first block of the file as an optimisation is a
> sensible idea.
Yes.
> The script could be easily modified to do this as a seperate step,
> however bear in mind that it will only even consider checking the file's
> contents if the files already have the same owner/group/permissions,
> relative path and file size. My assumption was that if these all match,
> the files are probably going to be the same anyway.
In that case, you can ignore the hashes anyway. Do a direct
comparison, nothing lost.
> Nice idea, but I think on UNIX that's pretty much a can of worms with no
> easy answer. You'd need something in the kernel that notifies userland
> when any inode on a filesystem changes. Have a look at the intermezzo
> module if you want to go down that path. If you can provide the kernel
> half, I'll be more than happy to extend unify-dirs to work with it :).
Yes, I know. Quite a few people tried it already, Al Viro didn't like
any of it.
> Failing active monitoring, as a simple compromise there's no reason that
> unify-dirs couldn't optionally store its internal inode/stat/SHA1 hash
> cache in a Berkeley database, and run the script every hour or so via
> cron. It would certainly prevent the copious stat()'ing that the script
> does, at the expense of not noticing unlikely unification situations
> until the DB cache entries expire.
>
> Of course, it would still absolutely hammer the VFS every time it runs
> with readdir() calls and find all those glorious reiserfs corner case
> bugs, but in my experience with a "handful" (say, 30) of vservers that
> are already mostly unified the script completes in under a minute when
> unifying just the OS (eg, /usr, /lib, /sbin and /bin).
>
> Who knows, maybe there are other optimizations possible - like only
> stat()'ing the leaf directories in the heirarchy, to see if any files
> have been added or removed before actually using readdir() to read them.
> Again this will not catch some unlikely unification situations until
> full stat()'ing happens.
Your problem is simpler, compared to the one I want to solve. Also,
with final cowlinks, it's perfectly sane to combine two files with
different owners, permissions, [amc]times, etc. Both will have
seperate inodes, just the data is identical.
Jörn
-- Invincibility is in oneself, vulnerability is in the opponent. -- Sun Tzu _______________________________________________ Vserver mailing list Vserver_at_list.linux-vserver.org http://list.linux-vserver.org/mailman/listinfo/vserver