From: Christian (chth_at_gmx.net)
Date: Wed 13 Nov 2002 - 09:07:30 GMT
On Mon, 11 Nov 2002 16:37:14 +0100
Herbert Poetzl <herbert_at_13thfloor.at> wrote:
> hmm, so a configuration file, or a log file which would
> not be used for some time, will become a candidate for
> unification? what if the file then gets used in a way
> not suited for the IMMUTABLE-UNLINK approach?
Thats the task of 'clever' selcection options --exclude '.*/etc/.*'
--exclude '.*\.conf' but You are right prolly it needs better selection
options  '--exclude --clrmod 111' exclude files where the execute bits are
not set ... and so on .. thats why i asked here for ideas ... thanks 
> how do you plan to match (compare) the files?
> - by path/contents
> - by hash values (md5,etc)
- fileselection/size/contents
calculating a hash would involve a scan through a entire file anyways plus
some calculations ... so i plan to do the following
a) stat all files, any special files are excluded, dirs are matched
against the include/exclude regex (anyone wants --includedir --excludedir
instead?), files are matched against all selection-options
b) the stat-data of files which became a candidate by file selection are
kept in a map/set {filename,stat,attr} (i will use C++ for implemetation
like the other tools too) the filesize will be used as ordering attribute.
c) mmap reasonably many files of the same size and matching
uid/gid/stat/attr... into memory and compare them (and redo this if not
all files can be mapped, have special care for huge files which can not be
mmap'ed, ...)
Note: files dont need to be on the same path only content matters
b1) if the big dictionary in memory becomes a problem i could use a
temponary db3 or so.
> because you might run in an O(n^2) issue ...
I don't really care this is not a performance important task you might run
it once a month and it can take many hours, no problem and the above
algorithm might be somewhere in O(n) maybe little worse.
> a linear approach could be generating a list of hash 
> values (sum, md5sum, cksum, fsum) for each vertual 
> server (including a reference) and then only comparing 
> a to-be-unified server (list) with the reference ... 
> should give O(n)
so i thought ... but keeping stat-structs instead hashes (i first thought
about hashes, but that will be slower! and less precise).
cya Christian