From: Herbert Poetzl (herbert_at_13thfloor.at)
Date: Fri 15 Nov 2002 - 02:47:11 GMT
first, you do not need to cc me, because fortunately
I'm currently on the list (except for the rare case
where you would like two answers instead of one ;)
second, do not let you stop by anything, or anybody!
(the vserver list seems good at stopping people from
doing useful things ...)
and finally I've come to the conclusion, we are both
(or at least I am) bad in explaining things, so I will
try to shed light from different angles ...
How I see the process:
a) selecting files on an apropriate basis
(including find syntax, file lists, patterns, etc)
b) comparing/sorting the files per size
(this might be done with your bucket algorithm)
c) comparing files of equal size agains each other
(you can't assume that files of equal size are
the same, so you'll have to compare them)
d) unifying all files found to be equal
Some ideas I had regarding this process:
- why a brand new selection/pattern syntax if
find probably already does what you want?
- what about external knowledge, in the form of
include/exclude lists?
- why not generate some hash value for the files
(in step c), so they could be compared instead
of the files ...
- maybe one can store the hashes of once unified
files (together with the file name, location,
creation time, etc) and reuse this information
Some (might be) useful information:
- be careful about filesystem change (-xdev)
- avoid/block recursive/broken links
- do not modify/touch the files (timestamps)
- do not assume (virtual) memory is unlimmited
- if you want it fast, code in C
regarding Heuristics:
(# of files of equal size, and resulting # of
comparisons between them, on a production system)
# find /vservers/???? -xdev -printf "%s\n" \
| gawk '{N[$1]++} END {for (a in N) if (N[a]>1) {
C+=N[a]; S+=N[a]/2*(N[a]+1) }; print C,S }'
793162 3.16015e+09 (with a total of 883185 files)
best,
Herbert
On Fri, Nov 15, 2002 at 12:38:41AM +0100, Christian wrote:
> On Thu, 14 Nov 2002 10:32:06 +0100
> Herbert Poetzl <herbert_at_13thfloor.at> wrote:
>
> > On Thu, Nov 14, 2002 at 01:19:45AM +0000, ragnar_at_this.is wrote:
> > > Hi,
> > >
> > > > > because you might run in an O(n^2) issue ...
> > > >
> > > > I don't really care this is not a performance important task you
> > > > might run
> > > > it once a month and it can take many hours, no problem and the above
> > > > algorithm might be somewhere in O(n) maybe little worse.
> > >
> > > I do not see a performance problem.
> >
> > somebody will care, if a memory mapping process
> > goes wild on thousands of files, making millions
> > of comparisons for several hours *G*
>
> huh? there are not many files which become a unify-candidate by match all
> filters and exact same size
> also i saied 'mmap() reasonably many' files, i think about "map as much
> files which are comfortably fit in the free address space" wich might not
> much more than 1 GB on normal x86's ... see below
>
> > how to compare 5 files?
> > - compare 1st with number 2-5
> > - compare 2nd with number 3-5
> > - compare 3rd with number 4-5
> > - and compare the last two
> > makes a total of (4+3+2+1)=10 comparisons
> Heuristics!!
> algorithm (just from scratch, hopefully noone patented it):
> we put all same-size candidate files together in one bucket,
> start a iterator from 1 to size_of_files,
> compare the iterator position of the first file in the bucket with all
> other files same position,
> each time we find a different value than the one of the current position
> of the first file in the bucket we check if there is an other bucket we
> 'just created' with the same value at that position, then we move the file
> there, else we create a new bucket and move the file there and remember
> this bucket as 'just created'. The 'just created' list is reset each time
> the iterator is incremented.
> So we need to scann each file only once, i would call that O(N) and
> comparing cost is almost nothing compared to disc access and is even
> faster than calculating any checksum. There are a very few cases when not
> all files map at once, and a few optimizations (buckets with only 1 file
> dont need to be compared with anything else..). mmap was my intention for
> easy file-handling which is more a implementation detail, reading char or
> blockwise or mmap blocks instead whole files will do just fine. (well i
> must agree that accessing many files in parallel will be more worse than
> accessing many files in serial order, if the files are not fragmented)
>
> > cksum (for example) would map each file only once,
> > and produce a hash value, which could be compared
> > in the same way as planned, but much much faster,
> > equally fast like the size sort/comparison ...
> >
> > please correct me if I'm wrong ...
>
> please correct me if I'm wrong ... make it better. I want a) a solutions
> for unifying vservers and b) a good one. That doent imply that my one will
> be the best .. but someone has to start ... well there is the Perl Version
> i check it latter but it looks much more limited than my proprosal since
> it has no file-selections et all (and can run into symlink-race
> conditions).
>
> I try to start coding on it within the next days lets see then.
>
> cya Christian