From: Herbert Poetzl (herbert_at_13thfloor.at)
Date: Thu 14 Nov 2002 - 09:32:06 GMT
On Thu, Nov 14, 2002 at 01:19:45AM +0000, ragnar_at_this.is wrote:
> Hi,
>
> > > because you might run in an O(n^2) issue ...
> >
> > I don't really care this is not a performance important task you might
> > run
> > it once a month and it can take many hours, no problem and the above
> > algorithm might be somewhere in O(n) maybe little worse.
>
> I do not see a performance problem.
somebody will care, if a memory mapping process
goes wild on thousands of files, making millions
of comparisons for several hours *G*
read below ...
>> b) the stat-data of files which became a candidate by file selection are
>> kept in a map/set {filename,stat,attr} (i will use C++ for implemetation
>> like the other tools too) the filesize will be used as ordering attribute.
>> c) mmap reasonably many files of the same size and matching
>> uid/gid/stat/attr... into memory and compare them (and redo this if not
>> all files can be mapped, have special care for huge files which can not be
>> mmap'ed, ...)
> One of the basic assumtions is that the files should
> be of the same size. So working from a size sorted
> list of files looks linear.
>
> For an interesting look at the files of the same size:
>
> bash# find /TheServers -type f -printf \\n%s\\t%p|sort -rn|less
how to compare 5 files?
- compare 1st with number 2-5
- compare 2nd with number 3-5
- compare 3rd with number 4-5
- and compare the last two
makes a total of (4+3+2+1)=10 comparisons
for 10 files this would be
(9+8+7+6+5+4+3+2+1)=45 comparisons
for N files it is (N/2)*(N+1) = (N^2+N)/2
(I think, a smart algorithm might be able to
reduce this to O(N*sqrt(N)) but not to O(N))
consider 10 vserver
9000 files each
8000 files the same (normally unified)
that means a list of 90000 files 10000 different,
10 blocks of 8000 files for possible unification (80000 files)
so the highest theoretical number of files of equal size
to compare would be? how many comparisons whould be needed
for 10000 files? how many for 80000?
you can make some tests/calcs on your virtual server with
awk (to get some idea what usually will happen)
# find /vservers/IS01 -xdev -printf "%s\n" \
| gawk '{N[$1]++} END {for (a in N) S+=N[a]/2*(N[a]+1); print S}'
will give you the number of comparisons, and
# find /vservers/IS01 -xdev -printf "%s\n" \
| gawk '{N[$1]++} END {for (a in N) S+=N[a]/2*(N[a]+1)*a; print S}'
the amount of memory mappings done if only 2 files
are mapped to compare them ... or to be more realistic
the memory that actually has to be compared ...
cksum (for example) would map each file only once,
and produce a hash value, which could be compared
in the same way as planned, but much much faster,
equally fast like the size sort/comparison ...
please correct me if I'm wrong ...
best,
Herbert
>
> best
> ragnar