From: Sam Vilain (sam_at_vilain.net)
Date: Wed 24 Nov 2004 - 04:01:47 GMT
Jörn Engel wrote:
>>...and the big challenge is - how do you apply this to memory usage?
> Oh, you could. But the general concept behind my unquoted list is a
> renewing resource. Network throughput is renewing. Network bandwidth
> usually isn't. With swapping, you can turn memory into cache and
> locality to the cpu is a renewable resource.
Yep, that was my thought too. Memory seems like a static resource, so
consider RSS used per second the renewable resource. Then you could
charge tokens as normal.
However, there are some tricky questions:
1) who do you charge shared memory (binaries etc) to ?
2) do you count mmap()'d regions in the buffercache?
3) if a process is sitting idle, but there is no VM contention, then
they are "using" that memory more, so maybe they are using more
"fast memory" tokens - but they might not really be occupying it,
because it is not active.
Maybe the thing with memory is that it's not important about how much is
used per second, but more about how much active memory you are
*displacing* per second into other places.
We can find out from the VM subsystem how much RAM is displaced into
swap by a context / process. It might also be possible for the MMU to
report how much L2/L3 cache is displaced during a given slice. I have a
hunch that the best solution to the memory usage problem will have to
take into account the multi-tiered nature of memory. So, I think it
would be excellent to be able to penalise contexts that thrash the L3
cache. Systems with megabytes of L3 cache were designed to keep the
most essential parts of most of the run queue hot - programs that thwart
this by being bulky and excessively using pointers waste that cache.
And then, it needs to all be done with no more than a few hundred cycles
every reschedule. Hmm.
Here's a thought about an algorithm that might work. This is all
speculation without much regard to the existing implementations out
there, of course. Season with grains of salt to taste.
Each context is assigned a target RSS and VM size. Usage is counted a
la disklimits (Herbert - is this already done?), but all complex
recalculation happens when somethings tries to swap something else out.
As well as memory totals, each context also has a score that tracks how
good or bad they've been with memory. Let's call that the "Jabba"
value.
When swap displacement occurs, it is first taken from disproportionately
fat jabbas that are running on nearby CPUs (for NUMA). Displacing
other's memory makes your context a fatter jabba too, but taking from
jabbas that are already fat is not as bad as taking it from a hungry
jabba. When someone takes your memory, that makes you a thinner jabba.
This is not the same as simply a ratio of your context's memory usage to
the allocated amount. Depending on the functions used to alter the
jabba value, it should hopefully end up measuring something more akin to
the amount of system memory turnover a context is inducing. It might
also need something to act as a damper to pull a context's jabba nearer
towards the zero point during lulls of VM activity.
Then, if you are a fat jabba, maybe you might end up getting rescheduled
instead of getting more memory whenever you want it!
-- Sam Vilain, sam /\T vilain |><>T net, PGP key ID: 0x05B52F13 (include my PGP key ID in personal replies to avoid spam filtering) _______________________________________________ Vserver mailing list Vserver_at_list.linux-vserver.org http://list.linux-vserver.org/mailman/listinfo/vserver