Bottleneck identification and scheduling in multithreaded applications There are bottlenecks in multithreaded applications. Define a bottleneck to be anything which can cause thread waiting. Importance of bottlenecks change over time, as the program runs. Idea: run bottleneck on faster core. Requires you to identify bottlenecks. Bottleneck identification and scheduling. Software annotaties bottleneck code and implements waiting for bottlenecks using a special instruction. Allows hardware to determine which bottlenecks actually matter, and then to give extra resources to the bottleneck core. Implementation is fairly obvious: have a bottleneck table which tracks how much time is spent waiting on each bottleneck, and if the counter is above some threshold run the bottleneck on a faster core. Cache result on slower cores. Faster core seems to have a simple queue of pending things to run? Large core pulls stuff from scheduling buffer in order of how long people are expected to wait for the bottleneck. Can lead to false serialisation if a bottleneck gets stuck in the queue -> send it back to the small core. Another one: a bottleneck could become critical while running on a small core. Migrate at that point. Important for barrier-type synchronisation. If you have multiple large cores, a given bottleneck will be attached to a particular large core. They claim the silicon cost is tolerable; not sure I know enough about hardware to say how meaningful that is. Perf is what you'd expect they'd say: it helps against their chosen baselines. Q: Very large number of cores? A: Our approach is not buggy. Q: How do you tell how many threads are waiting? A: Count them. ------------------------------ Optimal task assignment in multithreaded processors: A statistical approach Peta Radojkovic. Barcelon Supercomputing Centre Cannot expect a universal scheduler. They care about scheduling for network servers. Bad scheduling can apparently cost up to 35%. Ah ha: they're talking about resource contention in processor and interconnect etc. Don't know what an optimal schedule is for this kind of thing. Don't want best configuration, just one in top 1%. For their hardware and workload, randomly sampling 1000 is apparently an adequate approximation. Use some kind of statistical technique to estimate the population maximum given a sample. Extreme value theory is apparently the magic phrase. Apparently has a standard theorem saying that msot of the time you need a generalised Pareto distribution. They do some kind of goodness-of-fit test to check that it's the right sort of distribution. The results look plausible, but I don't see any actual ground-truth validation. Result of all of this is that, for their workload, if you do 5000 samples then you'll be within 2-3% of optimal in 95% of cases. Q: Assuming some properties of distribution in order for extreme value theory to work? A: Don't need *very* strong assumptions, and can validate the assumptions they do need. Q: Scale down to small number of cores? Might be useful for validation. A: Q: How many samples do you actually need in practice? A: Going below 1000 samples doesn't give you a good estimator. Problem is that you're interested in the tail of the distribution, which is quite hard to probe with random sampling. --------------------------- CRUISE: Cache replacement and utlity-aware scheduling Aamer Jaleel Intel Shared last-level caches common. Number of contending applications tends to increase over time, so managing contention important. LRU: if app1 has a small cache footprint and app2 has a large one, and you run them at the same time, app2 gets most of the cache -> you sometimes have both fall out of the cache, even though app1 could happily run inside it and app2 could never fit. Hardware solution: allocate based on utility rather than demand OS solution: map application pages to different cache sets. Scheduler: try not to co-schedule them. Tricky when you're heavily loaded. Tried with a bunch of different apps -> on the order of 9% win for using optimal scheduling rather than worst, on average. Question: If you have intelligent cache replacement, do you need intelligent scheduling? Compare LRU cache to DRRIP cache. Find some apps which only benefit from intelligent scheduling under DRRIP, but not under LRU. Consider apps: povray (fits in first-level cache), bzip2 (fits easily in llc), bwaves (thrashes LLC), sphinx3 (fits precisely into LLC) povrays: put them on different LLCs. bwaves: under LRU: put them together; they don't benefit from LLC anyway, so can't hurt each other. under DRRIP: split them apart, since if you put something which needs an LLC on the same core, DRRIP will evict the thrasher. sphinx3: schedule with pvray, ideally, or bwaves under DRRIP. bzip2: schedule together. Running the LRU version of CRUISE on a DRRIP cache is bad, as is the DRRIP cache on an LRU cache. Obvious question: how do you classify your application? Profiling: not good when app behaviour changes. HW perf counters: assumes isolated behaviour same as contended behaviour Their answer: RICE -- runtime isolated cache estimator -- hardware to estimate how it'd behave if it had a core to itself. Essentially, reserve some cache sets for one core or the other, and then extrapolate from results on those sets. Q: How did you find the optimal answer? A: Exhaustively try every possible schedule. Q: Did you try multiple randoms? i.e. take sample max rather than pop max A: Random was the average of a bunch of combinations. Q: What happens if you have shared memory, in addition to shared cache? A: It'll work. Doesn't consider it explicitly, so room for other optimisations. ------------------------------------ Execution migration in a heterogeneous-ISA chip multiprocessor Ashish Venkat UCSD He's talking about fixing up the in-memory image for different pointer formats? Assuming shared memory. Plan seems to involve mulitple-compilation to produce program in different ISAs and then use page table tricks to swap in right one. Consistency: pointers point at same objects, including data, code, and stack. If consistency difficult they use a runtime translator (e.g. stack frame layout). Define equivalence points at which they can safely perform migration at function call boundary; use binary translation to get to an equivalence point when necessary. Basic consistency: assume same endianness and basic data types (e.g. bitness of words). Require all functions to be same size in each ISA by adding padding to small ones. Return addresses fixed up at translate time. Stack: must grow in same direction, frames must be same size, objects at same offset from stack pointer. Basically, the memory layout ends up being almost lowest-common-denominator of all supported ISAs. Transforming return stacks is the hard part. Need to translate local variables, spilled registers, return addresses. Stack translation usually costs on the order of hundreds of microseconds. Looks like a fairly standard DBT engine for the binary translation bit. Can reduce number of instructions to next call site by introducing dummy calls, but they don't do that. Performance: seems to be sane to migrate on the order of every hundred milliseconds. Q: Compare to previous work? A: Most previous work assumed memory not shared, which changes the performance tradeoffs. Q: Paper says OS runs on a single core/image? A: Assume it runs on every core, and that the OS author has made that work. Q: Do you model cache migration impact in results? A: Numbers do include those effects, but don't record them separately. Q: Didn't look at x86. What would you do there? A: Endianness: use little-endian everywhere. Q: Have you considered adding instructions to make this easier? e.g. hack up return address to make translation easier. A: Haven't considered it, but might be helpful to look at in future. Q: How does this interact with virtual machines? A: Not considered it, but would be straightforward. Q: If you're migrating from ARM to a crypto processor, why not use a function call? A: Not really what this work is looking at. Q: Have you thought about implications on memory model i.e. strong memory ordering vs. weak memory ordering? A: Not looked at that.