Second session. Resource management for task-based parallel prorgrams over a multi-kernel Georgias Varisteas, Mats Brorson, et al. Barrelfish is barrelfish Aim of this project is to introduce a programming model which takes advantage of shared memory if it's present, while retaining scalability and portability. Initial design for system-wide scheduling and resource management. Motivation: current models tested mainly in isolation, with minimal OS support, and can be wasteful in multiprogramming environment. Real life apps have variable parallelism throughout their life, or just limited parallelism. Therefore need adaptive scheduling and resource partitioning. Scheduling: Two levels, one in system and one in userspace. Diagram indicates that they want to do sensible cross-job scheduling. System scheduler: accepts feedback on process efficiency: effectively requests for resources. System is segmented for scheduling, but processes can span segments. User-level scheduler: integrated with work-stealing runtime. Provides feedback on efficiency: number of cycles wasted waiting for work to arrive. Capture average and worst efficiency, and then do a threshold thing to figure out whether current assignment is ``efficient'' or ``inefficient''. Also track whether we're being given the resources we ask for (satisfied vs. deprived). Results: -- Inefficient -> ask for less stuff -- Efficient+satisfied -> ask for more stuff -- Efficient+deprived -> don't change demands System scheduler rearranges cores according to user-level demands. Some coordination between scheduler instances . Selecting a worker to suspend: Try to get the least efficient local core, then if that fails ask neighbouring schedulers for an available core, and if that fails use timesharing. Malleability: load balance by modifying the domain of each process. No details given in talk. Evaluation: claim low cost, high benefit. Graphs say otherwise; seem to show a hit in most benchmarks when running in isolation. Contended case is much better, although he didn't say what the contended load was. Future work: proper eval, locality awareness, heterogenity awareness, non-shared memory awareness. Q: How do you do the estimation bit? (i.e. efficiency) A: Using one basic metric, number of wasted cycles. Has obvious meaning, and is measured in obvious way. Q: How does this result to resource-conscious scheduling e.g. eurosys 2010? Chair: Take that offline. Q: Is this assuming apps are cooperative? e.g. if the hints you give back are bad? A: Yes, but no. Q: Letting an application give you hints about what resources it uses is a bad idea, because I say so. A: Q: MESOS related work? A: Yes, there are lots of other schedulers out there. ----------------------------------------------------------- Optimising power-performance trade-off for parallel applications through dynamic core and frequency scaling. Satoshi Imamura, Hiroshi Sasaki, et al. Multi-core mainstream, core count rising. Also want low power consumption. Assert there are two knobs: CPU freq + nr of cores. Program characterisation tricky. Some apps like lots of cores, some like a fast single core. Also in phases within an app. Propose dynamic core and frequency scaling (DCFS) to solve this dynamically and pick optimal balance of parallelism against single-threaded throughput. Two phase: training and execution. Training amounts to changing the configuration and measuring IPS in each one. Execution phase uses data so collected to select a good configuration. Execution phase can detect that the profile is no longer good and re-starting training . Training: start at max core count, then work down to first local maximum. Repeat for each available frequency. Eval: waste of time for embarrassingly parallel programs, but pretty useful for middling-parallel ones. Using PARSEC benchmarks. Future work: better way of detecting behaviour changes. Q: Is the search technique a simple hill-climb? A: Yes. Q: Is that making assumptions about monotonicity? e.g. noise? A: Assume convexity. Q: Have you ever seen that to be false? A: No Q: This is dynamic reconfiguration, but your baseline is a fixed allocation. Also, same baseline configuration for every benchmark. Is that fair? (e.g. maybe dedup would behave better with only one core, rather than 32) A: Actually comparing best fixed configuration, despite what the slides say. Q: What happens with the cores which you don't use? A: We assume that they don't consume any static power i.e. off completely. Q: Have you considered fixing performance and then minimising power? A: No. Q: Have you considered heterogeneous core frequencies e.g. scale one down and leave others running? A: No, we haven't considered that. Moderator: dynamic frequency/voltage scaling probably won't work on future processors, so it's all a waste of time anyway. ----------------------------------------------------------------- Third talk -- light weighted virtualisation layer for multicore processor-based embedded systems. Hitoshi Mitake. Background: modern embedded systems complex. Real time constraints, but also a fancy UI. Sometimes solved by running multiple OSes on a VMM. OKL4, RedBend VLX apparently do this. Their answer: vlk -- thin abstraction layer for CPUs, interrupt controllers, timers, IPIs. Abstraction, but no policy management . Why new hypervisor? Using SuperH architecture , Lockholder preemption == preemption of threads which are holding spinlocks. Solved using a fixed priority scheduling algorithm . Motivation: #include "microkernels.tex". Want multiple resource management policies. Aim is ``temporal and code isolation between contexts of guest OSes'' . Doing everything in one OS means that you pay the price of having RT even when you're not using it. Using multi-OS personalities means that ``shared resources between OSes can be managed easily'' Future work: support spatial isolation, so that single-factor evaluation is easier. Q: This has already been done by Jason Li. How does your stuff compare? A: I've never heard of it. Q: Is the SuperH special in any sense? A: It's RISC and superscalar. Software-managed TLB. No virtualisation support. Q: How does this compare to paravirtualisation? i.e. do you actually gain anything by specialising the hypervisor to the OS? A: Q: There are lots of commercial things which do the same thing e.g. RTEMS? Why is this better than that? Q: What are you actually trying to do here? A: Putting a non-realtime hypervisor/OS on the bottom won't work. Panel session ------------- Q: As we have large numbers of cores, to what extent will virtualisation be about partitioning resources rather than scheduling them? A: In static partitioning, even if there are no RT activities, A: It depends on what you want to run on these systems. We don't have too many embarrassingly parallel applications, and at that point if you have an enormous number of cores you probably can't fill them all, so scheduling is less important. Q: We're moving towards the cloud, which makes scheduling necessary. Q: If you look at phones then they often have a dedicated unit for doing the phone processing independent of the main processor. Q: Partly for regulatory reasons, though. Ross: Heterogeous architecture e.g. if you have a DSP you might be able to do software radio? Q: That makes it a lot easier, yes. Other constraint is power, though, and that argues for a carefully-optimised ASIC rather than a high-power GPU/whatever. Also, many phones are quite locked-down Rob: How do you get programs to tell you their security policy (or whatever)? Anil: TCP interacts badly with radio power management, so you need cross-layer optimisation. Changing hardware ends up rippling through all the way through to the web layer. Q: Resource sharing usually means sharing chip area, but perhaps sharing battery charge is more useful? Simon: Most wireless basestations are FPGA-based rather than ASIC based, just because they need the flexibility. .