- all the current language-level models are badly broken Java: not sound wrt implementations C/C++11: thin-air disaster for programs using relaxed ...and for programs mixing nonatomics and atomics? - two kinds of fix: 1) prevent all load->store reordering (nice and simple) C/C++11 performance: arguably not too bad, restricting local ordering between relaxed. That would affect ARM and Nvidia implementations (and architecturally but not for current impls, Power) Carl/Scott scheme to benchmark linux kernel with that restriction on ACCESS_ONCE and rcudr to get some realish numbers To really measure, need to make a compiler that tries hard to optimise away and see how well it works. Worth enquiring (in Google?). Java performance: guess much worse, as applies to all accesses, but optimisation *might* save one enough. how to quantify cost? no-one is currently doing that D: more than 20% would be too much? H,A: guess much less. A could get some approximate figure? Naive h/w load-store reordering inhibition cost? Microarchitecturally, on Nvidia and ARM, why do we see LB? In any case this is accepting a perf. cost due to our lack of imagination in inventing a decent PL semantics. 2) invent a semantics that is not per-candidate-execution (our example shows that that is necessary) We have experiments in this direction; all problematic: a) ticky semantics (doesn't cope with non-thread-local analysis and needs semantic equivalence for irrelevant read elimination) (uses cross-execution existence-of-event knowledge; mild) b) concatenation of Jaro-style abstract optimisations (doesn't deal with non-thread-local...) c) Viktors concatenation of sb-reduction and a base model d) Alan's event-structure semantics - does it admit h/w ARM/Power executions? - how does it compare with C/C++11? - how does the order that justification must be embedded in relate to anything that compilers+h/w are doing? - it's using justifying events from quite different (nonreachable) executions - suspicious? These are pushing one towards a semantic setup involving a global fixed point... The global (non-per-thread) analysis information that real global optimisations depend on seems to be of a particular form: non-relational (properties of the sets of values that single variables take on in isolation) and sound w.r.t. an extremely weak model, as the analysis is completely context-insensitive or at most call-stack sensitive? - can try to identify the very egregious thin-air examples, per Ali Also via some hyper-symbolic cycle detection - the definition of undefined behaviour is an opportunity for research. Both the interaction with thin-air and also for the examples not involving thin-air (but involving any thread-local reordering (maybe even between non-memory-accessing statements)). Some people should write a note. - there are some easy model fixes: Viktors #2 and #3. (are this enough to fix the sc-atomic/nonatomic thin-air cases? no, that needs #0 (or #1?). maybe that's worth pushing to the committee after all.) #0 changing rf to nonatomic to be both at reader and writer side #1 rf and nonatomics #2 sc read restriction #3 release seq #4 same-thread consume #5 apply mo to nonatomics --------------------- Key question: what information about other executions does the semantics need to look at? In ticky, we see that it needs: - existence of actions across all branches, for ... - semantic identity of subtrees, for ... While in Alan's, it needs - existence of justifying actions in other (perhaps unreachable?) executions. That seems too much And in Jaro's trace-set semantics...