- all the current language-level models are badly broken
    Java: not sound wrt implementations
    C/C++11:  thin-air disaster for programs using relaxed
              ...and for programs mixing nonatomics and atomics?

- two kinds of fix:

  1) prevent all load->store reordering (nice and simple)

       C/C++11 performance: 

         arguably not too bad, restricting local ordering between
         relaxed.  That would affect ARM and Nvidia implementations
         (and architecturally but not for current impls, Power)

         Carl/Scott scheme to benchmark linux kernel with that
         restriction on ACCESS_ONCE and rcudr to get some realish
         numbers

         To really measure, need to make a compiler that tries hard to
         optimise away and see how well it works. Worth enquiring (in
         Google?).  

       Java performance:        
 
         guess much worse, as applies to all accesses, but
         optimisation *might* save one enough.

         how to quantify cost?  no-one is currently doing that
         D: more than 20% would be too much?  H,A: guess much less.
         A could get some approximate figure?

       Naive h/w load-store reordering inhibition cost? 

       Microarchitecturally, on Nvidia and ARM, why do we see LB?


     In any case this is accepting a perf. cost due to our lack of
     imagination in inventing a decent PL semantics. 

  2) invent a semantics that is not per-candidate-execution
      (our example shows that that is necessary)
     We have experiments in this direction; all problematic:

     a) ticky semantics (doesn't cope with non-thread-local analysis
        and needs semantic equivalence for irrelevant read
        elimination) (uses cross-execution existence-of-event
        knowledge; mild)

     b) concatenation of Jaro-style abstract optimisations (doesn't
        deal with non-thread-local...)

     c) Viktors concatenation of sb-reduction and a base model

     d) Alan's event-structure semantics 
          - does it admit h/w ARM/Power executions?
          - how does it compare with C/C++11?
          - how does the order that justification must be embedded in
             relate to anything that compilers+h/w are doing?
          - it's using justifying events from quite different
              (nonreachable) executions - suspicious?

     These are pushing one towards a semantic setup involving a global
     fixed point...

     The global (non-per-thread) analysis information that real global
     optimisations depend on seems to be of a particular form:
     non-relational (properties of the sets of values that single
     variables take on in isolation) and sound w.r.t. an extremely
     weak model, as the analysis is completely context-insensitive or
     at most call-stack sensitive?

  
- can try to identify the very egregious thin-air examples, per Ali
     Also via some hyper-symbolic cycle detection

- the definition of undefined behaviour is an opportunity for
  research.  Both the interaction with thin-air and also for the
  examples not involving thin-air (but involving any thread-local
  reordering (maybe even between non-memory-accessing statements)).
  Some people should write a note. 


- there are some easy model fixes: Viktors #2 and #3.
    (are this enough to fix the sc-atomic/nonatomic thin-air cases?
    no, that needs #0 (or #1?).  maybe that's worth pushing to the
    committee after all.)

     #0 changing rf to nonatomic to be both at reader and writer side
     #1 rf and nonatomics
     #2 sc read restriction
     #3 release seq
     #4 same-thread consume
     #5 apply mo to nonatomics


--------------------- 

Key question: what information about other executions does the
semantics need to look at?  In ticky, we see that it needs:

  - existence of actions across all branches, for ...
  - semantic identity of subtrees, for ...

  While in Alan's, it needs

  - existence of justifying actions in other (perhaps unreachable?)
    executions.  That seems too much

  And in Jaro's trace-set semantics...