==========================================================

Related: N2219: Clarifying Pointer Provenance (Q1-Q20) v3 N2223: Clarifying the C Memory Object Model: Introduction to N2219 - N2222, N2090, Section 4 of N2012, Questions 3/15, 4/15, and 5/15 of our survey, Section 2.1-2.9 (Q1-20) of our N2013, and DR260

1 Summary

This is a revision of N2219, which was based on N2090, which was based on N2012 (Section 4), adding a concrete Technical Corrigendum proposal for discussion, revising the text, and adding concrete examples (from N2013).

1.1 Problem

C pointer values could traditionally be considered to be concrete numeric values (our survey, Questions 3, 4, and 5, indicates many still do). The DR260 Committee Response suggests otherwise, hinting at a notion of provenance associated to values that keeps track of their "origins":

"Implementations are permitted to track the origins of a bit-pattern and treat those representing an indeterminate value as distinct from those representing a determined value. They may also treat pointers based on different origins as distinct even though they are bitwise identical."

Current compilers exploit this, using it to justify alias analysis based on provenance distinctions. However, the DR260 CR has not been incorporated in the standard text, and it also leaves many specific questions unclear: it's ambiguous whether some idioms are allowed or not, and unclear what alias analysis and optimisation is allowed to do.

In this note we go through a sequence of specific questions, supported by concrete examples, and make a candidate proposal for discussion.

1.2 Status and summary of changes since N2219

We're reasonably confident in the basic proposal, with provenances (single or empty) carried with pointer and integer values. The main open questions are:

This has been updated following discussions with Martin Sebor, Jens Gustedt, Martin Uecker, Richard Smith, Chris Lattner, Kostya Serebryany, Nuno Lopes, Juneyoung Lee.

1.3 Outline of proposal

The basic idea is to associate a provenance with every pointer value, essentially identifying the original allocation the pointer is derived from. This is for the "C abstract machine" as defined in the standard: compilers might rely on provenance in their alias analysis and optimisation, but one would not expect normal implementations to record or manipulate provenance at runtime (though dynamic or static analysis tools might). Accordingly, provenances do not have any representation.

Then there are many specific choices of how provenance is affected by arithmetic operations and suchlike. We first discuss the questions and then summarise our proposal.

Note that provenance-based alias analysis should not be confused with type-based alias analysis, and it is (contrary to the expectations of some expert C users and OS developers) still observable in current mainstream implementations even with -fno-strict-aliasing, opening the door to subtle bugs. Compilers should probably provide an option to turn it off, and any proposal for "safe C" should mandate and describe that option. We define the behaviour with and without a hypothetical -fno-provenance flag here.

Exactly how that option is expressed might be left implementation-defined (in some implementations it might not be a command-line flag). There should be a corresponding feature-test macro to expose whether it has been enabled. What granularity it should have (e.g. per-function / per-compilation-unit / per-compilation / per-link-job) needs thought.

2 Questions and examples

2.1 Basic provenance

2.1.1 Q1. Must the pointer used for a memory access have the right provenance, i.e. be derived from the pointer to the original allocation (with undefined behaviour otherwise)? (This lets compilers do provenance-based alias analysis) Yes

Here DR260CR clearly says yes. Our experimental data shows cases where recent versions of GCC and ICC do assume non-aliasing of pointers with identical representation values but distinct provenance. This is incompatible with a concrete semantics of pointers (where they are fully characterised by their representation values). Tracking of provenance in the "abstract machine" is therefore clearly necessary to make these compilers sound with respect to the standard.

For example, consider the following pathological code (adapted from DR260).

Example: provenance_basic_global_yx.c
#include <stdio.h>
#include <string.h> 
int  y=2, x=1;
int main() {
  int *p = &x + 1;
  int *q = &y;
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // does this have undefined behaviour?
    printf("x=%d y=%d *p=%d *q=%d\n",x,y,*p,*q);
  }
  return 0;
}

Depending on the implementation, x and y might happen to be allocated in adjacent memory, in which case &x+1 and &y will have bitwise-identical representation values, the memcmp will succeed, and p (derived from a pointer to x) will have the same representation value as a pointer to a different object, y, at the point of the update *p=11. This can occur in practice, e.g. with GCC -O2. Its output of

x=1 y=2 *p=11 *q=2

suggests that the compiler is reasoning that *p does not alias with y or *q, and hence that the initial value of y=2 can be propagated to the final printf.

This outcome would not be correct with respect to a naive concrete semantics, and so to make the compiler sound it is necessary for this program to be deemed to have undefined behaviour. Note that this example does not involve type-based alias analysis, and the outcome is not affected by GCC's -fno-strict-aliasing flag. One might ask whether the mere formation of the pointer x+1 is legal. This case is explicitly permitted by the ISO standard.

2.1.2 Q2. Can equality testing on pointers be affected by pointer provenance information? Yes, nondetermistically

Example: provenance_equality_global_yx.c
#include <stdio.h>
#include <string.h> 
int  y=2, x=1;
int main() {
  int *p = &x + 1;
  int *q = &y;
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  _Bool b = (p==q);
  // can this be false even with identical addresses?
  printf("(p==q) = %s\n", b?"true":"false");
  return 0;
}

This is also allowed according to DR260CR. We have observed GCC regarding two pointers with different provenance as nonequal (with ==) even though they have the same representation value. This happens in some circumstances but not others, so we suggest that whether pointer equality takes provenance into account or not should be made indeterminate in the standard (again to make the observed compiler behaviour sound with respect to the standard). Note that requiring equality to always take provenance into account would require implementations to track provenance at runtime.

The ISO C11 standard text is too strong here: 6.5.9p6 says "Two pointers compare equal if and only if both are [...] or one is a pointer to one past the end of one array object and the other is a pointer to the start of a different array object that happens to immediately follow the first array object in the address space", which requires such pointers to compare equal (reasonable pre-DR260CR, but not after it). We don't expect programmers to rely on that behaviour and GCC does not satisfy it, so, to be consistent with DR260CR and with the indeterminate behaviour we suggest, it should permit them to compare either equal or non-equal.

2.2 Pointer provenance via integer types

2.2.1 Q3 Can one make a usable pointer via casts to intptr_t and back? Yes

ISO C11 optionally allows implementations to provide the type intptr_t (along with an unsigned variant) with guaranteed round-trip properties for pointer/integer casts. The following should be allowed, and that means that the C abstract machine should track provenance via such casts to and from integer values.

Example: provenance_roundtrip_via_intptr_t.c
#include <stdio.h>
#include <inttypes.h>
int  x=1;
int main() {
  int *p = &x;
  intptr_t i = (intptr_t)p;
  int *q = (int *)i;
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

2.2.2 Q4 Can one make a usable pointer via casts to unsigned long and back? impl-def

It also seems to be common practice (e.g. in Linux) to extend these roundtrip properties to unsigned long, as in the example below, when its implementation is large enough. We suggest that it be implementation-defined which integer types support this - typically this will be all the types with wide-enough implementation representations.

Example: provenance_roundtrip_via_unsigned_long.c
#include <stdio.h>
int  x=1;
int main() {
  int *p = &x;
  unsigned long i = (unsigned long)p;
  int *q = (int *)i;
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

2.2.3 Q5 Must provenance information be tracked via casts to integer types and integer arithmetic? Yes

Example: provenance_basic_using_intptr_t_global_yx.c
#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x = 1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = 4;
  int *p = (int *)(ux + offset);
  int *q = &y;
  printf("Addresses: &x=%"PRIiPTR" p=%p &y=%"PRIiPTR\
         "\n",ux,(void*)p,uy);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11; // does this have undefined behaviour?
    printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
  }
}

Given the type intptr_t, this question asks whether one can return to a concrete view of pointers as simple numbers, by casting to intptr_t followed by integer arithmetic and casting back to a pointer type. Here again, we observe GCC behaving the same as with Q1, reasoning that pointers obtained in this way cannot alias even if they have the same numerical values. This observation is reinforced by the GCC documentation, which mentions an "original pointer" associated to integer values cast to pointer type, so the answer seems to be "yes". This leads to many more questions regarding the specifics of how provenance information affect the semantics of each integer operator. Some of these are discussed in the next subsection and the remainder are given a complete treatment in the summary of our proposal at the end.

2.2.4 Q6 Can one use bit manipulation and integer casts to store information in unused bits of pointers? Yes

Now we extend the example of Q3, that cast a pointer to intptr_t and back, to use bitwise logical operations on the integer value to store some tag bits. The following code exhibits a strong form of this, storing the address and tag bit combination as a pointer (which thereby creates a misaligned pointer value, though one not used for accesses); a weaker form would store the combined value only as an integer.

Example: provenance_tag_bits_via_uintptr_t_1.c
#include <assert.h>
#include <stdio.h>
#include <stdint.h>
int x=1;
int main() {
  int *p = &x;
  // cast &x to an integer 
  uintptr_t i = (uintptr_t) p;
  // check the bottom two bits of an int* are not used
  assert(_Alignof(int) >= 4);
  assert((i & 3u) == 0u);
  // construct an integer like &x with low-order bit set
  i = i | 1u;  
  // cast back to a pointer
  int *q = (int *) i; // defined behaviour?
  // cast to integer and mask out the low-order two bits
  uintptr_t j = ((uintptr_t)q) & ~((uintptr_t)3u);  
  // cast back to a pointer
  int *r = (int *) j; 
  // are r and p now equivalent?  
  *r = 11;           //  defined behaviour? 
  _Bool b = (r==p); 
  printf("x=%i *r=%i (r==p)=%s\n",x,*r,b?"true":"false");  
}

The standard leaves conversions between integer and pointer types implementation-defined (6.3.2.3p{5,6}), but it is common practice to use unused pointer bits (either low-order bits from alignment requirements or high-order bits beyond the maximum address range). We suggest that the set of unused bits for pointer types of each alignment should be made implementation-defined, to make this practice legal.

Moreover, where the standard does give a guarantee about round-trip casts, e.g. for round-trips through intptr_t (7.20.1.4p1), it says only that the result "will compare equal". In a provenance-aware semantics, that may not be enough to make the result usable to reference memory; the standard text should be strengthened here to guarantee that by giving the result a usable provenance - the same as that of the source. And similarly for casts to void * and back, and adding and removing qualifiers (c.f. 6.3.2.3p{1,2}).

2.2.5 Q7 Can equality testing on integers that are derived from pointer values be affected by their provenance? No

Example: provenance_equality_uintptr_t_global_yx.c
#include <stdio.h>
#include <inttypes.h> 
int y=2, x=1;
int main() {
  uintptr_t p = (uintptr_t)(&x + 1);
  uintptr_t q = (uintptr_t)&y;
  printf("Addresses: p=%" PRIxPTR " q=%" PRIxPTR "\n",
         p,q);
  _Bool b = (p==q);
  // can this be false even with identical numeric addresses?
  printf("(p==q) = %s\n", b?"true":"false");
  return 0;
}

GCC did at one point print false for this, but it was regarded as a bug and fixed. We have observed in Clang for a similar example, but believe it is also a bug there. DR 260CR does not address the question. We believe that integer equality testing should not be affected by provenance, i.e. "no".

This example is inspired by one from Krebbers' PhD thesis.

2.3 Pointers involving multiple provenances

2.3.1 Q8 Should intra-object pointer subtraction give provenance-free integer results? Yes

DR260CR does not address this, but it is uncontroversially "yes": an intra-object pointer subtraction, say between the addresses of two elements of an array, should give a provenance-free integer offset that can then be used for indexing into this or other arrays.

Example: provenance_multiple_2_global.c
#include <stdio.h>
int y[2], x[2];
int main() {
  int *p = &(x[0]) + (&(y[1])-&(y[0]));
  *p = 11;  // is this free of undefined behaviour?
  printf("x[1]=%d *p=%d\n",x[1],*p);
  return 0;
}

We say a provenance-free integer offset because the fact that an offset is calculated from pointers to y should not, when added to a pointer to a distinct (maybe adjacent) object x, license its use to access y. The example below should not be allowed to access y[0], and we observe GCC optimising based on that assumption.

Example: provenance_multiple_4_global_yx.c
#include <stdio.h>
#include <string.h> 
int  y[2], x[2];
int main() {
  int *p = &x[1] + (&y[1]-&y[0]) + 0;
  int *q = &y[0];
  printf("Addresses: p=%p q=%p\n",(void*)p,(void*)q);
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // does this have undefined behaviour?
    printf("y[0]=%d *p=%d *q=%d\n",y[0],*p,*q);
  }
  return 0;
}

2.3.2 Q9 Can one make a usable offset between two separately allocated objects by inter-object subtraction (using either pointer or integer arithmetic), to make a usable pointer to the second by adding the offset to the first? No

This is asking about pointers that have multiple provenances, which is not addressed in DR260CR or current GCC or Clang compiler documentation - they refer to "the origin" of a pointer as if there were necessarily only one.

The example below is a variant of the Q5 provenance_basic_using_intptr_t_global_yx.c in which the constant offset is replaced by a subtraction (here after casting from pointer to integer type).

Example: pointer_offset_from_subtraction_1_global.c
#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x=1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = uy - ux;
  printf("Addresses: &x=%"PRIiPTR" &y=%"PRIiPTR\
         " offset=%"PRIiPTR" \n",ux,uy,offset);
  int *p = (int *)(ux + offset);
  int *q = &y;
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11; // is this free of undefined behaviour?
    printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
  }
}

Our experiments and our survey responses both suggest that compilers do not in general support this, and we imagine it is uncommon in practice. For normal code, we suggest "no".

However, there do seem to be specific important use cases. These include the Linux and FreeBSD per-CPU-variable implementations - though it is unclear whether these are between multiple allocations in the C sense. We suspect there are other cases, e.g.~within fancy allocators. What kind of escape hatch is needed to support this code needs further discussion. The basic question seems to be how allocator implementations should communicate with compiler alias analysis.

2.3.3 Q11 Is the XOR linked list idiom supported? No

The classic XOR linked list algorithm (implementing a doubly linked list with only one pointer per node, by storing the XOR of two pointers) also makes essential use of multiple-provenance pointers. In this example we XOR the integer values from two pointers and XOR the result again with one of them.

Example: pointer_offset_xor_global.c
#include <stdio.h>
#include <inttypes.h>
int x=1;
int y=2;  
int main() {
  int *p = &x;
  int *q = &y;
  uintptr_t i = (uintptr_t) p;
  uintptr_t j = (uintptr_t) q;
  uintptr_t k = i ^ j;
  uintptr_t l = k ^ i;
  int *r = (int *)l;
  // are r and q now equivalent?  
  *r = 11;     // does this have defined behaviour?             
  _Bool b = (r==q); 
  printf("x=%i y=%i *r=%i (r==p)=%s\n",x,y,*r,
         b?"true":"false");  
}

It is unclear whether this algorithm is important in modern practice. One respondent remarks that the XOR list implementation interacts badly with modern pipelines and the space saving is not a big win; another doesn't know of modern usages, though suspects that it is probably still used in places. We don't know whether current compiler alias analysis permits it or not. Our suggested semantics would not allow it.

If one did want to allow it in specific code, it's not clear what kind of escape hatch would be necessary. At the point of construction of the final pointer, that code doesn't have any other way to refer to the provenance of the original allocation, so an attribute to add a specific provenance to a pointer (as suggested by Jens Gustedt) would not suffice for this case. A general wildcard would, but that's problematic w.r.t. alias analysis.

2.4 Pointer provenance via pointer representation copying

2.4.1 Q13 Can one make a usable copy of a pointer by copying its representation bytes using the library memcpy? Yes

Example: pointer_copy_memcpy.c
#include <stdio.h>
#include <string.h>
int x=1;
int main() {
  int *p = &x;
  int *q;
  memcpy (&q, &p, sizeof p); 
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);  
}

The ISO C11 text does not explicitly address this. In a pre-provenance semantics, before DR260, it did not need to, but now (as it surely should be allowed) the standard needs to guarantee that the result has the appropriate provenance to make it usable. One could allow it by special-casing memcpy() to preserve provenance, but the following questions suggest a less ad hoc approach.

2.4.2 Q14 Can one make a usable copy of a pointer by copying its representation bytes (unchanged) in user code? Yes

Example: pointer_copy_user_dataflow_direct_bytewise.c
#include <stdio.h>
#include <string.h>
int  x=1;
void user_memcpy(unsigned char* dest, 
                 unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = *src;
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy((unsigned char*)&q, (unsigned char*)&p, 
              sizeof(p));
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

ISO C11 and DR260CR again do not mention this explicitly (though the 6.5p6 effective type text weakly implies it is allowed). We believe it is widely relied on.
(The only exceptions we are aware of are capability machines such as IBM system 38 and descendents, and CHERI. In CHERI you have to copy pointers at pointer types for it to work properly, but capability loads and stores can operate generically, because the capability registers have tag bits. There is also some new tagged memory support for Oracle Sparc, to find invalid pointers.)

Our proposed semantics makes it legal by regarding each representation byte (as an integer value) as having the provenance of the original pointer, and the result pointer, being composed of representation bytes with that provenance, as having the same. We could either insist:

The former is our preferred option. There may not be much reasonable code that would be sensitive to the distinctions between these - perhaps manipulations of pointers where one knows the high-order bytes are the same, as in the survey response mentioning encoding 64-bit pointers in 48 bits. Our semantics will permit that.

2.4.3 Q15 Can one make a usable copy of a pointer by copying its representation bytes by user code that indirectly computes the identity function on those bytes? Yes if bytewise

For example, suppose one reads the bytes of a pointer representation pointing to some object, encrypts them, decrypts them, store them as the representation of another pointer value, and tries to access the object. The following code is a simplified version of this, just using a XOR twice; one should imagine a more complex transform, with the transform and its inverse separated in the code and in time so that the compiler cannot analyse them.

Example: pointer_copy_user_dataflow_indirect_bytewise.c
#include <stdio.h>
#include <string.h>
int  x=1;
void user_memcpy2(unsigned char* dest, 
                  unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = ((*src) ^ 1) ^ 1;
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy2((unsigned char*)&q, (unsigned char*)&p, 
              sizeof(p));
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

Whether this should be supported is unclear, and DR260 CR does not speak to it
(it calls out the library memcpy and memmove as special cases: "Note that using assignment or bitwise copying via memcpy or memmove of a determinate value makes the destination acquire the same determinate value.").

Our proposal would allow it only in the case where bytes of different pointer values are not mixed during the computation.

Note that this does not permit code that compresses or encrypts multiple pointers together, as such a computation would (somewhere) be combining values of different provenances, with an empty-provenance result. That would need an escape hatch.

Richard Smith argues that pointer construction via 'intptr_t' and back via any value-dependent identity function should be required to work. That would admit the encryption/compressing of multiple pointers, which our proposal forbids. But defining that notion of "value-dependent" is exactly the thing that is hard in the concurrency thin-air problem, and we don't believe compilers can be made to respect it in general.

2.4.4 Q16 Can one carry provenance through dataflow alone or also through control flow? Dataflow only

Our provenance examples so far have all only involved dataflow; we also have to ask if a usable pointer can be constructed via non-dataflow control-flow paths, e.g. if testing equality of an unprovenanced integer value against a valid pointer permits the integer to be used as if it had the same provenance as the pointer. We don't expect that this is relied on in practice, and our proposed semantics does not permit it - we track provenance only through dataflow. This needs to be discussed with respect to current compiler analysis behaviour.

For example, consider a version of the previous indirect memcpy example with a control-flow choice on the value of the bytes:

Example: pointer_copy_user_ctrlflow_bytewise_abbrev.c
#include <stdio.h>
#include <string.h>
#include <assert.h>
#include <limits.h>
int  x=1;
unsigned char control_flow_copy(unsigned char c) {
  assert(UCHAR_MAX==255);
  switch (c) {
  case 0: return(0);
  case 1: return(1);
  case 2: return(2);
  ...
  case 255: return(255);
  }
}
void user_memcpy2(unsigned char* dest, 
                  unsigned char *src, size_t n) {
  while (n > 0)  {      
    *dest = control_flow_copy(*src);
    src += 1;
    dest += 1;
    n -= 1;
  }
}
int main() {
  int *p = &x;
  int *q;
  user_memcpy2((unsigned char*)&q, (unsigned char*)&p, 
              sizeof(p));
  *q = 11; // is this free of undefined behaviour?
  printf("*p=%d  *q=%d\n",*p,*q);
}

Similarly, one can imagine copying a pointer via uintptr_t bit-by-bit via a control-flow choice for each bit.

Finally, contrasting with the first two examples above, that recover all the concrete value information of the original pointer, we can consider a variant of the Q5 provenance_basic_using_intptr_t_global_yx.c example in which there is a control-flow choice based on partial information of the intended target pointer (here just whether q is null) and the concrete value information is obtained otherwise:

Example: provenance_basic_mixed_global_offset+4.c
#include <stdio.h>
#include <string.h> 
#include <stdint.h>
#include <inttypes.h>
int  y = 2, x=1;
int main() {
  intptr_t ux = (intptr_t)&x;
  intptr_t uy = (intptr_t)&y;
  intptr_t offset = 4;
  printf("Addresses: &x=%"PRIiPTR" &y=%"PRIiPTR\
         "\n",ux,uy);
  int *q = &y;
  if (q != NULL) {
    int *p = (int *)(ux + offset);
    if (memcmp(&p, &q, sizeof(p)) == 0) {
      *p = 11; // is this free of undefined behaviour?
      printf("x=%d  y=%d  *p=%d  *q=%d\n",x,y,*p,*q); 
    }
  }
}

A semantics that tracks provenance only through dataflow dependency seems to be the simplest option and probably compatible with programming practice; we imagine that none of these idioms occur in normal practice. It would forbid the above examples, while permitting the dataflow bitwise copy example.
This is our preferred option.

Allowing provenance to be propagated via any control-flow dependency would allow all these examples, but it seems clear that the last example should be forbidden, in ISO or de facto semantics, and indeed GCC is again doing an optimisation that would not be sound if it were. In real code we imagine that many pointer accesses are in some way control-flow dependent on others, given the many null-pointer checks required in C, so tracking that would neither be feasible nor useful.

2.5 Pointer provenance and union type punning

2.5.1 Q17 Is type punning between integer and pointer values allowed? Yes, if representation conversion the identity

The following example (analogous to the provenance_roundtrip_via_intptr_t.c of Q3) constructs a pointer by casting a pointer to uintptr_t, storing that in a member of a union of that type, and then reading from a member of the union of pointer type.

Example: provenance_union_punning_1_global.c
#include <stdio.h>
#include <string.h> 
#include <inttypes.h>
int x=1;
typedef union { uintptr_t ui; int *p; } un;
int main() {
  un u; 
  int *px = &x;
  uintptr_t i = (uintptr_t)px;
  u.ui = i;
  int *p = u.p;
  printf("Addresses: p=%p &x=%p\n",(void*)p,(void*)&x);
  *p = 11;  // is this free of undefined behaviour?
  printf("x=%d *p=%d\n",x,*p);
  return 0;
}

The ISO standard says "the appropriate part of the object representation of the value is reinterpreted as an object representation in the new type", but says little about that reinterpretation, and DR260 CR does not speak to the provenance of the result.

For any particular implementation, pointers to normal object types might or might not have the same representation as uintptr_t values. That might well not hold for some implementations, but it does for our usual ones.
Our semantics has to have an implementation-defined map for the conversions between pointer representations and uintptr_t in any case, so we can say that this example is allowed, preserving the original provenance, iff that is the identity.

That choice relies on an assumption that compiler alias analysis and optimisation are not assuming that this example is undefined behaviour. At present we have no data either way about that.

2.6 Pointer provenance via IO

2.6.1 Q19 Can one make a usable pointer via IO? Yes

We now consider the extreme example of pointer provenance flowing via IO, if one writes the address of an object to a file and reads it back in. We have three versions: one using fprintf/fscanf and the %p format, one using fwrite/fread on the pointer representation bytes, and one converting the pointer to and from uintptr\_t and using fprintf/fscanf on that value with the PRIuPTR/SCNuPTR formats.
The first gives a syntactic indication of a potentially escaping pointer value, while the others (after preprocessing) do not. Giving just the first in full:

Example: provenance_via_io_percentp_global.c
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <inttypes.h>
int x=1;
int main() {
  int *p = &x;
  FILE *f = fopen(
    "provenance_via_io_percentp_global.tmp","w+b");
  printf("Addresses: p=%p\n",(void*)p);
  // print pointer address to a file
  fprintf(f,"%p\n",(void*)p);
  rewind(f);
  void *rv;
  int n = fscanf(f,"%p\n",&rv);
  int *r = (int *)rv;
  if (n != 1) exit(EXIT_FAILURE);
  printf("Addresses: r=%p\n",(void*)r);
  // are r and p now equivalent?  
  *r=12; // is this free of undefined behaviour?                                                           
  _Bool b1 = (r==p); // do they compare equal?                      
  _Bool b2 = (0==memcmp(&r,&p,sizeof(r)));//same reps?
  printf("x=%i *r=%i b1=%s b2=%s\n",x,*r,
         b1?"true":"false",b2?"true":"false");
}

This is used in practice: in graphics code for marshalling/unmarshalling, using %p, in xlib, using SCNuPTR, and in debuggers.

In the ISO standard, the standard text for fprintf and scanf for %p say that this should work: "If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to that value; otherwise the behavior of the %p conversion is undefined" (modulo the usual remarks about "compare equal"), and the text for uintptr_t and the presence of SCNuPTR in inttypes.h implies the same there.

To make the standard allow it in a provenance-aware C abstract machine, we suggest that either

After a successful usage of a pointer with wildcard provenance, one could conceivably side-effect the pointer value to collapse its provenance down to the one used. We have no idea whether compilers could or do depend on that; absent a compelling reason otherwise, we do not plan to build that in to our semantics.

2.6.2 Q20 Can one make a usable pointer from a concrete address (of device memory)? Yes

C programs should normally not form pointers from particular concrete addresses. For example, the following should normally be considered to have undefined behaviour, as address 0xABC might not be mapped or, if it is, might alias with other data used by the runtime. By the ISO standard it does have undefined behaviour, consistent with an abstract view of pointers.

Example: pointer_from_concrete_address_1.c
int main() { 
  // on systems where 0xABC is not a legal non-stack/heap 
  // address, does this have undefined behaviour?
  *((int *)0xABC) = 123;
}

But in some circumstances, especially for embedded devices, it is idiomatic to use concrete addresses in C to access memory-mapped devices, e.g.

Example: pointer_from_concrete_address_2.c
#define PORTBASE 0x40000000
unsigned int volatile * const port = 
  (unsigned int *) PORTBASE;
int main() {
  unsigned int value = 0;
  // on systems where PORTBASE is a legal non-stack/heap 
  // address, does this have defined behaviour?
  *port = value; /* write to port */
  value = *port; /* read from port */
}

Our suggestion is to introduce an implementation-defined set of "device memory" addresses (which may depend on linking), and which is guaranteed to be disjoint from normal C-accessible stack, heap, and program memory, for which the creation of such pointers be allowed.

2.7 Pointer provenance for subobjects

2.7.1 Q?. Are provenance checks only on a per-allocation granularity, or per-subobject? Debatable

Our previous proposal (N2219 and before) had a simple per-allocation notion of provenance, allowing access via a pointer anywhere within the memory footprint of the original allocation (though note that without '-fno-strict-aliasing' there would be some additional restrictions).

In some cases this is needed, e.g.:

But in other cases per-allocation provenance allows intra-allocation examples which we don't think should be supported, e.g. as below, and which will unnecessarily impede alias analysis. Existing compilers may well do this (according to Nuno Lopes, GCC and ICC already support some subobject analysis, and people are working on it for LLVM. We don't know whether or not these analyses are turned on even with '-fno-strict-aliasing'.)

Example: ?
#include <stdio.h>
#include <string.h> 
typedef struct { int x; int y; } st;
int main() {
  st s;
  int *p = &s.x + 1;
  int *q = &s.y;
  if (memcmp(&p, &q, sizeof(p)) == 0) {
    *p = 11;  // does this have undefined behaviour?
  }
}

A possible reconcilation of all this is to have a per-subobject provenance restriction by default, but relax this (to per-allocation provenance) for pointers that have been formed by an explicit cast. Perhaps only for casts to 'void *', 'unsigned char *', 'intptr_t', or 'uintptr_t', or perhaps (for simplicity) for all casts. This semantics was suggested by the RV-match documents.

We agree, especially given the status of compiler subobject alias analysis mentioned above, that something like that is needed. But is it already covered by effective types, or do we need separate subobject provenance enforcement even with '-fno-strict-aliasing'? That isn't clear.

For the moment the detailed proposal we provide does not do subobject provenance, so it allows the above example.

To enforce subobject provenance, we'd adapt the proposal to keep, with every pointer and integer value, both a provenance ID and a subrange of the original allocation footprint. When constructing a pointer to a struct member, by '&(x.p)', we'd take the subrange of that member. When constructing a pointer to a union member, we'd take the size of that member. For a pointer constructed as the address of a specific array element, it's unclear whether one should take just that element or the whole array. Then on any cast, we'd expand the subrange to that of the original allocation. Similarly on any representation-byte write, except that we'd like a user-memcopy to reconstruct the original limited pointer value.

2.8 Pointer provenance in C++, and provenance within allocated regions.

[This is from discussion with Richard Smith, C++ editor, at EuroLLVM 2018]

Ignoring malloc'd regions and wildcard-provenance pointers, our proposal seems broadly consistent with the current C++ draft for the "C fragment" of C++ (whatever that is exactly).

Note that this relies on the "implicit forwarding rule", and that in C++ they split allocation lifetime and object lifetime.

For plain-old-data (POD) object creation within malloc'd regions, which in practice will happen incrementally as struct/union members or array elements are written, they have an exotic style for angelically specifying ghost "object creation" in whatever way is necessary to make the program defined (if such exists). It'd probably be better to accumulate constraints explicitly.

In C++, if you create two distinct objects in the same allocated region from operator-new (a possibly user-defined allocator), then implementations will assume you can't get from one to the other by pointer arithmetic.

If you really malloc a region and create two POD objects, do the same rules apply? In principle, when you create an object, the pointer you get can only be used to access that object. In practice, RS doesn't know of implementations that take advantage of that (except maybe XL C/C++, which is rather aggressive for this), and there probably is user code that takes advantage of that. C++ text currently forbids it. Probably it should be allowed? Otherwise how could one use incrementally initialised arrays in an allocated region?

XL C/C++ might have an optimisation that assumes if you pass a function a pointer to a member, it assumes it doesn't modify the rest of the struct. That's arguably over-aggressive.

Also relevant here is our N2013 Q75. Can an unsigned character array with static or automatic storage duration be used (in the same way as a malloc’d region) to hold values of other types?. As we say there, a literal reading of the ISO effective type rules for C forbids this, but we believe it needs to be allowed. C++ explicitly allows it.

2.9 The problem with lost escapes

[From discussion with Richard Smith]

Our provenance proposal allows computations that erase the numeric value (and hence a concrete view of the "semantic dependencies") of a pointer but retain provenance. This makes examples like that below allowed in our proposal. Note that we don't believe it is especially desirable to allow this particular example - this is just a consequence of choosing a straightforward provenance semantics that allows the bytewise copying and bitwise manipulation of pointers of Q14 and Q15.

Example: ?
#include <stdio.h>
#include <string.h> 
#include <stdint.h> 
int x=1;
int main() {
  int *p = &x;                             // assume allocated at 0x5000
  uintptr_t i1 = (intptr_t)p;              // value 0x5000, provenance x
  uintptr_t i2 = i1 & 0x00000000FFFFFFFF;  // value 0x5000, provenance x
  uintptr_t i3 = i2 & 0xFFFFFFFF00000000;  // value 0x0000, provenance x
  uintptr_t i4 = i3 + 0x5000;              // value 0x5000, provenance x
  int *q = (int *)i4;
  printf("Addresses: p=%p\n",(void*)p);
  if (memcmp(&i1, &i4, sizeof(i1)) == 0) {
    *q = 11;  // does this have defined behaviour?
    printf("x=%d *p=%d *q=%d\n",x,*p,*q);
  }
  return 0;
}

However, in implementations some optimisation may be done before alias analysis - in particular, in LLVM, algebraic optimisations are - and those optimisations might erase the escape of &x, replacing it and all the calculation of i3 by 0x0 (a similar example would have i3 = i1-i1).
But then alias analysis might be unable to see that *q could access x, and so report that it couldn't, and hence enable optimisations that are unsound for this case. The basic point is that escapes are not preserved by optimisation.

A possible solution, which would need some implementation work but (we guess) not very much, would be to require those initial optimisation passes to record the escapes involved in computations they erase, so that that could be passed in explicitly to alias analysis.

This would also be necessary for a more refined treatment of pointers from I/O, in which the source-language semantics permits them to alias only with source-language I/O-escaped pointers - more on that below.

2.10 Provenance escape hatches, for IO and inter-object arithmetic - and the problem with wildcards

Our previous proposal suggested that pointers from I/O, and pointers from inter-object arithmetic explicitly flagged as such, could be supported with a "wildcard" provenance, allowing them to be used to access any current allocation with the same concrete address and type. But (as pointed out by several people, including Martin Uecker), this is problematic: given the existence of such pointers, unless they are somehow kept distinct (e.g. by having distinct types for potentially-wildcard pointers), which seems challenging, a compiler would have to assume that any pointer-typed function argument could be such a wildcard, and then in the semantics it might legitimately alias (and be used to access) function local variables, including those that are only created later in the function.

The LLVM internal-language semantics proposal by Lopes, Lee, Hur, et al. uses timestamps to limit what their (somewhat different) wildcard pointers can alias to.

For the IO case, one could keep track in the semantics of the provenances which have escaped via IO (in all forms, including '%p' and integer IO of 'intptr_t'-cast pointers and pointer representation bytes). Then regard a read pointer as potentially aliasing any of those (of the same type, without '-fno-strict-aliasing').

For pointers constructed via escape-hatch-tagged inter-object pointer arithmetic, it's less clear what would be sensible.
What syntactic form should such an escape hatch take? One could define a source-language-semantics notion of all the provenances which have escaped by the combination of casts to integer type, access to their representations, and I/O, and regard wildcard provenances as allowed to access any of those. That's rather crude - is it too liberal for implemented alias analysis?

At the other extreme, one could require the programmer to explicitly annotate allocations that might later be aliased by pointers constructed by wild casts. Syntactically, these might just be function calls, so we'd have void mark_as_escaping(void *p) and void * wild_cast(intptr_t i) (probably these should have better names, so as not to be confused with normal escape analysis). The former would just record the provenance/allocation of p in the abstract machine; the latter would check the integer value of i is within one of those allocations, and, if it is, give the result the corresponding provenance - with UB otherwise.

A proper solution to this would also handle allocators and their abstraction boundaries. In general an allocator (malloc or other OS or user allocators) starts with some region of memory which is essentially private to it, obtained from linking or from some other allocator, and hands out pointers to parts thereof. As far as the rest of the code is concerned, this should have fresh provenance - roughly as captured by the GCC malloc function attribute: "This tells the compiler that a function is malloc-like, i.e., that the pointer P returned by the function cannot alias any other pointer valid when the function returns, and moreover no pointers to valid objects occur in any storage addressed by P. Using this attribute can improve optimization. Functions like malloc and calloc have this property because they return a pointer to uninitialized or zeroed-out storage. However, functions like realloc do not have this property, as they can return a pointer to storage containing pointers.". When such pointers are passed back into the allocator, e.g. in a call to free, they can alias with the whole region, though presumably the allocator doesn't write into the issued memory until it reallocates it. It might write into metadata via a computation based on the pointer passed into free - which could be handled via an explicit re-provenancing operation, taking that pointer and giving it the whole-region provenance. Proper treatment of this needs annotations to manually annotate lifetime too? Do existing alias analysis implementations get this right? With link-time optimisation (LTO) it presumably really matters.

3 Proposed semantics, in detail

TODO: This is currently the N2219 proposal. The wildcard aspects need to be replaced/updated, and it needs to be adapted to support subobject provenance.

4 Proposed Technical Corrigendum

TODO: This is currently the N2219 proposal. The wildcard aspects need to be replaced/updated, and it needs to be adapted to support subobject provenance.

TODO: There are errors in the lvalue-conversion and equality text, identified by Martin Sebor, that need to be fixed.

provenance
abstract information associated to each concrete value of pointer or integer type. It can either be the empty provenance, a single provenance for a given object or region, or the wildcard provenance.

Without the -fno-provenance option, when the lvalue conversion is performed on an lvalue with pointer type, if it has the empty provenance and the associated memory location is not within the implementation-defined device memory, the behavior is undefined. If it has a single provenance and the object designated by the lvalue is not the same as or within the object from the single provenance, the behavior is undefined.

NOTE: for the wildcard provenance there is no additional check.

With the -fno-provenance option, when the lvalue conversion is performed on an lvalue with pointer type, if the associated memory location is not within a current allocation, the behaviour is undefined.

The provenance of a value resulting from an lvalue conversion is has follows: if all the bytes of the designated object have the empty provenance, then so does the resulting value; if some bytes have a single provenance and all the other bytes have the empty provenance, then the resulting value has the same single provenance; if any two bytes have different single provenances, then the resulting value has the empty provenance; if any byte has the wildcard provenance, then so does the resulting value.

There is an implementation-defined set of integer values for which the resulting pointer is within a device region of storage.

A null pointer has the empty provenance

Some expression operators evaluate to an integer value when both operands have integer type. For these the provenance of the resulting integer is as follows: if the value of both operands have empty provenance, so does the result; if the value of one operand has a single provenance and the other has empty provenance, or both values have the same single provenance, the result has that single provenance; if the values of the two operands have different single provenances, the result has the empty provenance; if the value of one operand has the wildcard provenance, the result has the wildcard provenance.

When the operand designates an object, the result has the single provenance of the outermost object containing that object.

For all these operators, when the operand has integer type, the result has the empty provenance.

The result has the empty provenance.

When both operands have integer type, the resulting integer has provenance as described in X.

The result has the provenance of the first operand.

, and has the empty provenance.

The result has provenance as described in X.

The result has provenance as described in X.

The result has provenance as described in X.

, and has the empty provenance.

, and has the empty provenance.

The value may be copied into an object of type unsigned char [n] (e.g., by memcpy)

   to read:

The value may be copied into an object of type unsigned char [n] (e.g., by memcpy), if the value is an integer or a pointer (and therefore has a provenance), the elements of the resulting array all have that provenance.

If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to that value

   to read:

If the input item is a value converted earlier during the same program execution, the pointer that results shall compare equal to the most recent such value and have the same provenance

5 Additional notes that haven't been integrated with the above