What is C in practice? (Cerberus survey v2): Analysis of Responses (n2014)

Kayvan Memarian and Peter Sewell

University of Cambridge

2015-06-21 (updated 2016-02-05)


In April-September 2015 we distributed a web survey to investigate what C is, in current mainstream practice: the behaviour that programmers assume they can rely on, the behaviour provided by mainstream compilers, and the idioms used in existing code, especially systems code. We were not asking what the ISO C standard permits, which is often more restrictive, or about obsolete or obscure hardware or compilers. We focussed on the behaviour of memory and pointers. This is a step towards an unambiguous and mathematically precise definition of the de facto standards: the C dialects that are actually used by systems programmers and implemented by mainstream compilers.

This document analyses the results. For many questions the outcome seems clear, but for some, especially 1, 2, 9, 10, and 11, major open questions about current compiler behaviour remain; we'd greatly appreciate comments from the relevant compiler developers or other experts.

Acknowledgements

We would like to thank all those who responded to the survey, those who distributed it, and especially those who helped us tune earlier versions, including members of the Cambridge Systems Research Group. This work is funded by the EPSRC REMS (Rigorous Engineering for Mainstream Systems) Programme Grant, EP/K008528/1.

Responses

Aiming for a modest-scale but technically expert audience, we distributed the survey at the University of Cambridge systems research group, at EuroLLVM 2015, via John Regehr's blog, and via various mailing lists: gcc, llvmdev, cfe-dev, libc-alpha, xorg, a FreeBSD list, xen-devel, a Google C users list, and a Google C compilers list. It was then distributed second-hand via Facebook, Twitter, and Reddit. It was also sent to some Linux and MSVC people, but not widely advertised there.

In all there were 323 responses, between 2015/04/10 and 2015/09/29. Of those, 223 included a name and/or an email address while 100 were anonymous. The responses included a few duplicate submissions from non-anonymous people; the earlier submissions are not included in these numbers. There may also be a small number of duplicates from anonymous people. It's hard to be certain here about exactly which are duplicates, though, so they left them in the data; the small number means they shouldn't significantly affect the results. There were also a few responses directly to the mailing lists, which we include in the text discussion but not in the numbers below.

The responses include around 100 printed pages of textual comments, which are often more meaningful that the numerical survey results. Below we include just a few representative examples for each, not all the comments.

Expertise

C applications programming : 255
C systems programming : 230
Linux developer : 160
Other OS developer : 111
C embedded systems programming : 135
C standard : 70
C or C++ standards committee member : 8
Compiler internals : 64
GCC developer : 15
Clang developer : 26
Other C compiler developer : 22
Program analysis tools : 44
Formal semantics : 18
no response : 6 other : 18

Most have expertise in C systems programming and significant numbers report expertise in compiler internals and in the C standard.

Comparing the numbers between the set of all responses and those with one or other of those two, they seemed broadly similar. On the whole the "C standard" expertise people are a little more pessimistic.

MAIN QUESTION RESPONSES

[1/15] How predictable are reads from padding bytes?

If you zero all bytes of a struct and then write some of its members, do reads of the padding return zero? (e.g. for a bytewise CAS or hash of the struct, or to know that no security-relevant data has leaked into them.)

Will that work in normal C compilers?

yes : 116 (36%)
only sometimes : 95 (29%)
no : 21 ( 6%)
don't know : 82 (25%)
I don't know what the question is asking : 3 ( 1%)
no response : 6

Do you know of real code that relies on it?

yes : 46 (14%)
yes, but it shouldn't : 31 ( 9%)
no, but there might well be : 158 (49%)
no, that would be crazy : 58 (18%)
don't know : 25 ( 7%)
no response : 5

If it won't always work, is that because [check all that apply]:

you've observed compilers write junk into padding bytes : 31
you think compilers will assume that padding bytes contain unspecified values and optimise away those reads : 120
no response : 150
other : 80

It remains unclear what behaviour compilers currently provide (or should provide) for this. On one side, arguing for a relatively tight semantics:

On the other side, looking at the optimisations that compilers actually do (which may force a relatively loose semantics):

The above suggests four possible semantics, listed with the strongest first (in order of decreasing predictability for the programmer and increasing looseness, and hence increasing permissiveness, for optimisers):

a) Structure copies might copy padding, but structure member writes never touch padding.

b) Structure member writes might write zeros over subsequent padding.

c) Structure member writes might write arbitrary values over subsequent padding, with reads seeing stable results.

d) Padding bytes are regarded as always holding unspecified values, irrespective of any byte writes to them, and so reads of them might return arbitrary and unstable values. (But note that for structs stored to malloc'd regions, this is at odds with the idea that malloc'd regions can be reused, so perhaps we could only really have this semantics for the other storage-duration kinds.)

For each compiler (GCC, Clang, MSVC, ICC, ...), the question is which of these it provides on mainstream platforms? If the answer is not (a), to what extent is it feasible to provide compiler flags that force the behaviour to be stronger?

[2/15] Uninitialised values

Is reading an uninitialised variable or struct member (with a current mainstream compiler):

(This might either be due to a bug or be intentional, e.g. when copying a partially initialised struct, or to output, hash, or set some bits of a value that may have been partially initialised.)

a) undefined behaviour (meaning that the compiler is free to arbitrarily miscompile the program, with or without a warning) : 139 (43%)

b) ( * ) going to make the result of any expression involving that value unpredictable : 42 (13%)

c) ( * ) going to give an arbitrary and unstable value (maybe with a different value if you read again) : 21 ( 6%)

d) ( * ) going to give an arbitrary but stable value (with the same value if you read again) : 112 (35%)

e) don't know : 3 ( 0%)

f) I don't know what the question is asking : 2 ( 0%)

g) no response : 4

If you clicked any of the starred options, do you know of real code that relies on it (as opposed to the looser options above the one you clicked)?

yes : 27 (11%)
yes, but it shouldn't : 52 (22%)
no, but there might well be : 63 (27%)
no, that would be crazy : 80 (34%)
don't know : 10 ( 4%)
no response : 91

Here also it remains unclear what compilers current provide and what they should provide. The survey responses are dominated by the "undefined behaviour" and "arbitrary but stable" options.

It's not clear whether people are actually depending on the latter, beyond the case of copying a partially initialised struct, which it seems must be supported, and comparing against a partially initialised struct, which it seems is done sometimes. Many respondents mention historical uses to attempt to get entropy, but that seems now widely regarded as a mistake. There is a legitimate general argument that the more determinacy that can be provided the better, for debugging.

But it seems clear that GCC, Clang, and MSVC do not at present exploit the licence the ISO standard gives (in defining this to be undefined behaviour) to arbitrarily miscompile code. Clang seems to be the most aggressive, propagating undef in many cases, though one respondent said "LLVM is moving towards treating this as UB in the cases where the standards allow it to do so (Richard Smith)". But there are special cases where LLVM is a bit stronger (cf the undef docs); it's unclear why they think those are useful. For GCC, Joseph Myers said

"Going to give arbitrary, unstable values (that is, the variable assigned from the uninitialised variable itself acts as uninitialised and having no consistent value). (Quite possibly subsequent transformations will have the effect of undefined behavior.) Inconsistency of observed values is an inevitable consequence of transformations PHI (undefined, X) -> X (useful in practice for programs that don't actually use uninitialised variables, but where the compiler can't see that)."

For MSVC, one respondent said:

"I am aware of a significant divergence between the LLVM community and MSVC here; in general LLVM uses "undefined behaviour" to mean "we can miscompile the program and get better benchmarks", whereas MSVC regards "undefined behaviour" as "we might have a security vulnerability so this is a compile error / build break". First, there is reading an uninitialized variable (i.e. something which does not necessarily have a memory location); that should always be a compile error. Period. Second, there is reading a partially initialised struct (i.e. reading some memory whose contents are only partly defined). That should give a compile error/warning or static analysis warning if detectable. If not detectable it should give the actual contents of the memory (be stable). I am strongly with the MSVC folks on this one - if the compiler can tell at compile time that anything is undefined then it should error out. Security problems are a real problem for the whole industry and should not be included deliberately by compilers."

For each compiler we ask which of these four semantics it provides (weakest first, as in the question):

a) undefined behaviour (meaning that the compiler is free to arbitrarily miscompile the program, with or without a warning).

b) the result of any expression involving that value unpredictable.

c) an arbitrary and unstable value (maybe with a different value if you read again).

d) an arbitrary but stable value (with the same value if you read again).

It looks as if several compiler writers are saying (b), while a significant number of programmers are relying on (d) (which may also be what MSVC supports).

[3/15] Can one use pointer arithmetic between separately allocated C objects?

If you calculate an offset between two separately allocated C memory objects (e.g. malloc'd regions or global or local variables) by pointer subtraction, can you make a usable pointer to the second by adding the offset to the address of the first?

Will that work in normal C compilers?

a) yes : 154 (48%)

b) only sometimes : 83 (26%)

c) no : 42 (13%)

d) don't know : 36 (11%)

e) I don't know what the question is asking : 3 ( 0%)

f) no response : 5

Do you know of real code that relies on it?

yes : 61 (19%)
yes, but it shouldn't : 53 (16%)
no, but there might well be : 99 (31%)
no, that would be crazy : 73 (23%)
don't know : 27 ( 8%)
no response : 10

If it won't always work, is that because [check all that apply]:

you know compilers that optimise based on the assumption that that is undefined behaviour : 51 no response : 228
other : 51

Most respondents expect this to work, and a significant number know of real code that relies on it. For example:

Historically, the main reason to disallow it seems to have been segmented architectures, especially 8086. There are still some embedded architectures with distinct address spaces, and people mention the following, but it's not clear that "mainstream" C should be concerned with this, and those cases could be identified as a language dialect or implementation-defined choice.

Semantically, it's straightforward to identify language dialects in which this is or is not allowed.

Then there is the possibility of exotic implementations in which pointers are represented by hash-map entries (Nick Lewycky), but again that seems outwith "mainstream" C.

On the other hand, current compilers sometimes do optimise based on an assumption (in a points-to analysis) that this doesn't occur (c.f. comments from Joseph Myers and Dan Gohman). How could these be reconciled?

[4/15] Is pointer equality sensitive to their original allocation sites?

For two pointers derived from the addresses of two separate allocations, will equality testing (with ==) of them just compare their runtime values, or might it take their original allocations into account and assume that they do not alias, even if they happen to have the same runtime value? (for current mainstream compilers)

a) it will just compare the runtime values : 141 (44%)

b) pointers will compare nonequal if formed from pointers to different allocations : 20 ( 6%)

c) either of the above is possible : 101 (31%)

d) don't know : 40 (12%)

e) I don't know what the question is asking : 16 ( 5%)

f) no response : 5

If you clicked either of the first two answers, do you know of real code that relies on it?

yes : 60 (26%)
yes, but it shouldn't : 16 ( 7%)
no, but there might well be : 68 (29%)
no, that would be crazy : 46 (20%)
don't know : 37 (16%)
no response : 96

The responses are roughly bimodal: many believe "it will just compare the runtime values", while a similar number believe that the comparison might take the allocation provenance into account. Of the former, 41 "know of real code that relies on it".

In practice we see that GCC does sometimes take allocation provenance into account, with the result of a comparison (in an n+1 case, comparing &p+1 and &q) sometimes varying depending on whether the compiler can see the provenance, e.g. on whether it's done in the same compilation unit as the allocation. We don't see any reason to forbid that, especially as this n+1 case seems unlikely to arise in practice, though it does complicate the semantics, effectively requiring a nondeterministic choice at each comparison of whether to take provenance into account. But for comparisons between pointers formed by more radical pointer arithmetic from pointers originally from different allocations, as in [3/15], it's not so clear.

Conclusion:

The best "mainstream C" semantics here seems to be to make a nondeterministic choice at each comparison of whether to take provenance into account or just compare the runtime pointer value, option (c). In the vast majority of cases the two will coincide.

[5/15] Can pointer values be copied indirectly?

Can you make a usable copy of a pointer by copying its representation bytes with code that indirectly computes the identity function on them, e.g. writing the pointer value to a file and then reading it back, and using compression or encryption on the way?

Will that work in normal C compilers?

a) yes : 216 (68%)

b) only sometimes : 50 (15%)

c) no : 18 ( 5%)

d) don't know : 24 ( 7%)

e) I don't know what the question is asking : 9 ( 2%)

f) no response : 6

Do you know of real code that relies on it?

yes : 101 (33%)
yes, but it shouldn't : 24 ( 7%)
no, but there might well be : 100 (33%)
no, that would be crazy : 54 (17%)
don't know : 23 ( 7%)
no response : 21

The responses are overwhelmingly positive, with many specific use cases in the comments, e.g.:

The responses about current compiler behaviour are clear that in simple cases, with direct data-flow from original to computed pointer, both GCC and Clang support this. But for computation via control-flow, it's not so clear:

Conclusion:

It looks as if a reasonable "mainstream C" semantics should allow indirect pointer copying at least whenever there's a data-flow provenance path, perhaps not when there's only a control-flow provenance path. It should allow pointers to be marshalled to and read back in, and the simplest way of doing that is to allow any pointer value to be read in, with the compiler making no aliasing/provenance assumptions on such, and with the semantics checking the numeric pointer values points to a suitable live object only when and if it is dereferenced.

[6/15] Pointer comparison at different types

Can one do == comparison between pointers to objects of different types (e.g. pointers to int, float, and different struct types)?

Will that work in normal C compilers?

a) yes : 175 (55%)

b) only sometimes : 67 (21%)

c) no : 44 (13%)

d) don't know : 29 ( 9%)

e) I don't know what the question is asking : 2 ( 0%)

c) no response : 6

Do you know of real code that relies on it?

yes : 111 (35%)
yes, but it shouldn't : 47 (15%)
no, but there might well be : 107 (34%)
no, that would be crazy : 27 ( 8%)
don't know : 17 ( 5%)
no response : 14

The question should have been clearer about whether the pointers are first cast to void* or char*. With those casts, the responses seem clear that it should be allowed, modulo now-unusual architectures with segmented memory or where the pointer representations are different.

Then there's a question, which we would hope applies only in the case without those casts, about whether -fstrict-aliasing will treat comparisons with type mismatches as nonequal, e.g.

[7/15] Pointer comparison across different allocations

Can one do < comparison between pointers to separately allocated objects?

Will that work in normal C compilers?

a) yes : 191 (60%)

b) only sometimes : 52 (16%)

c) no : 31 ( 9%)

d) don't know : 38 (12%)

e) I don't know what the question is asking : 3 ( 0%)

f) no response : 8

Do you know of real code that relies on it?

yes : 101 (33%)
yes, but it shouldn't : 37 (12%)
no, but there might well be : 89 (29%)
no, that would be crazy : 50 (16%)
don't know : 27 ( 8%)
no response : 19

This seems to be widely used for lock ordering and collections.

As for Q3, there's a potential issue for segmented memory systems (where the implementation might only compare the offset) which seems not to be relevant for current "mainstream" C.

Apart from that, there doesn't seem to be any reason from compiler implementation to forbid it.

[8/15] Pointer values after lifetime end

Can you inspect (e.g. by comparing with ==) the value of a pointer to an object after the object itself has been free'd or its scope has ended?

Will that work in normal C compilers?

a) yes : 209 (66%)

b) only sometimes : 52 (16%)

c) no : 30 ( 9%)

d) don't know : 23 ( 7%)

e) I don't know what the question is asking : 1 ( 0%)

f) no response : 8

Do you know of real code that relies on it?

yes : 43 (14%)
yes, but it shouldn't : 55 (18%)
no, but there might well be : 102 (33%)
no, that would be crazy : 86 (28%)
don't know : 18 ( 5%)
no response : 19

The responses mostly say that this will work (the ISO standard notwithstanding), and include various use cases:

There are debugging environments that will warn of it, however, e.g.:

And for GCC, one respondent writes:

but doesn't say what might go wrong.

Can we either establish that current mainstream compilers will support this or identify more specifically where and how they will fail to do so?

[9/15] Pointer arithmetic

Can you (transiently) construct an out-of-bounds pointer value (e.g. before the beginning of an array, or more than one-past its end) by pointer arithmetic, so long as later arithmetic makes it in-bounds before it is used to access memory?

Will that work in normal C compilers?

a) yes : 230 (73%)

b) only sometimes : 43 (13%)

c) no : 13 ( 4%)

d) don't know : 27 ( 8%)

e) I don't know what the question is asking : 2 ( 0%)

f) no response : 8

Do you know of real code that relies on it?

yes : 101 (33%)
yes, but it shouldn't : 50 (16%)
no, but there might well be : 123 (40%)
no, that would be crazy : 18 ( 5%)
don't know : 14 ( 4%)
no response : 17

It seems clear that this is often assumed to work, e.g.:

though we also see:

But on the other hand, compilers may sometimes assume otherwise:

Here the prevalence of transiently out-of-bounds pointer values in real code suggests it's worth seriously asking the cost of disabling whatever compiler optimisation is done based on this, to provide a simple predictable semantics.

[10/15] Pointer casts

Given two structure types that have the same initial members, can you use a pointer of one type to access the intial members of a value of the other?

Will that work in normal C compilers?

a) yes : 219 (69%)

b) only sometimes : 54 (17%)

c) no : 17 ( 5%)

d) don't know : 22 ( 6%)

e) I don't know what the question is asking : 4 ( 1%)

f) no response : 7

Do you know of real code that relies on it?

yes : 157 (50%)
yes, but it shouldn't : 54 (17%)
no, but there might well be : 59 (19%)
no, that would be crazy : 22 ( 7%)
don't know : 18 ( 5%)
no response : 13

It's clear that this is used very commonly:

On the other hand, with strict aliasing:

and w.r.t. GCC:

though the latter doesn't say why, or whether that's specific to strict-aliasing.

At least with -no-strict-aliasing, it seems this should be guaranteed to work.

[11/15] Using unsigned char arrays

Can an unsigned character array be used (in the same way as a malloc’d region) to hold values of other types?

Will that work in normal C compilers?

a) yes : 243 (76%)

b) only sometimes : 49 (15%)

c) no : 7 ( 2%)

d) don't know : 15 ( 4%)

e) I don't know what the question is asking : 2 ( 0%)

f) no response : 7

Do you know of real code that relies on it?

yes : 201 (65%)
yes, but it shouldn't : 30 ( 9%)
no, but there might well be : 55 (17%)
no, that would be crazy : 6 ( 1%)
don't know : 16 ( 5%)
no response : 15

Here again it's clear that it's very often relied on for statically allocated (non-malloc'd) character arrays, and it should work, with due care about alignment. For example:

But the ISO standard disallows it, and we also see:

though the latter doesn't say why. It is a violation of the strict-aliasing text of the ISO standard.

With -no-strict-aliasing it seems clear it that it should be allowed.

[12/15] Null pointers from non-constant expressions

Can you make a null pointer by casting from an expression that isn't a constant but that evaluates to 0?

Will that work in normal C compilers?

a) yes : 178 (56%)

b) only sometimes : 38 (12%)

c) no : 22 ( 6%)

d) don't know : 67 (21%)

e) I don't know what the question is asking : 11 ( 3%)

f) no response : 7

Do you know of real code that relies on it?

yes : 56 (18%)
yes, but it shouldn't : 21 ( 6%)
no, but there might well be : 113 (37%)
no, that would be crazy : 63 (20%)
don't know : 50 (16%)
no response : 20

This is very often assumed to work. The only exception seems to be some (unidentified) embedded systems.

WRT GCC:

A "mainstream C" semantics should permit it.

[13/15] Null pointer representations

Can null pointers be assumed to be represented with 0?

Will that work in normal C compilers?

a) yes : 201 (63%)

b) only sometimes : 50 (15%)

c) no : 54 (17%)

d) don't know : 7 ( 2%)

e) I don't know what the question is asking : 4 ( 1%)

f) no response : 7

Do you know of real code that relies on it?

yes : 187 (60%)
yes, but it shouldn't : 61 (19%)
no, but there might well be : 42 (13%)
no, that would be crazy : 7 ( 2%)
don't know : 12 ( 3%)
no response : 14

Basically an unequivocal "yes" for mainstream systems. For example:

Again a potential exception for segmented memory, but not relevant for "mainstream" current practice:

[14/15] Overlarge representation reads

Can one read the byte representation of a struct as aligned words without regard for the fact that its extent might not include all of the last word?

Will that work in normal C compilers?

a) yes : 107 (33%)

b) only sometimes : 81 (25%)

c) no : 44 (13%)

d) don't know : 47 (14%)

e) I don't know what the question is asking : 36 (10%)

f) no response : 8

Do you know of real code that relies on it?

yes : 40 (13%)
yes, but it shouldn't : 39 (13%)
no, but there might well be : 103 (35%)
no, that would be crazy : 42 (14%)
don't know : 67 (23%)
no response : 32

This is sometimes used in practice and believed to work, modulo alignment, page-boundary alignment, and valgrind/MSan/etc.

A "mainstream C" semantics could either forbid this entirely (slightly limiting the scope of the semantics) or could allow it, for sufficiently aligned cases, if some switch is set.

[15/15] Union type punning

When is type punning - writing one union member and then reading it as a different member, thereby reinterpreting its representation bytes - guaranteed to work (without confusing the compiler analysis and optimisation passes)?

There's widespread doubt, disagreement and confusion here, e.g.:

Here the minimal thing it seems one should support is, broadly following the GCC documentation, type punning via a union whose definition is in scope and which is accessed via l-values that manifestly involve the union type.

Or, in the -no-strict-aliasing case, one could allow it everywhere. For "mainstream C", it's not yet clear.

POSTAMBLE RESPONSES

Other differences

If you know of other areas where the C used in practice differs from that of the ISO standard, or where compiler optimisation limits the behaviour you can rely on, please list them.

There were many comments here which are hard to summarise. Many mention integer overflow and the behaviour of shift operators.