==========================================================
Related:
N2223: Clarifying the C Memory Object Model: Introduction to N2219 - N2222, N2221, N2220, N2089, Section 2 of N2012, Sections 3.1 and 3.2 (Q48-59) of N2013, DR338, DR451, N1793, N1818.
This is a revision of N2220 and N2221. The latter was a revision of N2089. N2089 was based on N2012 (Section 2), adding a concrete Technical Corrigendum proposal for discussion, revising the text, and adding concrete examples from N2013.
Our previous proposal basically fleshed out a version of "Option (b)" below, making the result of any uninitialised read a symbolic unspecified value, with that being propagated through most operations, rather than undefined behaviour. Now:
The case of incremental bit initialisation raises doubts about whether that would be unacceptable for the codebase out there.
We think the current ISO C notion of undefined behaviour is an unduly heavy hammer for this, and something more nuanced is needed to properly capture a reasonable intent.
Accordingly, we now first summarise the issues from each of the three main perspectives: existing code, implementions, and the standard. We then describe the "Option (b)" proposal, but beware that that does not handle all the issues and will surely need adaption.
Reading uninitialised values has many conceivable semantics in C, and there are conflicting demands and expectations. Summarising the main aspects of this in current practice:
In many cases, reading an uninitialised value is a programmer error, and it is/would be desirable for implementations to report this promptly, at compile-time or run-time, wherever they reasonably can.
There are some significant exceptions, where reading uninitialised values is either desirable or is endemic in practice:
It should be possible to copy a partially initialised struct, either explicitly by a struct assignment, or implicitly for function-call arguments, e.g. if the struct is incrementally initialised in multiple functions.
There's a real use-case of debug printing partially initialised structs. Making that UB could be very confusing if it gets exploited by compilers.
It seems to be fairly common for sets of flags (stored as bits in integer-typed variables) to be initialised incrementally, e.g. by reading a possibly-uninitialised value, doing some arithmetic or bitwise-logical operations on it, and storing the result back. Kostya Serebryany reports that this has to be allowed in their sanitisers, as there are too many instances to require them to be fixed.
That said, we guess (without evidence) that not much memory is used in this way, and hence that the runtime/code-size cost of requiring zero-initialisation would be small. Although that might reduce error-detection opportunities, e.g. if this is allowed by special-casing some set-bit operations.
Some polymorphic bytewise operations on structs should be supported e.g. serialisation, encryption, and (library or user) implementations of memcpy. These necessarily will read and write any padding bytes in the struct layout. Such accesses might or might not be treated in the same way as other uninitialised variables.
For struct copies and for those polymorphic bytewise operations, we think there should be some way (either always on by default, or enabled by some compiler option or annotation) for the user to ensure they have results that are (a) deterministic and (b) that cannot leak potentially-security-relevant information.
There's a conceivable case of reads just beyond the end of an allocation, e.g. reading a char[3] array at a four-byte integer type. In practice this will almost always be ok, but we guess it will be very rare, and Kostya reports that it was feasible to get address sanitiser to flag this as an error. We don't think this needs to be supported.
Our 2015 survey of C experts gave basically bimodal results between options (a) and (d) below. We asked (survey question 2/15): Is reading an uninitialised variable or struct member (with a current mainstream compiler):
139 (43%) undefined behaviour (meaning that the compiler is free to arbitrarily miscompile the program, with or without a warning)
42 (13%) going to make the result of any expression involving that value unpredictable
21 (6%) going to give an arbitrary and unstable value (maybe with a different value if you read again)
112 (35%) going to give an arbitrary but stable value (with the same value if you read again)
A straw poll at EuroLLVM 2018 gave roughly the same distribution. The survey comments suggest that some (but perhaps not much) real code depends on one of the stronger semantics.
It seems impractical to always require initialisation of all variables and allocated regions, e.g. to zero values, though this might be desirable for many usages of C. We suggest this be provided with compiler options, separately for static/thread/allocated storage-duration variables and for allocated regions. Someone observed that non-null pointers would need extra care.
Compilers do optimise based on inferences about uninitialised reads, e.g. with the LLVM undef behaviour and SSA-based compilation. We see implementations produce unstable values for repeated reads of an uninitialised value. The LLVM treatment of undef and poison seems to be in flux, but some (not all) operations are deemed to give undef results if one or more of their arguments are undef. This is roughly (b) above. We're not aware of implementation behaviour that would require all uninitialised reads to be deemed UB.
The current LLVM behaviour seems to be (from conversation with Nuno Lopes and others) roughly option (b) from the survey, but with a per-bit notion of uninitialised value (their undef
), and with UB on control-flow choices on undef
, and slightly different strictness and daemonicity. They seem to have a consensus to replace undef
by poison
, which is currently per-value not per-bit and currently mostly occurs only for booleans, e.g. from comparisons with signed-arithmetic overflows. They may be moving to a per-bit poison.
Apparently VS and ICC also implement some per-bit notion of uninitialisation status tracking, e.g. via bitwise undef. ICC ensures consistent reads from uninitialised loads.
For some types there can exist representation values that do not correspond to any abstract value, for which implementations might assume that, after initialisation, such representation values are (either) not read, or not operated on.
For several current common implementations, for most integer types there are no unused representation values - this follows from the sizeof, min, and max values. The only exception we are aware of is _Bool
. Some current implementations do assume that there are no {0,1} values read - e.g. one can see executions in which, after an read of an uninitialised _Bool
, or one which in which a non-{0,1} value has been explicitly written, conditionals on that value and on its negation might both fire. More exotic implementations clearly could have unused representation values for other integer types.
For floating-point types, there is the case of Signalling NaNs. Our understanding (but we are not floating-point experts) is that some implementations can be switched to cause these to raise a floating-point exception when used (not when read). LLVM currently can't do that but there is a move to add such a switch.
For pointer types, there is the possibility of pointers in segmented architectures in which merely reading a pointer value does some dynamic check. Derek Jones reported that original 68000 did this. It would be interesting to know whether there are current implementations that do this.
There has been much discussion over the years of the Itanium NaT flag, but that is not a memory-value-representable entity. Our impression (from email with Hans Boehm) is that a sound treatment of Itanium NaT, in the presence of function inlining, might need all reads of uninitialised values (perhaps except padding) to nondeterministically give either a symbolic unspecified value or trap.
The last Itanium processor was released in 2017, and it's suggested that HPE will support it only until 2025, so the significance of Itanium for C2x is debatable.
For padding bytes, implementations may write over them (when writing a struct member or array element) out of necessity, e.g. if they don't have machine instructions of the correct size, or for performance, e.g. if a wider machine instruction is faster, or if they can combine multiple member/element writes into one. SRA optimisations (scalar replacement of aggregates) may introduce this.
In general the values written might be arbitrary. We don't know what compilers currently do. It would be desirable if we could - perhaps optionally - restrict them to either (a) always use zero for padding, or (b) nondeterministically use either zero or padding from a source struct, e.g. for a copy. Either of those would let the programmer prevent information leakage by initialising the padding once.
The C11 standard text seems to be confused on these points.
ISO C11 (following C99) defines the concept of trap representations, as particular object representations that do not represent values of the object type, for which merely reading a trap representation (except by an lvalue of character type), is undefined behaviour. See 3.19.4, 6.2.6.1p5, 6.2.6.2p2, DR338. An "indeterminate value" is either a trap representation or an unspecified value.
The standard text is quite clear that trap representations are "certain object representations", in which case, for implementations that do not have any unused object representations at a particular type, the above does not apply for reads at that type. But WG14 mailing-list discussion (Martin Sebor, Larry Jones) suggests the previous committee intent was to make reads of uninitialised variables at non-struct/union/character type be undefined behaviour irrespective of that - that would essentially be introducing ghost state recording the initialisation status of values.
Then ISO C11 also makes uninitialised reads undefined behaviour if the last sentence of 6.3.2.1p2 applies, irrespective of trap representations (c.f.~the DR338 CR): "If the lvalue designates an object of automatic storage duration that could have been declared with the register storage class (never had its address taken), and that object is uninitialized (not declared with an initializer and no assignment to it has been performed prior to use), the behavior is undefined.". This "never had its address taken" wording seems to have been an attempt to model the Itanium NaT not a thing behaviour, but it's not clear whether it's sufficient for that (c.f. mail with Hans Boehm, who IIRC pointed out that inlining and optimisation might keep values in registers, potentially with a NaT flag, even if in the source its address is taken).
The standard doesn't currently require the trap representations (for each type) to be implementation-defined.
TODO
It seems clear that in most cases reads of non-struct/union uninitialised values are programmer errors, and implementations should report this promptly, at compile-time or run-time, wherever they reasonably can.
But the current use of UB (undefined behaviour) in the standard text for such cases is too crude a mechanism for expressing this: it conflates programmer errors that should be compile-time detected where possible; cases where the behaviour of a conventional implementation can't be constrained, e.g. wild writes; and cases where optimisations want to assume something that is not always sound (in some implementations or for some code), but where we actually could bound the potential bad behaviour.
Ideally, we'd like to make it legal (and encouraged) for implementations to report uninitialised-read errors where they can, but prohibit optimisations from (for example) inferring from an uninitialised read that that code is unreachable, and can be removed.
We think that can be achieved in the standard by bounding the potential implementation behaviour of uninitialised reads, either:
using a symbolic-unspecified-value semantics as in N2221 (broadly akin to the "wobbly values" of the DR451 CR), or
with an option that requires implementations to actually read from memory, or
with an option for implicit zero-initialisation, of local variables and (as a further option) all allocations.
It seems fairly uncontroversial that this should be allowed. For uniformity and semantic simplicity, we'd also allow reads of fully uninitialised structs/unions.
Then what about the case of incrementally initialised flags? There are several possible choices, but none of them are a clear winner:
The abstract machine could track initialisation status on a per-bit level, eg allowing an uninitialised uint32_t
read to be read, OR'd with 0x00000001, and then regarded as having its low-order bit initialised to 1
and all the others still uninitialised. This could either:
build in particular facts about some arithmetic and bitwise-logical operators (which is unpleasantly ad hoc), or
incorporate the entire theory of those operators over bitvectors (which would be difficult to work with).
We could prohibit this idiom, requiring programmers to initialise. But there seem to be many instances of this in real code, and this would also prevent implementations giving more refined diagnostics when they can detect that a single bit read was not specifically initialised.
We could rely on implicit initialisation of all variables. But this seems (while desirable as an option) impractical for the C2x default. It likewise prevents that more refined error reporting.
We have to somehow accommodate the non {0,1} _Bool
and signalling NaN cases, but apart from that it would be nice to get rid of the concept of trap representations altogether, along with the concept of indeterminate value. Is that feasible?
As in N2220 option (a), if we have a symbolic-unspecified-value semantics, we think this could be done for _Bool
simply by making operations on non {0,1} values have unspecified-value results.
Such values are converted (by the integer promotion rules) to other integer types before they are operated on, so the unspecified value can be introduced just at the conversion point (and then propagated as in our N2221 proposal by the operations).
Is it necessary to make unchecked _Bool
computed branch tables a sound implementation technique? If one were also making control-flow choices based on unspecified values be undefined behaviour (a separate semantic choice, Q50 of this and of N2221), that would do that.
Alternatively, at the very least we should make the set of trap representations an implementation-defined set, as in N2220 option (b).
In any case we suggest removing the 6.3.2.1p2 address-taken clause.
unsigned char
pointers)TODO
The rest of this note fleshes out the survey option (b) proposal in more detail - but this doesn't handle all the things above.
We do so by a modest change to the C abstract machine: for any scalar type, we extend the set of values of that type with a symbolic "unspecified value" token, then we can give rules defining how that is propagated, e.g. if one adds an unspecified value and a concrete integer. The unspecified value token does not have a bit-level representation. We detail a possible technical corrigendum for all this below, after the examples.
As always, the "as if" rule applies: the fact that the abstract machine manipulates an explicit "unspecified value" token doesn't mean that implementations have to. Normal implementations may at compile-time but will not at runtime: at runtime they will typically have some arbitrary bit pattern where the abstract machine has the unspecified value token, and the looseness of the rules for operations on unspecified values licenses compiler optimisations. Our unspecified value token is (roughly) a language-level analogue of the LLVM undef
.
We believe that the same machinery can be used to handle structure padding and the padding after the current member of a union, as in Section 3.3 of our revised N2013.
This semantics seems to be a reasonable and coherent choice, with several benefits:
it makes bytewise copying of uninitialised objects (and partially uninitialised objects) legal, e.g. by user-code analogues of memcpy
;
it makes bytewise (un)serialisation of such objects legal, though with nondeterministic results for the values of uninitialised bytes as viewed on disc;
it permits SSA optimisations for unspecified values;
it permits the "wide writes" accesses and optimisations that might touch padding that we are aware of; and
it respects any explicit clearing of padding bytes by the programmer, e.g. to ensure that no confidential information has leaked into them.
However, there are some things it does not support:
it does not support bytewise hashing or comparison of such objects, or (un)serialisation that involves compression, as the unspecified values will infect any computation more complex than a copy;
it does not support copying or (un)serialisation at larger than byte granularities, even with -fno-strict-aliasing
, for the same reason; and
the semantics for printf
and other library calls effectively presumes that their arguments are "frozen", which isn't really coherent with the fact that they will be compiled by the same compiler.
So all this should be discussed.
Probably there should be a language-level "freeze" of some kind, to support code that knowingly manipulates potentially uninitialised values. We do not here attempt to specify that.
If we are keeping the concept of trap representations, with _Bool
as a type (possibly the only one in normal implementations) that has them, then copying a partly uninitialised struct member-by-member will give rise to UB. How about copying a partially uninitialised struct as a whole? Is this an argument for making _Bool
have no trap representations, instead making it UB to use non-canonical values in boolean operations or control-flow choices?
Do we need to change the Q50 answer to make control-flow choices based on unspecified values UB rather than (our previous proposal) runtime-nondeterministic? further limits what one can do with uninitialised values. LLVM optimisations hoisting expressions past control-flow need this UB (see Taming Undefined Behavior in LLVM) to be sound. Without changing our source semantics, the mapping to LLVM will need freeze()
instructions to be placed around the code of every C controlling expression. It is unclear how costly this is. Changing our source semantics to have control-flow choices based on unspecified values be UB would remove the need for these freeze()
instructions, but further limit what one can do with uninitialised values.
A "safe C" might also want to provide (d) as an option (e.g. with a -f(no)-concrete-unspecified-values
flag), especially given that a third of our survey respondents believe they can rely on that.
Example trap_representation_1.c
int main() {
int i;
int *p = &i;
int j=i; // should this have undefined behaviour?
// note that i is read but the value is not used
}
Example trap_representation_2.c
int main() {
int i;
int j=i; // should this have undefined behaviour?
// note that i is read but the value is not used
}
In C11 the first has defined behaviour and the second has undefined behaviour. In our proposal, both have defined behaviour, with the read of i
giving the unspecified value token, and that being written to j
. Note that these are minimal test cases: we're not saying that these are in themselves desirable code - in the cases that arise in practice, the creation of a (perhaps partially) uninitialised value and its use are separated. Note also that on many implementations int
has no unused representation bit patterns, so on those there can be no trap representations here. In the Itanium case, the NaT bit is per-register data, not a memory-storable bit pattern, and the "address taken" clause of 6.3.2.1p2 seems to have been intended to ensure that i
is allocated in memory - but an optimising compiler might easily keep a value in registers down some call hierarchy.
See N2220 for further discussion of trap representations. The Proposed TC here is written assuming Option (b) from there; if one were removing trap representations altogether, it could be simplified somewhat.
We start with this so that printf
can be used in later examples.
Example unspecified_value_library_call_argument.c
#include <stdio.h>
int main()
{
unsigned char c;
unsigned char *p = &c;
printf("char 0x%x\n",(unsigned int)c);
// should this have defined behaviour?
}
ISO C11 is unclear. The DR451 CR says "library functions will exhibit undefined behavior when used on indeterminate values" but here we are more specifically looking at unspecified values. We see no benefit from making this undefined behaviour, and we are not aware that compilers assume so. It prevents (e.g.) serialising or debug printing of partially uninitialised structs, or (if padding bytes are treated the same as other uninitialised values) byte-by-byte serialising of structs containing padding. Accordingly, we suggest that library functions such as printf
, when called with an unspecified value, are executed by first making an unspecified (nondeterministic) choice at call-time of a concrete value. This permits the instability of uninitialised values that we see in practice.
The DR451 CR also says "The committee agrees that this area would benefit from a new definition of something akin to a 'wobbly' value and that this should be considered in any subsequent revision of this standard. The committee also notes that padding bytes within structures are possibly a distinct form of 'wobbly' representation.". As far as we can see, our proposal subsumes the need for a distinct notion of 'wobbly' representation.
Example unspecified_value_control_flow_choice.c
#include <stdio.h>
int main()
{
unsigned char c;
unsigned char *p = &c;
if (c == 'a')
printf("equal\n");
else
printf("nonequal\n");
// should this have defined behaviour?
}
ISO C11 is unclear (it does not discuss this). We suggest "yes": permitting a runtime unspecified (nondeterministic) choice at any control-flow choice between specified alternatives based on an unspecified values. The only potential reason otherwise that we are aware of is (as noted by Joseph Myers) jump tables indexed by such a value, if implementations don't do a range check, but that seems likely to lead to security weaknesses. More conservatively, one could conceivably treat any switch
whose controlling expression has an unspecified value as having undefined behaviour. Computed goto
s (if they were allowed in the standard) on unspecified values should give undefined behaviour.
Example unspecified_value_stability.c
#include <stdio.h>
int main() {
// assume here that int has no trap representations and
// that printing an unspecified value is not itself
// undefined behaviour
int i;
int *p = &i;
// can the following print different values?
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
}
Clang sometimes prints distinct values here (this is consistent with the Clang internal documentation: ). Accordingly, we think the answer has to be "yes".
Example unspecified_value_strictness_and_1.c
#include <stdio.h>
int main() {
unsigned char c;
unsigned char *p=&c;
unsigned char c2 = (c | 1);
unsigned char c3 = (c2 & 1);
// does c3 hold an unspecified value (not 1)?
printf("c=%i c2=%i c3=%i\n",(int)c,(int)c2,(int)c3);
}
An LLVM developer remarks that different parts of LLVM assume that undef
is propagated aggressively or that it represents an unknown particular number.
In our proposal, the read of c
will give the unspecified value token, the result of the binary operation |
, if given at least one unspecified-value arguments, will also be the unspecified value token, which will be written to c2
. Likewise, the binary &
will be strict in unspecified-value-ness, and c3
will end up as the unspecified value. The printf
will then make nondeterministic choices for each of these, allowing arbitrary character-valued integers to be printed by implementations.
Our unspecified value token is a per-scalar-type-object entity, not a per-bit entity (except for single-bit bitfields, where the two coincide).
(Note this would make the N1793 Fig.4 printhexdigit not useful when applied to an uninitialised structure member.)
Example unspecified_value_daemonic_1.c
int main() {
int i;
int *p = &i;
int j = i;
int k = 1/j; // should this have undefined behaviour?
}
The division operation has undefined behaviour for certain concrete argument values, i.e. 0, to accommodate implementation behaviour. If there is an abstract-machine execution in which the second argument is an unspecified value, then a corresponding execution of an actual implementation might divide by zero, so in the abstract machine division should be daemonic: division by an unspecified value should be just as "bad" as division by zero. The same holds for other partial operations and library calls.
This seems to be relied on in practice, and consistent with the "unspecified value token" semantics we have so far, so we suggest "yes" (except in the (Itanium) implementation-defined case above where all reads of uninitialised values give UB). The copy will have an unspecified value for the same member.
Consistent with this, forming a structure value should not be strict in unspecified-value-ness: in the following example, the read of the structure value from s1
and write to s2
should both be permitted, and should copy the value of i1=1
. The read of the uninitialised member should not give rise to undefined behaviour (is this contrary to the last sentence of 6.3.2.1p2, or could the structure not `have been declared with the register storage class'' in any case?) . What
s2.i2` holds after the structure copy depends on the rest of the unspecified-value semantics; in our proposal, it holds the unspecified value token.
Example unspecified_value_struct_copy.c
#include <stdio.h>
typedef struct { int i1; int i2; } st;
int main() {
st s1;
s1.i1 = 1;
st s2;
s2 = s1; // should this have defined behaviour?
printf("s2.i1=%i\n",s2.i1);
}
This and the following questions investigate whether the property of being an unspecified value is associated with arbitrary (possibly aggregate) C values, or with "leaf" (scalar-type) values, or with individual bitfields, or with individual representation bytes of values, or with individual representation bits of values.
In principle there is a similar question for unions: can a union value as a whole be an unspecified value? There might be a real semantic difference, between an unspecified value as whole and a union that contains a specific member which itself is an unspecified value. But it's unclear whether there is a test in ISO C that distinguishes the two.
Example besson_blazy_wilke_bitfields_1u.c
#include <stdio.h>
struct f {
unsigned int a0 : 1; unsigned int a1 : 1;
} bf ;
int main() {
unsigned int a;
bf.a1 = 1;
a = bf.a1;
printf("a=%u\n",a);
}
This example is from Besson, Blazy, and Wilke 2015.
For consistency with the rest of our per-leaf-value proposal, we suggest "yes".
Example unspecified_value_representation_bytes_1.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
unsigned char c = * ((unsigned char*)(&i));
// does c now hold an unspecified value?
printf("i=0x%x c=0x%x\n",i,(int)c);
printf("i=0x%x c=0x%x\n",i,(int)c);
}
The best answer to this is unclear from all points of view: ISO C11 doesn't address the question; we don't know whether existing compilers assume these are unspecified values, and we don't know whether existing code relies on them not being unspecified values.
For stylistic consistency one might take the answer to be "yes", but then (given the suggested answers above) a bytewise hash or checksum computation involving them would produce an unspecified value. In a more concrete semantics, it could produce different results in different invocations, even if the value is not mutated in the meantime.
We don't have sufficient grounds for a strong conclusion at present. We tentatively suggest "yes".
Example unspecified_value_representation_bytes_4.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
unsigned char *cp = (unsigned char*)(&i);
*(cp+1) = 0x22;
// does *cp now hold an unspecified value?
printf("*cp=0x%x\n",*cp);
printf("*cp=0x%x\n",*cp);
}
This too is unclear. One could take the first such access as "freezing" the unspecified value and its representation bytes, but we don't know whether that would be sound with respect to current compiler behaviour. The simplest choice is "yes".
Example unspecified_value_representation_bytes_2.c
#include <stdio.h>
int main() {
// assume here that the implementation-defined
// representation of int has no trap representations
int i;
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
* (((unsigned char*)(&i))+1) = 0x22;
// does i now hold an unspecified value?
printf("i=0x%x\n",i);
printf("i=0x%x\n",i);
}
Again "yes" is the simplest choice, but one could argue instead that a read of the whole should give any nondeterministically chosen value consistent with the concretely written bytes.
Values, unspecified values, indeterminate values, and trap representations are currently defined as follows (3.19):
3.19
1 value
precise meaning of the contents of an object when interpreted as having a specific type3.19.2
1 indeterminate value
either an unspecified value or a trap representation3.19.3
1 unspecified value
valid value of the relevant type where this International Standard imposes no requirements on which value is chosen in any instance
2 NOTE
An unspecified value cannot be a trap representation.3.19.4
1 trap representation
an object representation that need not represent a value of the object type
There "unspecified value" is used to speak of a particular but unknown concrete value (one might be misled by the language of 3.19.3 into thinking that an uninitialised variable that cannot be a trap representation must hold some particular concrete value at runtime, and hence that it will be stable if read multiple times).
Instead, in our proposal there is a "symbolic" unspecified value token at each type, and the abstract machine operates over the disjoint union of the normal concrete semantic values and this token. The notions of value, indeterminate value, and trap representation change accordingly, but in a way that basically preserves the way these terms are currently used:
3.19
1 concrete value
a concrete semantic value (the meaning of the non-trap-representation contents of an object when interpreted as having a specific type)1 value
either a concrete value or the abstract unspecified value token for a specific type3.19.2
1 indeterminate value
either a concrete value, or the unspecified value token for a specific type, or a trap representation for a specific type3.19.3
1 unspecified value
an abstract token, distinct from all concrete values.
2 NOTE unspecified values typically will not have any runtime representation; they are used in the C abstract machine to define what optimisations are allowed, and implementations may use them for compile-time analysis. In the C abstract machine there is a single unspecified value for each type.3.19.4
1 trap representation
a concrete object representation that does not represent a value of the object type
In C11 unspecified values arise in two main ways:
These are unchanged in our proposal, but we have to remove the following, to make reading uninitialised values for types without trap representations be defined behaviour:
Following the general principle that an unspecified value in the C abstract machine should permit the implementation behaviour when given an arbitrary concrete value, control-flow choices between bounded alternatives should be an unspecified choice between them, while control-flow choices to an unbounded choice of locations should give unspecified behaviour. Accordingly, we suggest the following changes to the standard text:
modify (§6.5.15#4, for the Conditional Operator) from:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). The second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
to the following:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). If the first operand evaluates to an unspecified value, then is unspecified then it is unspecified which of the second or third operand is evaluated. Otherwise, the second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
add to (6.8.4.1#2, for the if statement)
If the controlling expression evaluates to an unspecified value, it is unspecified which substatement is executed.
For switch statements, we could either follow the above or make it unspecified behaviour.
add to (§6.8.5#4, for iteration statements)
If the controlling expression evaluates to an unspecified value, it unspecified whether the repetition occurs.
The semantics of expression operators should be "daemonic" when given an operand that evaluates to an unspecified value. That is, for any operand of an expression operator, if there is a given concrete value for which the behaviour of that operator is undefined, then if that operand is an unspecified value, then the behaviour of the operator is undefined (because an exceptional condition may occur).
For example if the value of the second operand of the /
operator is zero, it gives undefined behaviour, and hence if the second operand is an unspecified value, it should likewise give undefined behaviour. Similarly if the value of any operand of a signed +
operator is unspecified, the operation is undefined.
Accordingly, we suggest the following change to the text of the standard:
adding as a new clause to §6.5:
During the evaluation of an expression operator, if an unspecified value is used as an argument such that there exists a specified value for which the behaviour is undefined, then the behaviour is undefined.
To give an implementation intuition: daemonicity means that implementations do not need to take care not to introduce undefined behaviours when choosing a concrete value for something that in the C abstract machine is an unspecified value.
We suggest that the semantics of expression operators should be made strict on unspecified values. That is, except in the cases of daemonic undefined behaviour above, if the value of any operand of an expression operator is an unspecified value, then the result is an unspecified value.
For example if the value of the first operand of the /
operator is unspecified, then the result has unspecified value.
Accordingly, we suggest the following changes to the text of the standard:
adding as a new clause to §6.5:
During the evaluation of an expression operator (other then the relations operators, equality operators, logical OR and AND operators and conditional operators), if an operand has an unspecified value and this does not cause undefined behavior, then the value of the operator is unspecified.
Fleshing out the details of strict and daemonic treatment of unspecified values, we propose the following changes (one could just rely on the diffs from the two sections above, but that seems likely to lead to confusion - it seems worth adding the more verbose but detailed diff below too).
Function call
add the following clause to (§6.5.2.2 Semantics):
If the value of the expression that denotes the called function is unspecified, the behavior is undefined.
Structure and union members
append the following to the end of (§6.5.2.3#3 and #4):
If the value of the first expression is unspecified, the behavior is undefined.
Postfix increment and decrement operators
add the following clause at the beginning of (§6.5.2.4 Semantics):
If the value of the operand of the postfix increment or decrement operand is unspecified, the behavior is undefined.
Address operator add the following sentence to(§6.5.3.2#3):
If the value of its operand is unspecified, the behavior is undefined.
Indirection operator add to (§6.5.3.2#4) before its current last sentence:
If the value of the operand is unspecified, the behavior is undefined.
NOTE: this also takes care of the array subscripting operator given (§6.5.2.1#2)
unary ~ operator add the following sentence to (§6.5.3.3#4):
If the value of the operand is unspecified, the result is the unspecified value of the promoted type.
sizeof and _Alignof operators add the following clause to (§6.5.3.4 Semantics):
If the value of the operand of the sizeof or _Alignof operator is unspecified, then the result is the unspecified value of type
size_t
.
Multiplicative operators in (§6.5.5#5), replace:
In both operations, if the value of the second operand is zero, the behavior is undefined.
by
In both operations, if the value of the second operand is zero or is unspecified, the behavior is undefined.
and add the following clause to (§6.5.5 Semantics):
If the common real type is signed and the value of either operand is unspecified, the behavior is undefined.
Additive operators add the following clause to (§6.5.6):
If both operands have arithmetic type and their common real type is signed, if the value of either operand is unspecified, the behavior is undefined.
and the following clause to (§6.5.6):
If one of the operand has a pointer type and the value of either operand is unspecified, the behavior is undefined.
Bitwise shift operator replace the current last sentence of (§6.5.7#3):
If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.
with
If the value of the right operand is negative, is greater than or equal to the width of the promoted left operand or is unspecified, the behavior is undefined.
and append to the end of (§6.5.7#4):
In particular if
E1
has signed type and its value is unspecified, the behavior is undefined.
Relational operator append to the end of (§6.5.8#5):
In particular if any the value of either operand is unspecified, the behaviour is undefined.
and add the following clause to the end of (§6.5.8):
When two objects of real types are compared, and the value of either operand is unspecified, it is unspecified whether operator shall yield 1 or 0.
Equality operator in clause (§6.5.9#3) replace current final sentence:
For any pair of operands, exactly one of the relations is true.
by
For any pair of operands, exactly one of the relations is true (except when the value of either operand is unspecified).
and append to the end of (§6.5.9#4):
If the value of either operand is unspecified, it is unspecified whether they are equal.
and add the following clause to (§6.5.9): > If at least one operand is a pointer and the value of either operand is unspecified, > the behavior is undefined.
Bitwise AND operator append to the end of (§6.5.10#4):
If the value of either operand is unspecified, the result is the unspecified value of the common real type of the operands.
Bitwise exclusive OR operator append to the end of (§6.5.11#4):
If the value of either operand is an unspecified, the result is the unspecified value of the common real type of the operands.
Bitwise inclusive OR operator append to the end of (§6.5.12#4):
If the value of either operand is unspecified, the result is the unspecified value of the common real type of the operands.
Logical AND operator add the following sentence to (§6.5.13#3):
If the value of either operand is unspecified, it is unspecified whether the operator yield 1 or 0.
and append to the end of (§6.5.13#4):
If the value of the first operand is unspecified, it is unspecified whether the second operand is evaluated.
Logical OR operator add the following sentence to (§6.5.14#3):
If the value of either operand is unspecified, it is unspecified whether the operator yield 1 or 0.
and append to the end of (§6.5.14#4):
If the value of the first operand is unspecified, it is unspecified whether the second operand is evaluated.
Conditional operator modify (§6.5.15#4) from the current:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). The second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
to the following:
The first operand is evaluated; there is a sequence point between its evaluation and the evaluation of the second or third operand (whichever is evaluated). If the first operand evaluates to an unspecified value, then is unspecified then it is unspecified which of the second or third operand is evaluated. Otherwise, the second operand is evaluated only if the first compares unequal to 0; the third operand is evaluated only if the first compares equal to 0; the result is the value of the second or third operand (whichever is evaluated), converted to the type described below.
Assignment operators add the following clause to (§6.5.16):
If the value of the left operand is unspecified, the behavior is undefined.
To permit (for example) bytewise marshalling of structs that may contain unspecified value members or unspecified value padding, we have to allow invocation of standard library functions that make system calls with unspecified value arguments. The following sentence should therefore be added to clause (§7.1.4#1):
If an argument to a standard library function has an unspecified value, it is replaced by a nondeterministic concrete value in an unspecified fashion.
NOTE: This also makes standard library functions that have undefined behaviour for specific arguments daemonic, in the same way as operators.
NOTE: It might be preferable to restrict these conversions to library functions performing I/O (or more generally having to do with system calls).
NOTE: A return from main()
with an unspecified value should be similar, making a nondeterministic choice of a concrete value in an unspecified fashion.
§6.2.6 Representations of types has to be adapted to account for the fact that values can be either concrete values (which have representations) or unspecified value tokens (which do not), and to specify that when one reads a byte from an unspecified value, one gets an unspecified value at type unsigned char
.
6.2.6 Representations of types
modify (§6.2.6.1#3) from the current
Values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.49)
to:
Concrete values stored in unsigned bit-fields and objects of type unsigned char shall be represented using a pure binary notation.49)
modify (§6.2.6.1#4) from the current
Values stored in non-bit-field objects of any other object type consist of
n × CHAR_BIT
bits, wheren
is the size of an object of that type, in bytes. The value may be copied into an object of typeunsigned char [n]
(e.g., bymemcpy
); the resulting set of bytes is called the object representation of the value. Values stored in bit-fields consist of m bits, where m is the size specified for the bit-field. The object representation is the set of m bits the bit-field comprises in the addressable storage unit holding it. Two values (other than NaNs) with the same object representation compare equal, but values that compare equal may have different object representations.
to
Concrete values stored in non-bit-field objects of any other object type shall be represented with
n × CHAR_BIT
bits, wheren
is the size of an object of that type, in bytes. A concrete value may be copied into an object of typeunsigned char [n]
(e.g., bymemcpy
); the resulting set of bytes is called the object representation of the value. An unspecified value stored in a non-bit-field object may also be copied into such an array, the elements of which then hold the unspecified value for typeunsigned char
. Concrete values stored in bit-fields consist ofm
bits, wherem
is the size specified for the bit-field. The object representation is the set ofm
bits the bit-field comprises in the addressable storage unit holding it. Two concrete values (other than NaNs) with the same object representation compare equal, but concrete values that compare equal may have different object representations.
Consider either (1) a object which has not been initialised, and therefore has an unspecified value, which then has one or more (but not all) of its representation byte written to with a concrete value, or (2) an object that has been initialised but which has one or more (but not all) of its representation bytes overwritten from some other uninitialised object. In either case we have a mix of unspecified-value and concrete representation bytes. When reading from the whole object, the read value should be the unspecified value of the appropriate type. In particular if the read value is then stored back to the object, the representation bytes that were made concrete disappear.
§6.3.2.1 Lvalues, arrays, and function designators. Replace
Except when it is the operand of the
sizeof
operator, the unary&
operator, the++
operator, the--
operator, or the left operand of the.
operator or an assignment operator, an lvalue that does not have array type is converted to the value stored in the designated object (and is no longer an lvalue); this is called lvalue conversion.
by
Except when it is the operand of the
sizeof
operator, the unary&
operator, the++
operator, the--
operator, or the left operand of the.
operator or an assignment operator, an lvalue that does not have array type is converted to the value stored in the designated object (and is no longer an lvalue); this is called lvalue conversion. If any byte of any scalar type subobject of the designated object is an unspecified value, this conversion for that subobject gives the unspecified value for that type.