Standardized units for use in information technology ---------------------------------------------------- or "What is a Megabyte ...?" Markus Kuhn -- 1996-12-29 [updated 1999-07-19] The precise meaning and correct abbreviation of units like 'kilobyte', 'megabyte', and 'gigabit' is a frequently discussed topic. A formal standard that defines units for information capacity and related quantities does not exist and is urgently needed. This paper presents a possible standard and critiques existing practice. http://www.cl.cam.ac.uk/~mgk25/information-units.txt 1 Introduction In information technology, many units for memory capacity, data transmission rate, etc. are used today without formal standardization. The current application of units in information processing is ill-defined, unsystematic, and inconsistent. International Standard ISO 31 [1,2], based on the International System of Units (SI), defines quantities and units for various fields of science and technology. ISO 31 is well-accepted around the world, but it currently excludes units for 'information content', and no other comprehensive formal standard specifying units for information technology exists. In order to eliminate the current uncertainty in engineering and trade involved with the use of units for information capacity and related quantities, I propose the extension of ISO 31 by a new part "Information Technology" that defines a set of practical units for this field. The next section gives an idea of how such a definition might look like. 2 Proposal The unit of information capacity shall be '1 bit'. The name 'bit' is derived from 'binary digit' and shall not be abbreviated further. The quantity information capacity is dimensionless, because it refers to a number of binary symbols. Unit: 1 bit One 'bit' is the information capacity equivalent to one binary digit. It represents the ability to handle the knowledge about which one of two possible complementary events has happened. Unless explicitly noted, the information capacity described in the unit 'bit' or multiples of the unit 'bit' shall never be measured or calculated based on any low entropy assumptions of the information. This means for the information capacity '1 bit' that the a priori probability of both binary values can be 0.5 and both values need not be correlated in any way to any other available knowledge. Note: This definition of the unit 'bit' makes it clear that high entropy random bit sequences must be used to measure the information capacity of devices that use internal data compression. The unit 'bit' shall not be used in order to measure decision content of symbols or entropy of sources in information theory or in order to measure entropy in thermodynamics. (See also the unit shannon (Sh) mentioned below). The following widely used multiple of the unit 'bit' is defined as an alternative unit of information capacity based on the definition of the unit 'bit': Alternative unit: 1 byte = 1 b = 8 bit Note: The abbreviation 'B' is already used for the unit 'bel' (usually used as dB = decibel) defined in ISO 31-2. In addition, capital letter abbreviations are used in the SI only for units named after a person, therefore the 'B' is not available for 'byte'. Information capacity units are very frequently used together with 'bel' in texts dealing with communication systems. There is some discussion going on to define neper (Np) as a new SI base unit and based on it to include bel (B) as a new derived SI unit. Note: The abbreviation 'b' is also already used for the unit 'barn' (where 1 b = 1e-28 m^2), which is used to specify cross-sections in nuclear physics. Although 'barn' is mentioned (not defined!) in an informal remark in ISO 31-10 and listed in some national laws about legal units of measurement (e.g., [6]), it is not an SI unit, and 1 barn = 1 b can also easily be written using the SI unit femtometer as 100 fm². It is unlikely that 'barn' and 'byte' will be used in the same context frequently. However, whether the official abbreviation for 'byte' shall be 'b' or 'by' is a matter that might need further discussion. Note: Historically, the term 'byte' has been also used in the computer industry to refer to the information capacity required to represent one text character or to refer to the smallest fraction of a machine word which can be addressed separately [4]. However, today the meaning of 1 byte = 8 bit [5] clearly dominates any other meaning, therefore standardization on the modern widely known meaning of 'byte' is adequate. Several international standards use the definition 1 byte = 8 bit (e.g., ISO 9660 and ISO 11544). In the French language, the term 'octet' is often used instead of 'byte', however the abbreviation should in any case remain 'b', because 'o' can easily be confused with the digit 0. (For this reason, SI avoids 'O' for the ohm and uses the capital Greek letter omega). Note: It has been suggested to define additional multiples of 1 bit as alternative units for information capacity. Some interesting suggestions have been for instance 1 nibble = 1 ni = 4 bit 1 rune = 1 r = 16 bit 1 quad = 1 q = 32 bit However these are today not widely used and therefore standardization seems premature. The terms 'word', 'halfword' and 'doubleword' are specified in an IBM standard, but they commonly also refer to different machine word sizes and are consequently not good candidates for international standardization. The units defined here can be used together with other SI units and SI prefixes. As in the SI, the prefixes denote powers of ten. Examples: 1 kbit/s = 1 000 bit/second (data transmission rate) 1 kbyte/cm = 800 000 bit/meter (tape storage density) 1 Mbyte = 8 000 000 bit (file length) Note: It is possible and recommended by this proposal to use the unabbreviated unit name 'byte' with abbreviated prefixes as shown above. The abbreviation 'b' for 'byte' is mainly intended for situations where a compact notation is desirable (e.g., in narrow tables). As powers of two play a significant role in digital information technology, the following SI prefixes can also be used to denote the next higher power of two instead of a power of ten. In this case, a subscripted digit 2 shall be appended to the abbreviation of the SI prefix and the syllable 'di' is added before the prefix name. If printing a lowered digit 2 in a smaller font is not possible (like in this ISO 8859-1 file), also a normal digit 2 directly after the prefix abbreviation is acceptable. Binary prefixes: dikilo = k2 = 2^10 = 1024 dimega = M2 = 2^20 = 1048576 digiga = G2 = 2^30 = 1073741824 ditera = T2 = 2^40 = 1099511627776 dipeta = P2 = 2^50 = 1125899906842624 diexa = E2 = 2^60 = 1152921504606846976 dizetta = Z2 = 2^70 = 1180591620717411303424 diyotta = Y2 = 2^80 = 1208925819614629174706176 Examples: - Maximum application memory size of a historical PC operating system: 640 k2byte = 5 242 880 bit = 5 M2bit - Formatted capacity of a 90 mm (3.5 in) high density PC floppy disk: 2 * 80 * 18 * 512 byte = 1440 k2b = 11 796 480 bit = 11.25 M2bit Note: The modern Unicode and ISO 10646 character sets include a character SUBSCRIPT TWO as code hexadecimal 2082, which allows correct display of the binary prefix abbreviations on computers. Unfortunately, many very widely used old character sets (especially ISO 8859 and IBM CP437) contain no such character. Note: In some situations, binary prefixes might also be useful when applied to other units. For example, the common digital wrist watch crystal frequency is 32 768 Hz = 32 k2Hz or the number of pixels on a 1024x768 raster graphics screen could be denoted 0.75 dimegapixels. Negative powers of two can also sometimes be useful as prefixes. For example, a timer with a frequency of 1024 Hz = 1 k2Hz has a timing resolution of 1/1024 s = 0.97656... ms = 1 m2s = 1 dimillisecond. Note: In spoken language, the 'di' syllable can be omitted when it is either clear or irrelevant from the context whether a power of two or ten is described (like the "one megabit chip", which is obviously a "one dimegabit chip"). In written texts however, the distinction shall always be made clear by adding the (subscript) digit '2' to the abbreviated prefix. The syllable 'di' has been selected, because it is short, easy to pronounce in many languages, and offers some consistency with the chemical notation (e.g., carbon dioxide = CO2). 3 Existing practice The 10^3 versus 2^10 notation problem has already been discussed early in the computer science literature. Suggestions included using the small Greek letter kappa or the symbol 'bK' and its powers for powers of 1024 [7,8,9]. These proposals have never gained any significant acceptance and are not aligned well with the SI and ISO 31 standards. There seems to exist some consensus in the technical world, that the prefix 'kilo' used with the units 'bit' and 'byte' denotes a factor of 2^10 = 1024. Unfortunately, this is highly inconsistent with the official meaning of 'kilo' as specified in the SI standard, which is a factor of 10^3 = 1000 and which is widely used this way in all areas of science, technology, and trade. An often mentioned proposal to solve this contradiction is to use the capital letter 'K' for 1024 and the normal SI abbreviation for 'kilo' which is the small letter 'k' for 1000. This at first glance elegant solution fails however already with 'mega' which has the capital letter 'M' as the normal SI abbreviation and where the small letter 'm' denotes already milli = 10^-3 = 0.001. However, the SI prefixes are not used consistently today as powers of two in the context of 'byte'. For example, a 90 mm floppy disk (yes, it is really exactly 90 mm wide, not 3 1/2 inch!), which has a formatted storage capacity of 18 * 2 * 80 * 512 byte = 1 474 560 byte = 1.44 * 1000 * 1024 byte, is commonly referred to as a "1.44 megabyte floppy disk". Here, 'mega' is quite commonly used as a prefix denoting a factor of 1024 000 instead of 2^20. In general, in the context of magnetic and optical storage systems, both the definitions 1 megabyte = 1000 * 1000 byte and 1 megabyte = 1000 * 1024 byte seem to be much more popular than 1 megabyte = 1024 * 1024 byte. In the context of data transmission rates, the units 'kbit/s', 'Mbit/s', and 'Gbit/s' are today used consistently with the prefixes referring to powers of ten. In mid 1996, a new international standard proposal for power-of-two prefixes has been published by IEC [11]. This IEC proposal is as follows: Prefixes for binary multiples for use in information technology Factor Name Symbol Origin Derived from 2^10 kibi Ki kilobinary: (2^10)^1 kilo: (10^3)^1 2^20 mebi Mi megabinary: (2^10)^2 mega: (10^3)^2 2^30 gibi Gi gigabinary: (2^10)^3 giga: (10^3)^3 2^40 tebi Ti terabinary: (2^10)^4 tera: (10^3)^4 Examples: one kibibit: 1 Kibit = 2^10 bit one kilobit: 1 kbit = 10^3 bit one mebibit: 1 Mibit = 2^20 bit one megabit: 1 Mbit = 10^6 bit Note: Suggested pronounciation in English: The first syllable should be pronounced in the same way as in the first syllable of the corresponding SI prefix. The second syllable should be pronounced "bee". UPDATE: This proposal was adopted as IEC 60027-2 in January 1999 [12,13]. It has been extended by the prefixes pebi and exbi for (2^10)^5 and (2^10)^6. I obviously would have prefered the proposal from section 2. It is sometimes suggested to abbreviate 'bit' as 'b' and 'byte' as 'B' and one existing national standard [10] gives this definition. Existing practice however seems to be to abbreviate 'byte' both as 'b' and 'B'. Also 'bit' is sometimes abbreviated as both 'b' and 'B', but 'bit' is usually not abbreviated at all. In [3], in addition to 'bit' and 'byte', also the units 'baud' (Bd) (for modem data transmission symbol rate), and 'shannon' (Sh) (for decision content and entropy) have been defined. These and other quantities, units and abbreviations for units should probably be included here, too. Theoretically, it might be possible to specify the unit 'bit' in terms of existing SI base units for thermodynamic entropy. However, this has very little practical application in both thermodynamics and information technology. A quantity measured in 'bit' usually denotes only a dimensionless number of binary symbols and therefore this thermodynamic definition of the 'bit' is not suggested here. References: [1] International Standard ISO 31-0, Quantities and units -- Part 0: General principles, International Organization for Standardization, Geneva, 1992. [2] Quantities and units, ISO Standards Handbook, International Organization for Standardization, third edition, 345 p., Geneva, 1993, ISBN 92-67-10185-4, 161 CHF. This book contains the full text of ISO 31 and ISO 1000. Check for ordering details. [3] Terms and abbreviations for information quantities in telecommunication, CCITT Recommendation B.14, CCITT Blue Book, Fascicle I.3, International Telecommunication Union, Geneva, 1989. [4] International Standard ISO/IEC 2382-1, Information technology - Vocabulary - Part 1: Fundamental terms, International Organization for Standardization, Geneva, 1993. [5] International Standard ISO/IEC 2382-4, Information processing systems - Vocabulary - Part 04: Organization of data, International Organization for Standardization, Geneva, 1987. [6] German law about legal units for measurement: Ausführungsverordnung zum Gesetz über Einheiten im Meßwesen (Einheitenverordnung - EinhV), published 1985-12-13 in BGBl. I S. 2272, last modification 1991-03-22 in BGBl. I S. 836. [7] Donald R. Morrison, Abbreviations for Computer and Memory Sizes, Communications of the ACM, Vol 11, No. 3, March 1968, p. 150. [8] Wallace Givens, Proposed Abbreviation for 1024: bK, Communications of the ACM, Vol. 11, No. 6, June 1968, p. 391. [9] Bruce A. Martin, On Binary Notation, Communications of the ACM, Vol 11, No. 10, October 1968, p. 658. [10] ANSI/IEEE 260.1-1993: Letter Symbols for Units of Measurement. [11] International Electrotechnical Commission (IEC), new work item proposal 25/180/NP, Ammendment of IEC 27-2: Letter symbols to be used in electrical technology, Part 2: Telecommunications and electronics - Introduction of prefixes for binary multiples, IEC/TC 25/WG 1, June 1996. [12] Anders J. Thor : IEC standardizes prefixes for binary multiples, IEC TC newsletter, p. 4, February 1999, . [13] International System of Units (SI) - Prefixes for binary multiples. NIST Web page . I wish to thank all people who have contributed information or helped me to improve this proposal, including Lawrence Crowl Rainer Seitel Prof. Karl Kleine Olle Järnefors Bruce Barrow Volker Seibt Markus Kuhn -- Markus Kuhn, Computer Laboratory, University of Cambridge http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain