Standardized units for use in information technology
----------------------------------------------------

            or "What is a Megabyte ...?"


Markus Kuhn -- 1996-12-29 [updated 1999-07-19]

The precise meaning and correct abbreviation of units like 'kilobyte',
'megabyte', and 'gigabit' is a frequently discussed topic. A formal
standard that defines units for information capacity and related
quantities does not exist and is urgently needed. This paper presents
a possible standard and critiques existing practice.
 
http://www.cl.cam.ac.uk/~mgk25/information-units.txt


1  Introduction

In information technology, many units for memory capacity, data
transmission rate, etc. are used today without formal standardization.
The current application of units in information processing is
ill-defined, unsystematic, and inconsistent.

International Standard ISO 31 [1,2], based on the International System
of Units (SI), defines quantities and units for various fields of
science and technology. ISO 31 is well-accepted around the world, but
it currently excludes units for 'information content', and no other
comprehensive formal standard specifying units for information
technology exists.

In order to eliminate the current uncertainty in engineering and trade
involved with the use of units for information capacity and related
quantities, I propose the extension of ISO 31 by a new part
"Information Technology" that defines a set of practical units for
this field. The next section gives an idea of how such a definition
might look like.


2  Proposal

The unit of information capacity shall be '1 bit'. The name 'bit' is
derived from 'binary digit' and shall not be abbreviated further. The
quantity information capacity is dimensionless, because it refers to a
number of binary symbols.

Unit: 1 bit

One 'bit' is the information capacity equivalent to one binary digit.
It represents the ability to handle the knowledge about which one of
two possible complementary events has happened.

Unless explicitly noted, the information capacity described in the
unit 'bit' or multiples of the unit 'bit' shall never be measured or
calculated based on any low entropy assumptions of the information.
This means for the information capacity '1 bit' that the a priori
probability of both binary values can be 0.5 and both values need not
be correlated in any way to any other available knowledge.

Note: This definition of the unit 'bit' makes it clear that high
entropy random bit sequences must be used to measure the information
capacity of devices that use internal data compression.

The unit 'bit' shall not be used in order to measure decision content
of symbols or entropy of sources in information theory or in order to
measure entropy in thermodynamics. (See also the unit shannon (Sh)
mentioned below).

The following widely used multiple of the unit 'bit' is defined as an
alternative unit of information capacity based on the definition of
the unit 'bit':

Alternative unit: 1 byte = 1 b = 8 bit

Note: The abbreviation 'B' is already used for the unit 'bel' (usually
used as dB = decibel) defined in ISO 31-2. In addition, capital letter
abbreviations are used in the SI only for units named after a person,
therefore the 'B' is not available for 'byte'. Information capacity
units are very frequently used together with 'bel' in texts dealing
with communication systems. There is some discussion going on to
define neper (Np) as a new SI base unit and based on it to include bel
(B) as a new derived SI unit.

Note: The abbreviation 'b' is also already used for the unit 'barn'
(where 1 b = 1e-28 m^2), which is used to specify cross-sections in
nuclear physics. Although 'barn' is mentioned (not defined!) in an
informal remark in ISO 31-10 and listed in some national laws about
legal units of measurement (e.g., [6]), it is not an SI unit, and 1
barn = 1 b can also easily be written using the SI unit femtometer as
100 fm². It is unlikely that 'barn' and 'byte' will be used in the
same context frequently. However, whether the official abbreviation
for 'byte' shall be 'b' or 'by' is a matter that might need further
discussion.

Note: Historically, the term 'byte' has been also used in the computer
industry to refer to the information capacity required to represent
one text character or to refer to the smallest fraction of a machine
word which can be addressed separately [4]. However, today the meaning
of 1 byte = 8 bit [5] clearly dominates any other meaning, therefore
standardization on the modern widely known meaning of 'byte' is
adequate. Several international standards use the definition 1 byte =
8 bit (e.g., ISO 9660 and ISO 11544). In the French language, the term
'octet' is often used instead of 'byte', however the abbreviation
should in any case remain 'b', because 'o' can easily be confused with
the digit 0. (For this reason, SI avoids 'O' for the ohm and uses the
capital Greek letter omega).

Note: It has been suggested to define additional multiples of 1 bit as
alternative units for information capacity. Some interesting
suggestions have been for instance

  1 nibble = 1 ni =  4 bit
  1 rune   = 1 r  = 16 bit
  1 quad   = 1 q  = 32 bit

However these are today not widely used and therefore standardization
seems premature. The terms 'word', 'halfword' and 'doubleword' are
specified in an IBM standard, but they commonly also refer to
different machine word sizes and are consequently not good candidates
for international standardization.

The units defined here can be used together with other SI units and SI
prefixes. As in the SI, the prefixes denote powers of ten.

Examples:

  1 kbit/s   =     1 000 bit/second        (data transmission rate)
  1 kbyte/cm =   800 000 bit/meter         (tape storage density)
  1 Mbyte    = 8 000 000 bit               (file length)

Note: It is possible and recommended by this proposal to use the
unabbreviated unit name 'byte' with abbreviated prefixes as shown
above. The abbreviation 'b' for 'byte' is mainly intended for
situations where a compact notation is desirable (e.g., in narrow
tables).

As powers of two play a significant role in digital information
technology, the following SI prefixes can also be used to denote the
next higher power of two instead of a power of ten. In this case, a
subscripted digit 2 shall be appended to the abbreviation of the SI
prefix and the syllable 'di' is added before the prefix name. If
printing a lowered digit 2 in a smaller font is not possible (like in
this ISO 8859-1 file), also a normal digit 2 directly after the prefix
abbreviation is acceptable.

Binary prefixes:

  dikilo  = k2 = 2^10 = 1024
  dimega  = M2 = 2^20 = 1048576
  digiga  = G2 = 2^30 = 1073741824
  ditera  = T2 = 2^40 = 1099511627776
  dipeta  = P2 = 2^50 = 1125899906842624
  diexa   = E2 = 2^60 = 1152921504606846976
  dizetta = Z2 = 2^70 = 1180591620717411303424
  diyotta = Y2 = 2^80 = 1208925819614629174706176

Examples:

 - Maximum application memory size of a historical PC operating system:

     640 k2byte = 5 242 880 bit = 5 M2bit

 - Formatted capacity of a 90 mm (3.5 in) high density PC floppy disk:

     2 * 80 * 18 * 512 byte = 1440 k2b = 11 796 480 bit = 11.25 M2bit

Note: The modern Unicode and ISO 10646 character sets include a
character SUBSCRIPT TWO as code hexadecimal 2082, which allows correct
display of the binary prefix abbreviations on computers.
Unfortunately, many very widely used old character sets (especially
ISO 8859 and IBM CP437) contain no such character.

Note: In some situations, binary prefixes might also be useful when
applied to other units. For example, the common digital wrist watch
crystal frequency is 32 768 Hz = 32 k2Hz or the number of pixels on a
1024x768 raster graphics screen could be denoted 0.75 dimegapixels.
Negative powers of two can also sometimes be useful as prefixes. For
example, a timer with a frequency of 1024 Hz = 1 k2Hz has a timing
resolution of 1/1024 s = 0.97656... ms = 1 m2s = 1 dimillisecond.

Note: In spoken language, the 'di' syllable can be omitted when it is
either clear or irrelevant from the context whether a power of two or
ten is described (like the "one megabit chip", which is obviously a
"one dimegabit chip"). In written texts however, the distinction shall
always be made clear by adding the (subscript) digit '2' to the
abbreviated prefix. The syllable 'di' has been selected, because it is
short, easy to pronounce in many languages, and offers some
consistency with the chemical notation (e.g., carbon dioxide = CO2).


3  Existing practice

The 10^3 versus 2^10 notation problem has already been discussed early
in the computer science literature. Suggestions included using the
small Greek letter kappa or the symbol 'bK' and its powers for powers
of 1024 [7,8,9]. These proposals have never gained any significant
acceptance and are not aligned well with the SI and ISO 31 standards.

There seems to exist some consensus in the technical world, that the
prefix 'kilo' used with the units 'bit' and 'byte' denotes a factor of
2^10 = 1024. Unfortunately, this is highly inconsistent with the
official meaning of 'kilo' as specified in the SI standard, which is a
factor of 10^3 = 1000 and which is widely used this way in all areas
of science, technology, and trade. An often mentioned proposal to
solve this contradiction is to use the capital letter 'K' for 1024 and
the normal SI abbreviation for 'kilo' which is the small letter 'k'
for 1000. This at first glance elegant solution fails however already
with 'mega' which has the capital letter 'M' as the normal SI
abbreviation and where the small letter 'm' denotes already milli =
10^-3 = 0.001.

However, the SI prefixes are not used consistently today as powers of
two in the context of 'byte'. For example, a 90 mm floppy disk (yes,
it is really exactly 90 mm wide, not 3 1/2 inch!), which has a
formatted storage capacity of 18 * 2 * 80 * 512 byte = 1 474 560 byte
= 1.44 * 1000 * 1024 byte, is commonly referred to as a "1.44 megabyte
floppy disk". Here, 'mega' is quite commonly used as a prefix denoting
a factor of 1024 000 instead of 2^20. In general, in the context of
magnetic and optical storage systems, both the definitions 1 megabyte
= 1000 * 1000 byte and 1 megabyte = 1000 * 1024 byte seem to be much
more popular than 1 megabyte = 1024 * 1024 byte.

In the context of data transmission rates, the units 'kbit/s',
'Mbit/s', and 'Gbit/s' are today used consistently with the prefixes
referring to powers of ten.

In mid 1996, a new international standard proposal for power-of-two
prefixes has been published by IEC [11]. This IEC proposal is as follows:

  Prefixes for binary multiples for use in information technology

  Factor   Name   Symbol   Origin                 Derived from

   2^10    kibi     Ki     kilobinary: (2^10)^1   kilo: (10^3)^1
   2^20    mebi     Mi     megabinary: (2^10)^2   mega: (10^3)^2
   2^30    gibi     Gi     gigabinary: (2^10)^3   giga: (10^3)^3
   2^40    tebi     Ti     terabinary: (2^10)^4   tera: (10^3)^4

   Examples:  one kibibit:   1 Kibit = 2^10 bit
              one kilobit:   1 kbit  = 10^3 bit
              one mebibit:   1 Mibit = 2^20 bit
              one megabit:   1 Mbit  = 10^6 bit

   Note: Suggested pronounciation in English: The first syllable should
   be pronounced in the same way as in the first syllable of the
   corresponding SI prefix. The second syllable should be pronounced
   "bee".

UPDATE: This proposal was adopted as IEC 60027-2 in January 1999 [12,13].
It has been extended by the prefixes pebi and exbi for (2^10)^5 and
(2^10)^6. I obviously would have prefered the proposal from section 2.

It is sometimes suggested to abbreviate 'bit' as 'b' and 'byte' as 'B'
and one existing national standard [10] gives this definition.
Existing practice however seems to be to abbreviate 'byte' both as 'b'
and 'B'. Also 'bit' is sometimes abbreviated as both 'b' and 'B', but
'bit' is usually not abbreviated at all.

In [3], in addition to 'bit' and 'byte', also the units 'baud' (Bd)
(for modem data transmission symbol rate), and 'shannon' (Sh) (for
decision content and entropy) have been defined. These and other
quantities, units and abbreviations for units should probably be
included here, too.

Theoretically, it might be possible to specify the unit 'bit' in terms
of existing SI base units for thermodynamic entropy. However, this has
very little practical application in both thermodynamics and
information technology. A quantity measured in 'bit' usually denotes
only a dimensionless number of binary symbols and therefore this
thermodynamic definition of the 'bit' is not suggested here.


References:

 [1] International Standard ISO 31-0, Quantities and units -- Part 0:
     General principles, International Organization for Standardization,
     Geneva, 1992.
 
 [2] Quantities and units, ISO Standards Handbook, International
     Organization for Standardization, third edition, 345 p., Geneva,
     1993, ISBN 92-67-10185-4, 161 CHF.

     This book contains the full text of ISO 31 and ISO 1000. Check
     <http://www.iso.ch/> for ordering details.

 [3] Terms and abbreviations for information quantities in
     telecommunication, CCITT Recommendation B.14, CCITT Blue Book,
     Fascicle I.3, International Telecommunication Union, Geneva, 1989.

 [4] International Standard ISO/IEC 2382-1, Information technology -
     Vocabulary - Part 1: Fundamental terms, International Organization
     for Standardization, Geneva, 1993.

 [5] International Standard ISO/IEC 2382-4, Information processing
     systems - Vocabulary - Part 04: Organization of data, International
     Organization for Standardization, Geneva, 1987.

 [6] German law about legal units for measurement: Ausführungsverordnung
     zum Gesetz über Einheiten im Meßwesen (Einheitenverordnung - EinhV),
     published 1985-12-13 in BGBl. I S. 2272, last modification
     1991-03-22 in BGBl. I S. 836.

 [7] Donald R. Morrison, Abbreviations for Computer and Memory Sizes,
     Communications of the ACM, Vol 11, No. 3, March 1968, p. 150.

 [8] Wallace Givens, Proposed Abbreviation for 1024: bK, Communications
     of the ACM, Vol. 11, No. 6, June 1968, p. 391.

 [9] Bruce A. Martin, On Binary Notation, Communications of the ACM,
     Vol 11, No. 10, October 1968, p. 658.

[10] ANSI/IEEE 260.1-1993: Letter Symbols for Units of Measurement.

[11] International Electrotechnical Commission (IEC), new work item
     proposal 25/180/NP, Ammendment of IEC 27-2: Letter symbols to be
     used in electrical technology, Part 2: Telecommunications and
     electronics - Introduction of prefixes for binary multiples,
     IEC/TC 25/WG 1, June 1996.

[12] Anders J. Thor <athor@mech.kth.se>: IEC standardizes prefixes
     for binary multiples, IEC TC newsletter, p. 4, February 1999,
     <http://www.iec.ch/tclet6.pdf>.

[13] International System of Units (SI) - Prefixes for binary multiples.
     NIST Web page <http://physics.nist.gov/cuu/Units/binary.html>.

I wish to thank all people who have contributed information or helped
me to improve this proposal, including

   Lawrence Crowl      <Lawrence.Crowl@Eng.Sun.COM>
   Rainer Seitel       <ujo7@rzstud1.rz.uni-karlsruhe.de>
   Prof. Karl Kleine   <kleine@fh-jena.de>
   Olle Järnefors      <ojarnef@admin.kth.se>
   Bruce Barrow        <barrowb@ncr.disa.mil>
   Volker Seibt        <Volker.Seibt@lrz-muenchen.de>

Markus Kuhn

-- 
Markus Kuhn, Computer Laboratory, University of Cambridge
http://www.cl.cam.ac.uk/~mgk25/ || CB3 0FD, Great Britain