There is also a study of some well known code and how many there can be
before the coding scheme runs out. This was published in the
"American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the
following coding schemes:
Decimal
From the Arabs we have leaned the "decimal" notation. We encode the digits as 0,1,2,3,4,5,6,7,8,9. Then we can represent any whole number by
string them together:
In about 1945 Shannon defined
Octal and Hex
This has resulted in many computer people being able to recite the powers
of two up to the highest address on their favorite machine.
However, calling out or typing 20 binary digits is a little inefficient,
and decimal notation disguises the binary format (which is often
significant) and so computer people tend to use base 8 (octal) and
base 16 (hexadecimal) notation:
Nibbles are written and spoken using the hexadecimal digits, 0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,F (=1111).
Signed Integers
In scientific computations integers are encoded using binary typically
with 8, 16, 32, ... bits. One extra bit indicates whether the number is
negative or positive.
In commercial systems it was common to find numbers encoded using
Here a number like 987 was encoded by three decimal digits each represented in binary:
This wastes some bits but is very convenient for important things like
dollars and cents.
Real Numbers
Real numbers (measurements) are encoded
using "floating point" where a number has two parts called the
mantissa and the exponent, both encoded in binary. The value is then
Floating point works well when we need a wide range of values and can put up with larger errors on the larger numbers.
In the 1960's the American standards people proposed what has become the standard 8 bit coding for characters -- ASCII
ASCII covers all the characters needed for American needs, but has become the de facto standard on the Internet, and whenever data needs to be shared. The International Standards Organization treats ASCII as a specialized code for use in America. In the UK, the American "#" becomes the symbol for the British pound. Each European country has its own special symbols.
IBM tried to create its own standard -- an Extended Binary Coded Decimal code named EBCDIC. This will disappear with the last mainframe.
Recently, a new standard -- Unicode -- has been created that covers just about every character in every alphabet in the world. This is a 16-bit code. ASCII and the ISO codes appear within it.
The Web uses HTML and HTML has introduced a number of special "entities" for showing non-ASCII characters like Σ and α. These are given numbers and encode in HTML like this:
There are four classic ways of encoding these. We also have the
new markup notations:
<name><first>Richard</first><initial>J</initial><family>Botting</family></name>is a piece of text with added "tags" that indicate the meaning of the parts. In a [ Record Structure ] (above) the "tags" are not needed because their sequence is known and the lengths are fixed (or at least predictable). Thus we get an encoding that is guaranteed not to be ambiguous, is to some extent easy to read, and is somewhat inefficient.
<tags>
</end tags>to delimit data. Tags can also have attributes:
<certificate type="participation">Unix Training</certificate>.
XML also allows some tags to be unpaired and these are shown like this:
<endless tag attributes... />XML documents can be parsed fairly easily.
For each application that uses XML must have a DTD -- Document Type Definition published that defines the structure of the data -- what tags can appear inside others. Defining a DTD takes a significant amount of work. But once defined you can use tools to check validity, ...
. . . . . . . . . ( end of section Markup Languages) <<Contents | End>>
Complex Syntax
Complex syntax gets us into natural and artificial languages. It is
rare that we need to express natural data groupings using complex
syntax. When we do we can use a extension of the syntactic meta-languages like
Backus-Naur Form (BNF).
In computer science most of our knowledge about linguistic design has been put into designing programming languages. Programming languages are the most complicated schemes for encoding a domain in existence. There are hundreds of them. For more take a CSCI Programming Language class like our CSCI320 [ ../cs320/ ] (Advert).
. . . . . . . . . ( end of section Encoding Compound data) <<Contents | End>>
Experience -- coding data in the ICI Infra-Red Spectrum Analysis Program.
. . . . . . . . . ( end of section Special Encodings) <<Contents | End>>
Guidelines for encoding data
Reference and Online Resources
Universal Product Code
I don't expect you to understand UPC but if you are interested in these
ubiquitous
bar-codes
see
[ Universal_Product_Code ]
for the details and history.
Samples of Syntax Definitions
My
[ http://www.csci.csusb.edu/dick/samples/ ]
define a large number of sophisticated coding schemes including
programming languages and meta-languages.
ASCII
For reference purposes see
[ comp.text.ASCII.html ]
Markup Languages
For reference purposes see
[ Mark Up Languages in index ]
(my notes).
HTML
[ ../samples/comp.html.syntax.html ]
XML
[ ../samples/xml.html ]
XML reference
. . . . . . . . . ( end of section XML) <<Contents | End>>
<section><department>CSCI</department><cnumber>201</cnumber><sectionnumber>02</sectionnumber><section>
Also see [ glossary.html ] for more special abbreviations and phrases.