.Open Data Element Design
. Story -- Naming Names Handout
To understand how arbitrary coding can be, and how confusing the
results are, I will be handing out a piece of correspondence on the CSUSB
electronic bulletin board discussing why this campus's zip-code
(92407) is listed on the Internet as "Arrowhead Farms", and some sites
on the WWW list the phone numbers (909-537-nnnn) as in Riverside.
There is also a study of some well known code and how many there can be
before the coding scheme runs out. This was published in the
"American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the
following coding schemes:
.List
Internet account IDs
NY Stock Exchange Ticker Symbols -- up to 3 Uppercase letters -- 26+26**2+26*3 possible
Universal Product Codes -- bar codes -- (
.Key UPC
)-- 12-digit until January 2005, and then 13-digits.
Global Traded Item Numbers (GTIN)
European Article Numbers (EAN)
Biological Species -- Two Latin-like words
The Chemical Elements -- One uppercase letter + option lower case letter.
(I had to program these while working for ICI in the 1960's)
Organic Chemicals -- Complex coding.
Internet Assigned Number Authority
.Key IANA
names for countries -- two ASCII letters
Radio Call Signs -- 3..4 Capital letters -- KVCR, KFROG, ...
Telephone numbers -- 10 digits (in the USA).
Social Security Numbers -- 9 digits in the USA
Airport codes (IATA) -- Three letters.
Names of Horses -- 2..18 characters ( letters plus space, period, and apostrophe).
.Close.List
. Story -- Changing Student IDs
Once upon a time, it took this campus a year to change all its records from
using the Social Security Number to a campus assigned number. Every record
in a dozen different data bases had to be changed. Why did
CSUSB spend so much time and money changing one
element. Because it is illegal
to use the SSN for non-SSN type purposes. We also wanted
to avoid identity theft.
.Open Data Options
We look for the best technology to represent our data.
Each data flow in a DFD has many ways to be implemented. It could
be a phone call, a memo, a signal between devices, or a the sending
of a record of data between parts of the system.
Each flow from an external entity into your system needs an input device.
Each store needs a storage device. Each flow between processes can be
internal to a program or use a network. And each flow out to an
external entity will use an output device.
Choosing a good device and technology can make a system fly rather
than crash. So you need to know about the options. The hardware options
were covered in
.See ./a2.html
earlier.
When the data is digital we still need to choose how to encode different
things.
Encoding data is a return to
.Key Physical Design
from
.Key Logical Design.
On the other hand drawing up a logical model of data is all about ignoring
the actual encoding of the data.
.Open Reminder -- use UML Classes to model logical data records
Draw a class box with two compartments (suppress the
third compartment if using a tool).
Put the record name in the center of the top compartment -- the class name.
List the attributes inside the second compartment -- these are the elements.
Example:
.Image ./12DataUML.gif [Example of two records described in UML]
.Close
.Close
.Open Types of code
.Open Special Encodings
.Open Encoding Data Elements
There are a number of well known ways to encode data elements.
.Open Common Numerical Codes
. Roman
Humans tend to use a notation for numbers
based on the number of fingers on a hand. In
fact the word `digit` comes from the word for finger. In Roman,
times the 'V' stood for a hand (5= one thumb + four fingers) and the 'X'
was two hands (one up one down) or 10 fingers.
. Decimal
From the Arabs we have leaned the "decimal" notation. We encode the digits as 0,1,2,3,4,5,6,7,8,9. Then we can represent any whole number by
string them together:
digit::=0..9.
number::= one or more digits.
987 = 100*9 + 10*8 + 1*7.
For number n, digit d, value( n d ) = 10*value(n) +d.
. Binary
Computers use binary (you can find examples of this in Ancient
China and Leibniz). This has two digits: 0, 1. So:
110011 = 2**5 + 2*4 + 2**1 + 2**0 = 32+16+3 = 61.
In about 1945 Shannon defined
bit::="Binary digIT",
is either 0 or 1.
binary_number::= one or more bits.
For binary_number n, bit b, value( n b ) = 2*value(n) +b.
. Octal and Hex
This has resulted in many computer people being able to recite the powers
of two up to the highest address on their favorite machine.
However, calling out or typing 20 binary digits is a little inefficient,
and decimal notation disguises the binary format (which is often
significant) and so computer people tend to use base 8 (octal) and
base 16 (hexadecimal) notation:
nibble::= 4 bits.
byte::=2 nibbles.
Nibbles are written and spoken using the hexadecimal digits,
0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,F (=1111).
. Signed Integers
In scientific computations integers are encoded using binary typically
with 8, 16, 32, ... bits. One extra bit indicates whether the number is
negative or positive.
In commercial systems it was common to find numbers encoded using
BCD::="binary coded decimal".
Here a number like 987 was encoded by three decimal digits each
represented in binary:
1001 1000 0111
This wastes some bits but is very convenient for important things like
dollars and cents.
. Real Numbers
Real numbers (measurements) are encoded
using "floating point" where a number has two parts called the
mantissa and the exponent, both encoded in binary. The value is then
mantissa * 2**exponent
Floating point works well when we need a wide range of values and can put
up with larger errors on the larger numbers.
.Close
.Open Standard Character Codes
At one time each manufacturer had its own binary code for characters. This
continued up to the 1980's
In the 1960's the American standards people proposed what has become the
standard 8 bit coding for characters -- ASCII
ASCII::="American Standard Code for Information Interchange".
ASCII covers all the characters needed for American needs, but has become the
`de facto` standard on the Internet, and whenever data needs to be shared. The
International Standards Organization treats ASCII as a specialized code
for use in America. In the UK, the American "#" becomes the symbol for
the British pound. Each European country has its own special symbols.
IBM tried to create its own standard -- an Extended Binary
Coded Decimal code named EBCDIC. This will disappear with the last mainframe.
Recently, a new standard -- Unicode -- has been created that covers just about
every character in every alphabet in the world. This is a 16-bit code. ASCII
and the ISO codes appear within it.
The Web uses HTML and HTML has introduced a number of special "entities" for
showing non-ASCII characters like \Sigma and \alpha. These are given
numbers and encode in HTML like this:
"&" digit digit digit ";"
.Close
. Keys
A key is an item of data that is used to uniquely define a single element
in a file, database, or table. These are typically a fixed number of
characters.
. Sequence numbers
Example: Number of a student on roster.
Numbers assigned in a specific order -- typically when created.
. Block sequences
Example: CSUSB Course call Number
Blocks of numbers given to different parts of the organization to allocate (in sequence) to data records.
. Alphabetic Abbreviations and Mnemonics
Example: States(CA, AL,...), IANA and DNS countries(uk, tv, ru, ...),
IATA (LAX, ONT, LHR, MSP, ...)
Abbreviation for a department teaching a course -- CSCI PHYS MATH ENG.
Abbreviations for buildings on campus.
A string of letters is chosen to identify an entity or a group of entities
of a given type.
A few letters stand in for a word or phrase
. Arbitrary Systematic Codings
Example: Library of congress subject classification, Dewey Decimal system for books.
The Assoc for Computing CCS system for computer science.
Linnaeus's technique for species.
.Open Significant subgroups
In many cases a data element is actually made up of smaller elements
each with their own coding.
. Digit Groups
Example: Zip codes(92407-1133), phone numbers((909)-537-5257),..., SSN,
URLs.
Different groups of digits/characters in the data are themselves coded data.
For example in a 9-digit ZIP-code the first digit determines a geographical
area, the next three the town, and the 5th digit a post-office. The next 4
digits identify a delivery point.
. Derived Codes
Example: My UK Driving license number, CSUSB Library call numbers, Subscriber
codes for magazines, Rooms on campus.
Mixes different coded data into one element
. Ciphers
Example: Spoof at ICI (a number added to the paint sales). DES. PGP. ...
Numbers are disguised for security or mnemonic purposes
Passwords should be encrypted as soon as they are entered and never stored
without salting and hashing!
. Actions
Example: A=Add, D=Delete, ..., The 50+ actions that the 'vi'
editor has built in to it, Mnemonic codes in assembler.
Codes that represent actions. Transaction codes.
. Self-checking Elements
Example: 9s remainder and 11s remainder check digits.
Uses an added digit or character calculated from the rest.
.Close
.Close
.Open Encoding Compound data
Elements usually occur together in logical grouping.
The problem to be solved is to design an efficient way to combine
them so that the data can be extracted in only one way. Extracting data
elements from compound data is called
`parsing`.
When there is more than one way of
.Key parsing
a compound data group
then the coding is said to be `ambiguous`.
.Key Ambiguous Encodings
are bad encodings.
There are four classic ways of encoding these. We also have the
new markup notations:
.List
Fixed Length fields: Phone number.
Delimiters and counters: tab delimitted spread sheet.
Complex Syntax: C++ programs.
Marked Up Text: HTML, XML, ...
.Close.List
. Record Structures
In a record structure a number of `fields` are defined, each with a given
length (bits or bytes). Each field has a name and its own encoding.
A very popular technique in RAM and in Data Bases.
Record structures rely on each element being a fixed length.
And this in turn gave us one of the most successful bugs (so far)
in the history of computing: the Y2K Bug. Just about every system was
at risk of breaking down when a 2 digit field could no longer
handle years greater than 1999. And just about all of these
problems were found and fixed before the date arrived. Business as usual,
computer people had to work hard and late into the night to make
computers do what they were supposed to do.
. Delimiters
A common kind of data is the string. It can have any length
When the data does not have a fixed length the end can be marked by a special
sentinel character -- a `delimiter`. In programming languages for example
we indicate the start and end of a string by using a quotation mark.
. Count Fields
Another technique is to add a field indicate the length of the following data
element -- an encoded counter or integer. Humans do not find this as easy
as delimitted data
. Simple Syntax
A very simple way to place several variable length fields into a single
string is to make sure that the first character of each field can not be
a character in the previous field. A simple example would be to choose
alternating digits and letter encoding.
.Open Markup Languages
In recent years there have been a lot of markup languages defined for
different purposes. The idea is to have explicit identification of the
parts of the data. Typically something like
.As_is RichardJBotting
is a piece of text with added "tags" that indicate the meaning of the parts.
In a
.See Record Structure
(above)
the "tags" are not needed because their sequence is known and the lengths
are fixed (or at least predictable). Thus we get an encoding that
is guaranteed not to be ambiguous, is to some extent easy to read,
and is somewhat inefficient.
. RTF -- Rich Text Format
Here the tags indicate the appearance of the text. Probably developed
with early word processor software. Still a format used with MS Word.
. SGML -- The Standard Generalized Markup Language
IBM defined SGML so that you could created and define tags for any purpose. It
is an amazingly difficult encoding to use.
. HTML -- The HyperText Markup Language
HTML is a specific application of SGML defined to describe and link
pages on the world-Wide Web. It has gone through half-a-dozen versions.
A new one stays within the XML (next) conventions.
. XML -- the eXtendable Markup Language
HTML describes the appearance of pages. XML tries to describe the
meaning of data not its appearance. It has a very simple basis using
.As_is
.As_is
to delimit data. Tags can also have attributes:
.As_is Unix Training.
XML also allows some tags to be unpaired and these are shown like this:
.As_is
XML documents can be parsed fairly easily.
For each application that uses XML must have a DTD -- Document Type Definition
published that defines the structure of the data -- what tags can
appear inside others. Defining a DTD takes a significant amount of work.
But once defined you can use tools to check validity, ...
.Close Markup Languages
. Complex Syntax
Complex syntax gets us into natural and artificial languages. It is
rare that we need to express natural data groupings using complex
syntax. When we do we can use a extension of the syntactic meta-languages like
Backus-Naur Form (BNF).
In computer science most of our knowledge about linguistic design has been
put into designing programming languages.
Programming languages are the most complicated schemes for encoding
a domain in existence. There are hundreds of them. For more
take a CSCI Programming Language class like our CSCI320
.See ../cs320/
(Advert).
.Close Encoding Compound data
. Experience -- coding data in the ICI Infra-Red Spectrum Analysis Program.
.Close Special Encodings
.Open Guidelines for encoding data
Make your codes -
.List
Concise
Expandable
Stable
Unique
Sortable
Meaningful
Single Purpose
Consistent
Clear -- Don't allow digits and letters in one field. They look too alike:
(0=O, 1=l=|, 2=Z, 5=S, 6=b, 8=B)
.Close.List
Use standard codes and technologies whenever possible!
Collect input automatically if at all possible.
Get the input where it is available.
All data storage needs back up and archiving.... but for how long?
All output needs data control, security, and destruction.
Who needs dead trees? Paper data is dead data.
All stored or transmitted data implies a security problem.
All data needs some kind of security -- how much and at what cost?
Design I/O with care and with the help of the users.
.Close
. Information theory and combinatorics
An elegant theory of coding data and signals. It shows that
you can measure the information needed for a given purpose and calculate
the maximum rate at which data can be sent through a given channel
using the same measure: bits.
Do the math. MATH272+MATH262.
. Reference and Online Resources
. Universal Product Code
I don't expect you to understand UPC but if you are interested in these
ubiquitous
.Key bar-codes
see
.See http://en.wikipedia.org/wiki/Universal_Product_Code
for the details and history.
. Samples of Syntax Definitions
My
.See http://www.csci.csusb.edu/dick/samples/
define a large number of sophisticated coding schemes including
programming languages and meta-languages.
. ASCII
For reference purposes see
.See http://csci.csusb.edu/dick/samples/comp.text.ASCII.html
. Markup Languages
For reference purposes see
.See http://www.csci.csusb.edu/dick/samples/index.html#Mark Up Languages
(my notes).
. HTML
.See ../samples/comp.html.syntax.html
. XML
.See ../samples/xml.html
.Open XML reference
Robert Eckstein
The XML Pocket Reference
O'Reilly ISBN 3692082709 1999
.Close
Up to date reference
.See http://proquest.safaribooksonline.com/9780596100506
.Close XML
.Open Review Questions
When did changing the coding of one element become important in CSUSB and
the recoding take more than 6 months to complete?
What is IANA is and what does it do?
Define: ASCII, Unicode, BCD, EBCDIC.
Parse the following data strings -- label the parts:
.List
CSCI201-02
http://csci.csusb.edu/dick/cs372/d2.html#Hints for encoding Data
rbotting@csusb.edu
telnet:139.182.151.222
LOL
.Close.List
Take the above parsings and express it using a plausible XML syntax.
For example the first might be
.As_is CSCI20102
How many ways can you encode a date?
How many ways can you encode a time on a given day?
How about encoding a time on any day....
Research: How does the California DMV assign number plates? Can you
have an obscene vanity plate? How can California enforce this rule?
Compare doing the same calculation in Arabic, Roman, and Binary:
.List
32 * 16 = ?
XXXII * XVI = ?
100000 * 10000 = ?
.Close.List
Count from 0000 to 1111 in Binary, and in Hexadecimal.
Visit my office and figure out how I've encoded the day of the month in it.
.Close
.Close