.Open Data Element Design . Story -- Naming Names Handout To understand how arbitrary coding can be, and how confusing the results are, I will be handing out a piece of correspondence on the CSUSB electronic bulletin board discussing why this campus's zip-code (92407) is listed on the Internet as "Arrowhead Farms", and some sites on the WWW list the phone numbers (909-537-nnnn) as in Riverside. There is also a study of some well known code and how many there can be before the coding scheme runs out. This was published in the "American Scientist" magazine, Vol 93, Jan-Feb 2005. It covers the following coding schemes: .List Internet account IDs NY Stock Exchange Ticker Symbols -- up to 3 Uppercase letters -- 26+26**2+26*3 possible Universal Product Codes -- bar codes -- ( .Key UPC )-- 12-digit until January 2005, and then 13-digits. Global Traded Item Numbers (GTIN) European Article Numbers (EAN) Biological Species -- Two Latin-like words The Chemical Elements -- One uppercase letter + option lower case letter. (I had to program these while working for ICI in the 1960's) Organic Chemicals -- Complex coding. Internet Assigned Number Authority .Key IANA names for countries -- two ASCII letters Radio Call Signs -- 3..4 Capital letters -- KVCR, KFROG, ... Telephone numbers -- 10 digits (in the USA). Social Security Numbers -- 9 digits in the USA Airport codes (IATA) -- Three letters. Names of Horses -- 2..18 characters ( letters plus space, period, and apostrophe). .Close.List . Story -- Changing Student IDs Once upon a time, it took this campus a year to change all its records from using the Social Security Number to a campus assigned number. Every record in a dozen different data bases had to be changed. Why did CSUSB spend so much time and money changing one element. Because it is illegal to use the SSN for non-SSN type purposes. We also wanted to avoid identity theft. .Open Data Options We look for the best technology to represent our data. Each data flow in a DFD has many ways to be implemented. It could be a phone call, a memo, a signal between devices, or a the sending of a record of data between parts of the system. Each flow from an external entity into your system needs an input device. Each store needs a storage device. Each flow between processes can be internal to a program or use a network. And each flow out to an external entity will use an output device. Choosing a good device and technology can make a system fly rather than crash. So you need to know about the options. The hardware options were covered in .See ./a2.html earlier. When the data is digital we still need to choose how to encode different things. Encoding data is a return to .Key Physical Design from .Key Logical Design. On the other hand drawing up a logical model of data is all about ignoring the actual encoding of the data. .Open Reminder -- use UML Classes to model logical data records Draw a class box with two compartments (suppress the third compartment if using a tool). Put the record name in the center of the top compartment -- the class name. List the attributes inside the second compartment -- these are the elements. Example: .Image ./12DataUML.gif [Example of two records described in UML] .Close .Close .Open Types of code .Open Special Encodings .Open Encoding Data Elements There are a number of well known ways to encode data elements. .Open Common Numerical Codes . Roman Humans tend to use a notation for numbers based on the number of fingers on a hand. In fact the word `digit` comes from the word for finger. In Roman, times the 'V' stood for a hand (5= one thumb + four fingers) and the 'X' was two hands (one up one down) or 10 fingers. . Decimal From the Arabs we have leaned the "decimal" notation. We encode the digits as 0,1,2,3,4,5,6,7,8,9. Then we can represent any whole number by string them together: digit::=0..9. number::= one or more digits. 987 = 100*9 + 10*8 + 1*7. For number n, digit d, value( n d ) = 10*value(n) +d. . Binary Computers use binary (you can find examples of this in Ancient China and Leibniz). This has two digits: 0, 1. So: 110011 = 2**5 + 2*4 + 2**1 + 2**0 = 32+16+3 = 61. In about 1945 Shannon defined bit::="Binary digIT", is either 0 or 1. binary_number::= one or more bits. For binary_number n, bit b, value( n b ) = 2*value(n) +b. . Octal and Hex This has resulted in many computer people being able to recite the powers of two up to the highest address on their favorite machine. However, calling out or typing 20 binary digits is a little inefficient, and decimal notation disguises the binary format (which is often significant) and so computer people tend to use base 8 (octal) and base 16 (hexadecimal) notation: nibble::= 4 bits. byte::=2 nibbles. Nibbles are written and spoken using the hexadecimal digits, 0(=0000),1=(0001),2,3,4,5,6,7,8,9,A,B,C,D,E,F (=1111). . Signed Integers In scientific computations integers are encoded using binary typically with 8, 16, 32, ... bits. One extra bit indicates whether the number is negative or positive. In commercial systems it was common to find numbers encoded using BCD::="binary coded decimal". Here a number like 987 was encoded by three decimal digits each represented in binary: 1001 1000 0111 This wastes some bits but is very convenient for important things like dollars and cents. . Real Numbers Real numbers (measurements) are encoded using "floating point" where a number has two parts called the mantissa and the exponent, both encoded in binary. The value is then mantissa * 2**exponent Floating point works well when we need a wide range of values and can put up with larger errors on the larger numbers. .Close .Open Standard Character Codes At one time each manufacturer had its own binary code for characters. This continued up to the 1980's In the 1960's the American standards people proposed what has become the standard 8 bit coding for characters -- ASCII ASCII::="American Standard Code for Information Interchange". ASCII covers all the characters needed for American needs, but has become the `de facto` standard on the Internet, and whenever data needs to be shared. The International Standards Organization treats ASCII as a specialized code for use in America. In the UK, the American "#" becomes the symbol for the British pound. Each European country has its own special symbols. IBM tried to create its own standard -- an Extended Binary Coded Decimal code named EBCDIC. This will disappear with the last mainframe. Recently, a new standard -- Unicode -- has been created that covers just about every character in every alphabet in the world. This is a 16-bit code. ASCII and the ISO codes appear within it. The Web uses HTML and HTML has introduced a number of special "entities" for showing non-ASCII characters like \Sigma and \alpha. These are given numbers and encode in HTML like this: "&" digit digit digit ";" .Close . Keys A key is an item of data that is used to uniquely define a single element in a file, database, or table. These are typically a fixed number of characters. . Sequence numbers Example: Number of a student on roster. Numbers assigned in a specific order -- typically when created. . Block sequences Example: CSUSB Course call Number Blocks of numbers given to different parts of the organization to allocate (in sequence) to data records. . Alphabetic Abbreviations and Mnemonics Example: States(CA, AL,...), IANA and DNS countries(uk, tv, ru, ...), IATA (LAX, ONT, LHR, MSP, ...) Abbreviation for a department teaching a course -- CSCI PHYS MATH ENG. Abbreviations for buildings on campus. A string of letters is chosen to identify an entity or a group of entities of a given type. A few letters stand in for a word or phrase . Arbitrary Systematic Codings Example: Library of congress subject classification, Dewey Decimal system for books. The Assoc for Computing CCS system for computer science. Linnaeus's technique for species. .Open Significant subgroups In many cases a data element is actually made up of smaller elements each with their own coding. . Digit Groups Example: Zip codes(92407-1133), phone numbers((909)-537-5257),..., SSN, URLs. Different groups of digits/characters in the data are themselves coded data. For example in a 9-digit ZIP-code the first digit determines a geographical area, the next three the town, and the 5th digit a post-office. The next 4 digits identify a delivery point. . Derived Codes Example: My UK Driving license number, CSUSB Library call numbers, Subscriber codes for magazines, Rooms on campus. Mixes different coded data into one element . Ciphers Example: Spoof at ICI (a number added to the paint sales). DES. PGP. ... Numbers are disguised for security or mnemonic purposes Passwords should be encrypted as soon as they are entered and never stored without salting and hashing! . Actions Example: A=Add, D=Delete, ..., The 50+ actions that the 'vi' editor has built in to it, Mnemonic codes in assembler. Codes that represent actions. Transaction codes. . Self-checking Elements Example: 9s remainder and 11s remainder check digits. Uses an added digit or character calculated from the rest. .Close .Close .Open Encoding Compound data Elements usually occur together in logical grouping. The problem to be solved is to design an efficient way to combine them so that the data can be extracted in only one way. Extracting data elements from compound data is called `parsing`. When there is more than one way of .Key parsing a compound data group then the coding is said to be `ambiguous`. .Key Ambiguous Encodings are bad encodings. There are four classic ways of encoding these. We also have the new markup notations: .List Fixed Length fields: Phone number. Delimiters and counters: tab delimitted spread sheet. Complex Syntax: C++ programs. Marked Up Text: HTML, XML, ... .Close.List . Record Structures In a record structure a number of `fields` are defined, each with a given length (bits or bytes). Each field has a name and its own encoding. A very popular technique in RAM and in Data Bases. Record structures rely on each element being a fixed length. And this in turn gave us one of the most successful bugs (so far) in the history of computing: the Y2K Bug. Just about every system was at risk of breaking down when a 2 digit field could no longer handle years greater than 1999. And just about all of these problems were found and fixed before the date arrived. Business as usual, computer people had to work hard and late into the night to make computers do what they were supposed to do. . Delimiters A common kind of data is the string. It can have any length When the data does not have a fixed length the end can be marked by a special sentinel character -- a `delimiter`. In programming languages for example we indicate the start and end of a string by using a quotation mark. . Count Fields Another technique is to add a field indicate the length of the following data element -- an encoded counter or integer. Humans do not find this as easy as delimitted data . Simple Syntax A very simple way to place several variable length fields into a single string is to make sure that the first character of each field can not be a character in the previous field. A simple example would be to choose alternating digits and letter encoding. .Open Markup Languages In recent years there have been a lot of markup languages defined for different purposes. The idea is to have explicit identification of the parts of the data. Typically something like .As_is RichardJBotting is a piece of text with added "tags" that indicate the meaning of the parts. In a .See Record Structure (above) the "tags" are not needed because their sequence is known and the lengths are fixed (or at least predictable). Thus we get an encoding that is guaranteed not to be ambiguous, is to some extent easy to read, and is somewhat inefficient. . RTF -- Rich Text Format Here the tags indicate the appearance of the text. Probably developed with early word processor software. Still a format used with MS Word. . SGML -- The Standard Generalized Markup Language IBM defined SGML so that you could created and define tags for any purpose. It is an amazingly difficult encoding to use. . HTML -- The HyperText Markup Language HTML is a specific application of SGML defined to describe and link pages on the world-Wide Web. It has gone through half-a-dozen versions. A new one stays within the XML (next) conventions. . XML -- the eXtendable Markup Language HTML describes the appearance of pages. XML tries to describe the meaning of data not its appearance. It has a very simple basis using .As_is .As_is to delimit data. Tags can also have attributes: .As_is Unix Training. XML also allows some tags to be unpaired and these are shown like this: .As_is XML documents can be parsed fairly easily. For each application that uses XML must have a DTD -- Document Type Definition published that defines the structure of the data -- what tags can appear inside others. Defining a DTD takes a significant amount of work. But once defined you can use tools to check validity, ... .Close Markup Languages . Complex Syntax Complex syntax gets us into natural and artificial languages. It is rare that we need to express natural data groupings using complex syntax. When we do we can use a extension of the syntactic meta-languages like Backus-Naur Form (BNF). In computer science most of our knowledge about linguistic design has been put into designing programming languages. Programming languages are the most complicated schemes for encoding a domain in existence. There are hundreds of them. For more take a CSCI Programming Language class like our CSCI320 .See ../cs320/ (Advert). .Close Encoding Compound data . Experience -- coding data in the ICI Infra-Red Spectrum Analysis Program. .Close Special Encodings .Open Guidelines for encoding data Make your codes - .List Concise Expandable Stable Unique Sortable Meaningful Single Purpose Consistent Clear -- Don't allow digits and letters in one field. They look too alike: (0=O, 1=l=|, 2=Z, 5=S, 6=b, 8=B) .Close.List Use standard codes and technologies whenever possible! Collect input automatically if at all possible. Get the input where it is available. All data storage needs back up and archiving.... but for how long? All output needs data control, security, and destruction. Who needs dead trees? Paper data is dead data. All stored or transmitted data implies a security problem. All data needs some kind of security -- how much and at what cost? Design I/O with care and with the help of the users. .Close . Information theory and combinatorics An elegant theory of coding data and signals. It shows that you can measure the information needed for a given purpose and calculate the maximum rate at which data can be sent through a given channel using the same measure: bits. Do the math. MATH272+MATH262. . Reference and Online Resources . Universal Product Code I don't expect you to understand UPC but if you are interested in these ubiquitous .Key bar-codes see .See http://en.wikipedia.org/wiki/Universal_Product_Code for the details and history. . Samples of Syntax Definitions My .See http://www.csci.csusb.edu/dick/samples/ define a large number of sophisticated coding schemes including programming languages and meta-languages. . ASCII For reference purposes see .See http://csci.csusb.edu/dick/samples/comp.text.ASCII.html . Markup Languages For reference purposes see .See http://www.csci.csusb.edu/dick/samples/index.html#Mark Up Languages (my notes). . HTML .See ../samples/comp.html.syntax.html . XML .See ../samples/xml.html .Open XML reference Robert Eckstein The XML Pocket Reference O'Reilly ISBN 3692082709 1999 .Close Up to date reference .See http://proquest.safaribooksonline.com/9780596100506 .Close XML .Open Review Questions When did changing the coding of one element become important in CSUSB and the recoding take more than 6 months to complete? What is IANA is and what does it do? Define: ASCII, Unicode, BCD, EBCDIC. Parse the following data strings -- label the parts: .List CSCI201-02 http://csci.csusb.edu/dick/cs372/d2.html#Hints for encoding Data rbotting@csusb.edu telnet:139.182.151.222 LOL .Close.List Take the above parsings and express it using a plausible XML syntax. For example the first might be .As_is
CSCI20102
How many ways can you encode a date? How many ways can you encode a time on a given day? How about encoding a time on any day.... Research: How does the California DMV assign number plates? Can you have an obscene vanity plate? How can California enforce this rule? Compare doing the same calculation in Arabic, Roman, and Binary: .List 32 * 16 = ? XXXII * XVI = ? 100000 * 10000 = ? .Close.List Count from 0000 to 1111 in Binary, and in Hexadecimal. Visit my office and figure out how I've encoded the day of the month in it. .Close .Close