.Open Detailed Data Design Part of system analysis and design is abstracting the underlying .Key Logical Data from the given (or planned) .Key Physical Data. Computer Science has discovered a powerful procedure that helps. It is called `normalization`. It is an important process that becomes, with experience, an intuitive part of good design. The theoretical breakthrough was to realise that all data can be expressed in terms of tables with all the structure of the data being handled by storing key values in tables. In other words, a .Key Relational Data Base (Tables + Keys) can encode any data structure. In any real project you should make both Conceptual Models and Normalized Models and express them using data dictionaries and ERDs. As a general rule, the conceptual model give you a simpler picture of the data needed in a project. Normalization provides a complete and detail design for a data base. You can do informal and mental normalization as you do conceptual modelling. Your conceptual models will proved names for tables that you later derive during normalization. And sometimes you will uncover a gigantic mismatch between what people think (their conceptual model) and what the data actually is. In this case you have an interesting problem to solve -- political, psychological, and sociological. Programming will probbly make the situation worse if you don't resolve the conflict first. In this page I will tell a story of data retrieval, review the words and processes we use, Give some hints, give rules for mapping tables to ERDs, and a UML profile for modelling data. . Story -- what do you mean by data item I was once (1970s) analyzing a Social Work Department in a part of London. One of the workers, Edna, had an effective manual system for retrieving information on clients. She used 3><5 cards with data written on them and a large automated storage divice for them. A typical scenario: .List Phone call asking for information. Edna asks a number to call back. The number is an extension and so secure -- Edna turns to the machine. Edna dials the write drawer, machine rotates it to the front and opens it. Edna finds the right card (a quick binary search). Edna phones back the information. .Close.List I noted that it took 25 seconds to do this. I noted the subtle but effective security. I also noted that some cards were red or green. I asked about this and it turned out that the red cards indicated threats to social workers... So in this case one data item was the color of the media. By the way... I recommended not computerizing this process. .Open Data Tables and Normalization .Open Keys and Identifiers When an enterprise needs to keep track of a type of entity it nearly always assigns a different identifier to each individual instance. As an example all the project reports and theses in the department office have just had a number assigned them -- and stuck on the back. Now we can refer to the document just by that number. The number identifies the report or thesis. We say that `the thesis or report depends on the number`. We say that `the number is a key identifying the thesis`. Absolutely typically, the reports and theses are stored so that the key increases. This makes it easy to find one given the value of its key. However, it is harder to find a report if you only know the name of the student who wrote it. The name is not the `prime key` for the reports and theses. Pay attention to identifiers and how they are assigned. A system can work or fail depending on how it handles them. For example, a university library (a long way from here) saved thousands of dollars by assigning identifiers to books that were meaningless serial numbers, and letting the computer store the relation between this identifier and the traditional Library of Congress card number. In computerized system, identifiers are mirrored by `keys`. All data is stored as a set of records of the same type in a file or table. Each record is a set of data fields -- attributes of some real entity. In these records, there should be special field called a prime key. A .Key prime key uniquely identifies one record in the table (= file). The value of the prime key is unique in the table. The rest of attributes in the record all depend on that key. Keys act as surrogates to records. We can then store the value of the key in another record were it links back to the record. Here it is a foreign key because it identifies a record in a foreign file/table. So a .Key foreign key is an field in a table/file that is a .See prime key in a different table/file. . Candidate Key A candidate key for a table can be any set of fields in the table that might be used to uniquely identify the rows in the table. . Prime Key A prime key is a set of fields in a table that uniquely identifies a row in the table. . Foreign Key A foreign key is a field in a table that is a prime key in a different table. A foreign key defines a natural many-to-one relationship between one table and another. There is one row identified by the prime key. But the same value can appear in many records as a foreign key. In a diagram you can connect the foreign key to its prime key. You will also hear about OIDs: OID::="Object Identifier", an invisible internal attribute associated with each object in an object-oriented program that uniquely identifies it. OIDs are based on pointers in C++. You will also find `Secondary Keys` in many data bases. A .Key secondary key identifies a set of rows in a table. A database system is told that about these so that it can attempt to optimize performance by creating indexes. . Notation for keys In word processed texts and writing people underline In the UML keys are attributes. Prime Keys are marked <> and foreign keys <>. prime keys and put asterisks infront of foreign keys. In ASCII the simplest technique is to indicate a foreign key by putting an asterisk in fron of it, and a double asterisk in front of the prime key. .As_is Report(**number, author, date, title, ...) .Close Keys .Open Extracting data structure from given data The first step in generating a logical data base from a sample of data is called `0NF` -- "Zero Normal form". All you do is: Mark up the sample. Highlight identifiers and other data. Give each piece of data a name. List names of fields in parentheses. Add asterisks for keys. Show repeated groups in parentheses: .As_is ( **Identifier, data, (repeating group ) , *foreign_key ) Example of online documents and analyis below .Close .Open Normalization . Motivation ** Normalization is most reliable way to design a data base to support your system. ** Normalization forces you to get answers to important questions. It reduces guess work and corrects errors. ** Normalization doesn't give you the most efficient data base. However, it gives you a collection of the simplest possible data structures. ** Normalization makes certain anomalies impossible. Example anomalies: (1) You can't insert an instance of an entity unless another entity is already in the data base. (2) You daren't delete the last entity of a given type because it destroys all record of a parent type. (3) You can't update certain records without being forced to change a completely different one. Normalization is a way of going from existing data to the logical data: A collection of simple tabular files linked solely by keys. If you start with a rough design for your data, normalization improves it. Normalization is a quality control process. If you design normalized data, you won't have to normalize it. It is possible to map normalized data onto just about any data storage system. Normalized data is easy to store in text files, traditional fixed length record files, spreadsheets, and even XML! Indeed -- even object oriented data should be normalized whenever possible. The attributes of object should depend only that objects identity. Normalized data is not fastest or smallest design for the data, but it is simple and obvious, and the basis for optimized designs when they are needed. Design with normalized data and then do the math on the volume of the data vs your resources, and the timing of performance critical processes. Then you can always re-factor the design to work faster or use less disk space; and sometimes both. Normalized data gives you the freedom to code data elements any way you wish. Normalization, in class, is not like real normalization. In a real project normalization raises questions that you must get answered. You have to go out in the field and ask questions. In fact, many developers have found that normalization generates precisely the right questions that you have to ask stakeholders. .Open Functional Dependencies One data item .Key depends on other data if its value can be deduced from the other data values. For example, you look at some samples of "Customer Order" which describes some items that a customer wishes to get from us. When we look at our sample we notice that whenever a customer orders an item with product number "WI007B" it is described by "Widget size 7 type B". In fact the product number is all we need to know to predict the product description. We say to our selves: "Product description depends on Product number". This is an example of a `dependency`. Looking further we notice that the price seems to depend on the product type in some of our sample, but not in others. Sometimes a "WI007B" costs $5.00 and sometimes $4.95. So we say that "Price does not depend on the Product Type". If we go and interview the clerk who handles Customer Orders. We discover that different customers get different discounts, and that the price varies with time. So you can say "The price depends on the time, the customer, and the product type". You'll find that experts tend to use the phrase "Functional Dependency". Here is an explanation why. .Net In mathematics (set theory) a function models this kind of dependency. If a mathematician writes sin: Real >-> Real then they say that "sin is a function". They mean that each real number `x` has a real number called `sin(x)`. Similarly, in C/Java/C++ when we declare .As_is double distance(Point x, Point y); we are saying that the distances `depends` on two Points. The rule is that if `x R y` is a function from x to y then for all x there is one y such that (x R y). So, the relation between Product Number and Product Description (above) is a function! Product_number >-> Product_description. .Close.Net .Close Functional Dependencies . Normalization Process -- The key, the whole key, and nothing but the key Memorize the mnemonic, and practice the following steps. Normalization has numbered levels: Zero normalized form, 1st normal form, 2nd normal form, 3rd normal form, ... We abbreviate these: 0NF, 1NF, 2NF, 3NF, ... normalize::procedure= .Net (0NF): List tables and their fields. (field, . . .). Indicate candidate keys. (1NF): The Key: no repeating groups or options. Generates two tables from one. (2NF): The Whole Key: No part key dependencies. Moves attributes between tables. (3NF): Nothing but the Key: No non-key dependencies. Generates extra tables. We won't cover 4NF and 5NF in this course. .Close.Net . Tool kit for normalization Simple editors/word processors are a good tool for normalizing data. I've used chalk boards, flipcharts, paper and pencil, PDAs, palmtops, laptops, and desktops! Use a dumb editor in preference to a word processor or graphic tool. Copy/paste each successive stage into the next (DEMO in class if time). Show keys by typing an asterisk "*" in front. Use "**" for Prime Keys and "*" for Foreign ones. Show repeated groups in parentheses: .As_is Name( **prime key, *foreign key, attribute, (repeating group ) ) .Open Suggested Abbreviations for Normalizing data on a simple editor (....) = Repeating group. ** = Primary Key. Note: usually underlined. * = Foreign Key. # = "number". ? = "Yes/No"|"True|False". N = Name. C = Code. .Close .Open Examples of Normalizing Samples .Open CSUSB SIS+ Screen 105 Course Section Scan The following is a step by step normalization of a sample of data on the old CSUSB Student Information System (SIS+). First the sample of data: .As_is W-Z0137 - SECURITY - YOU MAY NOT UPDATE ON THIS SCREEN .As_is 105 Course Section Scan .As_is S .As_is Screen: ___ SID: _________ Course: CSCI Term: 071 G S .As_is tS SEATS INSTRUCTORS/ PRIM R T CALL .As_is SECeS GT AVAIL DAYS TIMES SUBTITLE MAX ENR UNITS ACT P A NO. .As_is CSCI-121 - COMPTR TECH&PEOPLE .As_is 01 76 T 1200-0150PM WATSON P 80 4 2.00 LEC O 10649 .As_is CSCI-122 - BIOINFORMATICS .As_is 01 113 M 0200-0350PM DEVLIN K 120 7 2.00 LEC O 10650 .As_is 02 118 T 1200-0150PM DEVLIN K 120 2 2.00 LEC O 10652 .As_is CSCI-123 - USING COMPTR SOFTWARE .As_is 01 30 M 0900-0950AM PERLEPES T 30 2.00 LEC A O 10653 .As_is 02 NG 30 M 1000-1150AM PERLEPES T 30 0.00 ACT A O 10654 .As_is Activity .As_is CSCI-124 - EXPLORING INFO SUPERHIGHWAY .As_is 01 59 R 0200-0250PM WATSON P 60 1 2.00 LEC A O 10655 .As_is 02 NG 29 R 0300-0450PM WATSON P 30 1 0.00 ACT A O 10656 .As_is Activity .As_is 03 NG 30 R 1200-0150PM WATSON P 30 0.00 ACT A O 10657 .As_is Activity .As_is .... .As_is CSCI-365 - SYSTEMS ADMINISTRATION .As_is 01 30 TR 0600-0715PM HAN K 30 4.00 LEC O 10681 .As_is 02 NG 30 T 0730-0920PM HAN K 30 0.00 ACT O 10682 .As_is Activity .As_is CSCI-375 - REQUIREMENTS ANALYSIS AND DESIGN .As_is 01 30 TR 0400-0600PM BOTTING R 30 4.00 LEC O 10683 .As_is ... .Open 0NF Screen 105 .Key Describe the data Give names to fields, mark keys (*), guess at prime keys (**), note repeating groups(....). Screen( **dept, ** term , ( *course, title ( *sect#, available, days, times, teachers, max, enrolled, units, mode, group, status, *call#, subtitle))) .Close .Open 1NF Screen 105 .Key The key -- Remove repeating groups and add new tables. Each removed group takes all the prime keys of the outer groups, and its own key. It takes just the fields that repeat in the group. This may leave some key only tables. Don't worry. In this example there are two (nested) groups -- data determined by the class and those depending on the section. DepartmentTerm (**dept, **term ) Course( **dept, **term, ** course, title ) Section (**dept, **term, ** course, **sect#, days, times, #available, #enrolled, maxEnrollment, mode, group, status, * call #, subtitle) .Close .Open 2NF Screen 105 .Key The whole key -- Now look for part key dependencies. Parts of prime keys that have data that is determined by them alone, indpendent of the rest of the prime key. In this example, we notice that the `course` determines it's title. And extract Course from Offering (inveneted name). DepartmentTerm (**dept, **term ) Course( ** course, title ) Offering( **dept, **term, ** course ) Section (**dept, **term, ** course, **sect#, days, times, #available, #enrolled, maxEnrollment, mode, group, status, * call #, subtitle) .Close .Open 3NF Screen 105 .Key Nothing but the key -- now we look for dependencies between data items that are not in the prime keys. In this case the "mode" and the "call#" determine other data. This gives us two new table. In the Section table we have: the dept, term, course, and section depend on the call# and so the prime keys are demoted to foreign keys. DepartmentTerm (**dept, **term ) Course( ** course, title ) Offering( **dept, **term, ** course ) Section ( ** call# , *dept, *term, * course, *sect#, days, times, #available, #enrolled, maxEnrollment, mode, group, status) ModeSubtitle(**mode, subtitle) .Close .Close Note -- we .Key normalize many sample documents -- one at a time. Each sample tends to add fields to existing table, and add new tables to the data base. The question to ask is: where does this bit of data on the new document fit into the already discovered data base. What data does it depend on? What determines its value? Here is an example from the old SIS+ system .Open Screen 105 + CSUSB CSCI Dept Winter Schedule 2007 Separates the Course identifier into two parts: dept and number. Adds information on the building and room. .Open 3NF Screen 105 + Winter Schedule Department(**dept) DepartmentTerm (**dept, **term ) Course( **dept, ** course#, title ) Offering( **dept, **term, ** course# ) Section ( ** call# , *dept, *term, *course#, *sect#, days, times, *building, *room#, #available, #enrolled, maxEnrollment, mode, group, status) ModeSubtitle(**mode, subtitle) Room(**building, **room# ) Building( **building ) .Close .Open Optimized For all courses #available = maxEnrollment - #enrolled. More: when I asked the head of department about how the maximum enrolment was chosen he explained it was a property of the room. Section ( ** call , *dept, *term, *course#, *sect#, days, times, *building, *room#, #enrolled, mode, group, status) Room(**building, **room#, maxEnrollment ) .Close .Close .Close .Close Examples of Normalizing Sample data . Hints Don't use upper case letters unless you write badly. (Agile Normalization): remove data that is positively not a key and replace by "...". Use abbreviations: #=number and ?=Boolean for example. When you have more experience you may be able to often combine several steps going directly from 0NF to 3NF quite quickly. But this is a matter of experience and skill. Don't do this, yet. Do normalization step by step, and keep a record of each normal form. This lets you review what you have done. Mistakes can be made. So get your work checked by colleagues and stakeholders! I often start out with an incomplete set of 3NF tables with missing columns. As I uncover each data element I work out which table/entity it belongs in, and where it is a key, and whether I need more tables to record relationships and data. Most books don't mention .Key Optimized 3NF. After getting 3NF from different samples, you may have many tables with the same prime keys.... try combining them. .Open Mapping 3NF back to ERD It is easy to change a 3NF schema back into an ERD. Mark all the key fields: prime(**) and foreign(*). Start at the top of the page/screen with tables that have no foreign keys and the simplest prime keys. Put each new entity with combined prime keys below these and connect them to entities that have a part of the prime key. Connect foreign keys to prime key. Connect parts of a composite key to one prime key. The prime key is the "one" end, the other key is the "many" end. Validate against process and, if needed add extra entities/tables (views and indexes) to provide starting points into the database. .Open Computer Science Brochure I took the 2004 CSci brochure and normalized it. This is what I found: .Open 3NF Department(**department name, address, phone, fax, email, blurb) AccreditingBody(**body name, url) DepartmentAccredition(**department name, *body name) Program(**department name, **type, **subject, description) FacultyDepartment(**department name, **faculty name) Faculty(**faculty name, faculty title, rank, highest degree, date of highest degree, email, research description) FacultyResearch(**faculty name, **research topic) Prospect(**prospect name, prospect title, address, phone, email, international?, Financial assistance?) ProspectResponse(**prospect name, **department name, **sequence#, date) ProspectRequestsQualification(**prospect name, **department name, **sequence#, **type, **subject) .Close Here is the conceptual ERD derived from the 3NF. .Image CSci2004.gif Diagram summarizing above tables. .Close Mapping 3NF back to ERD .Close .Open Using the UML to model data . Customizing the UML A .Key UML profile is a set of stereotypes, tags, and constraints that extends the UML for a particular purpose. .Key Stereotypes restrict the meaning of things in UML diagrams. They look like this .As_is <> These are often associated with a particular graphic notation. .Key Tags provide extra information. They take the form: {tag_name=tag_value} .Key Constraints describe additional rules that should hold in a model. They are statements that must be true. There is a specialized language (the OCL) for writing them if needed. I've given you links to some data base profiles below that I think are too complex for this course. .Open UML & Data bases My advice: KEEP IT SIMPLE. (proposed profile for data bases): My proposed profile for describing data bases .Table stereotype a special kind of meaning .Row conceptual model class diagram picture of the world outside the computer in terms of entity types and relationships between entities. .Row logical data base class diagram A picture of an ideal data base independent of a particular technology or technological requirements. .Row physical data base class diagram A detailed solution, for a particular technology, to a set of data base requirements. .Row table class database table .Row PK attribute Prime key .Row FK attribute Foreign key .Row view class view of database(for a function) .Row index class index into a database(for a function) .Close.Table .Table Constraints .Row In conceptual models do not show operations, keys, tables, views, indexes, composition, aggregation, etc. .Row In logical and physical data bases unmarked classes are tables. .Row Use <
> only if you have a diagram that has both persistent (data base) classes and run time objects in the same diagram. .Close.Table .Close UML & Data bases .Close .Close .Open Online Resources and Examples . Definition of Normalization Here are a couple of excellent definitions: .See http://en.wikipedia.org/wiki/Database_normalization In this class we won't go further than third normal form. .Open Examples A Flyer for a meeting at CSUSB .See ./372doc4.pdf Flyer(OrgN, Chapter, (* SpeakerN, Description, title, abstract, motivation, date, time, building, room#), (TypeOfMember, Price), (Reason), URL ) Example .See ./372doc1.gif Schedule( DeptCode, DeptName, (Day), (Person#, PersonName), (Day, Person# (State) ) ) Example .See ./372doc5.pdf Container(*name, description, (property), (operation), (picture) ) Example -- Tally Roll from a drive thru with my analysis: .See ./372doc3.pdf Also see the example at .See http://www.troubleshooters.com/littstip/ltnorm.html .Open ISSA Flyer Albert Garcia brought this .See ./372doc4.pdf (in 2006) to class. .Open 0NF -- ISSA Flyer Flyer(OrgN, Chapter, (SpeakerN -> Description, title, abstract, motivation, date, time, building, room#), (TypeOfMember, Price), (Reason), URL ) .Close .Open 1NF -- ISSA Flyer Organization(**OrgN, chapter, URL) Reason(**OrgN, **reason) Speaker(**SpeakerN, Description) Talk(**OrgN, **Speaker, *Date, *Time, title, abstract, motivation, building, room#) Price(**OrgN, **TypeOfMember, price) .Close .Open 2NF -- ISSA Flyer = 1NF .Close .Open 3NF -- ISSA Flyer = 1NF .Close .Close ISSA Flyer .Open Info Service Schedule Aaron produced the following print out from a SQL data base .See ./372doc2.pdf .Open 0NF - Info Service Schedule Schedule( **DeptCode, DeptName, (*Day), (*Person#, *PersonName), (*Day, *Person# (State) ) ) .Close .Open 1NF - Info Service Schedule Department(**DeptCode, DeptName) Day(**day#, day) Person(**Person#, PersonName) Slot( **DeptCode, **day#, **Person#) SlotState( **DeptCode, **day#, **Person#, **StateNumber, StateCode) .Close .Open 2NF - Info Service Schedule .Close .Open 3NF - Info Service Schedule .Close .Close Info Service Schedule .Close Old CSci Brochure .See ./CSBrochure2004.pdf (Large sideways PDF) .Close . UML Profiles (Ambler's Proposed UML profile for Data Base design): Visit Ambler's UML DB Profile .See http://www.agiledata.org/essays/umlDataModelingProfile.html (Rational's proposed UML profile for mapping objects to data models): PDF .See http://www3.software.ibm.com/ibmdl/pub/software/rational/web/whitepapers/2003/tp185.pdf .Close Online Resources and Examples .Open Review Questions and Exercises on Normalized Data Bring any piece of data to class -- we will normalize it. Terminology: Can you define these terms from .See ./d3.html ? Table, Row, Tuple, Field, Attribute, Key. (some terms refer to the same idea). Define: table, key, row, normalize, 0NF, 1NF, 2NF, 3NF. What is normalized data good for? What is normalized data bad for? Draw a rough DFD of normalization: what data does it start with and what does it output? What processes? What metadata is stored? Print out any CMS screen -- normalize it. Do you discover the same data as implicit in SIS+? Find any tally roll or printout that you have (and don't mind sharing) -- normalize it. Use the $Template at the end of this page. Do you have any sample data from a field trip? -- Normalize it. You can apply normalization to software development artifacts. Do you have any source code to normalize? How about a data dictionary? .Close .Open Template .Open 0NF .Close .Open 1NF The Key... .Close .Open 2NF The Whole Key... .Close .Open 3NF And Nothing but the Key... .Close .Open Optimized 3NF .Close .Close Template .Close Detailed Data Design