3. The Migration Work

Keeping in mind that the migration of the bibliographic data should facilitate the future administration work and allow the processing of "intelligent", structured queries, some additional steps had to be carried out before building the final database. First of all, the complexity of the topic required the developing of appropriate data models for describing the representation of information in terms of data. In other words, we had to determine the types of information that needed capturing. This had to be done under structural and operational aspects and produced a so-called "conceptual data model" (sometimes called "logical data model" as well, but these terminological questions can be ignored here). It contained all knowledge necessary to model the miniworld of bibliographies irrespective of the concrete software to be used for the later implementation. The realization of this conceptual data model was done using entity-relationship (E-R) diagrams. They are based on the distinction between "entity types" that represent the logical "things" our miniworld is composed of and "relationship types" among those entity types. Below, you will find the E-R diagram for the Bibliography of German Grammar, which was created with the powerful CASE tool Oracle Designer/2000 (see [5]). Entity types are displayed as rounded rectangles, with their names and their attributes listed inside. Relationship types are presented according to the following conventions: Single lines represent "one-to-one"-relationships, crow's feet represent "one-to-many"- or "many-to-many"-relationships; lines stand for "must"-relationships, broken lines stand for "may"-relationships. For example, the following statements are coded in the figure below: "Each book (buch) must be published by a publishing house (verlag)." and "Each publishing house (verlag) may publish one ore more book(s) (buch).". The relationship between publishing houses and periodicals is an example for a "many-to-many" (n:m)-relationship: "Different publishing houses can publish the same periodical. Different periodicals can be published by the same publishing house.".

Conceptual data model for the Bibliography of German Grammar:

So what is the advantage of mapping bibliographic constructs into an E-R representation? One important aspect is that it seems to be a good way to understand the logical structures that organize our information. Based on this understanding and with the help of relational algebra, the later development of database applications can take these structures into account and offer solutions to certain problems. For example, let's think of the relationship between books and authors. The non-relational way to store this information would be similar to the index card solution: For each book, we had to enter the names of the authors concerned again and again. Given the fact that 80 out of 2000 authors included in our bibliography are responsible for about 25% of all book titles, this is obviously not very effective. Besides the multiple work, it preprogrammes the storing of different spellings of author names. The introduction of separate entities for the book and the author opens up the possibility to store the information about the authors at a different location and to use this information when entering a new book entry (readers familiar with database design will notice that this already gets us to the physical data and application modelling, which is the next step that comes after the conceptual modelling). Of course it is sometimes difficult to determine the "real" name of an author. Especially transliterations from other alphabets raise problems, as well as titles of nobility ("von", "de", "van" etc.) or the fact that people sometimes completely change their surname after marriage. To model the last-mentioned case, we could add an additional recursive relationship "is identical with" to the entity "person" in the E-R diagram above.

When designing a database based on the former analysis of functional dependencies, attention has to be payed to the normalization of the defined relations. The primary goal here is to reduce an arbitrary relation to an equivalent collection of 3NF (third normal form) relations. Despite of some special cases described below, the relations defined within the database model for the Bibliography of German Grammar fulfill the demands of E.F. Codd's original three normal forms (for more information on normal forms, see [2]).

When examining our conceptual database design, relational purists would possibly argue that the conception of the two entities "person" and "verlag" ("publishing house") does not exhaust the full potential of E-R modelling. It is obvious: Since people can have one ore more first names, it would be accurate to split the entity "person", to create a new entity called "first name" and to connect these two entities with a "one-to-many"-relationship. The same is true for the "verlag" entity: Publishing houses can be based at one or more different cities, so under ideal circumstances the modelling should reproduce this. We intentionally broke the rules and restricted the maximal number of first names to four and the maximal number of cities to three. This was done after analysing the complete SGML data to be sure that none of the existing entries needs more than that. The advantages we derived from avoiding additional entities are a substantial ease of administration work in the resulting database application as well as less complex and therefore faster retrieval queries.

As always, the conceptual model finally used for the later creation of our database tables is a compromise and far from being the only possible view on the mini-world it describes. Maybe changing demands will make it necessary to re-model some parts of it. For example, we assumed that each book or article can only be written in one language and that it may cover only one contrastive language. When detecting that some of the original SGML entries contain a contrastive language attribute called "Verschiedene" (mixed), we decided not to extend our model because of administrative reasons. If some future application will make it desirable to explicitly distinguish between different contrastive languages, it should be manageable to change the way we looked on the data and to add the changes to the bibliograhic database.

Another aspect that gives reasons for the use of conceptual data models has to do with system independency; those models can be used as a basis when migrating to a different DBMS. As already mentioned, the translation of conceptual data models into something a database system can work with is called "physical data modelling". It can be defined as a low-level description of data, as a description of how to store information physically in database tables. In contrast to the abstract modelling we did before, the physical data model shown in the figure below precisely describes the internal structure of the database tables and the foreign keys used to join two tables together.

Physical data model for the Bibliography of German Grammar:

For a better understanding of the role of the database tables, a few words should be said about their content and its origin. Most parts of the information come from the SGML files described in the previous chapter, but besides structural reorganization and addition of formerly implicitly stored information we also added some new content. The descriptions below start with the formal characterization of each table column, followed by a short explanation. The tables are listed in alphabetical order.

Description of table TB_BDG_AUFSATZ:

CO_ID                   NOT NULL VARCHAR2(10)
CO_TITEL                NOT NULL VARCHAR2(400)
CO_JAHR                          NUMBER(4)
CO_JAHRZUSATZ                    VARCHAR2(2)
CO_SPRACHE                       VARCHAR2(50)
CO_KONTRASTIV                    VARCHAR2(50)
CO_TYP                           VARCHAR2(50)
CO_RELEVANT                      VARCHAR2(10)
CO_BD1                           VARCHAR2(10)
CO_BD3                           VARCHAR2(5)
CO_BAND                          VARCHAR2(15)
CO_HEFT                          VARCHAR2(5)
CO_VON                           VARCHAR2(5)
CO_BIS                           VARCHAR2(5)
CO_VERWEISTYP                    VARCHAR2(1)
CO_VERWEIS_ID                    VARCHAR2(10)
CO_NETZADR                       VARCHAR2(200)
CO_DATUM                         DATE
This table stores information about scientific articles that were published in journals, anthologies etc. The column CO_ID contains an unambiguous identifier for each article. CO_TITEL stores its title, CO_JAHR stores the year of publication and CO_JAHRZUSATZ an addition like "a", "b", "c" etc. CO_SPRACHE stores the language the article is written in, CO_KONTRASTIV stores the language the article is about (both correspond to the column CO_SPRACHE of table TB_BDG_SPRACHE). CO_TYP stores the article type and corresponds to the column CO_TYP of table TB_BDG_AUFSATZTYP. CO_RELEVANT stores information about the relevance of the entry, CO_BD1 and CO_BD3 say something about whether the entry was published in the first and third print edition of the biography or not (these columns are remains of the original SGML data records and will not be maintained in future). The column CO_VERWEISTYP contains information about the type of publication the article was published in (book or journal); CO_VERWEIS_ID is a link to the publication concerned and corresponds to an entry in TB_BDG_BOOK or TB_BDG_PERIODIKUM. The columns CO_BAND and CO_HEFT contain - if available - information about volume and issue. CO_VON and CO_BIS store the page numbers. A new type of information is collected in column CO_NETZADR: To keep up with current developments in scientific publication, it is now possible to enter the internet address of electronically published articles. Finally, CO_DATUM stores the last modification date for each record set.

Description of table TB_BDG_AUFSATZTYP:

 CO_TYP                  NOT NULL VARCHAR2(50)
 CO_KURZFORM                      VARCHAR2(5)
This table holds the information about possible article types. The column CO_TYP contains the different types ("journal article", "review", "summary", "foreword" etc.), CO_KURZFORM stores the corresponding abbreviations.

Description of table TB_BDG_AUTOR:

 CO_ID                   NOT NULL VARCHAR2(10)
 CO_FUNKTION                      VARCHAR2(20)
 CO_POSITION             NOT NULL NUMBER(2)
 CO_PERSON_ID            NOT NULL NUMBER(5)
 CO_DATUM                         DATE
This table stores information about the persons responsible for a certain publication. Column CO_ID holds the unique identifier of the publication (a book or an article), CO_PERSON_ID refers to a personal name stored in TB_BDG_PERSON. CO_FUNKTION says something about the function of this person ("editor", "author", "co-author" etc.). Since the names on a book cover are not always arranged in alphabetical order, CO_POSITION explicitly stores the position of each name. CO_DATUM stores the last modification date for each entry.

Description of table TB_BDG_BUCH:

 CO_ID                   NOT NULL VARCHAR2(10)
 CO_TITEL                NOT NULL VARCHAR2(400)
 CO_JAHR                          NUMBER(4)
 CO_JAHRZUSATZ                    VARCHAR2(2)
 CO_SPRACHE                       VARCHAR2(50)
 CO_KONTRASTIV                    VARCHAR2(50)
 CO_TYP                           VARCHAR2(50)
 CO_AUFLAGE                       VARCHAR2(50)
 CO_RELEVANT                      VARCHAR2(10)
 CO_BD1                           VARCHAR2(10)
 CO_BD3                           VARCHAR2(5)
 CO_REIHE_BAND                    VARCHAR2(10)
 CO_REIHE_ID                      VARCHAR2(10)
 CO_VERLAG_ID                     NUMBER(5)
 CO_DATUM                         DATE
This table stores information about books relevant for grammar science. The column CO_ID contains an unambiguous identifier for each book. CO_TITEL stores its title, CO_JAHR stores the year of publication and CO_JAHRZUSATZ an addition like "a", "b", "c" etc. CO_SPRACHE stores the language the book is written in, CO_KONTRASTIV stores the language the book is about (both correspond to the column CO_SPRACHE of table TB_BDG_SPRACHE). CO_TYP stores the book type and corresponds to the column CO_TYP of table TB_BDG_BUCHTYP. CO_AUFLAGE holds information about the book's print edition. CO_RELEVANT stores information about the relevance of the entry, CO_BD1 and CO_BD3 say something about whether the entry was published in the first and third print edition of the bibliography or not (as for TB_BDG_ARTICLE, these columns are remains of the original SGML data records and will not be maintained in future). CO_REIHE_ID refers to a series a book belongs to (stored in table TB_BDG_PERIODIKUM), CO_REIHE_BAND says under which number the book appears in this series. Column CO_VERLAG_ID stores a link to the publishing house in table TB_BDG_VERLAG. As always, CO_DATUM stores the last modification date for each entry.

Description of table TB_BDG_BUCHTYP:

 CO_TYP                  NOT NULL VARCHAR2(50)
 CO_KURZFORM                      VARCHAR2(5)
This table holds the information about possible book types. The column CO_TYP contains the different types ("monograph", "anthology", "ph.d. thesis" etc.), CO_KURZFORM stores the corresponding abbreviations.

Description of table TB_BDG_OBJEKTWORT:

 CO_ID                   NOT NULL VARCHAR2(10)
 CO_WORT                 NOT NULL VARCHAR2(200)
 CO_DATUM                         DATE
This table connects books and articles with so-called object words (German: "Objektworte") that are relevant in the publication's context. Object words can be full words as well as prefixes, suffixes etc. and are stored in column CO_WORT. The column CO_ID refers to the book's or article's unique identifier, CO_DATUM stores the last modification date of the record set.

Description of table TB_BDG_OBJEKTWORTLISTE:

 CO_WORT                 NOT NULL VARCHAR2(200)
This is a look-up table which contains the list of possible object words used in table TB_BDG_OBJEKTWORT.

Description of table TB_BDG_PERIODIKUM:

 CO_ID                   NOT NULL VARCHAR2(10)
 CO_TITEL                NOT NULL VARCHAR2(400)
 CO_UTITEL                        VARCHAR2(400)
 CO_SIGLE                         VARCHAR2(50)
 CO_TYP                           VARCHAR2(50)
 CO_NETZADR                       VARCHAR2(200)
 CO_DATUM                         DATE
This table stores information about periodicals like journals or book series. The column CO_ID contains an unambiguous identifier for each periodical, CO_TITEL stores its title and CO_UTITEL holds the subtitle. In CO_SIGLE we find the sigle - a kind of abbreviation - for the periodical and CO_TYP stores its type. Like in table TB_BDG_ARTICLE, the column CO_NETZADR refers to an internet address (e.g., for e-journals). CO_DATUM stores the last modification date for each record set.

Description of table TB_BDG_PERIODIKUMTYP:

 CO_TYP                  NOT NULL VARCHAR2(50)
 CO_KURZFORM                      VARCHAR2(5)
This is again a look-up table, this time with a list of possible types of periodicals used in table TB_BDG_PERIODIKUM. The types are found in column CO_TYP, whereas CO_KURZFORM stores an abbreviation.

Description of table TB_BDG_PERIODIKUM_VERLAG:

 CO_ID                            VARCHAR2(10)
 CO_VERLAG_ID                     NUMBER(5)
 CO_JAHR                          NUMBER(4)
This table was created for modelling the relationship between periodicals and publishing houses. Since the publishing house responsible for a certain periodical can change with the time, this table holds the data necessary for connecting one periodical with different publishing houses. Column CO_ID stores the unique identifier of a periodical, CO_VERLAG_ID refers to a publishing house stored in table TB_BDG_VERLAG. In column CO_JAHR it is possible to enter the year a periodical moved to this publishing house.

Description of table TB_BDG_PERSON:

 CO_ID                   NOT NULL NUMBER(5)
 CO_VORNAME1                      VARCHAR2(200)
 CO_VORNAME2                      VARCHAR2(200)
 CO_VORNAME3                      VARCHAR2(200)
 CO_VORNAME4                      VARCHAR2(200)
 CO_ZUSATZ                        VARCHAR2(20)
 CO_NACHNAME             NOT NULL VARCHAR2(200)
 CO_DATUM                         DATE
As already mentioned, table TB_BDG_PERSON stores the names of the persons responsible for an article or a book. The column CO_ID contains an unambiguous identifier for each entry. CO_VORNAME1 to CO_VORNAME4 hold the first names of a person, CO_NACHNAME stores the surname and CO_ZUSATZ other parts like titles of nobility ("von", "de", "van" etc.). CO_DATUM stores the entry's last modification date.

Description of table TB_BDG_SCHLAGWORT:

 CO_ID                   NOT NULL VARCHAR2(10)
 CO_WORT                 NOT NULL VARCHAR2(200)
 CO_DATUM                         DATE
This table connects books and articles with relevant headwords. Column CO_ID refers to the book's or article's unique identifier, in CO_WORT we find the headword, and CO_DATUM stores the last modification date.

Description of table TB_BDG_SCHLAGWORTLISTE:

 CO_WORT                 NOT NULL VARCHAR2(200)
This is a look-up table which contains the list of headwords used in table TB_BDG_SCHLAGWORT.

Description of table TB_BDG_SPRACHE:

 CO_SPRACHE              NOT NULL VARCHAR2(50)
 CO_KURZFORM                      VARCHAR2(5)
 CO_DATUM                         DATE
This table holds the different languages used in TB_BDG_ARTICLE and TB_BDG_BOOK. The column CO_SPRACHE contains the fully written languages (from "Afrikaans" to "Zurich-German"), CO_KURZFORM stores corresponding abbreviations. CO_DATUM stores the last modification date for each entry.

Description of table TB_BDG_VERLAG:

 CO_ID                   NOT NULL NUMBER(5)
 CO_NAME                          VARCHAR2(200)
 CO_ORT1                          VARCHAR2(200)
 CO_ORT2                          VARCHAR2(200)
 CO_ORT3                          VARCHAR2(200)
 CO_DATUM                         DATE
Finally, this table stores information about publishing houses. The column CO_ID contains an unambiguous identifier for each entry. CO_NAME stores the name of the publishing house, CO_ORT1 to CO_ORT3 stores up to three the city names and in CO_DATUM we keep the last modification date.

[Previous]   [Next]