Facilitating in a Human S. B. Davidson, Dept. Genome and Information University of Pennsylvania Email: susan@cis. Department Science Dept. author: B. Davidson, Phone (215) interesting lution, Project database complex challenges: data entry the need to iutegrate tems which While ical range these are unusual rapid multiple data are not and make automat these Genome Center new approach problems database developed to a solution and software of models intensity and ed solutions The 22, and describe by means of a acid) beads on the of interesting data and common to realm and perative. This either paper rapid entry range These exist illustrates over make are techniques to problems sources variety of within aid in their in this in this paper, in the of the major schema existing problems evolution applications. at context perimental data forcing evolution schema and to Furthermore being po- discussed for Chromosome and the in HGP resulting 22, Children’s and better laboratory since plan and data integrity guide sion dependencies). plex, non-standard ongoing enforced This gives in order for the rise that data the ex- database process investigators is mod- changing, notebook This is crucial. constraints and is constantly applications. to experimental developed modeled rapidly, databases need must consult octhe experimentation. However, is very complex, hierarchically organized, an unusually large number of links among 423 or- along notebook for being of the related extremely database and the kuown database laboratory New are constantly Permission to copy without fee all or part of this material is granted provided that the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. CIKM ’94- 11/94 Gaitherburg MD USA 0 1994 ACM 0-89791 -674-3/94/001 1..$3.50 en- ax a less az markers faced and techniques cur the involves markers Center se- Consequently of Pennsylvania domain. *This research was supported in part by the following grants: NSF IR19004137, ARO DAAH0493G0, NIH P50-HG-00425, NSF BIR9402292 and ARO DAAL03-89-CO031 PRIME. on the directly of Philadelphia. One ify G’s chromosomes The diflike 400 bases), time. Mapping is the Genome University for fragments a discover- and for sequencing anchoring Chr22DB, of four arranged means T’s, at one as landmarks. Philadelphia at the DNA the DNA and C, T), C’s, goal. the of (deoxyribonupairs G, practicaf bases) is to is composed (approximately mapping chromosome, rapid solu- set (HGP) comprising techniques strings of identifiable Hospital im- are are not serve (A, of A’s, there DNA has to combination, solutions to prob- transforma- Project of DNA Sequencing intermediate located the broader bases (3 biUion sitions the particularly the automated or are inadequate these a wide the man- data their HGP dering as within where the Genome chromosome sequence methods genome schema constraint data of complementary or short tire ambitious and up Although a confluence databases, as well databases, Furthermore, and multiple challenges notebook Project complexity do not challenges: to int egrate laboratory of biological intensity present which formats. Genome databases data need systems and Human complex and the software models tion database evolution, agement, Project step of these chromosomes molecule a string. exact current Introduction made nucleotides ing transformations Human Each ferent string. Genome Genome a first core to express 24 distiuct double-stranded cleic a of the the quencing Human to be the in which genome. long, We and constraints. 1 Philadelphia describes constraints. goal human complexity of the Philadelphia database perceive a language and sequence to biolog- imperative. for these problems, at the 22, and sys- Chromosome for expressing 89S-0587 Chromosome what we tions and formats. unique for lems: and en. upenn. edu Fax (215) solving evo- in the context for Human language and data of management, necessarily combination, a confluence schema sources variety their illustrate present and constraint over a wide challenges databases, deductive databases PA 19104-6145 eckman~cbil.humg S98-3490, Center Science of Pennsylvania Philadelphia, Entail: of the Genome and Information University edu Susan of Genetics of Computer rspenn. edu, cis.upenrs. * B. Eckman Abstract Hmnan Database PA 19104-6389 kosky@saul. Contact Project A. S. Kosky of Computer Philadelphia, Transformations to a number need the data and contains tables (incluof com- to be specified to be correct. Another major heterogeneous is frequently notebook as the protein ical bibliographic genome laboratory tems perform ject and This different haps persist. ple queries, it is often models only views increase in of the to as a whole; organize dent structuring of the The GenBank family is “standard” flat-file variants, National veloped at ASN. the 22 [7], data cept from Each advantages that recent been vasive o A recent report of [9] listed that were current data sources, among was issues and that query and (b) important data users, a query The problem ible tools research. conand dis- query others. that they of we are per- Energy In- of simple to with answer the sources databases, there how we been to analyze a to and to as follows: and a data trans- encountered. language sample current have and than desirable is organized transformation the used large evolution. techniques and shows We fail encountered, Section and problem. 2 A Sample Data to how conclude address discussing Transformation The data and notebook and difficult know little to in for start modeled the the the future in Chr22DB archivaJ HGP understand, to nothing therefore ing schemas databsses about what some complex for those biology. a bit of the laboratory highly molecular off by explaining and and are especially about terms who We will what used is be- mean. programs into and information fact of some technique for of their system faced: that rapidly HGP’s View intermediate (fragments and (a) the the the attempt listed lack DBMS of they underlying two of the Biological goal of DNA) locating or we will consider probes and an were them schema The is the that an or data re-mapping problem of trans- is is understandable applications evolution databases. by program. calls Back- into the We need string. 424 interest is a linear ordering Sequence cut the to be able of DNA neighboring overlap using cloned randomly into are then representing from A variety manipulable pieces To discover pieces chromo- positions. mapping These come ordering human (STS’S). bases). of two pieces Sites rniUion it is crucial sequence flex- physicai experimentally string. the markers to specific lothe sake of simplicity, of DNA fragments, for Tag of pieces mapping along at known one: chromosome (50,000-1 original only Sequence overlapping bled this form language, The markers some evoking. part A Databaser’s ground the are dis- is no effective 2.1 of techniques are used to anchor cations on the chromosome. For they the of schema for because language constantly forming de- them. a genome adequate using; onr is easy is of as a rather of Chr22DB hss this forms is highly paper that to capture arguing that a part problem it right for is rather database of rapid of this 2 illustrates problems a number et al [10] in an appraisal create major impossible various files, combining Goodman to in a form reason be thought output – a whole accep- are the computational whose in light remainder it is used The Furthermore, query universal they can the (datalog transforms problems that and complex about The by in which simple the reason gained we believe transformation relation. have yet and languages Chro- among the indicate single Los version, to a biological a Department “summit” structured An and for base advantages portability, is one a declar- from of representation, underscore to, queries for ● and it very formation the 1 version which are on transformations transformations. a data ushas databases: formatics tributed [8], view issues papers HGP group while Section syn- developed Center that, to data it that is based query not approach iUustrate transformations declarative have our and approach languages, (as in sources). for specifying extensions) as query 3 describes is at the ASN. its While approach data trivial one knowledge entry include alluding to the hese has its own expressiveness, Two have oft there developed Phdadelphia our the information. point: version sim- acceptable indepen- numerous [6], at least a sequence view. language with by the within similar in a relational and developed or version Laboratory N CBI, 1 version mosome also version a relational Alamos the same and Our TSL, fam- models is to describe of data (as models Genbank data multiple model data as in the transformations, DB. language, structurally per- beyond application to achieve Thus, we find numerous a case and data complexity necessary the increases, partial, in Chr22 ative query, within to optimize a specific system performance. tactic (Ob- paper problem tance dat abases complexity capture diilerent tasks string flat-relational from of th~ ing a sample and trans- data different multiple of data data a single between screens, to specifying transformations: in or across purpose constraints. databases. and As data may significantly and entry integration arisen which databases in schemas databases analysis search include personal-computer-bwed to of computa- complex-relationti heterogeneity is likely Staden, such object-oriented GemStone), human data evolution), data The assoftwaresys- as pattern-matching, (Sybase), (ASN.1), HGP involving biomed- to schemas of dat abases), the Genbank, the schema ily number and with approach between (ss with queries the agroting analysis the databases, and as well databases formations answer [1], [4], These Store, [2]; a principled packages of database, PIR [3], FASTA data comparison. databsses to Medline, GDB multiple, contents archival base, databases; as BLAST problems and base, base, complex the sequence data to software include data notebook such tional acid data access and augment These sequence map that databases nucleic the as to by researchers. such is databases needed laboratory posed problem remote their order relative between in the ordering to ascertain overlaps, size reassem- when that sites in the two pieces is, of the when original of DNA can be detected the a probe. The them, desired linear and related and interest, and visible tion lines larger, quence contains of a tiny 1. become used to size range and time cloned be the se- 2.2 to which at the denote markers the (probes). of pat- The func- of granularity. Horizontal lines the sequence data - ‘--- to are: human in J-. I .. . . physical Cloned of the In cloning, probes and In are what and and STS’S carrier these STS It consists When of the the host human Sequence Tag Sites (STS ‘s). cells DNA DNA merely introduced lations rather tributes for attribute defined cal reaction pai~ used as primers amplification stages, each An amplification primer sequences the demonstrates items a primer amplification). several perature. less the within called (PCR reaction prises by intervals about test sequence; sequence an STS The proceeding reaction are found, therefore, its name, entire multiple 2. sub- Since be one row relation rows in the screen must be Chr22DB schemal in of linkages a precise given primary rele- 3; this between is re- semantics of the tables and at- relational are the Figure below. Uppercase keys. lMIE, pnblicxmne ) erwilid, date-picked, strand) pr2-primer_id, PCR-prod-size_hi) (STSID, .t ime, AHPLHACEIIIE, denat -t smp, AIiIIEALIEHP, denat 4 tie, . . .) chain reaction comtem- In will not occur unproperly spaced, with this database nested lational reaction Important including give pickmethod, ions Aenat Primers the LAB-CODE, PCR-prod~izelo, init the pickedfiromnaint pri-primer-id, three (relational) relevant denote Figure Location). is shown schema ing-temp, PCRxxndit a chemi- at a different a successful this used will (EER) to a form in with data-entry convey of the names STS (ID, of sequenced to start by the polymerase containment. are: a pair to names (SATSRIAIJD, primer (ID, pname, is an interval the may map relations. underlying than in melt STS at of the must there in and conceptusJ Some produced, rows on of database is shown ions, pair, view the (STS) Location the are captured example STS, of Chr22DB schema. experiments. An A portion of are cultured, are entered database. vant two and to An involved, application relation primer ions and data is generator by hand. data the PCR.condit trans- applications of the data view a single STS relation, Data application schemas. illu- of structural since and case a good entry the notebook enters the provides a specialized The Though is a special data provides (Primers, screen transformed we or interval vector two hand. entry to be done of a complex PCR.condit in- the from Importing transformations, by complexity applications, lab data the complexity that center. largely and has spreadnotebook as directly at the data in structure, enter the done database. denoting 2) maintained or entry between actual are follows, a fragment into cells. replicas acid and of sources: preexisting as well consuming handle to a variety involves been of the form relations probes information in future of DNA of the derived; laboratory out sources modification widely each mapping 1) Cloned to be used nucleic were GDB, centers, carried time underlying of a Chromosome in” freezers, is inserted or yeast exact (PCR name site. from as Rewriting can not the differ .. comes other of some In data by Chr22DB. Probes. bacterial temperature Transformation transformation, enormously a database. some DNA melt- expected process the primers of the glamorous, formations. -. -. in (STS’S). stored stored Cloned many used Sites probes the GDB; object-oriented date a data a screen Mapping Chr22DB reagents about the and the of the to such particularly L -. . . of probes describe stage Database these has and — -. -. 1: Physical formation which being from and in briefly primers; product; location from experiments m ,.. ,’ Tag from sequence, the each in Chr22DB databases se- is shown. . ——— physical for databases, tools Sequence name, of amplified databases sheet whose Below, data archival 1,<+ in the a cross-reference probe stration types the A Sample of Two of required chromosomal not represented it; of each this banding fragments sequence. of DNA DNA top themselves level DNA marker to the which coarsest overlapping substring Figure named temperature conditions); the sequence thought At with a microscope, as landmarks Vertical a lin- then relationship Figure is depicted under denote yields is contained as regions its in a chromosome terns be ratory ing called disease. mapping is illustrated figure, pieces sequence may contain fragment, probes such to inheritable Physical quence The landmarks of special whose versa. sequences shorter on the probes vice their much ordering on the map areas that of a third, ear ordering in by showing sequence lated transformation subrelations schema tables. with The a complex is flattened value-baxed atomic into re- linking re- pointers attributes of relation a standard the top-level data the 1 The labo- [13]. 425 schemas in this paper were all drawn using ERDRAW JIJn 28 CHROMMOM2 1993 STS name GDB 10cU, used here Dex,ved BELL lab K1-189 D22S119 DNA Seqnent Y Tech PCR product s,,. , Lab 254 low (bp) 22 GENOWi CENTER STS DATA from single clone KI-189 COPY Probe DUMANSK1 lab K1-189 Y&C screen BUDARF 254 hqh Polymoqhlc PHP.GE vector type stat”, Probe N IN type PROGR2SS ANONYMOUS Ccmwent , PRIMERS Name K1–189. K1–189. PcR Imt Cbr 22 ,al. t95 t~ 120 CHRO!.K3SOMAL Denature. Anneal. t=w 94 ;~ :y= relation target naint erval, composed and into mat er ial, the End Q1l posltmn two by lab, STS. The pos,txm rows in erval, the two target Data to accomplish insert normalized data name fields in are mapped names table, which of the object being maintain relies be generated. on internal to accomplish data dency the also hold. least among re- data the tardepen- name one the more example, GDB at least complex each (i.e., material names. non-GDB constraints lab-code name (i.e, have imply certain get database. = “GDB” names. A Language Constraints for Database The proposed believe for expressing reasons ify and data tational ming reason express types, a deductive transformations. transformations about; the the though it does and of finally, is the best choice There are several should language structural expressiveness language; approach the of the code We will start then giving should be manipulation not need a general the of to have the purpose language able to compu- program- should unify 426 level. generator that rule for by generated previous and how programs. the The logical stages: each for rule database normalised appropriate inferences rather than easy are many the are to adaptation of database of the section. they genera- in two target the and for lan- it is straightforward a variety syntax two. languages. the allowing explaining the the this code form, level, Further of constraints been forms, code means at the for al- transforma- work database. into of the program, in the transformation complex data core examples normal be easy to mod- once at the have source approach only described the converted in using will entry and between a normal tar- transforma- programming generators and logic database nonrecursive a complete re-use model, and is Not trans- source be expressed to from then the interactions converted times eral data for this: and easily that code This the there two. a transformation Horn-clause about of database are how are on on the determining be implemented a variety rules that We can programs for performed ) and unambiguous tion in but transformations but tors a part reasoning since between constraints constraints can constraints databases, is baeed formal only DBMS. lab~ode Transformations language for is generated at play may rules integrity of interaction constraints specifying may must and level between First “GDB’). 3 do guage, system- links COmC.Sn’c, formations Not The transformed For one and integrity, but Buff= 1.5 Mgc12 Notebook Our ap- to the integrity constraints of Preeminent are key and inclusion constraints, strand RV w Screen lows tables. To . .. . :% Entry only is de- transformations, must schema identifiers must conform get database. # the statements target generated F,nal ;p LOCatlOn BUDARE tions, propriate lated c!@e, 30 a significant inserted. In order Date pinked 11/03/92 12/15/92 transformations tables interval, GDBJ.ocus) identifier Ver, fLed S0 BLOT interval, subrelation The and internal :7 2: STS 6 relational na-int sequence. the t~ 12 Un,t, BANDS material, Primers tables: (STSmame separate are linked over names, 5 target screen Ymethcd LANDER LANDER LOCATION start Q1l distributed primer, entry to are schema: Extend. :7 Figure screen telq CONDITIONS PCR Machine PcR-9600 in the Melting 56 55 Sequence (5? t. 3, ) CACCATC2?ATGGTGCAG GGGGAGACGTGATFIGAATTAA GCCC FB R2 systems. underlying language transformation data entry Finally, used in data with sev- clauses application we describe implementing Figure 3.1 Data The language model is based allows formations is similar nesting of set relational resent the rently the a wide to that and gaining (not-null data data [12]. can be used referencing a row relations. The values can not also data The for allows The Our language scribe than us to rep relation. various based that are cur- ILOG, the model allows functions to generate functions to create can applied an entirely new as an object in some type identifier, particular system generated object be variables value, from for the language by two distinct els established cies, support certain straints, some while there occur in of these language categories. provides of constraints, object important other of constraints inclusion databases and Rather including but than to express but not that language to of Datalog with database non-recursive our Datalog consider atflat or ILOG that though for many the with in non-recursive concept ([15]), of so that be recursive could Datalog. model, a very general limited the any data of existing nested for query gree of rule sically above. used a database concerned 427 is used and and to describe necessary primarily of this with the work and the a query. manipulation are in implement involve manipulations satisfy of IQL established may opttilzation, the to In they it ([1 2]) to manipulations. to express though rewriting for example of I LOG data constraints. languages, novel of as an or as a restriction structural contributions language some be thought languages: model, with significant transformations tive to those deductive relational the incorporate basically to be an extension dealing most way does it could be considered The fall language evolution ([11]) which do not the features, the How- making in our only, than we do not syntactic could support dependencies primitive a means may con- identity. databases Though mod- dependen- and programs of dealing concerned and to the of Datalog Iess than a a practice with limited that strictly are only; power than When is weaker which When in logic- as predicates, relations expressive we rather tuple as Datrdog used values when stratified-negation. de- established such to base but or clauses various data functional object-oriented and classes incorporate relational other of inheritance many any models dependencies remain such mentioned or biological into family keys existence concept ever our data as primitives: the be expressed from are considered. negation, clauses of constraints may case, recursion be confused. Many are being transformation that functions not kinds bound a relation of an entire names awkward transformations other ensures Skolem are in indi- variables of a transformation, languages, relation the and Individual it differs be seen to be greater with of respect to the tuples construction query becomes without which or a-s a way relation can a group part the database relational identities to values, of t uples). conceptual In this which access or tuple, (sets in which and or optional independent of a relation to simple describing in the required one extension found In addition, allows relations models to be either Language be bound entire a repre- STS’S. components can arbitrary for 3.2 tributes Skolem in order the and Database vidual null). of values then for is a natural but data trans- models. allows structures popularity. We use Skolem as in of data and It model, of records or relational implement constructors, object-oriented attributes and range identities. complex and a nested of [14], tuple of object of the around us to describe between model semantic of Target Model which sent ation 3: Schema deducsome rules of data Here d- are bafrom we of the are rules themselves, which and the specified with in a clear where the efficient rules can them and and database 3.2.1 converting transformations meaningful then from constraints a form manner, be easily programming in and are logically into translated @ a form into P=Q — I P#Q – inequality I P6Q — set-inclusion I I — Undef(a) arithmetic predicates utaa;$~;~eoptionai 1 False some language. P2Q I P?Q – Types Types in stract our language are given by the following ab syntax: Here t ::r & . . . . an :* t~) I irrt (l$i~g I... — set type — record — base the tuples type in {t},represents A set t ype, t.A tuple type (al or records with required tl, ..., either :, for attribute, a required Base atomic types type is a type In a flat-relation base types. as a whole variabIes type to have types. for and attributes the type language source a unique of ..., so a . . . . an t~ type we consider not in any An attribute term. a tuple all relations types will and so on, type. is strongly typed, target databases can be inferred for whose such be type each type of which in that, term in tuple # are built of a tuple. The and Atomic example main ranged syntactic over by elements of our P, Q, . . . . and language atoms, are ranged an in the and atom X reiation further by the ::= Src [ Tgt following abstract and by P.a I f(R, They I P(+I,.. : target — constant — — variable P,f ., #k) ) X term Id same field value as = (equal- Undef the of an optional used Fa Ise repre- in checking ~ STS would mean the that X use a compound on X: = PI, = P2) is attribute pr2_primerLd 3.2.3 c Tgt.STS a tuple 1, in the target prl.primerid attribute relation attribute P1 P2. Clauses has the form database database $+ 41,...,477 The attribute atom Not — Skolem frsnction – compourad all A clause term type # is called form 41 ,...,42 — projection . . . . attribute while syntax: source ~: I represents and STS. We could restrictions that id A clause P X predicate is = I, prl-primer-id means STS with terms, over represent values in a database, 4,*, . . . . Terms atoms are the basic building blocks of formulae. are defined the as com- Formulae which The so & the predicates nullary pr2-primerid Terms some the for the definedness situation, X(id to P: compound E (set inclusion), check error to put and has the atoms is evaluated represents the binary and y of a transformation. term a trans- using validlt For has 41, ..., w’ithin of the Id is the the of variable equivalently, an is a t uple it relative term) & P.a . . . . #n) one STS, term sents types as well is defined). with occur if the atom the y) which attribute as type, X. Id. (ineqndit program. 3.2.2 then P(q$l, relation an or, term predicates with in X, tuples (if it carries a, must example @k), then of the Atoms of a transformation, each of the it y), would given form compound target occurs .,., For schema as STS, primer, a tuple Id target of base and tuple P but term, For regarded a tuple, of the term smaller the X(&, that in the relations. database relation and or classes of those relations erval, databases in and are always to sets of the term (but source type. representing as the P.s. pound :* are type is a tuple are are to be interpreted 41, . . . . ijn which any attribute term, a, occurring in represent :* tl, tl, value the which be bound a attribute A compound an optional tuples, can of base of the same one represent Constants P is a term value of types so on {(al types for the 3, with na-int STS, primer formation form the of the to the be of an appropriate Our a set be A database each going in Figure sequence :“P, for to relations, a database is shown or and of the as individual and example . . . . an, :* representing string If ype tuples term attributes database int, is considered t~)} As well al, symbol oft represents values. relation relation sets of values attributes each simple with finite . . . . an :* tn) tn respectively: attribute. A :* tl, Tgt a transformation, of relations. while types contradiction Src and databases as to values for equaiity ::= T,,. the syntactically is said to and target typed with respect the types of terms when 428 we take the the head bodg of the correct be clauses weli-formed database to Tar. occurring term of the Src and in to clause, while clause. are for type meaningful. source TtGt if database it is well- Ttgt,m~aning the have clause the that au make sense type T,,. and Tgt to The concept ([16]), have and stricted the can in this paper each over some will of instantiation such of a clause target of database denote for the on the which values are tence dependencies for In determining a to term v, Src to + which which get vari- For generating part in Figure 3 from the head of the source of if it is true example, and (id the = Y +- X(id the A transformation target terms any following of the says that, for any two STS, if X and Y have id is a key attributes then tribute a constraint rather The the terms which target constraint Ttgt re- when ~ and tion Tgt = P12, PCR.prod-izeJo to is carried values in determining of a transformation will now for = SIf) ((pname inclusion there atof may = PN2) E primers, There databaae. only tion A only source Primers validity play some more in that for body The as examples Figure every is a corresponding 3. of priner entry id in the in for ~ Tgt.primer this the descrip This and of each al- attribute in the is bethe tuple use of the pname two in the in target data- the primer_id’s. turn generated by primer = PN, pickmethod = DP, strand= = PM, ST) G Tgt.primer ((pname = PN, prnethod = P) one a difierent are = MT, date-picked + Y(prl-primer-id f -STS Also relation. to lookup relation = f -primer(PN), + = P) makes in order melting-temp table: X(id clause deserve clause: (id an of only atoms valued presence primer tuples con- Firstly has separate set that function STS relation. STS~creen is of this base relation be counted clause Skolem sub-relation. The of the a significant in two the the the relation attribute asserts the for STS=creen in this that ids it occurs the about notice generate of a tuple atoms points Firstly another shown 6 Tgt.primer, < P12 are several cause target program. at = P12) prirnersj a transforma- the and terms target E Src.STS_ecreen, id the and = SH) = PN2, though as source = SL, (pname PI1 to also 2: 6 Tgt.STS E Primers, G Tgt.primer, is used to ensure shown in Figure = PI1) of databases. after tested shown id a pair target screen = PN1, comment. containing tar- clause schema (pnsme one database contains database dependency, STS table the of the = PN1) PCR..prod..sizhihi only transformations, look the the be in order part words is an example constraint Constraints part We in a clause may out id database, head, for = SL, PCR-prodsize-hi relation their can be clsmified source the 1, on other concerns the a source transformation. In clause between in Constraints straints equal. is then while Y in value, its P12), pr2-prime~id = 1) E Tgt.ST’S and STS. This connection denote X same which values terms terms. are in a clause denote terms, the for a clause than which they t uples clause in is a transformation entry = P~I, (pneme Y(id two of clauses U ndef atoms STS relation the data = f-STS(P1l, clause = 1) c Tgt.STS, class only PCR-prodsize.lo X between in a special contain exis- model. clauses. prl.-primerid + example relational not and truth A pair T8T~ and not could functiomd terms. v. For does constraints transformation contains ia an the value and the transformation one there evaluated. the called is of these traditional we are interested of the a clause denote lan- @ of types satisfy databases, types. Clearly values a considered in it is being p and said with clause, true. two the The relevant variables @ is also is dependent spectively, the welI-formed remaining that databases we take a for last using is re- of the clauses the clause together the that be expressed Datalog values. if, for some instantiation @l ,..., q$~ are true, then of the clause of semantics All be well-formed meaning set of the in [17]. from in the finite Note range-restricted. restrictions, presentation be found is variable of these 41,. -., d~, is that ables in the body, the it is taken that detailed guage, The and range definitions more Tt9t, means to formal type of range-restriction strand E Tgt.STS melting.temp = PM, = ST) = MT, date-picked= DP, G Primers) E Src.STSscreen Next that each material has exactly one GDB name: We will X.y+ clauses X(staterialid = M, lab.code = “GDB” ) = J4, lab.code = “GDB” ) to source E Tgt.nemes, Y(materialid see in Section like tions this, relations in its in one-pass 3.3 that in order head. a clause body and in its Clauses without it is necessary to get of this referring only form to the to unfold that can target to refers target only rela- be processed database. E Tgt.names And, finally, False + that a public (publicneme name = cannot “Yes”, be a GDB lab.code = 3.2.4 name: “GDB” Transformation A transformation ) database E Tgt .nemes clauses 429 type that Programs program, Ttgt, are well consists formed from database type Ts,. of a set A of transformation for T STC and Tt9t. to If A is such database type iff, a transformation value Ttgt, for of type then each Ts,, v is said clause to u and program iff, if a A-transformation there exists smallest such transformation data source gram by imply get database but being in the program what certain does not smallest there formation be carried out values into done “one in the the database, target transformations in describe which which the target database is then used target database. The problem tional for program model the flat be found more relational in delicate model and than ([18, 19]) a “selectdatabase form or clauses can other suitable some recursive a tuple be 2) to by combining in some our of the this only relation, clause the STS (Figure would process it of the programs will termi- description follows that the complete. a normal-form from of a tuple elements a partial then is not and to form transformation that program 3.2.3 description terms to build Ch22DB in order a complete it follows for in section are built Because the transformation in (id to data for whether nested the Datalog. normal clauses database. recursive more of testing in our then or source for the STS table data-entry 3) formed screen from the in (Figclauses be: inserting to to create If the in nor- a join-and- calculus in SQL. provide transformation be clauses into of a transformation, If it is possible ure can then translated relational database For example is inserted is recursive is a little and which target are not of CPL clauses the nate. trans- they database the transformation is, as opposed data source transformations that source tar- to compute. in non-recursive pass”: by reading in unique in normal-form clauses pro- about is these relational language. unfolding data ambiguity into The A from flat directly flat-relational query the be expression converted A- be in the It is also can expression is not is a additional we wish These programs. can is no interested project If a transformation is. that there smaliest represents other transformation are particularly T,,c, should as well. transformations We p of type program exclude then is said a transformation data database smallest it type form from-where” to Ttgt The generzd mal of of p of p then transformation in is complete this that the that from value because database y is a value C. Z’sr. transformation. database: will database is important generated the any v satisfy A from to be complete unique for and v is a database be a A-transformation C 6 A, A transformation program, and = f-STS(P1l, = P1l, pr2.prirner_id = P12, PCR_prod~ize_lo a = SL, PCR-prodsize-hi rela- + problem Details ~12), prl-primeri.d ((pname = SH) E Tgt.STS = PN1) G primers, = PN2) E primers, (pneme can [17]. PCR-prod~izeJo = S.L, PCR_prod~ize_hi = SH) C Src.STS-screen, 3.3 Normal We now database flat Forms limit our transformations relational. formation turn In easily Suppose our the the the target we first a normal into special case database convert form, can in some target Notice in clause contains (non- is said a relation to be in normal of the R. form ... + if . . . , ak lation ak=Pk, bl=Ql, R, and are . . ..bz= bl, ,.. transformation if all its have recursive an This the calls gives does body to the primer clause a complete and in the of the relation in section 3.2.3 of the Skolem function Transformation is not algorithm for [17] if the program complete forms transformation the reat- description not call which have on any clause. In par- were been in the replaced f _primer. Tools transformation code-generators languages. terfaces con- for be in for to for a flat is normal fail, central programs. a form, reporting of our If In relational complete part base non- the and tools, if SYBASE (the most and several other of the tool such code take much users, normal tools systems, and as of to such the allows load them are for language a code has in- data-sources. implementation and programs of further to other meta-data and are being entering to concentrate from and indata- be constructed ([13]), of of biological as SYBASE, constraints means requirement can read by programming immediate is an form ER-draw and this algorithm, languages addition types efforts since core to convert TSL 430 implementation programming an ersource of database ([18]), the database will is automated convert-to-normal-form terfaces given process a variety CPL Chr22DB) the normal for Initial generator form. program in variables, to which will of the optional However is said program type, equivalent program generators relations of the 3.4 gil, . . . . q$n conand the terms only are in normal algorithm transformation database return ror. program clauses an R; the atoms and constants; QZ are built using function symbols. pl, . . ..pk. Q>,..., stant symbols and the clause STS relation, by applications Qt)6R the required attributes of the , bl are a subset tributes of the relation tain only source terms target this in the target body The al, We < P12 41, . ..74n where form = f-primer(PN2), PI1 form X(al=Pl,. A that of a tuple ticular database = f-primer(PNl), P12 is a trans- which a program of language. A transformation has case, into be converted query to where this program recursive) it attention PI1 easily. various schema-design convert it developed. constraints on specifying into These off the the substantial would which and part like of to build a transformation. graphical automatically generate transformation Ultimately we schema-manipulation clauses the for relevant a schema We tools try constraints evolution. have The complexity man Genome quency of data Project schema incompatible structures databases, evolutions an archival genomic Our experience is that heterogeneous must be exchanged, tools and the databases necessitates the Knowing is an extremely Chr22DB with the fre- tical number of with which dress much the problem formations Acknowledgements: of new all the the schema, of these although subject do However entered of not The ad- due available these limited us to and data for manner, and of database In Genome data which work to some also than There which underlying schema. To are flat from languages [2] 23, our knowledge, need [3] for already future indicated, algorithms such for to do this [6] some of of code gento many transformation edly. The entry development [7] to reflect databases. mation as GDB, the We programs evolution need which such to on the compose will of others import data which not every want time Chromosome the updates to rewrite there involve [8] the these is a minor 22 database, W. tool,” 1 is also and data J. Gar- database,” Nu- 2231–2236, base repository,” 2237-2239, W. R. Pearson, 1991. (GDB), a hu- Nucleic Acids 1991. W. MiUer, “Basic E. W. My- 10CSJ alignment of Molecular Biology, vol. 215, 1990. “Rapid with Sci. Gish, Lipman, Journal 403–410, and FASTP U. S. A., National sensitive and vol. Center sequence com- Proc. Natl. FASTA,” 85, pp. for National Library TREZ: Sequences M. 2444-2448, “The Biotechnology of Medicine, Users’ J. Cinkosky, K. Hart, 1990. Information, Bethesda, MD, EN- 1992. Release 1.0. Nelson, and Guide, J. Fickett, D. restructuring D. of [9] archival UPenn G. C. GenBank,” T. G. October vol. 10, no. Technical fields C. Overton, for 3, 1994. To NCBI’S in the appear. See Haas, and CBIL-9203. Aaronson, A system and G. Applications Report J. “QGB: 1994. and translator Computer Overton, J. for features,” querying sequence Computational Biol- To appear. Department Meeting in or- Searls, relational database,” also ogy, B. A database data- routinely of Hunt, 19, pp. genome J. pp. J. Adams, transfor- other vol. 19, pp. D. Biosciences, repeat- probably from are run continuous do is L. sequence mapping Altschul, ASN.1 of transformation be applied these programs; programs databases, der frequent protein vol. “SORTEZ: composing transformations: wiU be applied only once, programs most transformation mation of Figure ideas. 1987. programs issue is that transformations help these which programs. Another while some George, ‘The and Marr, completion transformation specifying [5] to been implementation for presenting in Peter for their to the interface S. F. Acad. form driven D. PIR genome have databases eventuzd to Searls diagram Research, P. Pearson, search [4] normal the Acids ers, relational; and indebted David and postscript “The to has as the target are and Searls. Barker, parison there research, W. 26]). related language. of cool 25, schemas constraint areas really to David Research, estab- are proposed many as prac- of merging 24, underlying databases developing man Human of schema proposed of interest, a window al- means using approach form are not the in cleic for a clear from is the advice and alIows arise [22, Overton avelli, transfor- by systematic we have erators the language that of the our of normal in representable (see We Chris in [21], language systems approaches manipulation no principled, other not includes of these how merged proposed that databases all user indicate integration our important References for languages. heterogeneous Central addition are addressed implement of constraints constraint Related then entry transforma- transformations a variety specification lished Our database generators. the of database model. specify formal mation code context also been labori- data trans- [1] have how for it is essen- data consequently issues in the a more lows the data. previously and the works are necessary. Some in on the corresponding underlying tiaJ to have tions written [20] ), existing of performing on the current been (see code in the gain. Buneman, has evolution this struc- indicated first-hand to is ex- data development methodologies. Although evolved, GDB, between is clearly en- a trans- approach relationships target SYBASE Hu- of the the and data database, the program. in large since source the specified ous it was to transform and schema useful, in the involved together and specified partially from clauses of the have Chr22DB. tremely Conclusions completely and formation tures 4 currently transformation, of Energy, Report, at gopher. DOE April gdb. Informatics 1993. Summit Available via gopher org. archival [10] transforhence N. Goodman, ments schema the Base for genome-mapping Workshop transformations. Vancouver, 431 S. Rozen, a deductive on October L. Stein, language database,” Programming BC, and query with 1993. “Requirein the Map- in Proceedings Logic Databases, of [11] S. Abiteboul as a and query of ACM Data, P. Kanellakis, language SIGMOD “Object primitive,” Conference (Portland, Oregon), on pp. [24] identity in Proceedings Management 159-173, of R. Hull and M. and manipulation creation in Proceedings Very Yoshikawa, of 16th Large Data “ILOG: of Bases, pp. object E. Szeto and graphical editor schemas. reference PUB–3084, ley, [14] 455-468, Tech. C. Beeri, ‘On manipulation TechnicaJ Unman, Science J. D. Unman, Press, Rockvill, MD Complex as IN- MD and KnowL 20850: Com- 1989. Principles Systems and available of Database Rockvill, I. edgebase in Theory 846. Principle. Systems of lan- on of Database II: The 20850: and New Know/- Technologies. Computer Science Press, 1989. [17] [18] A. S. Kosky, and from kosky@saul. language cis “Querying proposal,” from 646: Proceedings on Database 1992 (J. Technical — Record, new D. tech. tute, February A. Metro, tiple [23] Sheth, “A tool views,” ence October, eds.), 140–154, 1992. Available as UPenn and R. sys- SIGMOD December sharing 1992. Hull, “Worldbase: distributed informa- Sciences Insti- 1990. ‘%uperviews: Virtual IEEE vol. SE-13, J. Larson, integration on Software pp. July 1987. 785-798, J. Cornellio, conceptual in Proceedings of dth Engineering, of mul- Transactions for integrating on Data pp. in database USC/Information databases,” A. LNCS Conference Germany, 35–40, to rep., Engineering, Wong, in bibliography,” S. Wile, approach tion,n L. Hull, evolution annotated 21, pp. S. Widjojo, A [22] vol. and languages,” MS-CIS-92-47. “Schema An edu. Berlin, R. a- available International October Report J. F. Roddick, tems [21] Buneman, and Springer-Verlag, [20] P. query A dissert Manuscript .upenn. Theory, Biskup collections: cis of dth available edu. 1993. embedded transforma- Manuscript nested Breazu-Tannen, “Naturally databaee .upenn. August limsoonC!saul. V. for 1993. constrains,” L. Wong, tion [19] “A tions and International pp. S. Navethe, schemas 176–183, A. vol. LBL– objects,” Relations Also 50–62, Batini, M. J. Larson, design,” January “Integrating IEEE Computer, 1986. Lenzerini, analysis and S. Navathe, of methodologies 323–364, for ACM Computing December 1986. integration,” 18, pp. Sheth tems A Berke- power Workshop 1988. Report the of complex of Nested C. for and and user Confer-1988. 432 J. Larson, managing autonomous Rep. Laboratory, of International puter 4.o: entity-relationship Berkeley (Darmstadt), edgebase [16] extended 19, pp. and database “A database Surveys, 1990. “Erdraw manual,” Applications Objects, J. D. for and for the Proceedings RIA Markowitz, vol. vol. on 1993. S. Abiteboul and M. Lawrence California, guages [15] V. in schema identifiers,” Conference [26] [13] views comparative Declarative International R. Elmaeri, user 1989. [25] [12] S. Navathe, 22, pp. databases,” 183–236, “Federated distributed ACM September database heterogeneous Computing 1990. sysand Surveys,