Legal Data Markup Software CS501 Design Presentation November 9th, 2000 Project Team Sponsors Developers Professor William Arms Ju Joh Professor Thomas Bruce Sylvia Kwakye Jason Lee Nidhi Loyalka Reviewer Omar Mehmood Amy Siu Charles Shagong Brian Williams Introduction Objective: US Code (ASCII) Wellformed, valid XML output XML output used as input to other applications Goal of end-use: Making law available for general public use Overview Development Environment Execution Environment Software Design DTD Design Packaging Development Environment Hardware Server 233 MHz Intel PII processor 128 MB memory 28 GB hard disk Notebook Computers 400 MHz Intel Celeron processor 96 MB memory 4.7 GB hard disk Development Environment Software Red Hat Linux 6.2 Perl 5.6 SSH Secure Shell 2.3 CVS 1.10.7 Emacs 20.5.1 VIM 5.6 Execution Environment Caveat Client upgrades execution hardware and software environment at own risk. LDMS not guaranteed to work under new conditions. Execution Environment Naming Standards General Rule Filename Naming Convention Must start with a word in lower case. First letter of addition words in upper case. Example: thePerlFile.pl File Name Length Maximum of 20 characters. Execution Environment Naming Standards Function Names Must begin with a verb Example: initializeModule Variable names Must begin with qualifiers Example: $error_LastErrorMessage Execution Environment Naming Standards Filehandle Names Xml Output File Names Must be all capital letters Same as input file name with “.xml” extension DTD Element Names Element names in capital letters Nested element names start with DIV Execution Environment Coding Standards A function shall not exceed 100 lines. A function shall have preceding comments on its purpose, pre- and postcondition. A variable shall have a purpose comment. Each loop shall have begin and end comments. A 3-space indentation shall be used for each block of code. Execution Environment Coding Standards Perl contractions shall not be used. Each file shall have a modification history log. Each file shall include a copyright and license notice. Version number shall correspond to major and minor revisions to software Software Design System Architectural Components Modules and their descriptions Design Constraints Error Handling Application Environment User Interfaces System Architecture Program Read and Parse File Language Parsing Output Figure 1: Top-level diagram of major architectural components. UML Component Diagram LDMS Main File Parser IH SM Natural Language WsPM WPM Output EMH FC StM XMH File Parser Component File Parser Input Handler State Machine Natural Language Component Natural Language Whitespace Pattern Matching Word Pattern Output Component Output Error Message File Creator Status Message XML Output Handler WhiteSpacePatternMatching StateMachine WordPatternMatching Input StoreAndOutputErrors StoreAndOutputFile CreateFile Status Figure 2: UML class diagram for LDMS Design Constraints 8-bit ASCII input files. Non-uniform title structure. Unattended operation. Title Variation Example -CITE11 USC Sec. 506 01/23/00 -EXPCITETITLE 11 - BANKRUPTCY CHAPTER 5 - CREDITORS, THE DEBTOR, AND THE ESTATE SUBCHAPTER I - CREDITORS AND CLAIMS -HEADSec. 506. Determination of secured status Title Variation (cont’d) -CITE46 USC Sec. 13102 -EXPCITETITLE 46 - SHIPPING Subtitle II - Vessels and Seamen Part I - State Boating Safety Programs CHAPTER 131 - RECREATIONAL BOATING SAFETY -HEADSec. 13102. Program acceptance 01/05/99 Error Handling Handled at topmost level. Processed by StoreAndOutputErrors module. Standard report format: <date> <time> <input filename> <user id> <line number> <error message> Four main categories of errors. Error Categories Error Resolution Print brief usage help, exit. Exit and log error Output file already message unless overwrite exists. flag is set. Log to standard error, Linux system error. exit. Non-critical data error. Tag region as unprocessed, continue. Improper command. Application Environment Preconditions Input files must exist in a known path. Required hardware and software must be available. Sufficient system resources must be free. Postconditions A valid, well-formed XML document conforming to our DTD will be produced. User Interface Design Very little runtime interactivity required. Command-line operation. Allows batch processing. Command-Line Arguments Parameter Effect -O <filename> Output XML to <filename>. -F -V -L# -? Force overwriting of existing file. Verbose error and status messages. Status messages every # lines processed. Display help message. Status Reporting Frequency of status reports controlled by -L parameter. Default is no status reporting. Module Diagrams Diagrams can be divided into two categories: Structural diagrams. Flow diagram. Behavioral diagrams. Culture diagram. Context diagram. Flow Diagram U.S. Code (ASCII) U.S. Code House Cornell LII Public U.S. Code (ASCII) LDMS U.S. Code (XML) Culture Diagram House Format of code is not negotiable. “Why does publishing take so long?” Cornell LII Seriously faulty input must be manually resolved. LDMS XML should be double-checked. Public Context Diagram House of Representatives Legal Data Markup System Produces Uses as Input Produces XML Executes U.S. Code Downloads Cornell Legal Information Institute Publishes DTD Schema STRUCTDIV TITLEDATA NAVGROUP CITE HEAD EXPCITE SOURCE DIVEXPCITE DIVSOURCE STATUTE (FIELD TAGS) STATAMEND DATATEXT DATATEXTNAME XREF The <STRUCTDIV> Tag Generic tag to define structural divisions. May contain <TITLEDATA>, parsed character data (#PCDATA), or another <STRUCTDIV>. NAME - Label of division. VLEVEL - Depth of division. HLEVEL - Sequential order of division. EID - Globally unique identifier. The <TITLEDATA> Tag A container for sequences of fields (dashline-tagged text). May contain <NAVGROUP>, <STATUTE>, #PCDATA, or any of the field tags (MISC1-MISC8, REFTEXT, COD, CHANGE, TRANS, EXEC, CROSS, SECREF). Navigational Tags <NAVGROUP> - Container for navigational information, such as <CITE>, <HEAD>, and <EXPCITE>. <CITE> - Label, section number, and title. <EXPCITE> - Hierarchy of catchlines. <DIVEXPCITE> - Individual catchline. <HEAD> - Name of current TOC section. Content Tags <STATUTE> - Container for actual legal data. <SOURCE> - List of relevant sources. <DIVSOURCE> - Individual sources within a <SOURCE> tag. <STATAMEND> - Amendments to a statute. Data Tags <DATATEXT> - Text that consists of a centered header, followed by content. <DATATEXTNAME> - Header of the current data. <XREF> - Cross-reference: a link to another area of the USC. LDMS Tags in Action -CITE1 USC Sec. 1 -EXPCITETITLE 1 - GENERAL PROVISIONS CHAPTER 1 - RULES OF CONSTRUCTION -HEADSec. 1. Words denoting number, gender, and so forth … 01/23/00 LDMS Tags in Action <STRUCTDIV name=”Sec.” vlevel=”3” hlevel=”1” eid=”112358”> <TITLEDATA> <NAVGROUP> <CITE titlenumber=”1”> 1 USC Sec. 1 </CITE> <EXPCITE level=”3”> TITLE 1 - GENERAL PROVISIONS CHAPTER 1 - RULES OF CONSTRUCTION </EXPCITE> <HEAD> Sec. 1. Words denoting number, gender, and so forth </HEAD> … 01/23/00 Packaging Release package will include: Documentation Source Code Executable Files Data Files Documentation Source-level documentation. Program design document. DTD design document. Source-Level Documentation Required for inclusion in each build. Source code comments. Separate text files. Program Design Document Intended as developer/maintainer resource. High-level view of processing engine. Individual processing components. Component interfaces. Updated as development progresses. DTD Design Document Resource for DTD developers and maintainers. List of all elements and use. List of all attributes and use. Modified as development progresses. Source Code Source code for prototypes will not be considered deliverables. Testing harnesses will not be considered deliverables. All source code for release version will be provided. Executables and Data Files One executable script file. No other executables will be included. DTD will be considered a deliverable. Installation No installation script is planned. Path to Perl binary must be specified at head of executable script. Project directory must be copied in its entirety to desired location. Relative paths within directory must remain unchanged. User must have write permission