Running Head Lab II – READ Product Prototype Specification Lab II – READ Product Prototype Specification Andrew Moss CS411 Janet Brunelle April 10, 2013 Version 1 1 Lab II – READ Product Prototype Specification 2 Table of Contents 1 Introduction ................................................................................................................................................ 3 1.1 Purpose.................................................................................................................................................... 4 1.2 Scope ....................................................................................................................................................... 4 1.3 Definitions, Acronyms, and Abbreviations............................................................................................. 7 1.4 References ............................................................................................................................................. 10 1.5 Overview ............................................................................................................................................... 10 2 General Description ................................................................................................................................. 11 2.1 Prototype Architecture Description ...................................................................................................... 11 2.2 Prototype Functional Description ......................................................................................................... 12 2.3 External Interfaces ................................................................................................................................ 15 2.3.1 Hardware Interfaces ........................................................................................................................... 15 2.3.2 Software Interfaces ............................................................................................................................ 15 2.3.3 User Interface ..................................................................................................................................... 16 2.3.4 – Communication Protocols and Interfaces ....................................................................................... 16 3 – Specific Requirements .......................................................................................................................... 17 3.1 – Functional Requirements ................................................................................................................... 17 List of Figures Figure 1 – Major Functional Component Diagram....................................................................... 12 Figure 2 - User Process Flow ........................................................................................................ 13 Figure 3 - Scraper Process Flow ................................................................................................... 15 Figure 4 - READ Site Map ........................................................................................................... 16 List of Tables Table 1 – Side-by-side Comparison of Real World Product and Prototype ................................... 5 Lab II – READ Product Prototype Specification 3 1 Introduction Publications are the primary method of distributing the results that come from conducting research. There are approximately 4,600 universities (NCES, 2011) that “account for more than half of the basic research conducted in the United States (McRobbie, 2012)”. Unfortunately, many of these institutions lack an efficient online resource for organizing and displaying both the publications resulting from their research and information about the grants that helped finance it. Such a system would provide research universities and the departments therein, as well as the students and professors performing the research, with increased recognition and awareness of their work. One example of a university in need of an improved publication system is Old Dominion University (ODU) in Norfolk, Virginia. Their Computer Science Department (ODUCS), in particular, would benefit a great deal from having an online well-maintained system for publications and grants as it lacks one entirely. This department’s professors are burdened with manually updating their own web pages to provide awareness of their recent publications. In the past there was a single web page for the entire department that was maintained by an individual member of their Systems Group. However, this page was last updated in 2008, likely a result of the slow, tedious, and manual nature of the process. (This space intentionally left blank.) Lab II – READ Product Prototype Specification 4 1.1 Purpose The team behind READ, a Repository for Electronic Aggregation of Documents, intends to alleviate the lack of quality online resources for displaying publications and grants. The READ system will use a scraper to provide researchers with a method of organizing their publications and grants in a format that allows for easy searching, sorting, filtering, and browsing. Additionally, content authors will be able to verify that the listed publications are actually their own work in the event that READ mistakenly shows something written by another researcher with the same name. 1.2 Scope There will be a prototype READ system developed for ODUCS as a proof of concept to display its most basic capabilities. A prototype is necessary due to time constraints placed on development. This prototype will provide public and private user interfaces to publication and grant databases, user controls for publication verification, and most importantly, a scraper that will gather links to publications automatically at set intervals to minimize manual effort. The READ prototype will use data from real authors from ODUCS and the database will be populated with their publications by Schaefer’s Scraper. It will offer nearly the same functionality as the Real World Product (RWP). Due to limited time for development, the prototype will not feature graphical representation of data about publications and grants, nor will it implement a learning algorithm to automatically decide whether a publication does or does not likely belong to a specific author. This is shown in Table 1. Lab II – READ Product Prototype Specification 5 Table 1 – Side-by-side Comparison of Real World Product and Prototype Features Real World Project Prototype Browsing Ability to browse all grants and Ability to browse all grants and Capabilities publication publications Publication Filtered by title, publisher, authors, Filtered by title, publisher, authors, Filtering publication date, date added, and publication date, date added, and Capabilities keywords. keywords. Grant Filtering Filtered by title, funding agency, Filtered by title, funding agency, Capabilities principal or co-principal principal or co-principal investigator, start date, end date, investigator, start date, end date, and and active state. active state. Add, edit, and Included. A thumbnail image and Included. A thumbnail image and delete publications files may be associated with the files may be associated with the and grants document. Fields can be document. Fields can be automatically filled in using a automatically filled in using a Bibtext document. Bibtext document. Lists faculty and provides a link to Not included. Faculty page each person’s profile page Login interface Linked to Old Dominion University Linked to Old Dominion University Computer Science accounts Computer Science accounts Lab II – READ Product Prototype Specification Profile Page Scraper Displays authors’ profile picture, Displays authors’ profile picture, job title, email address, personal job title, email address, personal webpage link, and the author’s webpage link, and the author’s publications and grants. Displays publications and grants. Graphs not graphs included. Will update the system with new Will update the system with publications and grants and alert publications only and alert users users when one is added to the when one is added to the system system under their name. under their name. 6 Prediction Predicts if the consumer has enough Not included algorithm space to use the READ system. Administrative Administrators are able to edit, add, Administrators are able to edit, add, Privileges or remove anything in the system. or remove anything in the system. (This Space Intentionally Left Blank.) Lab II – READ Product Prototype Specification 7 1.3 Definitions, Acronyms, and Abbreviations Administrator/Administrative User: a user with increased privileges for editing database content Author: A person that is able to add and edit publications and grants to the system under their name. BibTeX: A file format for reference information in XML format. It will be used to automatically fill in key information when uploading or editing publications and grants. Computer Science (CS): An academic discipline based on advancing computing theory and algorithm development, that sometimes includes theory about software engineering methods. Client application: In a client/server architecture, the module that takes input and creates queries to be processed by a server, and receives the results from the server. Client/Server Architecture: A software engineering paradigm that separates functionality into a “client” application and a “server” application that interact. CSS: A programming language used to specify presentation of HTML pages Data Mining: The act of going through a source of input to find specific information. Database Schema: A description of the structure of database Funding Agency: The source of funds for research grants. These organizations usually have a limited amount of money to (pass out) principle investigator’s that submit an accepted application for research funds. GIT: A software system for controlling and organizing software versioning. GoogleScholar: Google Scholar provides a simple way to broadly search for scholarly literature. Lab II – READ Product Prototype Specification 8 Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc that can be interacted with via a mouse and keyboard, through which a user interacts with a software application. Used to differentiate from a “command-line interface”, in which a user interacts with a software application solely through a text terminal. JQuery Sparklines: A development library for the visualization of data. ODU: Old Dominion University. MicrosoftAcademic: Microsoft Academic Search is a free service developed by Microsoft Research to help scholars, scientists, students, and practitioners quickly and easily find academic content, researchers, institutions, and activities. MySQL: A relational database management system Parse: A technical term usually used to describe the processing of a statement written in a programming language. May be used generally to describe the processing of any statement for specific meaning. Perl: A widely-used programming language on the server-side of web applications. PHP: A widely-used programming language on the server-side of web applications. Principle Investigator (PI): The primary researcher that a research grant is bestowed upon, responsible for documenting the work and publishing research results. Publication or Academic Publication: A document created by a faculty member to share research. They are usually published in an academic journals, technical reports, and records of conference proceedings. Query: An algorithm sent to the database to either change the database or get back results READ: Repository for Electronic Aggregation of Documents Lab II – READ Product Prototype Specification 9 RSS: A system for subscribing to and distributing news. Scraper: An automated application designed to scan a source of input such as a document or a website for pertinent information. Server application: In a client/server architecture, the module that takes queries or requests from a client module, process them, and returns the result to the client. Software Compatibility: A description of whether different softwares, or versions of software, can communicate/interact. SQL: A widely-used programming language used to query databases. SQL injection: Performing unauthorized queries on a database for malicious purposes. User Authentication: The process of verifying the access credentials of a user of an automated system, usually accomplished by requesting a username and password combination. Viewer: In the scope of our project an outside person who wishes to query the information contained in the READ database. Version Control: A method for organizing and recording different versions of documents that have been created over time. Virtual Server: A software version of a hardware server. Webserver: A group of applications run on a computer or VPS in to serve webpages and provide server-side computation for browser-based client applications. A web server is a constantly “on” resource whose sole or main job is to respond to HTTP requests from browsers. XML: Extensible markup language. Lab II – READ Product Prototype Specification 10 (This space intentionally left blank.) 1.4 References McRobbie, Michael A (2012, December 19). The Multibillion-Dollar Threat to Research Universities. From The Chronicle of Higher Education: http://chronicle.com/article/The-Multibillion-Dollar-Threat/136363/ Moss, Andrew. (2013). LAB I – READ Product Description. Norfolk, VA: Author. National Center for Education Statistics. Degree-granting institutions and branches, by controls and level of institution and state or jurisdiction, 2010-11. From the Digest of Education Statistics: http://nces.ed.gov/programs/digest/d11/tables/dt11_280.asp 1.5 Overview This product specification details the features, components,and capabilities of the READ prototype, as well all necessary hardware and software. The following sections offer further information to that effect. (This space intentionally left blank.) Lab II – READ Product Prototype Specification 11 2 General Description READ is an automated system using a database to store links to articles, the publications themselves and information about grants involved. It will allow anyone with Internet access to browse the lists of publications and filter them by author, date, keywords, and publication type. It will minimize the need for manual effort on the part of the author by automatically finding their publications making it easier to manage the work they have already done. 2.1 Prototype Architecture Description The major functional components of the READ solution prototype are shown in Figure 1. The scraper will comb through a pre-defined list of specific web sites, searching for new publications by the author names given as input. It will then parse the results and export them to a MySQL database. The database will store links to publications, information about the publications, and in some cases, publications themselves. It will also contain information about grants associated with the aforementioned publications. Additionally, unique strings that identify the authors at the external web sites will be store in the database. The web interface contains both public and private sections. The latter will be accessible only to document authors and administrative staff. Access to this section will be strictly protected by requiring user authentication before it can be viewed. The web interfaces will be written in a combination of jQuery/javascript and PHP. (This space intentionally left blank.) Lab II – READ Product Prototype Specification 12 Figure 1 – Major Functional Component Diagram 2.2 Prototype Functional Description Read will allow anyone with internet access to view publications, grants, and author profiles. To access more features, the user will have to log in with valid ODUCS Linux/Unix credentials. If invalid credentials are entered, the user will still be considered only a viewer. Upon successful authentication, the user will be identified as either an author or administrator. If the user is determined to be an author, she will have access to edit her own publication and grant information, add missing publications and grants, and edit the information displayed on her public profile. Alternatively, if the user is an administrator, she will be able to edit or remove Lab II – READ Product Prototype Specification 13 publication and grant information and edit anyone’s profile information. This process flow is visualized in Figure 2. Figure 2 - User Process Flow The scraper starts by searching for publications at external websites for authors from the Computer Science Department. For each publication it finds, it checks to see if the publication is already referenced in the database. If the publication is already in the database, the scraper will Lab II – READ Product Prototype Specification 14 check to see if the author for whom it was searching is listed as an owner/author of the paper. If the author is not already associated with the work in the database, the association is made, but set to an unapproved status. Otherwise, the scraper resumes searching for publications. In the event that a scraped publication is not already in the database, it is added to the database and the user is added as an author. However, this publication will not be made viewable yet as it will be in an unapproved status. There will be a cron job that runs periodically which sends out e-mail notifications to authors that a publication has been attributed to them. If an author denies ownership of a paper the database will be queried to determine if the publication is owned by any other author in the database. Should the query return true, the author for whom the scraper was originally searching is removed from the list of the publication’s owners. Otherwise, the paper will be removed from the database entirely. In the event that the author confirms that she wrote the work in question, the database is queried to determine if there are any authors who should also be added to the list of owners. This chain of events is illustrated in Figure 3. (This space intentionally left blank.) Lab II – READ Product Prototype Specification 15 Figure 3 - Scraper Process Flow 2.3 External Interfaces 2.3.1 Hardware Interfaces READ will not require any custom-built hardware. Any device with internet connectivity and a web browser can be used to test its functionality. 2.3.2 Software Interfaces A physic al server running the Microsoft Hyper-V hypervisor will host the virtual machine where the READ solution is being developed. The READ database will be hosted with MySQL server. MySQL client is a command line client will be used to connect to the server instance. The READ web site uses Joomla, an open source content management system, and is written with a combination of PHP and jQuery/javascript. Python was used to write and modify the scraper. Lab II – READ Product Prototype Specification 2.3.3 User Interface A site map showing the user interfaces can be seen in Figure 4. Figure 4 - READ Site Map 2.3.4 – Communication Protocols and Interfaces READ will only make use of TCP/IP 16 Lab II – READ Product Prototype Specification 17 3 – Specific Requirements 3.1 – Functional Requirements UI Requirements (Jacob Phillmon and Marcus Zehr) The UI is what a person using the READ system will actually see. It governs all the functions of the READ display and allows people to interact with the system. The UI will be used by many types people including viewers, authors, and administrators, and extra interface functionality will be provided for each. The UI must follow the following requirements: 3.1.1 Publications Query Page This page is used to browse through all publications in the system. Filters can be chosen to narrow down specific publications of relevant interest. Publications consist of many forms of academic media, including but not limited to articles in conference proceedings, journal articles, tech reports, and abstracts. They usually are based off research done by specific individuals. The Publication Query Page must serve the following functional requirements: 1. The page must initially display publications with those that were most recently published at the top of the page. 2. The page must allow the following filters for publications displayed a. Title. b. Authors c. Date published d. Date added e. Keywords Lab II – READ Product Prototype Specification 18 f. Publisher 3. The page must display the following information for publications. . Title a. Authors b. Date published c. Date added to system d. Conference name or journal name or TR number e. Volume number f. Number of pages g. Page numbers h. Abstract (if available) i. A clickable link to where the publication is located. j. Thumbnail image 3.1.2 Grants Query Page This page is used to browse through all grants within the system. Filters can be chosen to narrow down specific grants of relevant interest. Grants are lump sums of money awarded for research to old dominion faculty. The Grant Query Page must serve the following functional requirements: 1. The page must initially display grants with those that were most recently granted at the top of the page. 2. The page must allow Grants must be filterable in all of the following ways. a. Title Lab II – READ Product Prototype Specification b. Funding agency c. Principal or co-principal Investigator d. Start date e. End date f. Active state 19 3. Grants must display the following information: . Title a. Funding agency b. Award amount c. Principal and co-principal investigators d. Start date e. End date f. Division g. Award number h. Abstract i. A clickable link to where the grant is located 3.1.3 Main Page The main page is the home page of the READ application. It is the first page that will be visited by anyone browsing the system. The most recent documents in the system can be found here as well as the ability to navigate to other parts of the user interface. The Main Page must serve the following functional requirements: Lab II – READ Product Prototype Specification 20 1. The main page must display the most recent publications and grants that have been added to the system. 2. The number of publications and/or grants displayed must match the current amount set by the system administrator or defaulting to a list displaying items published in the last 3 months. 3. It must allow navigation to the following pages: a. Grants page b. Publications page c. Login page d. Profile page 3.1.4 Login Page The login page will allow registered users to log into the system for authentication purposes. Logged in users will be able to edit publications or grants they have ownership of. The Login Page must serve the following functional requirements: 1. Provide an interface for a user to enter his or her login information 2. All login information must be linked to Old Dominion University CS accounts. 3.1.5 Profile Page The profile page will list information on the specific user currently logged in. This page is used to view any publications and grants associated with a specific user. Logged in users can use the profile page to choose to edit their own profile as well as any publications and grants they have ownership of. The Profile Page must serve the following functional requirements: Lab II – READ Product Prototype Specification 21 1. The profile page must display all grants and publications the user currently has on the system. 2. Provide the user the ability to select an option to edit information in grants and publications the user owns after said user has logged into the system. 3. The profile page must display the following: a. Profile picture b. Job title c. Email addresses d. Personal webpage link 4. Provide the user the ability to select an option to edit information displayed 5. Provide the user the ability to select an option to add publications or grants manually into the system. This function must be block if the user has not logged in or is not the owner of the specified profile page 6. Provide the user the ability to edit the profile page information, as well as grant and publication data; must be blocked if the user has not logged in or is not the owner of the specified profile page. 3.1.6 Publication Add Page The Publication Add Page will allow the user to submit publications manually into the system. The page will allow logged in users to add publications to the system if they do not wish to wait for the scraper to add it in. The Publication Add Page must serve the following functional requirements: 1. Provide the user with the ability to enter publication fields manually. Lab II – READ Product Prototype Specification 22 2. Provide the user with the ability to submit a BibTex document to automatically fill in various fields. 3.1.7 Grant Add Page The Grant Add Page will allow the user to submit grants manually into the system. The page will allow logged in users to add grants to the system if they do not wish to wait for the scraper to add it in. The Grant Add Page must serve the following functional requirements: 1. Provide the user with the ability to enter grant fields manually. 3.1.8 Profile Edit Page The profile editing page will allow the user to submit changes to their profile page. A logged in user would use this page to edit information displayed such as their job title, additional email addresses, personal website link, or profile picture. The Profile Edit Page must serve the following functional requirements: 1. Provide the user the ability to alter existing profile information including: a. Job title b. Email Addresses c. Personal webpage link 2. Provide the user the ability to submit a profile picture. 3.1.9 Publication Edit Page The publication editing page will allow the user to alter their own publication information. This page would be used to edit publications that were already stored in the system. The main reason for this would be to fix any mistakes that may have been created during the scraping process. The Publication Edit Page must serve the following functional requirements: Lab II – READ Product Prototype Specification 23 1. Provide the user the ability to alter information within the publication data. 2. Provide the user the ability to submit a Bibtext document to automatically fill in various fields. 3. Provide the user the ability to remove the publication from the system if it is not their own. 3.1.10 Grant Edit Page The grant editing page will allow the user to alter their own grant information. This page would be used to edit grants that were already stored in the system. The main reason for this would be to fix any mistakes that may have been created during the scraping process. The Grant Edit Page must serve the following functional requirements: 1. Allow the user to review existing grant information. 2. Allow the user to alter information within the grant data. 3. Allow the user to remove grants from the system if it is not their own. 3.1.11 Administration Page The administration page houses all abilities that are restricted solely to system administrators. Administrators would use this page to edit system settings such as the number of grants and publications displayed per query page The Administration Page must serve the following functional requirements. 1. Provide an administrator the ability to set the default number of publications displayed on the publications query page. 2. Provide an administrator the ability to set the default number of grants shown on the grants query page. Lab II – READ Product Prototype Specification 24 3. Provide an administrator the ability to set the default number of grants or publications displayed in the RSS feed on the main page. 3.2 System User Requirements: (Jacob Phillmon) The system users are people that access the read system through the UI and interact with it in many ways. there are three different types of users, each of which have unique privileges. The system users are made up of the following types and requirements: 3.2.1 Viewer Requirements Viewers are able to view the system but are unable to edit any information within it. They are people that may wish to use the system to view publications and grants that are already in the system, but not add anything to it. Viewers must have the following capabilities: 1. Viewers must have access to the following pages. a. Main Page b. Publications Query Page c. Grants Query Page d. Profile page e. Login Page 2. Viewers are able to view grants and publications stored within the system. 3. Viewers are able to view personal profile pages of registered users. 3.2.2 Author Requirements Authors are both able to view the system and edit information in which they have access to. They are people that actually add publications and grants to the system under their own name. They must have the following capabilities: Lab II – READ Product Prototype Specification 25 1. They must have access to the following pages: a. Main Page b. Publications Query Page c. Grants Query Page d. Profile page e. Login Page 2. Authors are able to view grants and publications stored within the system. 3. Authors are able to add grants and publications to the system manually. 4. Authors are able to edit grants and publications they have ownership of. 5. Authors are able to edit their own profile page. 3.2.3 Administrator Requirements Administrators Are able to view the system as well as edit any information displayed on the system. Administrators are separate from Authors in the fact that they don’t actually own any publications or grants in the system. They are able to make adjustments to anything within the system though. Administrators must have the following capabilities: 1. They must have access to the following pages: a. Main Page b. Publications Query Page c. Grants Query Page d. Profile Page e. Login Page f. Administration Page Lab II – READ Product Prototype Specification 26 2. Administrators are able to view grants and publications stored within the system. 3. Administrators are able to add grants and publication to the system manually. 4. Administrators are able to edit any grant and publication stored within the system. 5. Administrators are able to edit any profile page. 6. Administrators are able to set the default number of publications and grants displayed per page. 3.3 Backend User Interface (Jim Lawrence Calderon) 3.3.1 Publications Page: The backend of this page will be responsible for querying the database for the information pertaining to publications that will be displayed to the viewer. 1. By default, publications are queried to show the most recent publications first 2. Alter results shown based on the following filters: a. Title b. Author c. Date published d. Keywords e. Publisher 3.3.2 Grants Page: The backend of this page will be responsible for querying the database for the information pertaining to grants that will be displayed to the viewer. 1. By default, grants are queried to show the most recent grants first. 2. Alter results shown based on the following filters: a. Funding agency Lab II – READ Product Prototype Specification b. Award amount c. Investigator type i. ii.. Principle Co-op Principle d. Start date e. End date f. Current activity status 3.3.3 Login: 1. The following must be verified: a. Username is alphanumeric b. Credentials entered by user exist in the database c. Password is correct 3.3.4 Editing: The backend of this page will allow for privileged users to update information on publications and grants using a form with fields for each updateable field. 1. Update the following information within the database based on user input for: a. Publications i. Title ii. Author iii. Date published iv. Keywords v. Publisher b. Grants 27 Lab II – READ Product Prototype Specification i. Funding Agency ii. Award Amount iii. Investigator Type 1. Principle 2. Co-op Principle iv. Start Date v. End Date vi. Current Activity Status 3.3.5 Profile page: The backend of this page will be responsible for querying the database for the information belonging to the user who owns the particular profile. 1. Query the following information associated with the viewed profile page: a. Grants b. Publications c. Profile picture d. Job title e. Email address f. Personal webpage link 2.Display the queried information. 3. Update information within the database based on user input. 3.3.6 Profile Editing: 1. Update the following information within the database based on input: a. Profile Picture 28 Lab II – READ Product Prototype Specification b. Job Title c. Email Address d. Personal Webpage Link 29 3.4.Database Requirements(Andrew Sprague and Andrew Moss): The READ database will be used to store all of the information that the READ system will use to display the information and run the Schaefer Scrapper. The following functional requirements must be met: 3.4.1.Database must be made with MySQL The READ database must be created using MySQL. Both the creation of tables and the interfacing with tables will be done through a MySQL account. This will be done because of the widespread use and access to MySQL 3.4.2.Database must be normalized The READ database must be normalized. This will be done in order to keep the database as efficient as possible. Keeping the database efficient should allow for it to grow without taking up a large amount of space. 3.4.3.Auths table The READ database must include a table to store the authors information. The table exists so that the authors can be accessed by both the Schafer Scraper, for helping to associate the authors with their documents, and so that their information may be displayed to viewers. The following functional requirements must be met: 1. Authors must have an AID int as a primary key 2. Authors must have a String Variable to hold Degree 3. Authors must have a String Variable fname to hold the first name of the author Lab II – READ Product Prototype Specification 30 4. Authors must have a String Variable lname to hold the last name of the author 5. Authors must have a UserName String Variable 6. Authors must have a Password String Variable 7. Authors must have a String Variable Email to hold the Email Address 8. Authors must have a String Variable Link to hold their personal webpage link 9. Authors must have a String Variable Pic to hold the location of the profile picture 10. Authors must have a String Variable Pos to hold their position with the department 11. Authors must have a Bit Variable CurrentFaculty to hold information about if the Author is a current faculty member 12. Authors must have a Bit Variable Admin to determine if the author has administrator privileges 3.4.4.Papers Table The READ database must include a table for storing the information on publications. This table must exist in order for the Schaefer Scrapper to store the information into and for viewers to be able to access information on the papers.The following functional requirements must be met: 1. Papers must have a PID int as a primary key 2. Papers must have a Date variable Paper_Date to hold the date the publication was added 3. Papers must include a String variable Title to hold the title of the paper 4. Papers must include a TID foreign key to Tags 5. Papers must include a String variable pData to hold information about the paper 6. Papers must include a String variable link to hold the URL the paper is held at Lab II – READ Product Prototype Specification 31 7. Papers must include a VarChar variable Clevel to show the clearance level of the paper. An ‘A’ will be stored for approved papers and an ‘U’ will be stored for unapproved papers 8. Papers must include a String variable Abstract to hold the abstract of the paper. 9. Papers must include a String variable PubType to hold information on what kind of publication the record is. 10. Papers must include an Int variable Year_Published to keep the year when the paper was published. 11. Papers must include a String variable Date_Published to hold the month and day the paper was published 12. Papers must include a String variable ConName to hold the name of the convention or journal that a paper was published in. 13. Papers must include an int variable Volume to hold the number of the convention or journal that the paper was published in. 14. Papers must include an int variable NumPages to record the number of pages in a publication. 15. Papers must include a String variable thumbnail to hold information on the address of a thumbnail uploaded to the system. 16. Papers must include a String variable AuthString to list the authors in the format that was given in the publication, and to simplify the process of citing authors not at the university. 17. Papers must include an int DOI to identify the papers unique DOI number 18. Every Paper must be associated with an author Lab II – READ Product Prototype Specification 32 19. 3.4.5.Grants Table The READ database must include a grants table to store information on grants. The table must not be filled out by any scrapper but must have to be filled out by authors. The table must also be accessed by viewers who wish to see the faculties grants. The following functional requirements must be met: 1. Grants must have an int GID as a primary key 2. Grants must include a TID foreign key to Tags 3. Grants must include an Int variable StartYear to hold the start year of the grant 4. Grants must include an Int variable EndYear to hold the end year of the grant 5. Grants must include a String Variable StartDate to hold the month and day a grant started 6. Grants must include a String variable EndDate to hold the month and day agrant ended 7. Grants must include a String variable OrgAttrib to hold the name of the organization receiving the grant 8. Grants must include a String variable FundAgency to hold the name of the Agency giving the funds 9. Grants must include a String variable FundDirect to hold the name of the Directorate providing the funds 10. Grants must include an Int variable AwardNum to hold the ID number that the funding agency placed on the grant 11. Grants must include an Int variable Amount to hold the amount the grant was for 12. Grants must include a String variable GName to hold the name of the grant 13. Grants must have a foreign key PI to Authors AID Lab II – READ Product Prototype Specification 33 14. Grants must have a foreign key to CO_PI called COPIs 15. The same grant can not show up multiple times 3.4.6.Tags Table The READ database must include a tags table in order to store tag information on papers and grants. The tags table must be filled out by the author for grants and by the Schaefer Scrapper for papers. The following functional requirements must be met: 1. Tags must have a TID to Identify the grant or paper it belongs to 2. Tags must have a String Keyword to Identify what the tag is 3. Every Tag Record must be associated with either a grant or a paper. 4. When a paper is deleted there must be a cascading deletion of tags 5. When a grant is deleted there must be a cascading deletion of tags 3.4.7.Owns Table The READ database must include a owns table to associate authors to papers. This table must be organized in a way that allows for multiple authors to be associated with one paper if necessary. The following functional requirements must be met: 1. Owns must have a Foreign key to Authors 2. Owns must have a Foreign key to Papers 3. Owns must have an int Priority to determine what author has priority in edits 4. Owns may not have two instances where Authors and Papers are the same Lab II – READ Product Prototype Specification 34 3.4.8.CO_PI Table The READ database must include a CO_PI table to associate authors to grants. This table must be organized in a way that allows for multiple authors to be associated with one grant if necessary. The following functional requirements must be met: 1. CO_PI must have a PI_Num 2. CO_PI must have a Foreign key to Authors 3.4.9.SearchStrings Table The READ database must include a SearchStrings table to store information on how to search for the authors on different websites. The table must have information that will tell the system what to search and how to search it for each author. The following functional requirements must be met: 1. SearchStrings must have an AID foreign key to Authors 2. SearchStrings must have a Varchar string to the website they pertain to 3. SearchStrings must have a VarChar String to specify the authors site code 3.4.10. Important fields can not be null Some of the fields in the database can not allow for null values. This is because these values are required for the database to run. The following functional requirements must be met: 1.Paper Title 2.Grant Title 3.OrgAttrib 4.Funding Agency 5.Funding directorate 6. Agency Division Lab II – READ Product Prototype Specification 7.Award number 8. Amount 9. PI 10.Keyword of Tags 11.Grant Start Date 12.Grant End Date 13.any Primary Key 3.4.11. Some fields values must be unique 1.User name 2.Primary Keys 3.4.12. Date must be stored as YYYY-MM-DD The reason that dates must be stored this was is that it is the ISO standard for writing dates. 3.4.13. The Database must be accessed in the system through a MySQL account that has limited privileges (This space intentionally left blank.) 35 Lab II – READ Product Prototype Specification Figure 5 - Database Schema 36 Lab II – READ Product Prototype Specification 37 3.5 Microsoft Academic Research Scraper and Results Processing (Troy Connor and Philip McDonald) 3.5.1. Microsoft Academic Research Scraper 1. Cron Job set on intervals of once a month per user 2. Text file to split users into groups 3. Either Python or PHP to execute script 4. List of indexes from scraped sites per user in database table 5. Text file parser to read results from text file where results were saved 6. PHP script that checks database for existing pubs/grants so no duplicates will be added 7. Regular Expression text parser to compare title results (some titles are not labeled the same) 3.5.2. BibTex Results Parser (Philip McDonald) 1. Parser shall be triggered by initiation of scraper and input of search results. 2. The parser shall check for valid BibTex file. 3. The parser shall fail if the file format is not valid BibTex. 4. If the parser fails due to non-valid BibTex, this failure shall be written to a logfile specific to the parser component. 5. If the parser finds valid BibTex, this success shall be written to a logfile specific to the parser component. 6. The parser shall check the file for content. 7. If the parser fails to find any content in the BibTex file, this failure shall be written to a logfile specific to the parser component. Lab II – READ Product Prototype Specification 38 8. If the parser finds content in the BibTex file, all entries (a BibTex 'type' entry) shall be processed. 9. Each entries shall contain at least two fields: a. author b. title 10. If an entry contains both fields, the entry is 'valid' and the data shall be retained for use in the 'update database' component. 11. If an entry does not contain both fields described in requirement 9, then the entry is 'invalid' and shall not be retained for use in the 'update database' component. 12. If an entry is valid, it shall be checked for the following types: // use Bibtex names, refer to this . 'article' a. 'inproceedings' b. 'book' 13. If an entry is of one of the types referenced in requirement 12, it wil be checked for the following fields: . 'year' a. 'volume' b. 'pages' 14. Certain fields are only included with specific types. The following requirements are per type. . i. 'article' types shall be checked for the following fields: 'journal' field. Lab II – READ Product Prototype Specification a. 'inproceedings' type shall be checked for the following fields: . 'booktitle' b. 'book' types shall be checked for the following fields: . ??? c. (in general, the results provided by MAS do not contain all fields as required by the 39 BibTex standard.) 15. If entry data is found after checking the fields referenced in requirements 13 and 14.1-3, it shall be associated with the specific entry and retained for use in the "update database" component. 16. Fields shall be formatted according to the format specified in the database schema before being used in the "update database" component. //reference schema requirements 3.5.3. Database Updater (Philip McDonald) 1. The updater component shall be triggered by the parser component. 2. The updater shall perform the update process for all entries ("set of entries") supplied by the parser. 3. If the set of entries is empty, the updater will not perform the updating process. 4. Each set of entries supplied by the parser shall be associated with a maximum of one author, the "current author". 5. The following requirements shall be met for each entry: a. The updater shall check for duplicate publication using the title data using 'title' b. If a duplicate publication is found, the updater shall check the author of the paper. c. If a duplicate publication is found, and the current author of the duplicate publication is not an owner of the publication, the current author shall be made an an author of the publication. Lab II – READ Product Prototype Specification d. 40 If a duplicate publication is found, and the current author of the duplicate publication is an owner of the publication, the entry shall be discarded. e. If the publication is not a duplicate, the following requirements shall be met: i. The 'Authors' table shall be updated to reflect ownership of the new publication. ii. The 'Paper' table shall be updated with a new row. The new row shall contain the data from the entry in the following format: 1. The 'Title' attribute shall be written with the data from the entry's 'title' field. 2. The 'Paper_Date' shall be written with the data from the entry's 'date' field. 3. The 'Clevel' shall be set to the value corresponding to an "unapproved" publication. 4. If a certain data item is not included in search results for a grant or publication, the database entry shall be left null 3.5.4. Email Notifier (Troy Connor) The email notifier is a tool that will alert the author of a publication that is found from the scraper. The email notifier will allow the user to decide if the publication that was found to either be approved or disapproved. If approved, the publication will remain in the database and be able to be viewed. If disapproved, the publication will be deleted and not viewable in the system. The email notifier will also alert the author of publications/grants that they have uploaded. The email notifier must have the following requirements: 1. An automatic email sent when publication is found for Author 2. Link in email to activate publication awaiting approval Lab II – READ Product Prototype Specification 3. Link in email to delete publication awaiting disapproval 4. Email notification to tell users to READ once a month 5. PHP to execute script to alert users 6. Cron job set to run email notification at intervals when required 7. Email sent to verify uploaded submission (grant or publication) 8. Email notification to alert user that profile has been changed 41