Running Head Lab II – READ Product Prototype Specification

advertisement
Running Head
Lab II – READ Product Prototype Specification
Lab II – READ Product Prototype Specification
Andrew Moss
CS411
Janet Brunelle
April 10, 2013
Version 1
1
Lab II – READ Product Prototype Specification
2
Table of Contents
1 Introduction ................................................................................................................................................ 3
1.1 Purpose.................................................................................................................................................... 4
1.2 Scope ....................................................................................................................................................... 4
1.3 Definitions, Acronyms, and Abbreviations............................................................................................. 7
1.4 References ............................................................................................................................................. 10
1.5 Overview ............................................................................................................................................... 10
2 General Description ................................................................................................................................. 11
2.1 Prototype Architecture Description ...................................................................................................... 11
2.2 Prototype Functional Description ......................................................................................................... 12
2.3 External Interfaces ................................................................................................................................ 15
2.3.1 Hardware Interfaces ........................................................................................................................... 15
2.3.2 Software Interfaces ............................................................................................................................ 15
2.3.3 User Interface ..................................................................................................................................... 16
2.3.4 – Communication Protocols and Interfaces ....................................................................................... 16
3 – Specific Requirements .......................................................................................................................... 17
3.1 – Functional Requirements ................................................................................................................... 17
List of Figures
Figure 1 – Major Functional Component Diagram....................................................................... 12
Figure 2 - User Process Flow ........................................................................................................ 13
Figure 3 - Scraper Process Flow ................................................................................................... 15
Figure 4 - READ Site Map ........................................................................................................... 16
List of Tables
Table 1 – Side-by-side Comparison of Real World Product and Prototype ................................... 5
Lab II – READ Product Prototype Specification
3
1 Introduction
Publications are the primary method of distributing the results that come from conducting
research. There are approximately 4,600 universities (NCES, 2011) that “account for more than
half of the basic research conducted in the United States (McRobbie, 2012)”. Unfortunately,
many of these institutions lack an efficient online resource for organizing and displaying both the
publications resulting from their research and information about the grants that helped finance it.
Such a system would provide research universities and the departments therein, as well as the
students and professors performing the research, with increased recognition and awareness of
their work.
One example of a university in need of an improved publication system is Old Dominion
University (ODU) in Norfolk, Virginia. Their Computer Science Department (ODUCS), in
particular, would benefit a great deal from having an online well-maintained system for
publications and grants as it lacks one entirely. This department’s professors are burdened with
manually updating their own web pages to provide awareness of their recent publications. In the
past there was a single web page for the entire department that was maintained by an individual
member of their Systems Group. However, this page was last updated in 2008, likely a result of
the slow, tedious, and manual nature of the process.
(This space intentionally left blank.)
Lab II – READ Product Prototype Specification
4
1.1 Purpose
The team behind READ, a Repository for Electronic Aggregation of Documents, intends
to alleviate the lack of quality online resources for displaying publications and grants. The
READ system will use a scraper to provide researchers with a method of organizing their
publications and grants in a format that allows for easy searching, sorting, filtering, and
browsing. Additionally, content authors will be able to verify that the listed publications are
actually their own work in the event that READ mistakenly shows something written by another
researcher with the same name.
1.2 Scope
There will be a prototype READ system developed for ODUCS as a proof of concept to
display its most basic capabilities. A prototype is necessary due to time constraints placed on
development. This prototype will provide public and private user interfaces to publication and
grant databases, user controls for publication verification, and most importantly, a scraper that
will gather links to publications automatically at set intervals to minimize manual effort.
The READ prototype will use data from real authors from ODUCS and the database will
be populated with their publications by Schaefer’s Scraper. It will offer nearly the same
functionality as the Real World Product (RWP). Due to limited time for development, the
prototype will not feature graphical representation of data about publications and grants, nor will
it implement a learning algorithm to automatically decide whether a publication does or does not
likely belong to a specific author. This is shown in Table 1.
Lab II – READ Product Prototype Specification
5
Table 1 – Side-by-side Comparison of Real World Product and Prototype
Features
Real World Project
Prototype
Browsing
Ability to browse all grants and
Ability to browse all grants and
Capabilities
publication
publications
Publication
Filtered by title, publisher, authors,
Filtered by title, publisher, authors,
Filtering
publication date, date added, and
publication date, date added, and
Capabilities
keywords.
keywords.
Grant Filtering
Filtered by title, funding agency,
Filtered by title, funding agency,
Capabilities
principal or co-principal
principal or co-principal
investigator, start date, end date,
investigator, start date, end date, and
and active state.
active state.
Add, edit, and
Included. A thumbnail image and
Included. A thumbnail image and
delete publications
files may be associated with the
files may be associated with the
and grants
document. Fields can be
document. Fields can be
automatically filled in using a
automatically filled in using a
Bibtext document.
Bibtext document.
Lists faculty and provides a link to
Not included.
Faculty page
each person’s profile page
Login interface
Linked to Old Dominion University
Linked to Old Dominion University
Computer Science accounts
Computer Science accounts
Lab II – READ Product Prototype Specification
Profile Page
Scraper
Displays authors’ profile picture,
Displays authors’ profile picture,
job title, email address, personal
job title, email address, personal
webpage link, and the author’s
webpage link, and the author’s
publications and grants. Displays
publications and grants. Graphs not
graphs
included.
Will update the system with new
Will update the system with
publications and grants and alert
publications only and alert users
users when one is added to the
when one is added to the system
system under their name.
under their name.
6
Prediction
Predicts if the consumer has enough Not included
algorithm
space to use the READ system.
Administrative
Administrators are able to edit, add,
Administrators are able to edit, add,
Privileges
or remove anything in the system.
or remove anything in the system.
(This Space Intentionally Left Blank.)
Lab II – READ Product Prototype Specification
7
1.3 Definitions, Acronyms, and Abbreviations
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format. It will be used to automatically
fill in key information when uploading or editing publications and grants.
Computer Science (CS): An academic discipline based on advancing computing theory and
algorithm development, that sometimes includes theory about software engineering
methods.
Client application: In a client/server architecture, the module that takes input and creates queries
to be processed by a server, and receives the results from the server.
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” application and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants. These organizations usually have a
limited amount of money to (pass out) principle investigator’s that submit an accepted
application for research funds.
GIT: A software system for controlling and organizing software versioning.
GoogleScholar: Google Scholar provides a simple way to broadly search for scholarly literature.
Lab II – READ Product Prototype Specification
8
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with
a software application. Used to differentiate from a “command-line interface”, in which
a user interacts with a software application solely through a text terminal.
JQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
MicrosoftAcademic: Microsoft Academic Search is a free service developed by Microsoft
Research to help scholars, scientists, students, and practitioners quickly and easily find
academic content, researchers, institutions, and activities.
MySQL: A relational database management system
Parse: A technical term usually used to describe the processing of a statement written in a
programming language. May be used generally to describe the processing of any
statement for specific meaning.
Perl: A widely-used programming language on the server-side of web applications.
PHP: A widely-used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document created by a faculty member to share
research. They are usually published in an academic journals, technical reports, and
records of conference proceedings.
Query: An algorithm sent to the database to either change the database or get back results
READ: Repository for Electronic Aggregation of Documents
Lab II – READ Product Prototype Specification
9
RSS: A system for subscribing to and distributing news.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: In a client/server architecture, the module that takes queries or requests from
a client module, process them, and returns the result to the client.
Software Compatibility: A description of whether different softwares, or versions of software,
can communicate/interact.
SQL: A widely-used programming language used to query databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Viewer: In the scope of our project an outside person who wishes to query the information
contained in the READ database.
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Server: A software version of a hardware server.
Webserver: A group of applications run on a computer or VPS in to serve webpages and provide
server-side computation for browser-based client applications. A web server is a
constantly “on” resource whose sole or main job is to respond to HTTP requests from
browsers.
XML: Extensible markup language.
Lab II – READ Product Prototype Specification
10
(This space intentionally left blank.)
1.4 References
McRobbie, Michael A (2012, December 19). The Multibillion-Dollar Threat to Research
Universities. From The Chronicle of Higher Education:
http://chronicle.com/article/The-Multibillion-Dollar-Threat/136363/
Moss, Andrew. (2013). LAB I – READ Product Description. Norfolk, VA: Author.
National Center for Education Statistics. Degree-granting institutions and branches, by controls
and level of institution and state or jurisdiction, 2010-11. From the Digest of Education
Statistics: http://nces.ed.gov/programs/digest/d11/tables/dt11_280.asp
1.5 Overview
This product specification details the features, components,and capabilities of the READ
prototype, as well all necessary hardware and software. The following sections offer further
information to that effect.
(This space intentionally left blank.)
Lab II – READ Product Prototype Specification
11
2 General Description
READ is an automated system using a database to store links to articles, the publications
themselves and information about grants involved. It will allow anyone with Internet access to
browse the lists of publications and filter them by author, date, keywords, and publication type. It
will minimize the need for manual effort on the part of the author by automatically finding their
publications making it easier to manage the work they have already done.
2.1 Prototype Architecture Description
The major functional components of the READ solution prototype are shown in Figure 1.
The scraper will comb through a pre-defined list of specific web sites, searching for new
publications by the author names given as input. It will then parse the results and export them to
a MySQL database.
The database will store links to publications, information about the publications, and in
some cases, publications themselves. It will also contain information about grants associated
with the aforementioned publications. Additionally, unique strings that identify the authors at the
external web sites will be store in the database.
The web interface contains both public and private sections. The latter will be accessible
only to document authors and administrative staff. Access to this section will be strictly
protected by requiring user authentication before it can be viewed. The web interfaces will be
written in a combination of jQuery/javascript and PHP.
(This space intentionally left blank.)
Lab II – READ Product Prototype Specification
12
Figure 1 – Major
Functional Component Diagram
2.2 Prototype Functional Description
Read will allow anyone with internet access to view publications, grants, and author
profiles. To access more features, the user will have to log in with valid ODUCS Linux/Unix
credentials. If invalid credentials are entered, the user will still be considered only a viewer.
Upon successful authentication, the user will be identified as either an author or administrator. If
the user is determined to be an author, she will have access to edit her own publication and grant
information, add missing publications and grants, and edit the information displayed on her
public profile. Alternatively, if the user is an administrator, she will be able to edit or remove
Lab II – READ Product Prototype Specification
13
publication and grant information and edit anyone’s profile information. This process flow is
visualized in Figure 2.
Figure 2 - User Process Flow
The scraper starts by searching for publications at external websites for authors from the
Computer Science Department. For each publication it finds, it checks to see if the publication is
already referenced in the database. If the publication is already in the database, the scraper will
Lab II – READ Product Prototype Specification
14
check to see if the author for whom it was searching is listed as an owner/author of the paper. If
the author is not already associated with the work in the database, the association is made, but set
to an unapproved status. Otherwise, the scraper resumes searching for publications.
In the event that a scraped publication is not already in the database, it is added to the
database and the user is added as an author. However, this publication will not be made viewable
yet as it will be in an unapproved status. There will be a cron job that runs periodically which
sends out e-mail notifications to authors that a publication has been attributed to them.
If an author denies ownership of a paper the database will be queried to determine if the
publication is owned by any other author in the database. Should the query return true, the author
for whom the scraper was originally searching is removed from the list of the publication’s
owners. Otherwise, the paper will be removed from the database entirely. In the event that the
author confirms that she wrote the work in question, the database is queried to determine if there
are any authors who should also be added to the list of owners. This chain of events is illustrated
in Figure 3.
(This space intentionally left blank.)
Lab II – READ Product Prototype Specification
15
Figure 3 - Scraper Process Flow
2.3 External Interfaces
2.3.1 Hardware Interfaces
READ will not require any custom-built hardware. Any device with internet connectivity
and a web browser can be used to test its functionality.
2.3.2 Software Interfaces
A physic al server running the Microsoft Hyper-V hypervisor will host the virtual
machine where the READ solution is being developed. The READ database will be hosted with
MySQL server. MySQL client is a command line client will be used to connect to the server
instance. The READ web site uses Joomla, an open source content management system, and is
written with a combination of PHP and jQuery/javascript. Python was used to write and modify
the scraper.
Lab II – READ Product Prototype Specification
2.3.3 User Interface
A site map showing the user interfaces can be seen in Figure 4.
Figure 4 - READ Site Map
2.3.4 – Communication Protocols and Interfaces
READ will only make use of TCP/IP
16
Lab II – READ Product Prototype Specification
17
3 – Specific Requirements
3.1 – Functional Requirements
UI Requirements (Jacob Phillmon and Marcus Zehr)
The UI is what a person using the READ system will actually see. It governs all the
functions of the READ display and allows people to interact with the system. The UI will be
used by many types people including viewers, authors, and administrators, and extra interface
functionality will be provided for each. The UI must follow the following requirements:
3.1.1 Publications Query Page
This page is used to browse through all publications in the system. Filters can be chosen to
narrow down specific publications of relevant interest. Publications consist of many forms of
academic media, including but not limited to articles in conference proceedings, journal articles,
tech reports, and abstracts. They usually are based off research done by specific individuals. The
Publication Query Page must serve the following functional requirements:
1. The page must initially display publications with those that were most recently published
at the top of the page.
2. The page must allow the following filters for publications displayed
a. Title.
b. Authors
c. Date published
d. Date added
e. Keywords
Lab II – READ Product Prototype Specification
18
f. Publisher
3. The page must display the following information for publications.
.
Title
a.
Authors
b.
Date published
c.
Date added to system
d.
Conference name or journal name or TR number
e.
Volume number
f.
Number of pages
g.
Page numbers
h.
Abstract (if available)
i.
A clickable link to where the publication is located.
j.
Thumbnail image
3.1.2 Grants Query Page
This page is used to browse through all grants within the system. Filters can be chosen to
narrow down specific grants of relevant interest. Grants are lump sums of money awarded for
research to old dominion faculty. The Grant Query Page must serve the following functional
requirements:
1. The page must initially display grants with those that were most recently granted at the
top of the page.
2. The page must allow Grants must be filterable in all of the following ways.
a.
Title
Lab II – READ Product Prototype Specification
b.
Funding agency
c.
Principal or co-principal Investigator
d.
Start date
e.
End date
f.
Active state
19
3. Grants must display the following information:
.
Title
a.
Funding agency
b.
Award amount
c.
Principal and co-principal investigators
d.
Start date
e.
End date
f.
Division
g.
Award number
h.
Abstract
i.
A clickable link to where the grant is located
3.1.3 Main Page
The main page is the home page of the READ application. It is the first page that will be
visited by anyone browsing the system. The most recent documents in the system can be found
here as well as the ability to navigate to other parts of the user interface. The Main Page must
serve the following functional requirements:
Lab II – READ Product Prototype Specification
20
1. The main page must display the most recent publications and grants that have been added
to the system.
2. The number of publications and/or grants displayed must match the current amount set by
the system administrator or defaulting to a list displaying items published in the last 3
months.
3. It must allow navigation to the following pages:
a.
Grants page
b.
Publications page
c.
Login page
d.
Profile page
3.1.4 Login Page
The login page will allow registered users to log into the system for authentication
purposes. Logged in users will be able to edit publications or grants they have ownership of. The
Login Page must serve the following functional requirements:
1. Provide an interface for a user to enter his or her login information
2. All login information must be linked to Old Dominion University CS accounts.
3.1.5 Profile Page
The profile page will list information on the specific user currently logged in. This page
is used to view any publications and grants associated with a specific user. Logged in users can
use the profile page to choose to edit their own profile as well as any publications and grants they
have ownership of. The Profile Page must serve the following functional requirements:
Lab II – READ Product Prototype Specification
21
1. The profile page must display all grants and publications the user currently has on the
system.
2. Provide the user the ability to select an option to edit information in grants and
publications the user owns after said user has logged into the system.
3. The profile page must display the following:
a.
Profile picture
b.
Job title
c.
Email addresses
d.
Personal webpage link
4. Provide the user the ability to select an option to edit information displayed
5. Provide the user the ability to select an option to add publications or grants manually into
the system. This function must be block if the user has not logged in or is not the owner
of the specified profile page
6. Provide the user the ability to edit the profile page information, as well as grant and
publication data; must be blocked if the user has not logged in or is not the owner of the
specified profile page.
3.1.6 Publication Add Page
The Publication Add Page will allow the user to submit publications manually into the
system. The page will allow logged in users to add publications to the system if they do not wish
to wait for the scraper to add it in. The Publication Add Page must serve the following functional
requirements:
1. Provide the user with the ability to enter publication fields manually.
Lab II – READ Product Prototype Specification
22
2. Provide the user with the ability to submit a BibTex document to automatically fill in
various fields.
3.1.7 Grant Add Page
The Grant Add Page will allow the user to submit grants manually into the system. The
page will allow logged in users to add grants to the system if they do not wish to wait for the
scraper to add it in. The Grant Add Page must serve the following functional requirements:
1. Provide the user with the ability to enter grant fields manually.
3.1.8 Profile Edit Page
The profile editing page will allow the user to submit changes to their profile page. A
logged in user would use this page to edit information displayed such as their job title, additional
email addresses, personal website link, or profile picture. The Profile Edit Page must serve the
following functional requirements:
1. Provide the user the ability to alter existing profile information including:
a.
Job title
b.
Email Addresses
c.
Personal webpage link
2. Provide the user the ability to submit a profile picture.
3.1.9 Publication Edit Page
The publication editing page will allow the user to alter their own publication
information. This page would be used to edit publications that were already stored in the system.
The main reason for this would be to fix any mistakes that may have been created during the
scraping process. The Publication Edit Page must serve the following functional requirements:
Lab II – READ Product Prototype Specification
23
1. Provide the user the ability to alter information within the publication data.
2. Provide the user the ability to submit a Bibtext document to automatically fill in various
fields.
3. Provide the user the ability to remove the publication from the system if it is not their
own.
3.1.10 Grant Edit Page
The grant editing page will allow the user to alter their own grant information. This page
would be used to edit grants that were already stored in the system. The main reason for this
would be to fix any mistakes that may have been created during the scraping process. The Grant
Edit Page must serve the following functional requirements:
1. Allow the user to review existing grant information.
2. Allow the user to alter information within the grant data.
3. Allow the user to remove grants from the system if it is not their own.
3.1.11 Administration Page
The administration page houses all abilities that are restricted solely to system
administrators. Administrators would use this page to edit system settings such as the number of
grants and publications displayed per query page The Administration Page must serve the
following functional requirements.
1. Provide an administrator the ability to set the default number of publications displayed on
the publications query page.
2. Provide an administrator the ability to set the default number of grants shown on the
grants query page.
Lab II – READ Product Prototype Specification
24
3. Provide an administrator the ability to set the default number of grants or publications
displayed in the RSS feed on the main page.
3.2 System User Requirements: (Jacob Phillmon)
The system users are people that access the read system through the UI and interact with it in
many ways. there are three different types of users, each of which have unique privileges. The
system users are made up of the following types and requirements:
3.2.1 Viewer Requirements
Viewers are able to view the system but are unable to edit any information within it. They
are people that may wish to use the system to view publications and grants that are already in the
system, but not add anything to it. Viewers must have the following capabilities:
1. Viewers must have access to the following pages.
a.
Main Page
b.
Publications Query Page
c.
Grants Query Page
d.
Profile page
e.
Login Page
2. Viewers are able to view grants and publications stored within the system.
3. Viewers are able to view personal profile pages of registered users.
3.2.2 Author Requirements
Authors are both able to view the system and edit information in which they have access
to. They are people that actually add publications and grants to the system under their own name.
They must have the following capabilities:
Lab II – READ Product Prototype Specification
25
1. They must have access to the following pages:
a.
Main Page
b.
Publications Query Page
c.
Grants Query Page
d.
Profile page
e.
Login Page
2. Authors are able to view grants and publications stored within the system.
3. Authors are able to add grants and publications to the system manually.
4. Authors are able to edit grants and publications they have ownership of.
5. Authors are able to edit their own profile page.
3.2.3 Administrator Requirements
Administrators Are able to view the system as well as edit any information displayed on
the system. Administrators are separate from Authors in the fact that they don’t actually own any
publications or grants in the system. They are able to make adjustments to anything within the
system though. Administrators must have the following capabilities:
1. They must have access to the following pages:
a.
Main Page
b.
Publications Query Page
c.
Grants Query Page
d.
Profile Page
e.
Login Page
f.
Administration Page
Lab II – READ Product Prototype Specification
26
2. Administrators are able to view grants and publications stored within the system.
3. Administrators are able to add grants and publication to the system manually.
4. Administrators are able to edit any grant and publication stored within the system.
5. Administrators are able to edit any profile page.
6. Administrators are able to set the default number of publications and grants displayed per
page.
3.3 Backend User Interface (Jim Lawrence Calderon)
3.3.1 Publications Page:
The backend of this page will be responsible for querying the database for the
information pertaining to publications that will be displayed to the viewer.
1. By default, publications are queried to show the most recent publications first
2. Alter results shown based on the following filters:
a.
Title
b.
Author
c.
Date published
d.
Keywords
e.
Publisher
3.3.2 Grants Page:
The backend of this page will be responsible for querying the database for the
information pertaining to grants that will be displayed to the viewer.
1. By default, grants are queried to show the most recent grants first.
2. Alter results shown based on the following filters:
a.
Funding agency
Lab II – READ Product Prototype Specification
b.
Award amount
c.
Investigator type
i.
ii..
Principle
Co-op Principle
d.
Start date
e.
End date
f.
Current activity status
3.3.3 Login:
1. The following must be verified:
a.
Username is alphanumeric
b.
Credentials entered by user exist in the database
c.
Password is correct
3.3.4 Editing:
The backend of this page will allow for privileged users to update information on
publications and grants using a form with fields for each updateable field.
1. Update the following information within the database based on user input for:
a. Publications
i.
Title
ii.
Author
iii.
Date published
iv.
Keywords
v.
Publisher
b. Grants
27
Lab II – READ Product Prototype Specification
i.
Funding Agency
ii.
Award Amount
iii.
Investigator Type
1. Principle
2. Co-op Principle
iv.
Start Date
v.
End Date
vi.
Current Activity Status
3.3.5 Profile page:
The backend of this page will be responsible for querying the database for the
information belonging to the user who owns the particular profile.
1. Query the following information associated with the viewed profile page:
a.
Grants
b.
Publications
c.
Profile picture
d.
Job title
e.
Email address
f.
Personal webpage link
2.Display the queried information.
3. Update information within the database based on user input.
3.3.6 Profile Editing:
1. Update the following information within the database based on input:
a.
Profile Picture
28
Lab II – READ Product Prototype Specification
b.
Job Title
c.
Email Address
d.
Personal Webpage Link
29
3.4.Database Requirements(Andrew Sprague and Andrew Moss):
The READ database will be used to store all of the information that the READ system will use
to display the information and run the Schaefer Scrapper. The following functional requirements
must be met:
3.4.1.Database must be made with MySQL
The READ database must be created using MySQL. Both the creation of tables and the
interfacing with tables will be done through a MySQL account. This will be done because of the
widespread use and access to MySQL
3.4.2.Database must be normalized
The READ database must be normalized. This will be done in order to keep the database as
efficient as possible. Keeping the database efficient should allow for it to grow without taking up
a large amount of space.
3.4.3.Auths table
The READ database must include a table to store the authors information. The table exists so
that the authors can be accessed by both the Schafer Scraper, for helping to associate the authors
with their documents, and so that their information may be displayed to viewers. The following
functional requirements must be met:
1. Authors must have an AID int as a primary key
2. Authors must have a String Variable to hold Degree
3. Authors must have a String Variable fname to hold the first name of the author
Lab II – READ Product Prototype Specification
30
4. Authors must have a String Variable lname to hold the last name of the author
5. Authors must have a UserName String Variable
6. Authors must have a Password String Variable
7. Authors must have a String Variable Email to hold the Email Address
8. Authors must have a String Variable Link to hold their personal webpage link
9. Authors must have a String Variable Pic to hold the location of the profile picture
10. Authors must have a String Variable Pos to hold their position with the department
11. Authors must have a Bit Variable CurrentFaculty to hold information about if the Author
is a current faculty member
12. Authors must have a Bit Variable Admin to determine if the author has
administrator privileges
3.4.4.Papers Table
The READ database must include a table for storing the information on publications. This
table must exist in order for the Schaefer Scrapper to store the information into and for viewers
to be able to access information on the papers.The following functional requirements must be
met:
1. Papers must have a PID int as a primary key
2. Papers must have a Date variable Paper_Date to hold the date the publication was added
3. Papers must include a String variable Title to hold the title of the paper
4. Papers must include a TID foreign key to Tags
5. Papers must include a String variable pData to hold information about the paper
6. Papers must include a String variable link to hold the URL the paper is held at
Lab II – READ Product Prototype Specification
31
7. Papers must include a VarChar variable Clevel to show the clearance level of the paper.
An ‘A’ will be stored for approved papers and an ‘U’ will be stored for unapproved
papers
8. Papers must include a String variable Abstract to hold the abstract of the paper.
9. Papers must include a String variable PubType to hold information on what kind of
publication the record is.
10. Papers must include an Int variable Year_Published to keep the year when the paper was
published.
11. Papers must include a String variable Date_Published to hold the month and day the
paper was published
12. Papers must include a String variable ConName to hold the name of the convention or
journal that a paper was published in.
13. Papers must include an int variable Volume to hold the number of the convention or
journal that the paper was published in.
14. Papers must include an int variable NumPages to record the number of pages in a
publication.
15. Papers must include a String variable thumbnail to hold information on the address of a
thumbnail uploaded to the system.
16. Papers must include a String variable AuthString to list the authors in the format that was
given in the publication, and to simplify the process of citing authors not at the
university.
17. Papers must include an int DOI to identify the papers unique DOI number
18. Every Paper must be associated with an author
Lab II – READ Product Prototype Specification
32
19. 3.4.5.Grants Table
The READ database must include a grants table to store information on grants. The table must
not be filled out by any scrapper but must have to be filled out by authors. The table must also be
accessed by viewers who wish to see the faculties grants. The following functional requirements
must be met:
1. Grants must have an int GID as a primary key
2. Grants must include a TID foreign key to Tags
3. Grants must include an Int variable StartYear to hold the start year of the grant
4. Grants must include an Int variable EndYear to hold the end year of the grant
5. Grants must include a String Variable StartDate to hold the month and day a grant started
6. Grants must include a String variable EndDate to hold the month and day agrant ended
7. Grants must include a String variable OrgAttrib to hold the name of the organization
receiving the grant
8. Grants must include a String variable FundAgency to hold the name of the Agency giving
the funds
9. Grants must include a String variable FundDirect to hold the name of the Directorate
providing the funds
10. Grants must include an Int variable AwardNum to hold the ID number that the funding
agency placed on the grant
11. Grants must include an Int variable Amount to hold the amount the grant was for
12. Grants must include a String variable GName to hold the name of the grant
13. Grants must have a foreign key PI to Authors AID
Lab II – READ Product Prototype Specification
33
14. Grants must have a foreign key to CO_PI called COPIs
15. The same grant can not show up multiple times
3.4.6.Tags Table
The READ database must include a tags table in order to store tag information on papers and
grants. The tags table must be filled out by the author for grants and by the Schaefer Scrapper for
papers. The following functional requirements must be met:
1. Tags must have a TID to Identify the grant or paper it belongs to
2. Tags must have a String Keyword to Identify what the tag is
3. Every Tag Record must be associated with either a grant or a paper.
4. When a paper is deleted there must be a cascading deletion of tags
5. When a grant is deleted there must be a cascading deletion of tags
3.4.7.Owns Table
The READ database must include a owns table to associate authors to papers. This table must
be organized in a way that allows for multiple authors to be associated with one paper if
necessary. The following functional requirements must be met:
1. Owns must have a Foreign key to Authors
2. Owns must have a Foreign key to Papers
3. Owns must have an int Priority to determine what author has priority in edits
4. Owns may not have two instances where Authors and Papers are the same
Lab II – READ Product Prototype Specification
34
3.4.8.CO_PI Table
The READ database must include a CO_PI table to associate authors to grants. This table must
be organized in a way that allows for multiple authors to be associated with one grant if
necessary. The following functional requirements must be met:
1. CO_PI must have a PI_Num
2. CO_PI must have a Foreign key to Authors
3.4.9.SearchStrings Table
The READ database must include a SearchStrings table to store information on how to search
for the authors on different websites. The table must have information that will tell the system
what to search and how to search it for each author. The following functional requirements must
be met:
1. SearchStrings must have an AID foreign key to Authors
2. SearchStrings must have a Varchar string to the website they pertain to
3. SearchStrings must have a VarChar String to specify the authors site code
3.4.10. Important fields can not be null
Some of the fields in the database can not allow for null values. This is because these values
are required for the database to run. The following functional requirements must be met:
1.Paper Title
2.Grant Title
3.OrgAttrib
4.Funding Agency
5.Funding directorate
6. Agency Division
Lab II – READ Product Prototype Specification
7.Award number
8. Amount
9. PI
10.Keyword of Tags
11.Grant Start Date
12.Grant End Date
13.any Primary Key
3.4.11. Some fields values must be unique
1.User name
2.Primary Keys
3.4.12. Date must be stored as YYYY-MM-DD
The reason that dates must be stored this was is that it is the ISO standard for writing dates.
3.4.13. The Database must be accessed in the system through a MySQL account that has
limited privileges
(This space intentionally left blank.)
35
Lab II – READ Product Prototype Specification
Figure 5 - Database Schema
36
Lab II – READ Product Prototype Specification
37
3.5 Microsoft Academic Research Scraper and Results Processing (Troy Connor and Philip
McDonald)
3.5.1. Microsoft Academic Research Scraper
1. Cron Job set on intervals of once a month per user
2. Text file to split users into groups
3. Either Python or PHP to execute script
4. List of indexes from scraped sites per user in database table
5. Text file parser to read results from text file where results were saved
6. PHP script that checks database for existing pubs/grants so no duplicates will be added
7. Regular Expression text parser to compare title results (some titles are not labeled the
same)
3.5.2. BibTex Results Parser (Philip McDonald)
1. Parser shall be triggered by initiation of scraper and input of search results.
2. The parser shall check for valid BibTex file.
3. The parser shall fail if the file format is not valid BibTex.
4. If the parser fails due to non-valid BibTex, this failure shall be written to a logfile
specific to the parser component.
5. If the parser finds valid BibTex, this success shall be written to a logfile specific to the
parser component.
6. The parser shall check the file for content.
7. If the parser fails to find any content in the BibTex file, this failure shall be written to a
logfile specific to the parser component.
Lab II – READ Product Prototype Specification
38
8. If the parser finds content in the BibTex file, all entries (a BibTex 'type' entry) shall be
processed.
9. Each entries shall contain at least two fields:
a.
author
b.
title
10. If an entry contains both fields, the entry is 'valid' and the data shall be retained for use in
the 'update database' component.
11. If an entry does not contain both fields described in requirement 9, then the entry is
'invalid' and shall not be retained for use in the 'update database' component.
12. If an entry is valid, it shall be checked for the following types: // use Bibtex names, refer
to this
.
'article'
a.
'inproceedings'
b.
'book'
13. If an entry is of one of the types referenced in requirement 12, it wil be checked for the
following fields:
.
'year'
a.
'volume'
b.
'pages'
14. Certain fields are only included with specific types. The following requirements are per
type.
.
i.
'article' types shall be checked for the following fields:
'journal' field.
Lab II – READ Product Prototype Specification
a.
'inproceedings' type shall be checked for the following fields:
.
'booktitle'
b.
'book' types shall be checked for the following fields:
.
???
c.
(in general, the results provided by MAS do not contain all fields as required by the
39
BibTex standard.)
15. If entry data is found after checking the fields referenced in requirements 13 and 14.1-3,
it shall be associated with the specific entry and retained for use in the "update database"
component.
16. Fields shall be formatted according to the format specified in the database schema before
being used in the "update database" component. //reference schema requirements
3.5.3. Database Updater (Philip McDonald)
1. The updater component shall be triggered by the parser component.
2. The updater shall perform the update process for all entries ("set of entries") supplied by
the parser.
3. If the set of entries is empty, the updater will not perform the updating process.
4. Each set of entries supplied by the parser shall be associated with a maximum of one
author, the "current author".
5. The following requirements shall be met for each entry:
a.
The updater shall check for duplicate publication using the title data using 'title'
b.
If a duplicate publication is found, the updater shall check the author of the paper.
c.
If a duplicate publication is found, and the current author of the duplicate publication is
not an owner of the publication, the current author shall be made an an author of the publication.
Lab II – READ Product Prototype Specification
d.
40
If a duplicate publication is found, and the current author of the duplicate publication is
an owner of the publication, the entry shall be discarded.
e.
If the publication is not a duplicate, the following requirements shall be met:
i.
The 'Authors' table shall be updated to reflect ownership of the new publication.
ii.
The 'Paper' table shall be updated with a new row. The new row shall contain the data
from the entry in the following format:
1. The 'Title' attribute shall be written with the data from the entry's
'title' field.
2. The 'Paper_Date' shall be written with the data from the entry's
'date' field.
3. The 'Clevel' shall be set to the value corresponding to an
"unapproved" publication.
4. If a certain data item is not included in search results for a grant or
publication, the database entry shall be left null
3.5.4. Email Notifier (Troy Connor)
The email notifier is a tool that will alert the author of a publication that is found from the
scraper. The email notifier will allow the user to decide if the publication that was found to
either be approved or disapproved. If approved, the publication will remain in the database and
be able to be viewed. If disapproved, the publication will be deleted and not viewable in the
system. The email notifier will also alert the author of publications/grants that they have
uploaded. The email notifier must have the following requirements:
1. An automatic email sent when publication is found for Author
2. Link in email to activate publication awaiting approval
Lab II – READ Product Prototype Specification
3. Link in email to delete publication awaiting disapproval
4. Email notification to tell users to READ once a month
5. PHP to execute script to alert users
6. Cron job set to run email notification at intervals when required
7. Email sent to verify uploaded submission (grant or publication)
8. Email notification to alert user that profile has been changed
41
Download