Running Head: Lab 2 – READ Prototype Product Specification 1

advertisement
Running Head: Lab 2 – READ Prototype Product Specification
CS 411W Lab II
Prototype Product Specification
For
READ
Prepared by: Jim Lawrence Calderon, Black Team
Date: 04/08/13
1
Lab 1 – READ Prototype Product Specification
2
Table of Contents
1
2
INTRODUCTION ................................................................................................................................ 3
1.1
Purpose.......................................................................................................................................... 4
1.2
Scope ............................................................................................................................................. 6
1.3
Definitions, Acronyms, and Abbreviations ................................................................................... 6
1.4
References ..................................................................................................................................... 9
1.5
Overview ....................................................................................................................................... 9
General Description .............................................................................................................................. 9
2.1
Prototype Architecture Description .............................................................................................. 9
2.2
Prototype Functional Description ............................................................................................... 12
2.3
External Interfaces ...................................................................................................................... 15
2.3.1
Hardware Interfaces ............................................................................................................ 15
2.3.2
Software Interfaces ............................................................................................................. 15
2.3.3
User Interfaces .................................................................................................................... 15
2.3.4
Communication Protocols and Interfaces ........................................................................... 16
Table of Figures
Figure 1: Major Functional Component Diagram......................................................................................... 5
Figure 2: READ Prototype MFCD ............................................................................................................. 13
Figure 3: Scraper Flowchart........................................................................................................................ 13
Figure 4: READ Site Map........................................................................................................................... 16
Table of Tables
Table 1: RWP vs. Prototype features .......................................................................................................... 11
Table 2: Functionality for Publications, Grants, and User Profile pages .................................................... 15
Lab 1 – READ Prototype Product Specification
1
3
INTRODUCTION
Grants are up to 20% of a public research university’s revenue (Delta Cost Project).
Universities are awarded these grants through ongoing research that is documented through techreports and publications. Having an online database for these publications is a way to attract
potential students who are interested in the ongoing research at the university. However, keeping
this sort of system updated manually is very tedious.
Old Dominion University’s Computer Science department is a good case study to illustrate
this example. The current static webpage, which has not been updated since 2008, consists of a
list of publications manually updated by an appointed staff member. In order to have one’s
publication in the system, they must send the information to this person via email and wait for
them to update it. ODU is not the only institution that experiences this problem, and other
organizations may also experience similar issues as well.
The Repository for Electronic Aggregation of Documents (READ) is a system created by
the CS411 Black Group in response to the ODU Computer Department’s lack of an online
publication database. Like the initial problem, this may be integrated into the systems of any
interested parties and is not created strictly for ODU. The Schaefer Scraper is designed to
automatically perform the task of collecting publications owned by CS faculty members, from
across multiple scholastic websites. These publications will then be organized by the viewer
through the use of numerous filters as they browse through the material. While the viewers have
the ability to browse, authors and administrators will additionally be able to update information
on existing publications, or manually add publications when necessary. With these features,
READ aims to ease the responsibility of each researcher to manually manage their publications
outside of the initial upload to a single site.
Lab 1 – READ Prototype Product Specification
1.1
4
Purpose
READ is an online database that will house grants, articles, information on statistics
pertaining to them, and links to research. Viewers will also be able to browse through them using
a number of filters such by the name of the author, publish date, and various keywords. The goal
is to minimize the need for an author to manage their publications through the features that
READ will employ. READ will also strive to be able to advertise ongoing research more
efficiently, and show available grants to whomever requires them. The goal is for the authors to
work less in order to READ more.
A major part of READ is software named the Schaefer Scraper. It searches through a list of
specific academic websites, defined by an administrator, for publications that match registered
author’s credentials. The pertinent information is then uploaded to the database. The author will
receive an email notification at this point in order to authorize the publication, edit any mistaken
information, and the system will learn when not to notify them based on patterns on why they
denied past publications. This allows for gathering all of an author’s publications into a single
location automatically, with little work on their part.
The system will allow viewers to browse the database using filters for grants, multiple
types of article publications, and their authors. Viewing user profiles will show personal
information, a graphical representation of the amount of publications they have created, funding
they received, and a list of associated publications. This statistical information will show viewers
a specific author’s area of expertise, and level of activity.
READ consists of three major software functional components: web interface, database,
and scraper, (Figure 1). The web interface contains pages that are only viewable by
administrators or registered authors and those that are accessible to the public. Private content
Lab 1 – READ Prototype Product Specification
will only be accessible by having an authorized login, and will mainly deal with updating
publication or grant data. The public section consists of the multiple pages that any viewer will
be able to interact with, such as the article publications, grant proposals, graduate student list,
and faculty list. The viewer interacts with the web interface by selecting filters that will be used
to query the database. These queries will then build the pages for them to browse.
Figure 1: Major Functional Component Diagram
The second major component is a MySQL database that will contain six tables: Authors,
Paper, Grants, Owns, Tags, and CO_PI. These will either be populated automatically by
information obtained by the Schaefer Scraper or manually by registered authors. Its primary
function is to store links to external publications, grant proposals, and all the information of the
associated material. Information accuracy will be verified by an email confirmation sent to the
authors, and will allow them to edit any incorrect data.
5
Lab 1 – READ Prototype Product Specification
6
The third major component is the Schaefer Scraper, an automated tool used to go through
predefined websites for new publications. This algorithm locates an author’s profile in a
specified website using a unique string identifier and extracts the BibTex information. It is then
parsed for required data from these websites into the database, where it will be queried and
viewed on the web interface.
1.2
Scope
The prototype for READ will essentially be identical to the Real World Product (RWP)
modeled using actual publications from the ODU Computer Science department and hosted on a
virtual machine running the Debian operating system. Its main objective is to demonstrate the
essential features of the RWP. As illustrated in Table 1, it will not include the Sparkline, graph
integration in the user profiles, the learning algorithm, nor the faculty list. The graphs will be
visual representations of publications and grants that an author has been associated with within
the last few years. The learning algorithm will allow the system to learn to distinguish possible
publications of an author based on their patterns of approving or denying past entries.
1.3
Definitions, Acronyms, and Abbreviations
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format. It will be used to automatically
fill in key information when uploading or editing publications and grants.
Client application: The module that takes input and creates queries to be processed by a server,
and receives the results from the server.
Lab 1 – READ Prototype Product Specification
7
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants.
GIT: A software system for controlling and organizing software versioning.
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with a
software application. Used to differentiate from a “command-line interface”, in which a user
interacts with a software application solely through a text terminal.
Joomla!: A content management system.
jQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
MySQL: An implementation of SQL that is open-source.
Parse: The processing of a statement.
Perl: A widely-used programming language on the server-side of web applications.
PHP: A widely-used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document created published in an academic journals,
technical reports, and records of conference proceedings.
Query: An algorithm sent to the database to either change the database or get back results
Lab 1 – READ Prototype Product Specification
READ: Repository for Electronic Aggregation of Documents
RSS: A system for subscribing to and distributing news.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: T module that takes queries or requests from a client module, process them,
and returns the result to the client.
Software Compatibility: A description of whether different softwares, or versions of software,
can communicate/interact.
SQL: A widely-used programming language used to query databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Viewer: In the scope of our project an outside person who wishes to query the information
contained in the READ database.
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Private Server (VPS): A software version of a hardware server. Used to create
independent program that manages access to a centralized resource or service in a network on a
single piece of hardware.
Webserver: A constantly “on” resource whose sole or main job is to respond to HTTP requests
from browsers.
XML: Extensible markup language.
8
Lab 1 – READ Prototype Product Specification
1.4
9
References
"Delta Cost Project Data." The Delta Project on Postsecondary Education Costs, Productivity,
and Accountability. The Delta Project, n.d. Web. 9 Feb. 2013.
Calderon, Jim. (2013). Lab 1 – READ Product Description.
1.5
Overview
This product specification provides the hardware and software configuration, external
interfaces, capabilities and features of the READ prototype. The information provided in the
remaining sections of this document includes a detailed description of the software, and external
interface architecture of the READ prototype; the key features of the prototype; the parameters
that will be used to control, manage, or establish that feature; and the performance characteristics
of that feature in terms of outputs, displays, and user interaction.
2
General Description
The prototype of READ will focus on the proper gathering, and displaying of publications
owned by the ODU Computer Science faculty, obtained from MicrosoftAcademic. Secondary
priorities include user profiles, and administrative privileges. Due to time constraints, certain
features of the RWP will be omitted.
2.1 Prototype Architecture Description
READ’s prototype will be very similar to the RWP with a few withheld features. The web
interface that READ will be using is going to be created in PHP and hosted on an ODU web
server with Google Chrome as the browser of choice. All of the planned sections, including the
Lab 1 – READ Prototype Product Specification
10
publications, grants, and faculty list will be fully functional, while the profile pages will have
limited functionality compared to the RWP. A desktop or laptop with Internet access is required
to access the interface.
Features
Real World Project
Prototype
Browsing
Ability to browse all grants and
Ability to browse all grants and
Capabilities
publication
publications
Publication
Filtered by title, publisher, authors,
Filtered by title, publisher, authors,
Filtering
publication date, date added, and
publication date, date added, and
Capabilities
keywords.
keywords.
Grant Filtering
Filtered by title, funding agency,
Filtered by title, funding agency,
Capabilities
principal or co-principal investigator,
principal or co-principal investigator,
start date, end date, and active state.
start date, end date, and active state.
Add, edit, and
Included. A thumbnail image and files
Included. A thumbnail image and
delete
may be associated with the document.
files may be associated with the
publications
Fields can be automatically filled in
document. Fields can be
and grants
using a BibTex document.
automatically filled in using a BibTex
document.
Faculty page
Lists faculty and provides a link to
each person’s profile page
Not included.
Lab 1 – READ Prototype Product Specification
11
Feature
Real World Product
Prototype
Login interface
Linked to Old Dominion University
Linked to Old Dominion University
Computer Science accounts
Computer Science accounts
Displays authors’ profile picture, job
Displays authors’ profile picture, job
title, email address, personal webpage
title, email address, personal webpage
link, and the author’s publications and
link, and the author’s publications
grants. Displays graphs
and grants. Graphs not included.
Will update the system with new
Will update the system with
publications and grants and alert
publications only and alert authors
authors when one is added to the
when one is added to the system
system under their name.
under their name.
Administrative
Administrators are able to edit, add, or
Administrators are able to edit, add,
Privileges
remove anything in the system.
or remove anything in the system.
Profile Page
Scraper
Table 1: RWP vs. Prototype features
The Schaefer Scraper software, which will be responsible for obtaining publications, is
already coded in PHP as well, requiring only to be integrated with the web interface once ready.
Only MicrosoftAcademic will be used for the prototype. Actual data will be scraped from these
websites using the software. It will then be parsed for pertinent information, and stored into a
MySQL database, which will be integrated with the web interface.
The prototype’s main purpose consists of three major functions: obtain publication
information from across multiple scholastic sites using the Schaefer Scraper, store the data
Lab 1 – READ Prototype Product Specification
12
obtained from this procedure into the database and display the information for viewers to browse
through in the web interface. It will be divided into the publications, grants, user profile, faculty
list and administrative function pages. Each of them has their respective filters tailored to the
page. By default, both the publications and grants page will display their respective material by
latest upload. The profile page will allow the author to edit their personal, grant, publication or
citation information. They may be added either by entering the information into the fields or by
supplying a BibTeX bibliography file from which the information can be extracted in a similar
manner to the Schaefer Scraper.
2.2
Prototype Functional Description
Figure 2 illustrates the major functional components of prototype. READ users accounts
will be linked to the ODU Computer Science department accounts so further registration is
unnecessary. On their initial use of the system, each user will be asked to provide their unique
string identifiers for their profile pages on MicrosoftAcademic. The Schaeffer Scraper will then
scrape the website monthly for a BibTex file from each user, and parse it for publication
information. The information will be compared with current publications in the database, and
discarded if it is found to be a duplicate. Otherwise, it is uploaded into the database and an email
will be sent to verify with the user if the obtained publication is theirs, and if the gathered
information is correct. Appropriate changes are then made based on the user’s verification.
(This space left intentionally blank.)
Lab 1 – READ Prototype Product Specification
Figure 2: READ Prototype MFCD
Figure 3: Scraper Flowchart
13
Lab 1 – READ Prototype Product Specification
14
The publications, as well as grants, are then browsed by viewers who interact with the
READ website through the publication, grants, and profile page. Each page will have their
respective filters to narrow down the results. Table 2 lists the options that the viewer may filter
the entries by (including the user profile page), the information displayed for each entry, and the
information that viewers may see on a user profile page.
Publications
Grants
Profile Page
Filterable
Keywords
Principal, Co-principal
Start-End Date
by:
Start-End Date
investigator
Keywords
Full Text
Organization
Full Text
Availability
Funding Agency
Availability
Items per page
Award Range
Items per page
Authors
Start-End Date
Active Status
Items per page
Information Title
Title
Name
shown:
Authors
Principal, Co-principal
Job Title
Reference
Information
Investigators
Organization
Organization
Affiliation
Active State
Homepage Link
Publisher
Publication Date
Start-End Date
Lab 1 – READ Prototype Product Specification
Publications
Information Abstract
15
Grants
Profile Page
Funding Agency, Agency
Directorate, Agency Division
Grants/Publications
associated with the
faculty member
Shown:
Award Amount, Award Number
Table 2: Functionality for Publications, Grants, and User Profile pages
2.3
External Interfaces
The external interfaces required are limited to standard computer hardware and software.
The READ website where the information is browsed is the only custom interface.
2.3.1 Hardware Interfaces
No special hardware is created or needed for the prototype. Personal laptops and ODU
class computers will be used for testing the READ website, and monitoring and controlling the
database. All components will interact using the ODU network.
2.3.2 Software Interfaces
The system will be created and tested on a virtual machine running the Debian operating
system. MySQL databases will also be maintained and controlled through this machine. The
READ website will be developed in PHP with the assistance of Joomla!. The Schaeffer Scraper
will be written in python, and a parser will be developed to format the information obtained from
the scrape.
2.3.3 User Interfaces
This system will utilize two user interfaces, which are both accessible by any browser on
a computer device capable of internet connection. The first interface is an email service, such as
Lab 1 – READ Prototype Product Specification
16
Gmail or Yahoo! Mail, and is only required by users. The service is used to verify the validity of
uploaded publications. The second interface is the READ website. The site will allow for
publications, grants, and users in the database to be browsed by viewers. Figure 4 illustrates the
READ site map.
READ
Homepage
Publication
Grant
Administration
User Profile
Add
Publications
Add Grants
Edit
Publications
Edit Grants
Figure 4: READ Site Map
2.3.4 Communication Protocols and Interfaces
READ will be using two communication protocols: Hyper-Text Transfer Protocol Secure
(Https) and Transmission Control Protocol/Internet Protocol (TCP/IP).
Download