Running Head: Lab 1 – READ Description 1

advertisement
Running Head: Lab 1 – READ Description
1
Lab I – READ Product Specification
Jim Lawrence Calderon, Team Black
CS411W
Janet Brunelle
February 18, 2013
Version 3.0
Lab 1 – READ Description
2
Table of Contents
1
INTRODUCTION ................................................................................................................... 3
2
PRODUCT DESCRIPTION.................................................................................................... 4
2.1
Key Product Features and Capabilities ............................................................................ 4
2.2
Major Components (Hardware/Software) ........................................................................ 5
3
IDENTIFICATION OF CASE STUDY.................................................................................. 6
4
PRODUCT PROTOTYPE DESCRIPTION ........................................................................... 7
4.1
Prototype Architecture (Hardware/Software) .................................................................. 7
4.2
Prototype Features and Capabilities ............................................................................... 10
4.3
Prototype Development Challenges ............................................................................... 10
GLOSSARY ................................................................................................................................. 11
REFERENCES ............................................................................................................................. 14
Table of Figures
Figure 1: Major Functional Component Diagram .......................................................................... 5
Table of Tables
Table 1: RWP vs. Prototype features .............................................................................................. 9
Lab 1 – READ Description
1
3
INTRODUCTION
Grants are up to 20% of a public research university’s revenue (Delta Cost Project).
Universities are awarded these grants through ongoing research that is documented through techreports and publications. Having an online database for these publications is a way to attract
potential students who are interested in the ongoing research at the university. However, keeping
this sort of system updated manually is very tedious.
Old Dominion University’s Computer Science department is a good case study to illustrate
this example. The current static webpage, which has not been updated since 2008, consists of a
list of publications manually updated by an appointed staff member. In order to have one’s
publication in the system, they must send the information to this person via email and wait for
them to update it. ODU is not the only institution that experiences this problem, and other
organizations may also experience similar issues as well.
The Repository for Electronic Aggregation of Documents (READ) is a system created by
the CS411 Black Group in response to the ODU Computer Department’s lack of an online
publication database. Like the initial problem, this may be integrated into the systems of any
interested parties and is not created strictly for ODU. The Schaefer Scraper is designed to
automatically perform the task of collecting publications owned by CS faculty members, from
across multiple scholastic websites. These publications will then be organized by the viewer
through the use of numerous filters as they browse through the material. While the viewers have
the ability to browse, authors and administrators will additionally be able to update information
on existing publications, or manually add publications when necessary. With these features,
Lab 1 – READ Description
4
READ aims to ease the responsibility of each researcher to manually manage their publications
outside of the initial upload to a single site.
2
PRODUCT DESCRIPTION
READ is an online database that will house grants, articles, information on statistics
pertaining to them, and links to research. Viewers will also be able to browse through them using
a number of filters such by the name of the author, publish date, and various keywords. The goal
is to minimize the need for an author to manage their publications through the features that
READ will employ. READ will also strive to be able to advertise ongoing research more
efficiently, and show available grants to whomever requires them. The goal is for the authors to
work less in order to READ more.
2.1
Key Product Features and Capabilities
A major part of READ is software named the Schaefer Scraper. It searches through a list of
specific academic websites, defined by an administrator, for publications that match registered
author’s credentials. The pertinent information is then uploaded to the database. The author will
receive an email notification at this point in order to authorize the publication, edit any mistaken
information, and the system will learn when not to notify them based on patterns on why they
denied past publications. This allows for gathering all of an author’s publications into a single
location automatically, with little work on their part.
The system will allow viewers to browse the database using filters for grants, multiple
types of article publications, and their authors. Viewing user profiles will show personal
information, a graphical representation of the amount of publications they have created, funding
Lab 1 – READ Description
5
they received, and a list of associated publications. This statistical information will show viewers
a specific author’s area of expertise, and level of activity.
2.2
Major Components (Hardware/Software)
READ consists of three major software functional components: web interface, database,
and scraper, (Figure 1). The web interface contains pages that are only viewable by
administrators or registered authors and those that are accessible to the public. Private content
will only be accessible by having an authorized login, and will mainly deal with updating
publication or grant data. The public section consists of the multiple pages that any viewer will
be able to interact with, such as the article publications, grant proposals, student list, and faculty
list. The viewer interacts with the web interface by selecting filters that will be used to query the
database. These queries will then build the pages for them to browse.
Figure 1: Major Functional Component Diagram
Lab 1 – READ Description
6
The second major component is a MySQL database that will contain six tables: Authors,
Paper, Grants, Owns, Tags, and CO_PI. These will either be populated automatically by
information obtained by the Schaefer Scraper or manually by registered authors. Its primary
function is to store links to external publications, grant proposals, and all the information of the
associated material. Information accuracy will be verified by an email confirmation sent to the
authors, and will allow for them to edit any incorrect data.
The third major component is the Schaefer Scraper, an automated tool used to go through
predefined websites for new publications. This is an algorithm that locates an author’s profile in
a specified website with a unique string identifier and extracts the BibTex information. It is then
parsed for required data from these websites into the database, where it will be queried and
viewed on the web interface.
3
IDENTIFICATION OF CASE STUDY
This product is being built for the Old Dominion University Computer Science Department
in order to replace an antiquated and unused publication page. The previous system was a list of
publications sorted chronologically from latest upload, manually updated by an assigned staff
member. It lacked any sort of filtering, listing of received grants, and woefully out of date. Due
to the tedious work required, the page was abandoned as of 2008. As a result, publications
owned by the faculty are not readily available in a centralized location for interested parties.
(This space intentionally left blank.)
Lab 1 – READ Description
4
7
PRODUCT PROTOTYPE DESCRIPTION
The prototype for READ will essentially be identical to the Real World Product (RWP)
modeled using actual publications from the ODU Computer Science department and hosted on a
virtual machine running the Debian operating system. It is required in order to demonstrate the
essential features of the RWP. As illustrated in Table 1, it will not include the Sparkline, graph
integration in the user profiles, the learning algorithm, nor the faculty list. The graphs will be
visual representations of publications and grants that an author has been associated with within
the last few years. The learning algorithm will allow the system to learn to distinguish possible
publications of an author based on their patterns of approving or denying past entries.
4.1
Prototype Architecture (Hardware/Software)
READ’s prototype will be very similar to the RWP with a few withheld features. The web
interface that READ will be using is going to be created in PHP and hosted on an ODU web
server with Google Chrome as the browser of choice. All of the planned sections, including the
publications, grants, and faculty list will be fully functional, while the profile pages will have
limited functionality compared to the RWP. A desktop or laptop with Internet access is required
to access the interface.
(This space intentionally left blank.)
Lab 1 – READ Description
8
Features
Real World Project
Prototype
Browsing
Ability to browse all grants and
Ability to browse all grants and
Capabilities
publication
publications
Publication
Filtered by title, publisher, authors,
Filtered by title, publisher, authors,
Filtering
publication date, date added, and
publication date, date added, and
Capabilities
keywords.
keywords.
Grant Filtering
Filtered by title, funding agency,
Filtered by title, funding agency,
Capabilities
principal or co-principal
principal or co-principal
investigator, start date, end date,
investigator, start date, end date, and
and active state.
active state.
Add, edit, and
Included. A thumbnail image and
Included. A thumbnail image and
delete publications
files may be associated with the
files may be associated with the
and grants
document. Fields can be
document. Fields can be
automatically filled in using a
automatically filled in using a
BibTex document.
BibTex document.
Lists faculty and provides a link to
Not included.
Faculty page
each person’s profile page
Login interface
Linked to Old Dominion University
Linked to Old Dominion University
Computer Science accounts
Computer Science accounts
Lab 1 – READ Description
9
Features
Real World Description
Prototype
Profile Page
Displays authors’ profile picture,
Displays authors’ profile picture,
job title, email address, personal
job title, email address, personal
webpage link, and the author’s
webpage link, and the author’s
publications and grants. Displays
publications and grants. Graphs not
graphs
included.
Will update the system with new
Will update the system with
publications and grants and alert
publications only and alert authors
authors when one is added to the
when one is added to the system
system under their name.
under their name.
Administrative
Administrators are able to edit, add,
Administrators are able to edit, add,
Privileges
or remove anything in the system.
or remove anything in the system.
Scraper
Table 1: RWP vs. Prototype features
The Schaefer Scraper software, which will be responsible for obtaining publications, is
already coded in PHP as well, requiring only to be integrated with the web interface once ready.
Some websites that will be used are GoogleScholar, Scopus, and MicrosoftAcademic. Actual
data will be scraped from these websites using the software. It will then be parsed for pertinent
information, and stored into a MySQL database, which will be integrated with the web interface.
Lab 1 – READ Description
4.2
10
Prototype Features and Capabilities
The prototype’s main purpose consists of three major functions: obtain publication
information from across multiple scholastic sites using the Schaefer Scraper, store the data
obtained from this procedure into the database and display the information for viewers to browse
through in the web interface. It will be divided into the publications page, grants page, user
profile pages, and administrative function pages. Each of them has their respective filters tailored
to the page. By default, both the publications and grants page will display their respective
material by latest upload. The profile page will allow for the author to edit their personal
information, grants, publications or citation information. They may be added either by entering
the information into the fields or by supplying a BibTeX bibliography file from which the
information can be extracted in a similar manner to the Schaefer Scraper.
4.3
Prototype Development Challenges
The predominant development challenge for the READ prototype is the understanding and
integration of the Schaefer Scraper with the web interface. As it currently stands, the software is
merely a black box with the group having little knowledge of how it actually works. The data
will have to be translated in order to be used in queries for the database, so the manner in which
the Scraper exports the data from the websites is especially important to understand. Another
issue is the format of the information obtained through the Schaefer Scraper. When scraped, the
information does not necessarily follow a single format and parsing it for appropriate data to
insert into the database may be troublesome.
Lab 1 – READ Description
11
GLOSSARY
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format. It will be used to automatically
fill in key information when uploading or editing publications and grants.
Client application: The module that takes input and creates queries to be processed by a server,
and receives the results from the server.
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants.
GIT: A software system for controlling and organizing software versioning.
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with a
software application. Used to differentiate from a “command-line interface”, in which a user
interacts with a software application solely through a text terminal.
Joomla!: A content management system.
jQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
Lab 1 – READ Description
12
MySQL: An implementation of SQL that is open-source.
Parse: The processing of a statement.
Perl: A widely-used programming language on the server-side of web applications.
PHP: A widely-used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document created published in an academic journals,
technical reports, and records of conference proceedings.
Query: An algorithm sent to the database to either change the database or get back results
READ: Repository for Electronic Aggregation of Documents
RSS: A system for subscribing to and distributing news.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: T module that takes queries or requests from a client module, process them,
and returns the result to the client.
Software Compatibility: A description of whether different softwares, or versions of software,
can communicate/interact.
SQL: A widely-used programming language used to query databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Viewer: In the scope of our project an outside person who wishes to query the information
contained in the READ database.
Lab 1 – READ Description
13
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Private Server (VPS): A software version of a hardware server. Used to create
independent program that manages access to a centralized resource or service in a network on a
single piece of hardware.
Webserver: A constantly “on” resource whose sole or main job is to respond to HTTP requests
from browsers.
XML: Extensible markup language.
Lab 1 – READ Description
REFERENCES
"Delta Cost Project Data." The Delta Project on Postsecondary Education Costs, Productivity,
and Accountability. The Delta Project, n.d. Web. 9 Feb. 2013.
14
Download