Document 17805282

advertisement
Lab 1 – READ Description
Running head: LAB 1 – READ DESCRIPTION
Lab1 – READ Product Description
Andrew Sprague
CS411
Janet Brunelle
February 15, 2013
Version #4
1
Lab 1 – READ Description
2
Table of Contents
1 INTRODUCTION .......................................................................................................................3
2 READ PRODUCT DESCRIPTOIN ............................................................................................5
2.1 KEY PRODUCT FEATURES AND CAPABILITIES ..........................................................................5
2.2 MAJOR COMPONENTS ..............................................................................................................6
3 IDENTIFICATION OF CASE STUDY ....................................................................................10
4 READ PRODUCT PROTOTYPE DESCRIPTOIN ..................................................................11
4.1MAJOR FUNCTIONAL COMPONENTS ........................................................................................11
4.2 PROTOTYPE FEATURES AND CAPABILITIES ............................................................................12
4.3PROTOTYPE CHALLENGES ......................................................................................................15
GLOSSARY ..................................................................................................................................16
REFERENCES ..............................................................................................................................19
List of Tables
Table 1. Prototype Features ...........................................................................................................13
List of Figures
Figure1. Major Functional Components ..........................................................................................9
Figure2. Prototype Major Functional Components .......................................................................12
Lab 1 – READ Description
3
1. INTRODUCTION
According to the Digest of Education Statistics there are 4,706 research institutions in the
United States (Digest of Education Statistics). The primary way these institutions attract both
clients and new talent is to disseminate information on what they and their employees have
accomplished. This dissemination is usually done by employees publishing papers. Universities,
one of the largest groups of research institutions, make twenty percent of their annual income
from federal contracts and grants (freeby50.com). These universities do not have a good online
tool for sharing the work that they have done with prospective students or faculty. Currently the
systems that many institutions use to share publications are slow and tedious.
This issue causes much of the work that universities and faculty accomplish to go without
the proper recognition. Students, who wish to go to a university that has professors who
specialize in a particular research area, can have trouble discerning between universities. The
work universities have done is not as well known, and, as a result, the faculty may not always
have their work recognized.
The Repository for the Electronic Aggregation of Documents (READ) system is an
online program that consists of a database and web scraper designed to automate the process of
gathering and sharing faculty publications. READ will allow faculty to organize all of their
publications and make any corrections that are required before sharing them to the public. Then
the public may access and browse the publications using READ.
The case study of READ will be the Computer Science Department of Old Dominion
University (ODU). This is because the current system that ODU utilizes is particularly deficient.
Their current method for storing papers no longer is being utilized. It is not currently updated,
Lab 1 – READ Description
4
has never had any method of searching the content and, the method for updating the system was
unable to encourage faculty to participate.
(This space intentionally left blank.)
Lab 1 – READ Description
2
5
READ PRODUCT DESCRIPTION
READ is an online system that will collect and store information on the publications and
the grants obtained by authors. READ will access the publications from online sources and
obtain the information about the publication including a link to where the actual publication is
stored. The system will also allow for the authors to add their publications into READ manually.
READ will also allow people from outside of the system to access it to see the
publication information that has been stored. People will be able to sort through publications and
grants though a number of filters such as author, publication data, and keywords. The viewers
will also be able to sort the filtered results by relevance.
2.1
Key Product Features and Capabilities
READ will include the Schaefer Scraper, an online publication data scraper. The
Schaefer Scraper will search online publication databases for the names of all of the authors that
reside in READ. It will then take BibTex information it finds on the publications and put the
information into the database.
After the Schaefer Scraper gathers the information, READ will then send an email to the
author that the publication supposedly belongs to, in order to alert them that a publication has
been found. The author will log onto the system and either approve the information or correct it.
After the information is corrected, READ will save the changes. In the completed version there
will also be a learning algorithm that will keep repetitive inaccurate results from being repeated
in the database. For instance, if an author has indicated that a paper the Schaefer Scraper found
was not theirs, then the learning algorithm will prevent results from being gathered.
Lab 1 – READ Description
6
In addition to sending an email to authors, READ will also implement an RSS feed for
confirming publications. The RSS feed will be for both allowing outside users to see new
publications as they are added and for informing authors of publications found by the scraper if
they register to RSS feeds. The author will have a choice between being notified by either email
or RSS.
The author will also be able to change the information that is presented on their profile
that is provided to them. They will be able to upload a picture, a link to their homepage, contact
information, and personal information. The authors will also be able to upload thumbnails to any
of the publications that belong to him or her.
Viewers are users who wish to look up publication and grant information in READ, but
are not authors who contribute. Viewers will have limited access to the READ system. However,
viewers will be able to view the content of the system. The viewers will be able to sort and filter
the publications that they queried as well. For each author, there will be a profile page that the
viewers should have access to, featuring a graph to display the publication statistics of the
author.
2.2
Major Components (Hardware/Software)
The READ system will have three major components as shown in Figure 1. The READ
system will have a web-based interface. The system will include a database to hold all of the
information about publications, grants, and authors. The READ system will include the Schaefer
Scraper to gather the relevant publication information.
All users, including authors and viewers, will not access the system directly but will
instead access it through a web-based graphical user interface (GUI). The interface will be
separated into two sections. The public section will be used by people who wish to view the
Lab 1 – READ Description
7
information about the publications, and the private section will be used by authors and
administrators who will have to change parts of the content. The only accessible parts in public
section will be the page to search and filter through the database for grants and publications and
profile views of the authors.
The publicly viewable profile page will allow the viewer to see any information that the
author has added and their jQuery Sparklines statistics, but will not allow for any editing without
entering the site through a login page. Once logged into the website, authors and administrators
will be able to edit the grant and publication information. The system administrators will be able
to add and remove profiles as authors come into or leave the institution. The Administrators will
be able to manually take out any publications or grants that are not supposed to be in the
database for legal reasons. Finally, the web-based interface will implement an RSS feed that, will
direct any updates about the system that are important to the user to an RSS reader. This will
allow the viewers to see if any new publications have been found. There will also be RSS feeds
for each author so that a link to any new publications can be sent to them through RSS instead of
just email.
The major parts of the web-based interface will include the grants search, publication
search, login profile, and authors interfaces. The publication search page and grants search page
will allow for the searching and filtering of publications and grants respectively. The two pages
were separated at the request of the customer. The login page will allow authors and
administrators to login to the system using a password and username in order to edit publication,
grant, and profile information. The profile page is where authors will edit their profiles and
viewers will be able to see statistics on the authors. Finally, the authors’ page will be a list of the
various authors in the system so that their profiles will be accessible to the public.
Lab 1 – READ Description
8
The link database is the part of the system that will hold all of the information on
publications and grants. It will use MySQL and accessed from the web-based interface. The only
time anyone should access the database directly is when it is installed. Part of the database will
also be dedicated to holding information on the authors and the PI (principal investigators) of the
grants. The database may not necessarily hold the actual content of the papers, only the
information about them. The purpose is not to create a new place to access the research but, as a
place where viewers can see what kind of research has been done by the authors at a particular
institution.
Inside the database will be tables that will store data on the authors, the publications, and
the grants. Queries to the tables will be constructed in the web based interface using PHP to
create the SQL. Figure 1 shows the schema of the database.
The Schaefer Scraper will periodically gather information on the publications of the
authors from the Web. It will access specific academic sites that keep publications and find
information pertinent to the authors at the institution that uses it. After this, it will store all of the
information in the database and email the authors for them to correct, confirm, or remove. In the
finished product, the Schaefer Scraper will be implemented with a learning algorithm in order to
prevent authors from receiving multiple messages about publications that they do not own.
The Schaefer Scraper will attempt to gather the information in a BibTex format, for ease
of conversion. BibTex is an XML base file format which allows it to be easily understod and
converted into data for a MySQL database to use. Another advantage is that many of the sites
that the Schaefer Scraper will be gathering publication data from already allow for the exporting
of publication data into a BibTex format.
Lab 1 – READ Description
9
The Schaefer Scraper will be run at regular intervals in order to gather this data for the
system. In the finished product, the automatic running of the Schaefer Scraper will be managed
by the customer. The customer will be able to turn off or alter the frequency of runs of the
Schaefer Scraper in order to suit their individual needs.
Figure 1. Major Functional Component Diagram
(This space intentionally left blank.)
Lab 1 – READ Description
3
10
IDENTIFICATION OF CASE STUDY
The case study for READ will be the ODU Computer Science (CS) department.
Currently, the department does not have a working system to distribute publication information
on their faculty. The purpose of the case study will be to remedy this situation.
One important thing to note about the case study is that the department did once have a
system to document faculty publications, but that this system fell into disuse. The previous
system had authors submit their publications directly to an administrator who then manually
placed the paper’s information into the system. Because the process was entirely manual, many
people were either reluctant or too busy to submit information about their publications to the
administrator. Eventually, authors stopped submitting their papers and the system was
abandoned. The system had not been updated since 2008. This is important because it shows that
if a system is created, it must minimize user interaction or it will fail.
(This space intentionally left blank.)
Lab 1 – READ Description
11
4 READ PRODUCT PROTOTYPE DESCRIPTION
The READ prototype will be implemented in the Computer Science Department of Old
Dominion University. A prototype is needed because the scope of this project is larger than the
timeframe allotted to create it. Some of the functionality of the READ system must be left out of
the prototype.
The user types specified will be implemented in the READ prototype. The viewer,
author, and administrator will all be included in the prototype. The functions of each of the user
types will remain unchanged. After the prototype has been implemented, an administrator will be
chosen from the faculty or the systems group.
4.1
Prototype Architecture (Hardware/Software)
The prototype will consist of three main components that are the same as those of the
finished product as shown in Figure 2. The prototype will include a basic user interface. It will
also contain an implementation of the database. The Schaefer Scraper will be included to datamine websites.
The prototype will be implemented in the Computer Science Department’s servers. The
database will be implemented using MySQL. The web based interface will be implemented in
the prototype using Joomla!. Using this content management system will make logging in
authors easier to implement the interface because, one of the team members working on READ
already has a log in method for Computer Science faculty implemented in another project using
Joomla!. All of the queries to the database will be made through PHP scripts to interface between
MySQL and the web based interface. An interface between the Schaefer Scrapper and author
information in the database will be written in python.
Lab 1 – READ Description
12
Figure 2 Prototype MFCD
4.2
Features and Capabilities
The Prototype will include many of the features that are planned for the final product as
defined in Table 1. The prototype will allow viewers to search and filter the database through the
web-site. It will also allow for minimal user-profile control. An RSS feed and email system will
be implemented, so that people can stay informed of what is contained within the database.
Access Control will be a priority to prevent unauthorized users from updating author papers. The
Schaefer Scraper will automate much of the process of updating the publication lists. In the
Lab 1 – READ Description
13
prototype, the Schaefer Scraper will search online for publication on one fourth of the authors
every week.
The prototype will not implement every feature of the finished product. The prototype
will not include a learning algorithm that will make sure an incorrect paper is not resubmitted.
The prototype will also not include any visual representations of the data such as graphs and
jQuery Sparkliness to display author statistics.
Features
Real World Project
Prototype
Browsing Capabilities
Ability to browse all grants
Ability to browse all grants
and publication
and publications
Publication Filtering
Filtered by title, publisher,
Filtered by title, publisher,
Capabilities
authors, publication date, date authors, publication date, date
added, and keywords.
added, and keywords.
Filtered by title, funding
Filtered by title, funding
agency, principal or co-
agency, principal or co-
principal investigator, start
principal investigator, start
date, end date, and active
date, end date, and active
state.
state.
Add, edit, and delete
Included. A thumbnail image
Included. A thumbnail image
publications and grants
and files may be associated
and files may be associated
Grant Filtering Capabilities
Lab 1 – READ Description
14
with the document. Fields can with the document. Fields can
be automatically filled in
be automatically filled in
using a Bibtext document.
using a Bibtext document.
Features
Real World Project
Prototype
Faculty page
Lists faculty and provides a
Not included.
link to each person’s profile
page
Login interface
Profile Page
Scraper
Linked to Old Dominion
Linked to Old Dominion
University Computer Science
University Computer Science
accounts
accounts
Displays authors’ profile
Displays authors’ profile
picture, job title, email
picture, job title, email
address, personal webpage
address, personal webpage
link, and the author’s
link, and the author’s
publications and grants.
publications and grants.
Displays graphs
Graphs not included.
Will update the system with
Will update the system with
new publications and grants
publications only and alert
and alert users when one is
users when one is added to
Lab 1 – READ Description
15
added to the system under
the system under their name.
their name.
Features
Real World Project
Prototype
Prediction algorithm
Predicts if the consumer has
Not included
enough space to use the
READ system.
Administrative Privileges
Administrators are able to
Administrators are able to
edit, add, or remove anything
edit, add, or remove anything
in the system.
in the system.
Table 1 Key Prototype Features
4.3
Prototype Development Challenges
The primary challenge in the development of the prototype is the implementation of the
Schaefer Scraper. Much of the code is uncommented and difficult to understand. One method of
mitigating this challenge is to contact Andrew Schaefer, and try to get him to explain to us the
architecture of his code. Currently, the Black Group does not completely understand the code.
Another challenge in the development of the READ prototype will be publication types
that differ from one another. Some publications may be technical papers, while others may be
academic publications. In a BibTex, format some of the information may be arranged in a
slightly different way depending on what kind of document the publication is.
Lab 1 – READ Description
16
Glossary
Administrator/Administrative User: a user with increased privileges for editing database content
Author: A person that is able to add and edit publications and grants to the system under their
name.
BibTeX: A file format for reference information in XML format.
Computer Science (CS): An academic discipline based on advancing computing theory and
algorithm development, that sometimes includes theory about software engineering
methods.
Client application: In a client/server architecture, the module that takes input and creates queries
to be processed by a server, and receives the results from the server.
Client/Server Architecture: A software engineering paradigm that separates functionality into a
“client” application and a “server” application that interact.
CSS: A programming language used to specify presentation of HTML pages
Data Mining: The act of going through a source of input to find specific information.
Database Schema: A description of the structure of database
Funding Agency: The source of funds for research grants. These organizations usually have a
limited amount of money to (pass out) principle investigator’s that submit an accepted
application for research funds.
GIT: A software system for controlling and organizing software versioning.
GoogleScholar (http://scholar.google.com): A website that stores academic publications.
Graphical User Interface (GUI): A computer interface composed of icons, text fields, menus, etc
that can be interacted with via a mouse and keyboard, through which a user interacts with
a software application.
internet scraper: A program that is designed to sort through data that is stored online
Joomla!: A content management system for designing web interfaces.
Lab 1 – READ Description
17
JQuery Sparklines: A development library for the visualization of data.
ODU: Old Dominion University.
MicrosoftAcademic (http://academic.research.microsoft.com/): A website that stores information
on academic publications
MySQL: An implementation of SQL that is open source.
Parse: A technical term usually used to describe the processing of a statement written in a
programming language.
Perl: A widely used programming language on the server-side of web applications.
PHP: A widely used programming language on the server-side of web applications.
Principle Investigator (PI): The primary researcher that a research grant is bestowed upon,
responsible for documenting the work and publishing research results.
Publication or Academic Publication: A document created by a faculty member to share
research. They are usually published in an academic journals, technical reports, and
records of conference proceedings.
Query: An algorithm sent to the database to either change the database or get back results
READ: Repository for Electronic Aggregation of Documents
RSS: A specification for subscribing to and distributing news.
Scraper: An automated application designed to scan a source of input such as a document or a
website for pertinent information.
Server application: In a client/server architecture, the module that takes queries or requests from
a client module, process them, and returns the result to the client.
Software Compatibility: A description of whether different software, or versions of software, can
communicate/interact.
SQL: A widely used programming language used to manipulate databases.
SQL injection: Performing unauthorized queries on a database for malicious purposes.
User Authentication: The process of verifying the access credentials of a user of an automated
system, usually accomplished by requesting a username and password combination.
Lab 1 – READ Description
18
Viewer: an outside person who wishes to query the information contained in the READ database.
Version Control: A method for organizing and recording different versions of documents that
have been created over time.
Virtual Private Server (VPS): A software version of a hardware server, used to create
independent servers on a single piece of hardware.
Webserver: A group of applications run on a computer or VPS in to serve webpages and provide
server-side computation for browser-based client applications.
XML: Extensible markup language.
Lab 1 – READ Description
REFERENCES
Digest of Education Statistics. 2011. National Center For Educational Statistics Web. 19 Nov
2012.
<http://nces.ed.gov/programs/digest/d11/tables/dt11_001.asp?referrer=report>.
"Where Do Universities Get their Money From?." Free By 50. N.p., 13 2011. Web. 19 Nov
2012.
<http://www.freeby50.com/2011/11/where-do-universities-get-their-money.html>.
19
Download