Uploaded by Sikharin Suwannatee

Using EZproxy and Google Analytics to Evaluate Electronic Serials Usage

advertisement
SERIALS REVIEW
, VOL. , NOS. –, –
https://doi.org/./..
Using EZproxy and Google Analytics to Evaluate Electronic Serials Usage
H. Jamane Yeager
Carol Grotnes Belk Library, Elon University, Elon, North Carolina, USA
ABSTRACT
KEYWORDS
The purpose of the study is to evaluate electronic usage statistics for Belk Library at Elon University.
The Electronic Resources Office (ERO) collects usage data for all serials—both print and electronic—,
ebooks, and electronic databases. This study used ezPAARSE and Google Analytics to analyze the data
from our website, and we will use these data to compare our vendor statistics to get a more complete
picture of how the library’s electronic resources are being used.
electronic usage statistics;
ezPAARSE; Google Analytics;
key performance indicators;
web analytics
Introduction
The Electronic Resource Office (ERO) at the Belk Library
at Elon University has been collecting statistical data
from vendors for many years. Because the data came
from different vendors, it occurred in many different
formats. Prior to Counting Online Usage of Networked
Electronic Resources (COUNTER) becoming an industry standard for electronic resource usage, Belk Library’s
statistics could be difficult to interpret, especially when
deciding whether to keep a database or e-package or
not. COUNTER has brought some continuity to the
statistics we collect and has also helped us get most of
the statistical reports in a single format now that they
are COUNTER compliant. The library, and especially the
ERO, uses COUNTER statistics as much as possible and
Cost-Per-Use (CPU) data to help us determine whether
or not to renew databases and e-resources. The Electronic
Resources Office spends over half of the library’s budget on electronic resources including electronic serials,
e-books and other electronic databases. Keeping in mind
that the library is trying to acquire resources for all of
Elon University’s curricula and stay within the library’s
projected budgets for electronic resources, the ERO has
to show a return on investment (ROI) and be able to
demonstrate value that the dean and university librarian
can share with campus administrators.
It wasn’t long before the ERO office realized that
the data collected so far wasn’t answering some very
important questions. Most Belk Library librarians
begin instruction using the library’s home page and
research guides created using Springshare’s LibGuides
product. The librarians have created library research
guides (LibGuides) for each of their respective liaison
departments and sometimes for specific classes being
taught by individual librarians. The librarians can see
from the LibGuides statistics that their research guides
are being used, but there were still questions not being
answered about our users. The ERO office wanted to
know what paths patrons were using to get to research
guides, books and journals, and other library databases
and resources.
Coughlin, Campbell, and Jansen (2016) suggest that
“it is important for libraries to track data internally and
to compare these data with data from publishers not only
to ensure the accuracy of data that publishers provide
but also because any major differences could signal a
problem in a service the libraries provide” (as cited in
Duy & Vaughan, 2006, pp. 512–517). As a result, it was
obvious to the ERO that collecting search statistics from
vendors wasn’t enough anymore. There was more information and more data to be collected and analyzed. This
new data would be useful for the ERO to report to our
librarians and to campus administration. In this study
there were several questions asked:
1. How are users accessing our electronic resources?
2. How do users access journal articles in our
collection?
3. When and where do students access books in the
WorldCat Discovery Service (OCLC)?
4. Are the users located in the library or outside of
the library?
5. Are our users accessing the library portal from
mobile devices, computers, or through searching
on Google Scholar?
CONTACT H. Jamane Yeager
jyeager@elon.edu
Belk Library,  Campus Box, Elon, NC .
Color versions of one or more of the figures in the article can be found online at www.tandfonline.com/usrv.
Published with license by Taylor & Francis
© H. Jamane Yeager.
SERIALS REVIEW
209
Figure . EZproxy logs with DownThemAll.
It was time to find answers to these questions.
The ERO has recently begun collecting EZproxy Logs
and Google Analytics data to get answers to questions that
database vendors’ statistics were not telling us. At the same
time the ERO was investigating statistics and data from
WorldCat Discovery (OCLC).
This study began with the electronic access librarian
and the systems librarian collecting the EZproxy data
logs and Google Analytics data from August 1, 2016, to
March 11, 2017. We were interested in analyzing the data
by semesters, so we broke the data down by the following dates: September 1, 2016, to December 31, 2016 (fall
semester); January 1, 2017, to January 31, 2017 (winter
term); and February 1, 2017, to March 17, 2017 (spring
semester). Because the EZproxy server stopped recording data between March 11, 2017, and March 27, 2017,
we decided to only use data through March 11, 2017. The
reason for beginning data collection on August 1, 2016,
was because that was the date that Belk Library switched
over to WorldCat Discovery Service (OCLC).
Tools used
The systems librarian used a variety of tools to make the
data manageable and smaller. All of the following tools
were used or tried for this project.
r Bash scripting (http://ryanstutorials.net/bashscripting-tutorial/bash-script.php)—Bash script is
a plain text file that contains a series of commands.
r MacVim (https://www.macupdate.com/app/mac/
25988/macvim)—MacVim is a highly configurable
text editor built to enable efficient text editing.
MacVim is a Graphical User Interface (GUI) for
Vim made for Macs.
r Regex or Regular Expressions (http://www.regularexpressions.info/)—Regular expression (regex or
regexp for short) is a special text string for describing
a search pattern. You can think of regular expressions
as wildcards on steroids.
r ezPAARSE (https://ezPAARSE.readthedocs.io/en/
master/)—ezPAARSE is able to mine, analyze, and
enrich the locally collected logs generated by reverse
proxies (EZproxy, Biblio PAM, Squid, Apache),
which record access to academic and scientific
publishers’ platforms.
r OpenRefine (formerly Google Refine) (http://
openrefine.org/)—OpenRefine is a free open source
power tool for working with messy data.
r Tableau (https://www.tableau.com/)—Tableau produces interactive data visualization for data.
Our first action was to get access to the EZproxy
logs. The systems librarian had many conversations
with the Information Technology (IT) department to
find out how to access these logs, since the library
does not administer the EZproxy server. We did discover that the server is hosted locally. Because we
needed access to the full EZproxy logs for the given
dates, IT gave us a web page with links to each of
the past eight months’ logs. Once the systems librarian had this information, the logs were downloaded
with a one-click operation called DownThemAll (DTA).
DTA is a download manager/accelerator extension for
the Mozilla Firefox web browser. Figure 1 shows an
example of the downloaded EZproxy logs viewed in
DownThemAll.
In the process of downloading the logs, we discovered
they were Apache Web Server logs. We also discovered
just how large these log files were—1GB for logs spanning August 2016 to March 2017. The systems librarian
had to determine what was in the logs that made them so
large and then how to make them smaller and more manageable. He learned that the proxy server was capturing
everything: every style sheet, every JPEG, every WOFF
(web font), and other file formats. Once he determined
210
YEAGER
the script was written to include all of the file formats that
needed to be removed. Figure 4 shows this script.
This script removed some of the formats, but not all at
once. The systems librarian had to redo the script for the
individual formats that were left in the log until they were
all gone. Luckily this process worked quickly.
ezPAARSE
Figure . File formats to exclude from EZproxy.
Figure . Regex script.
what was being captured through the server, the EZproxy
settings were reconfigured to filter out some of the unnecessary file formats in the future. This was noise that we
didn’t need. See Figure 2.
We also observed that every time someone accessed a
resource without authenticating, the server collected this
information as quotation marks. These were easy to identify and remove. This left us with Elon user names and
affiliations with the university: student, faculty, staff, and
community users. Because we did not want to identify
individual users, the systems librarian used a hash/salt
generator to remove identification and replace with randomly created passwords.
As was mentioned earlier, several tools were used by
the systems librarian to make the data smaller and more
manageable. The systems librarian attempted to combine
the log files into one file, hoping this would reduce redundancy. The file was too large for plain text, so another tool
was used. MacVim proved the most versatile tool for dealing with such large amounts of data. The 1GB log of data
loaded faster in MacVim than in a bash script. Regex was
used to clear the false positives in the data. These were
the quotation marks that EZproxy had inserted in the logs
when users tried to access resources without authenticating. See Figure 3.
One last task was getting rid of the excess file formats
that were making the log data so large. These file formats
included the JPG, GIF, CSS, and other formats on the web
pages. They were eliminated by using Regex. Originally
Figure . Regex script to remove file formats.
Once the data were cleaned up, it was finally time to
mine it using ezPAARSE. The cleaned log files were
imported into ezPAARSE (https://github.com/expaarseproject/). The syntax was determined, and the logs
were parsed. There are a few caveats to using ezPAARSE.
The first is that a lot of the documentation is in French.
The second one is that ezPAARSE did not recognize one
of our heavily used vendors, ProQuest. It is worth noting
that ezPAARSE responded to questions quickly via email.
As was mentioned earlier, the logs show all user identification removed from them. Figure 5 shows what the logs
look like now that identifications have been disguised.
ezPAARSE mines and extracts several fields of information from the logs and visualizes this information in
an Excel workbook both graphically and numerically.
Figure 6 shows a sampling of journal titles extracted from
the EZproxy logs showing the number of times certain
journals were accessed.
There were 2,412 journal titles accessed during this
study’s time period. Vendor platform was another category that ezPAARSE extracted from the logs. As mentioned earlier, ProQuest databases are not on the list of
platforms filtered by ezPAARSE, although we know from
our database vendor statistics that ProQuest should be
either number two or number three in ranking, since
Elon undergraduates use ProQuest databases heavily.
See Figure 7.
There are several more categories that ezPAARSE
extracts, such as consultations, domain consultations,
date consultation, type of consultation, and platform. It is
much easier to read when the data is behind data filtered,
categorized, and displayed in Excel spreadsheets.
Google Analytics
The next part of the analysis was looking at the library
website using Google Analytics (GA). GA was not
SERIALS REVIEW
Figure . Users authentication removed.
Figure . Journals accessed with ISSN.
Figure . Platform accessed by users.
211
212
YEAGER
Figure . Google Analytics audience overview.
originally created for libraries but was instead created
for the commercial sector. However, there are ratios and
counts that libraries can use from the commerce industry
to our benefit. Key Performance Indicators (KPIs) are
used by the commerce sector to determine how effective
their virtual spaces are in achieving business goals. Just as
the commerce industry uses Key Performance Indicators
(KPIs), so should the library profession. According to
Fagan (2014), libraries should learn which of the KPIs
might best be adapted to the library environment. We
needed to explore web analytics to determine our KPIs.
To use Google Analystics, we had to learn some web
analytics terms and definitions. Farney and McHale
(2014) suggest that “You may find some discrepancies
between the terminology and definitions that we use and
the terminology that your tool uses. Yet most analytics
tools include a help section that defines the terms they
use” (p. 15). See definitions in the appendix.
The key performance indicators that Fagan (2014) suggests libraries look at are bounce rates, visits, page views,
duration of visit, and pages per visit. Fagan states that “A
library subject guide should have a very high average visit
the week or two after an instruction class is taught in that
subject” (p. 27). Fagan suggests that the most useful KPIs
for the web team to monitor regularly for the research
databases port seem to be visits, selections per page view,
average time on page, and visit depth. Using Fagan’s
suggested summary of KPIs, we used Google Analytics
to analyze the library website. Google Analytics allowed
us to visualize the data and make the data more readable.
The information covered in this overview that are KPIs
are sessions, page views, average session duration, and
bounce rate. When we first looked at the bounce rate, our
concern was that it was high. After reading Fagan (2014),
we discovered a possible reason for the high bounce rate.
See Figure 8.
Fagan’s (2014) study found the following:
The bounce rate for a database in an academic library
may not be useful because there may be many resources
listed directly on the portal home page. In such a case, the
bounce rate may be high even if the portal is successful.
If a page required users to click a subject or alphabetical
link before seeing individual databases, the bounce rate
for the portal’s home page should be lower (p. 31).
Another explanation of high bounce rates for our website was from Vecchione, Brown, Allen, and Baschnagel
(2016). They suggest that:
Bounce rate and drop-offs both indicate the same behavior. Bounce rates can be different from drop-offs as a
user can drop-off after looking at several pages, but a
bounce rate is primarily measured after the user looks at
one page and subsequently drops off. Google Analytics
SERIALS REVIEW
Figure . Google Analytics page views.
Figure . Google Analytics fall and spring break comparison.
213
214
YEAGER
automatically measures the bounce rate of any web site,
but for libraries this is a difficult metric to understand in
context because library web sites also serve as portals to
other sites. (p. 164)
This is what the electronic access librarian and the
systems librarian believe is happening to our website.
As an example of understanding the pathways that users
navigate to the Belk Library website, Vecchione et al.
(2016) suggest that if users leave the main library website to search for a book, they utilize the discovery layer,
WorldCat Discovery Service (OCLC). This discovery
layer is not tracked by the library’s instance of Google
Analytics. This search appears as a drop-off, which indicates “where users left the flow” (Google, 2017). We think
our drop-offs are occurring when a student moves from
the library website to go to the research guides or to the
database by title page. We are not tracking that event after
they leave the library’s home page yet with Google Analytics. Another KPI worth noting is sessions. Looking at
the KPI sessions chart, you can actually see which days
of the month had the most sessions and the least, typically around midterms and finals. The lowest number of
sessions is in January, during Elon’s three-week-long winter term session. See Figure 8, which shows sessions also.
Figure 9 shows page views in more detail. This KPI can
be very useful because it lets one see what pages are being
used and how much time is spent on the pages.
Figure 10 is the comparison between fall break 2016
and spring break 2017. The number of users, page views,
page/session, and average page duration are similar for
both spring and fall break. These are all benchmarks that
the electronic access librarian will use in the future to supplement our existing database vendor statistics.
Conclusions
ezPAARSE and Google Analytics are both good tools
to help us better understand more of the statistical data
we gather every month. ezPAARSE made interpreting the
large amount of server logs less challenging. When one
is looking at 1GB of data, it looks like a daunting task,
but working with all of these tools helped us break the
data down into manageable bits of information. Google
Analytics showed us the user’s path, which is what
we were looking for, but also gave us much more. Google
Analytics displayed graphically the information from the
same date ranges that we used in ezPAARSE. The metrics
that Google Analytics counts are definitely important to
the Electronic Resources Office. Using both ezPAARSE
and Google Analytics along with our database statistics
will help the ERO analyze electronic resources better.
The electronic access librarian and the systems librarian
learned from using all these tools that no tool will do
everything. Through this study we realized that web
analytics tools or methods of tracking data change frequently. Cleaning out excess file formats is necessary to
keep the size of the data as small as possible. We also
discovered that Google Analytics is not tracking our
WorldCat Discovery Service at this time. Additionally, it
is also not tracking the library’s research guides, which
creates a high bounce rate.
In the future the electronic access librarian, with help
from the systems librarian, needs to set internal benchmarks for the data to be collected. We need to compare
different ranges of data, such as day, week, month, or year.
The electronic access librarian, along with the systems
librarian, also needs to set up event tracking on the
library website to track users going to research guides and
to other areas of the website. The Electronic Resources
Office will continue to compare the EZproxy and Google
Analytics data to the vendor statistics to get a more complete picture of how the library’s resources are being used.
Acknowledgments
A special thanks to Jerry Waller, Systems Librarian at Elon, for
all the support and assistance with this project.
References
Coughlin, D. M., Campbell, M. C., & Jansen, B. J. (2016).
A web analytics approach for appraising electronic
resources in academic libraries. Journal of the Association
for Information Science and Technology, 67(3), 518–534.
https://doi.org/10.1002/asi.23407
Duy, J., & Vaughn, L. (2006). Can electronic journal usage data
replace citation data as a measure of journal use? An empirical examination. Journal of Academic Librarianship, 32(5),
512–517.
Fagan, J. C. (2014). The suitability of web analytics key
performance indicators in the academic library environment. The Journal of Academic Librarianship, 40(1), 25–34.
https://doi.org/10.1016/j.acalib.2013.06.005
Farney, T., & McHale, N. (2014). Library technology reports:
Web analytics strategies for information professionals
(4). New York, US: ALA Tech Source. Retrieved from
https://www.ebrary.com
Google. (2017). Google Analytics help center. Retrieved from
https://support.google.com/analytics
Vecchione, A., Brown, D., Allen, E., & Baschnagel, A. (2016).
Tracking user behavior with Google Analytics events on
an academic library web site. Journal of Web Librarianship, 10(3), 161–175. https://doi.org/10.1080/19322909.
2016.1175330
Appendix: Google Analytics definitions
Visits/Sessions—A visit is technically any web-accessible
device that interacted with the website during a particular
time frame.
SERIALS REVIEW
Unique Visitors—Any visit from a specific IP address
would increment the Unique Visitors count only once,
while going out of the website and going back in again
would increment the Visits count as many times as the
user goes back and in on that day.
Page Views—An instance of a page being loaded
(or reloaded) in a browser. Total number of pages viewed.
Page/Session (Average Page Depth)—The average
number of pages viewed during a session.
Bounce Rate—The number of bounces divided by the
number of visits to a website, for 5 seconds or less.
Content—Areas of library websites that include
instructional web pages. A common conceptual goal is to
215
increase learning, which is demonstrated in web analytics
by visitors who stay on the page longer.
Pages Per Visit—Page depth, averages the number of web pages visitors view before exiting the
site.
Duration of visit—Referred to as time on site. Records
the amount of time a visitor is on a website or an individual web page.
Sources: Compiled from Farney and McHale (2014):
Library Technology Reports: Web Analytics Strategies for
Information Professionals and Fagan (2014): “The Suitability of Web Analytics Key Performance Indicators in
the Academic Library Environment.”
Copyright of Serials Review is the property of Routledge and its content may not be copied or
emailed to multiple sites or posted to a listserv without the copyright holder's express written
permission. However, users may print, download, or email articles for individual use.
Download