SERIALS REVIEW , VOL. , NOS. –, – https://doi.org/./.. Using EZproxy and Google Analytics to Evaluate Electronic Serials Usage H. Jamane Yeager Carol Grotnes Belk Library, Elon University, Elon, North Carolina, USA ABSTRACT KEYWORDS The purpose of the study is to evaluate electronic usage statistics for Belk Library at Elon University. The Electronic Resources Office (ERO) collects usage data for all serials—both print and electronic—, ebooks, and electronic databases. This study used ezPAARSE and Google Analytics to analyze the data from our website, and we will use these data to compare our vendor statistics to get a more complete picture of how the library’s electronic resources are being used. electronic usage statistics; ezPAARSE; Google Analytics; key performance indicators; web analytics Introduction The Electronic Resource Office (ERO) at the Belk Library at Elon University has been collecting statistical data from vendors for many years. Because the data came from different vendors, it occurred in many different formats. Prior to Counting Online Usage of Networked Electronic Resources (COUNTER) becoming an industry standard for electronic resource usage, Belk Library’s statistics could be difficult to interpret, especially when deciding whether to keep a database or e-package or not. COUNTER has brought some continuity to the statistics we collect and has also helped us get most of the statistical reports in a single format now that they are COUNTER compliant. The library, and especially the ERO, uses COUNTER statistics as much as possible and Cost-Per-Use (CPU) data to help us determine whether or not to renew databases and e-resources. The Electronic Resources Office spends over half of the library’s budget on electronic resources including electronic serials, e-books and other electronic databases. Keeping in mind that the library is trying to acquire resources for all of Elon University’s curricula and stay within the library’s projected budgets for electronic resources, the ERO has to show a return on investment (ROI) and be able to demonstrate value that the dean and university librarian can share with campus administrators. It wasn’t long before the ERO office realized that the data collected so far wasn’t answering some very important questions. Most Belk Library librarians begin instruction using the library’s home page and research guides created using Springshare’s LibGuides product. The librarians have created library research guides (LibGuides) for each of their respective liaison departments and sometimes for specific classes being taught by individual librarians. The librarians can see from the LibGuides statistics that their research guides are being used, but there were still questions not being answered about our users. The ERO office wanted to know what paths patrons were using to get to research guides, books and journals, and other library databases and resources. Coughlin, Campbell, and Jansen (2016) suggest that “it is important for libraries to track data internally and to compare these data with data from publishers not only to ensure the accuracy of data that publishers provide but also because any major differences could signal a problem in a service the libraries provide” (as cited in Duy & Vaughan, 2006, pp. 512–517). As a result, it was obvious to the ERO that collecting search statistics from vendors wasn’t enough anymore. There was more information and more data to be collected and analyzed. This new data would be useful for the ERO to report to our librarians and to campus administration. In this study there were several questions asked: 1. How are users accessing our electronic resources? 2. How do users access journal articles in our collection? 3. When and where do students access books in the WorldCat Discovery Service (OCLC)? 4. Are the users located in the library or outside of the library? 5. Are our users accessing the library portal from mobile devices, computers, or through searching on Google Scholar? CONTACT H. Jamane Yeager [email protected] Belk Library, Campus Box, Elon, NC . Color versions of one or more of the ﬁgures in the article can be found online at www.tandfonline.com/usrv. Published with license by Taylor & Francis © H. Jamane Yeager. SERIALS REVIEW 209 Figure . EZproxy logs with DownThemAll. It was time to find answers to these questions. The ERO has recently begun collecting EZproxy Logs and Google Analytics data to get answers to questions that database vendors’ statistics were not telling us. At the same time the ERO was investigating statistics and data from WorldCat Discovery (OCLC). This study began with the electronic access librarian and the systems librarian collecting the EZproxy data logs and Google Analytics data from August 1, 2016, to March 11, 2017. We were interested in analyzing the data by semesters, so we broke the data down by the following dates: September 1, 2016, to December 31, 2016 (fall semester); January 1, 2017, to January 31, 2017 (winter term); and February 1, 2017, to March 17, 2017 (spring semester). Because the EZproxy server stopped recording data between March 11, 2017, and March 27, 2017, we decided to only use data through March 11, 2017. The reason for beginning data collection on August 1, 2016, was because that was the date that Belk Library switched over to WorldCat Discovery Service (OCLC). Tools used The systems librarian used a variety of tools to make the data manageable and smaller. All of the following tools were used or tried for this project. r Bash scripting (http://ryanstutorials.net/bashscripting-tutorial/bash-script.php)—Bash script is a plain text file that contains a series of commands. r MacVim (https://www.macupdate.com/app/mac/ 25988/macvim)—MacVim is a highly configurable text editor built to enable efficient text editing. MacVim is a Graphical User Interface (GUI) for Vim made for Macs. r Regex or Regular Expressions (http://www.regularexpressions.info/)—Regular expression (regex or regexp for short) is a special text string for describing a search pattern. You can think of regular expressions as wildcards on steroids. r ezPAARSE (https://ezPAARSE.readthedocs.io/en/ master/)—ezPAARSE is able to mine, analyze, and enrich the locally collected logs generated by reverse proxies (EZproxy, Biblio PAM, Squid, Apache), which record access to academic and scientific publishers’ platforms. r OpenRefine (formerly Google Refine) (http:// openrefine.org/)—OpenRefine is a free open source power tool for working with messy data. r Tableau (https://www.tableau.com/)—Tableau produces interactive data visualization for data. Our first action was to get access to the EZproxy logs. The systems librarian had many conversations with the Information Technology (IT) department to find out how to access these logs, since the library does not administer the EZproxy server. We did discover that the server is hosted locally. Because we needed access to the full EZproxy logs for the given dates, IT gave us a web page with links to each of the past eight months’ logs. Once the systems librarian had this information, the logs were downloaded with a one-click operation called DownThemAll (DTA). DTA is a download manager/accelerator extension for the Mozilla Firefox web browser. Figure 1 shows an example of the downloaded EZproxy logs viewed in DownThemAll. In the process of downloading the logs, we discovered they were Apache Web Server logs. We also discovered just how large these log files were—1GB for logs spanning August 2016 to March 2017. The systems librarian had to determine what was in the logs that made them so large and then how to make them smaller and more manageable. He learned that the proxy server was capturing everything: every style sheet, every JPEG, every WOFF (web font), and other file formats. Once he determined 210 YEAGER the script was written to include all of the file formats that needed to be removed. Figure 4 shows this script. This script removed some of the formats, but not all at once. The systems librarian had to redo the script for the individual formats that were left in the log until they were all gone. Luckily this process worked quickly. ezPAARSE Figure . File formats to exclude from EZproxy. Figure . Regex script. what was being captured through the server, the EZproxy settings were reconfigured to filter out some of the unnecessary file formats in the future. This was noise that we didn’t need. See Figure 2. We also observed that every time someone accessed a resource without authenticating, the server collected this information as quotation marks. These were easy to identify and remove. This left us with Elon user names and affiliations with the university: student, faculty, staff, and community users. Because we did not want to identify individual users, the systems librarian used a hash/salt generator to remove identification and replace with randomly created passwords. As was mentioned earlier, several tools were used by the systems librarian to make the data smaller and more manageable. The systems librarian attempted to combine the log files into one file, hoping this would reduce redundancy. The file was too large for plain text, so another tool was used. MacVim proved the most versatile tool for dealing with such large amounts of data. The 1GB log of data loaded faster in MacVim than in a bash script. Regex was used to clear the false positives in the data. These were the quotation marks that EZproxy had inserted in the logs when users tried to access resources without authenticating. See Figure 3. One last task was getting rid of the excess file formats that were making the log data so large. These file formats included the JPG, GIF, CSS, and other formats on the web pages. They were eliminated by using Regex. Originally Figure . Regex script to remove ﬁle formats. Once the data were cleaned up, it was finally time to mine it using ezPAARSE. The cleaned log files were imported into ezPAARSE (https://github.com/expaarseproject/). The syntax was determined, and the logs were parsed. There are a few caveats to using ezPAARSE. The first is that a lot of the documentation is in French. The second one is that ezPAARSE did not recognize one of our heavily used vendors, ProQuest. It is worth noting that ezPAARSE responded to questions quickly via email. As was mentioned earlier, the logs show all user identification removed from them. Figure 5 shows what the logs look like now that identifications have been disguised. ezPAARSE mines and extracts several fields of information from the logs and visualizes this information in an Excel workbook both graphically and numerically. Figure 6 shows a sampling of journal titles extracted from the EZproxy logs showing the number of times certain journals were accessed. There were 2,412 journal titles accessed during this study’s time period. Vendor platform was another category that ezPAARSE extracted from the logs. As mentioned earlier, ProQuest databases are not on the list of platforms filtered by ezPAARSE, although we know from our database vendor statistics that ProQuest should be either number two or number three in ranking, since Elon undergraduates use ProQuest databases heavily. See Figure 7. There are several more categories that ezPAARSE extracts, such as consultations, domain consultations, date consultation, type of consultation, and platform. It is much easier to read when the data is behind data filtered, categorized, and displayed in Excel spreadsheets. Google Analytics The next part of the analysis was looking at the library website using Google Analytics (GA). GA was not SERIALS REVIEW Figure . Users authentication removed. Figure . Journals accessed with ISSN. Figure . Platform accessed by users. 211 212 YEAGER Figure . Google Analytics audience overview. originally created for libraries but was instead created for the commercial sector. However, there are ratios and counts that libraries can use from the commerce industry to our benefit. Key Performance Indicators (KPIs) are used by the commerce sector to determine how effective their virtual spaces are in achieving business goals. Just as the commerce industry uses Key Performance Indicators (KPIs), so should the library profession. According to Fagan (2014), libraries should learn which of the KPIs might best be adapted to the library environment. We needed to explore web analytics to determine our KPIs. To use Google Analystics, we had to learn some web analytics terms and definitions. Farney and McHale (2014) suggest that “You may find some discrepancies between the terminology and definitions that we use and the terminology that your tool uses. Yet most analytics tools include a help section that defines the terms they use” (p. 15). See definitions in the appendix. The key performance indicators that Fagan (2014) suggests libraries look at are bounce rates, visits, page views, duration of visit, and pages per visit. Fagan states that “A library subject guide should have a very high average visit the week or two after an instruction class is taught in that subject” (p. 27). Fagan suggests that the most useful KPIs for the web team to monitor regularly for the research databases port seem to be visits, selections per page view, average time on page, and visit depth. Using Fagan’s suggested summary of KPIs, we used Google Analytics to analyze the library website. Google Analytics allowed us to visualize the data and make the data more readable. The information covered in this overview that are KPIs are sessions, page views, average session duration, and bounce rate. When we first looked at the bounce rate, our concern was that it was high. After reading Fagan (2014), we discovered a possible reason for the high bounce rate. See Figure 8. Fagan’s (2014) study found the following: The bounce rate for a database in an academic library may not be useful because there may be many resources listed directly on the portal home page. In such a case, the bounce rate may be high even if the portal is successful. If a page required users to click a subject or alphabetical link before seeing individual databases, the bounce rate for the portal’s home page should be lower (p. 31). Another explanation of high bounce rates for our website was from Vecchione, Brown, Allen, and Baschnagel (2016). They suggest that: Bounce rate and drop-offs both indicate the same behavior. Bounce rates can be different from drop-offs as a user can drop-off after looking at several pages, but a bounce rate is primarily measured after the user looks at one page and subsequently drops off. Google Analytics SERIALS REVIEW Figure . Google Analytics page views. Figure . Google Analytics fall and spring break comparison. 213 214 YEAGER automatically measures the bounce rate of any web site, but for libraries this is a difficult metric to understand in context because library web sites also serve as portals to other sites. (p. 164) This is what the electronic access librarian and the systems librarian believe is happening to our website. As an example of understanding the pathways that users navigate to the Belk Library website, Vecchione et al. (2016) suggest that if users leave the main library website to search for a book, they utilize the discovery layer, WorldCat Discovery Service (OCLC). This discovery layer is not tracked by the library’s instance of Google Analytics. This search appears as a drop-off, which indicates “where users left the flow” (Google, 2017). We think our drop-offs are occurring when a student moves from the library website to go to the research guides or to the database by title page. We are not tracking that event after they leave the library’s home page yet with Google Analytics. Another KPI worth noting is sessions. Looking at the KPI sessions chart, you can actually see which days of the month had the most sessions and the least, typically around midterms and finals. The lowest number of sessions is in January, during Elon’s three-week-long winter term session. See Figure 8, which shows sessions also. Figure 9 shows page views in more detail. This KPI can be very useful because it lets one see what pages are being used and how much time is spent on the pages. Figure 10 is the comparison between fall break 2016 and spring break 2017. The number of users, page views, page/session, and average page duration are similar for both spring and fall break. These are all benchmarks that the electronic access librarian will use in the future to supplement our existing database vendor statistics. Conclusions ezPAARSE and Google Analytics are both good tools to help us better understand more of the statistical data we gather every month. ezPAARSE made interpreting the large amount of server logs less challenging. When one is looking at 1GB of data, it looks like a daunting task, but working with all of these tools helped us break the data down into manageable bits of information. Google Analytics showed us the user’s path, which is what we were looking for, but also gave us much more. Google Analytics displayed graphically the information from the same date ranges that we used in ezPAARSE. The metrics that Google Analytics counts are definitely important to the Electronic Resources Office. Using both ezPAARSE and Google Analytics along with our database statistics will help the ERO analyze electronic resources better. The electronic access librarian and the systems librarian learned from using all these tools that no tool will do everything. Through this study we realized that web analytics tools or methods of tracking data change frequently. Cleaning out excess file formats is necessary to keep the size of the data as small as possible. We also discovered that Google Analytics is not tracking our WorldCat Discovery Service at this time. Additionally, it is also not tracking the library’s research guides, which creates a high bounce rate. In the future the electronic access librarian, with help from the systems librarian, needs to set internal benchmarks for the data to be collected. We need to compare different ranges of data, such as day, week, month, or year. The electronic access librarian, along with the systems librarian, also needs to set up event tracking on the library website to track users going to research guides and to other areas of the website. The Electronic Resources Office will continue to compare the EZproxy and Google Analytics data to the vendor statistics to get a more complete picture of how the library’s resources are being used. Acknowledgments A special thanks to Jerry Waller, Systems Librarian at Elon, for all the support and assistance with this project. References Coughlin, D. M., Campbell, M. C., & Jansen, B. J. (2016). A web analytics approach for appraising electronic resources in academic libraries. Journal of the Association for Information Science and Technology, 67(3), 518–534. https://doi.org/10.1002/asi.23407 Duy, J., & Vaughn, L. (2006). Can electronic journal usage data replace citation data as a measure of journal use? An empirical examination. Journal of Academic Librarianship, 32(5), 512–517. Fagan, J. C. (2014). The suitability of web analytics key performance indicators in the academic library environment. The Journal of Academic Librarianship, 40(1), 25–34. https://doi.org/10.1016/j.acalib.2013.06.005 Farney, T., & McHale, N. (2014). Library technology reports: Web analytics strategies for information professionals (4). New York, US: ALA Tech Source. Retrieved from https://www.ebrary.com Google. (2017). Google Analytics help center. Retrieved from https://support.google.com/analytics Vecchione, A., Brown, D., Allen, E., & Baschnagel, A. (2016). Tracking user behavior with Google Analytics events on an academic library web site. Journal of Web Librarianship, 10(3), 161–175. https://doi.org/10.1080/19322909. 2016.1175330 Appendix: Google Analytics definitions Visits/Sessions—A visit is technically any web-accessible device that interacted with the website during a particular time frame. SERIALS REVIEW Unique Visitors—Any visit from a specific IP address would increment the Unique Visitors count only once, while going out of the website and going back in again would increment the Visits count as many times as the user goes back and in on that day. Page Views—An instance of a page being loaded (or reloaded) in a browser. Total number of pages viewed. Page/Session (Average Page Depth)—The average number of pages viewed during a session. Bounce Rate—The number of bounces divided by the number of visits to a website, for 5 seconds or less. Content—Areas of library websites that include instructional web pages. A common conceptual goal is to 215 increase learning, which is demonstrated in web analytics by visitors who stay on the page longer. Pages Per Visit—Page depth, averages the number of web pages visitors view before exiting the site. Duration of visit—Referred to as time on site. Records the amount of time a visitor is on a website or an individual web page. Sources: Compiled from Farney and McHale (2014): Library Technology Reports: Web Analytics Strategies for Information Professionals and Fagan (2014): “The Suitability of Web Analytics Key Performance Indicators in the Academic Library Environment.” Copyright of Serials Review is the property of Routledge and its content may not be copied or emailed to multiple sites or posted to a listserv without the copyright holder's express written permission. However, users may print, download, or email articles for individual use.