Selective Memory: Priorities and Realities Michael Seadle Berlin School of Library and Information Science Humboldt-Universität zu Berlin National Library, Cape Town, South Africa 1 Digital Preservation The technology of digital preservation has been well established since the late 1990s. The technology is not at issue, though features vary. LOCKSS (Lots of Copies Keep Stuff Safe from Stanford) and Portico (part of Ithaka) are internationally the main providers. The basic strategy is to have multiple copies, to ensure their integrity, and to enable migration or emulation when necessary. Rooftop garden in Singapore 2 LOCKSS LOCKSS is an open-source software with multiple forms for implementation. The Global LOCKSS Alliance is well known, and Private LOCKSS Networks (PLNs) are growing rapidly. CLOCKSS (Controlled LOCKSS) is a PLN variant. The LOCKSS software is widely used internationally. Key features of LOCKSS are: 7 copies, active integrity checking, bitstream preservation, migration on the fly, and a high degree of automation for cost-control. 3 Portico Portico is a preservation service that grew out of JSTOR. Portico uses a mix of open source and proprietary software, and keeps three copies, one in Princeton, NJ, one in Ann Arbor, Michigan, and an offline copy in the Netherlands. Like LOCKSS, Portico is widely used outside the US. Unlike LOCKSS, the only legal venue for the data is US. Key features of Portico include immediate migration to an XML format and an emphasis on big publishers, though that is slowly changing. 4 Problems The remainder of this lecture looks at two problems. One is selection – that is, what to include. The other is rejection – that is, what is fake and should not be included without clear labeling. 5 Selection The first priority for many institutions in the 1990s was to guarantee that very expensive digital content would not be lost. This meant an emphasis on large publishers like Elsevier and Springer. It also meant an emphasis on conventional scholarly publishing. Open access journals, institutional repositories, journals from developing countries, non-English content, and gray literature / the World Wide Web tended to be left out. 6 Risk Risk needs to be one of the factors in preservation planning. In general large organizations are less at risk for data loss through negligence than smaller ones, simply because they tend to have more staff, do more backups, and recognize the risk of data loss better. This means that very large publishers like Elsevier and Springer are less at risk of data loss than very small publishers. Take-overs are a risk factor, which is greater for small than large institutions. Neglect is also a risk factor. Webpages are rarely intentionally saved. Even repositories may be ill-maintained. 7 Importance Basing importance on the cost of content entails risks. Over time valuations change. Works that sold well and were once heavily in demand may come to seem unimportant or even trivial. Not every scientific publication from the 1920s is key to contemporary thinking. At the same time historians today often take an interest in a material that previous generations regarded as not worth saving. Gray literature has become a major theme and gray literature is almost by definition content that libraries have not kept. 8 Empirical evidence Judging importance by some form of empirical evidence other than cost is also possible. Download statistics are one form of evidence. Citation rates are another. This form of evidence may not distinguish what will be seen as important in a century from what is ignored, but it does balance a ranking based purely on costs. A more important piece of evidence may be long term citation. A work that is cited and forgotten in a few years is different than one that is cited continually. Historically, what gets used gets saved. If storage capacity were a problem today, this would matter more. Nonetheless it plays a role in access. 9 Research Integrity If the principle of long term preservation is to save the scholarly (i.e. the “scientific”) record, then it matters whether the content is reliable. It may be important to reject some content. The blog “Retraction Watch” has documented the following reasons for retractions since August 2010: Faked Data 270 13% Image Manipulation 360 17% Plagiarism 325 16% Authorship issues 111 05% Other retractions 999 49% 10 Faked Data Faked data is the most destructive integrity breach, because it undermines any future research that uses the publication. Examples: • Jim Hunton: Hunton Report & Retraction Watch • Diederik Stapel: Stapel Investigation & Retraction Watch Both cases involved other researchers who used the faked data. Faked data is hard to detect, though a new paper by Markowitz and Hancock (2015), “Linguistic Obfuscation in Fraudulent Science” Journal of Language and Social Psychology, 1–11, claims to be able to detect it. 11 Image Manipulation Beautiful pictures help to sell manuscripts. Modern digital cameras interpret a data stream based on what the CCD or CMOS chips pass on. The cameras have software that makes decisions. With Photoshop or Gimp or other tools it is easy to erase and add. Many scholars are unclear about what the boundaries of acceptable image manipulation are. When people used chemical film, they also had to make decisions, such as about which film manufacturer to use, how to develop the film and how to crop it. The temperature of the film and the exposure time mattered. There was no exact true picture even then, but the potential for manipulation seems greater today and discovery is hard. 12 Plagiarism Plagiarism is mainly detected via brute-force comparisons using tools like iThenticate. Style changes may be an indicator, but they are by no means reliable. Reviewer judgments about what constitutes plagiarism vary considerably. In a recent case in a commission, reviewers from the natural sciences judged differently from reviewers from the humanities about whether any plagiarism occurred. While plagiarism is a serious problem, it may do less damage to the scholarly record than data falsification or image manipulation. 13 Conclusion This lecture has discussed two problems in long term preservation. One is the problem of setting priorities so that selection is not based primarily on the current costs of content, but on empirical measure of likely future use. The other is to preserve real scholarship, and to reject (unlabeled) faked results. An unknown quantity of what is being saved today may be of no scholarly value, because of research integrity problems. There are no easy solutions, but recognizing the problems is a first step. 14