Selective Memory: Priorities and Realities

advertisement
Selective Memory:
Priorities and Realities
Michael Seadle
Berlin School of Library and Information Science
Humboldt-Universität zu Berlin
National Library, Cape Town, South Africa
1
Digital Preservation
The technology of digital preservation has been
well established since the late 1990s. The
technology is not at issue, though features vary.
LOCKSS (Lots of Copies Keep Stuff Safe from
Stanford) and Portico (part of Ithaka) are
internationally the main providers.
The basic strategy is to have multiple copies, to
ensure their integrity, and to enable migration or
emulation when necessary.
Rooftop garden in Singapore
2
LOCKSS
LOCKSS is an open-source software with multiple
forms for implementation. The Global LOCKSS
Alliance is well known, and Private LOCKSS
Networks (PLNs) are growing rapidly. CLOCKSS
(Controlled LOCKSS) is a PLN variant. The LOCKSS
software is widely used internationally.
Key features of LOCKSS are: 7 copies, active
integrity checking, bitstream preservation,
migration on the fly, and a high degree of
automation for cost-control.
3
Portico
Portico is a preservation service that grew out of
JSTOR. Portico uses a mix of open source and
proprietary software, and keeps three copies, one
in Princeton, NJ, one in Ann Arbor, Michigan, and
an offline copy in the Netherlands. Like LOCKSS,
Portico is widely used outside the US. Unlike
LOCKSS, the only legal venue for the data is US.
Key features of Portico include immediate
migration to an XML format and an emphasis on
big publishers, though that is slowly changing.
4
Problems
The remainder of this lecture looks at two
problems.
One is selection – that is, what to include.
The other is rejection – that is, what is fake and
should not be included without clear labeling.
5
Selection
The first priority for many institutions in the
1990s was to guarantee that very expensive
digital content would not be lost. This meant an
emphasis on large publishers like Elsevier and
Springer. It also meant an emphasis on
conventional scholarly publishing.
Open access journals, institutional repositories,
journals from developing countries, non-English
content, and gray literature / the World Wide
Web tended to be left out.
6
Risk
Risk needs to be one of the factors in preservation
planning. In general large organizations are less at risk
for data loss through negligence than smaller ones,
simply because they tend to have more staff, do more
backups, and recognize the risk of data loss better.
This means that very large publishers like Elsevier and
Springer are less at risk of data loss than very small
publishers. Take-overs are a risk factor, which is greater
for small than large institutions. Neglect is also a risk
factor. Webpages are rarely intentionally saved. Even
repositories may be ill-maintained.
7
Importance
Basing importance on the cost of content entails
risks. Over time valuations change. Works that sold
well and were once heavily in demand may come
to seem unimportant or even trivial. Not every
scientific publication from the 1920s is key to
contemporary thinking.
At the same time historians today often take an
interest in a material that previous generations
regarded as not worth saving. Gray literature has
become a major theme and gray literature is
almost by definition content that libraries have not
kept.
8
Empirical evidence
Judging importance by some form of empirical
evidence other than cost is also possible. Download
statistics are one form of evidence. Citation rates are
another. This form of evidence may not distinguish
what will be seen as important in a century from what
is ignored, but it does balance a ranking based purely
on costs.
A more important piece of evidence may be long term
citation. A work that is cited and forgotten in a few
years is different than one that is cited continually.
Historically, what gets used gets saved. If storage
capacity were a problem today, this would matter
more. Nonetheless it plays a role in access.
9
Research Integrity
If the principle of long term preservation is to save the
scholarly (i.e. the “scientific”) record, then it matters
whether the content is reliable. It may be important to
reject some content.
The blog “Retraction Watch” has documented the
following reasons for retractions since August 2010:
Faked Data
270 13%
Image Manipulation
360 17%
Plagiarism
325 16%
Authorship issues
111 05%
Other retractions
999 49%
10
Faked Data
Faked data is the most destructive integrity breach,
because it undermines any future research that uses
the publication.
Examples:
• Jim Hunton: Hunton Report & Retraction Watch
• Diederik Stapel: Stapel Investigation & Retraction
Watch
Both cases involved other researchers who used the
faked data. Faked data is hard to detect, though a new
paper by Markowitz and Hancock (2015), “Linguistic
Obfuscation in Fraudulent Science” Journal of
Language and Social Psychology, 1–11, claims to be
able to detect it.
11
Image Manipulation
Beautiful pictures help to sell manuscripts.
Modern digital cameras interpret a data stream based
on what the CCD or CMOS chips pass on. The cameras
have software that makes decisions. With Photoshop or
Gimp or other tools it is easy to erase and add. Many
scholars are unclear about what the boundaries of
acceptable image manipulation are.
When people used chemical film, they also had to make
decisions, such as about which film manufacturer to
use, how to develop the film and how to crop it. The
temperature of the film and the exposure time
mattered. There was no exact true picture even then,
but the potential for manipulation seems greater today
and discovery is hard.
12
Plagiarism
Plagiarism is mainly detected via brute-force
comparisons using tools like iThenticate. Style
changes may be an indicator, but they are by no
means reliable.
Reviewer judgments about what constitutes
plagiarism vary considerably. In a recent case in a
commission, reviewers from the natural sciences
judged differently from reviewers from the
humanities about whether any plagiarism
occurred. While plagiarism is a serious problem, it
may do less damage to the scholarly record than
data falsification or image manipulation.
13
Conclusion
This lecture has discussed two problems in long term
preservation.
One is the problem of setting priorities so that
selection is not based primarily on the current costs of
content, but on empirical measure of likely future use.
The other is to preserve real scholarship, and to reject
(unlabeled) faked results. An unknown quantity of what
is being saved today may be of no scholarly value,
because of research integrity problems.
There are no easy solutions, but recognizing the
problems is a first step.
14
Download