Operation War Diary How we mobilised an army of volunteers

advertisement
Operation War Diary
How we mobilised an
army of volunteers
Laura Cowdrey & Steven Hirschorn
28 November 2014
The unit war diaries
The challenge
1.5 million pages of digitised unit war diaries, containing a wealth of information
about the British Army on the Western Front
The solution
The task
A two-step process:
1.
Classification of pages: only 50% are diary pages
2.
Tagging the data: structured with some flexibility for interesting finds
The volunteers
Largely fall into three groups:
1. Zooniverse: experienced crowdsourcers with no interest/background in history
2. TNA/IWM users: knowledgeable researchers with no experience of
crowdsourcing
3. Brand new users
Extracting and processing the data
X co-ordinates
type:
place
location: 16,29
name:
Ronssoy
Y co-ordinates
Clustering example
Multiple keying
Each page is tagged by at least 5 users
Advantages of increasing the number of volunteers per page:
• More corroborating evidence for each tag
• More likely that typos can be resolved automatically
• Less likely that information on the page is missed out
Disadvantages:
• Slower progress
Grouping tags by different users
Clustering
Use X/Y co-ordinate to identify tags from different users that are likely to refer to
the same entity
Check that the tag type is the same
Check that the value is similar
Define “Similar”
Easy for those tag types that had discrete values (drop-down selections)
Use an algorithm for comparing two sequences of text for approximate similarity
for fields that allowed free-text entry.
Establishing consensus
For each cluster of tags, select the majority opinion, if there is one.
“A T Bartstow”
“Amuiacke”
“Uniacke”
“Ziniacke”
“A T Barston”
“W A T Barstow”
“W Barstow”
“A T Bastow”
Other challenges
The data
Around 64,000 individuals have been identified so far
Sample of a single person’s record
Number of person mentions by rank
Data visualisation
Any questions?
Download