Operation War Diary How we mobilised an army of volunteers Laura Cowdrey & Steven Hirschorn 28 November 2014 The unit war diaries The challenge 1.5 million pages of digitised unit war diaries, containing a wealth of information about the British Army on the Western Front The solution The task A two-step process: 1. Classification of pages: only 50% are diary pages 2. Tagging the data: structured with some flexibility for interesting finds The volunteers Largely fall into three groups: 1. Zooniverse: experienced crowdsourcers with no interest/background in history 2. TNA/IWM users: knowledgeable researchers with no experience of crowdsourcing 3. Brand new users Extracting and processing the data X co-ordinates type: place location: 16,29 name: Ronssoy Y co-ordinates Clustering example Multiple keying Each page is tagged by at least 5 users Advantages of increasing the number of volunteers per page: • More corroborating evidence for each tag • More likely that typos can be resolved automatically • Less likely that information on the page is missed out Disadvantages: • Slower progress Grouping tags by different users Clustering Use X/Y co-ordinate to identify tags from different users that are likely to refer to the same entity Check that the tag type is the same Check that the value is similar Define “Similar” Easy for those tag types that had discrete values (drop-down selections) Use an algorithm for comparing two sequences of text for approximate similarity for fields that allowed free-text entry. Establishing consensus For each cluster of tags, select the majority opinion, if there is one. “A T Bartstow” “Amuiacke” “Uniacke” “Ziniacke” “A T Barston” “W A T Barstow” “W Barstow” “A T Bastow” Other challenges The data Around 64,000 individuals have been identified so far Sample of a single person’s record Number of person mentions by rank Data visualisation Any questions?