The Trouble with House Elves Experiments in Computational Folkloristics Timothy R. Tangherlini 2 A story… It was the old counselor from Skårupgård who came riding with four headless horses to Todbjærg church. He always drove out of the northern gate, and there by the gate was a stall, they could never keep that stall door closed. They had a farmhand who closed it once after it had sprung open. But one night, after he'd gone to bed, something came after the farmhand and it lifted his bed straight up to the rafters and crushed him quite hard. Then the farmhand shouted and asked them to stop lifting him up there. "No, you've tormented us, but now you'll die..." I heard that's how two farmhands were crushed to death. He wanted to close the door and then they never tried to close it again. 3 • Some standard questions: – Role of ghosts in late 19th century Denmark? – Origins of the story? – Structure of the story? – Who, what, where of this story? • Is there a need for a computational folkloristics? 4 Folklore • Early history of the discipline: – Philological – National Romanticism • Johann Gottfied Herder (1744-1803) – Wilhelm (1786-1859) and Jacob (1785-1863) Grimm – Search for original forms 5 Romantic Nationalism in the Nordic lands • Asbjørnsen and Moe (Norway) – • • • Development of the Norwegian language Linnaeus and the rush to categorize (Sweden) The ballad, archaeology and Svend Grundtvig (Denmark) The Kalevala and folklore as a science (Finland) 6 Mapping Folklore • Historic-geographic method – Kaarle and Julius Krohn (1906-1924) • Focused work on the Finnish epic, Kalevala • Led to the type index of folk literature (Antti Aarne) – Ripples on a pond theory of folklore diffusion 7 Maps in the study of culture • Geography is not an inert container, is not a box where cultural history "happens," but an active force, that pervades the literary field and shapes its depth. Making the connection between geography and literature explicit... will allow us to see some significant relationships that have so far escaped us – (Moretti 1998, 3). 8 A New Historic-Geographic Method • Folklore as a process: – in time and space – emerges from the dialectic between individuals and tradition • Maps can help model relationships between: – People – Environment – Folk Repertoires 9 Study Corpus • Evald Tang Kristensen (1843-1929) – Actively collected from 1865-1923 – 219 collecting trips • 6500+ named informants • 24,000 manuscript pages • 250,000 published expressions 10 A multi-level folklore browser • People • Places • Stories 11 Experiments in mapping 1. Mapping collecting routes • 2. Challenge Question: Did Tang Kristensen’s published statements about his collecting accurately reflect his collecting work? Mapping individual repertoire distribution • • 3. CQ: Does individual mobility influence the range of places mentioned in stories? CQ: Do other informant features, such as gender, influence range of places mentioned? Mapping by story features against individual repertoire • CQ: Are there patterns, ala Moretti, that become apparent in the visualization of stories by repertoire, genre and/or story topic? 12 Experiment 1: Mapping Collecting Routes – ETK presents himself as a West Jutlander – Political motivations • Aftermath of Napoleonic wars and Danish bankruptcy (1814) • Loss of Schleswig to Bismarck (1865) • Urbanization – Search for “authentic” Danish culture – What do the collecting routes reveal? 13 Experiments 2 & 3: Mapping Repertoire • Theory: Individual biography influences repertoire and its features • Hypothesis: Classes of individuals have different degrees of physical mobility, and this is reflected in their storytelling • Hope: Maps reveal interesting patterns of placesmentioned – A Caveat: My main interest, and the vast majority of the collection, are based on legends, stories that refract the lived environments and social organization of the tradition participants 14 Experiment 2: Place Name Distribution and Mobility – Target: repertoires of 5 storytellers – Limit: only stories that mention places – Method • Plot place names mentioned by storyteller • Calculate Standard Deviation Ellipse distribution patterns for places mentioned in storyteller repertoires • Look for patterns in the underlying place name distribution 15 Experiment 3: Can unsupervised learning on text help in pattern discovery? 16 Experiment 3: Unsupervised learning and Repertoire clusters – Target: repertoires of 5 storytellers – Limit: only stories that mention places – Method • Convert stories to TFIDF vector representations • Force dimensionality reduction using SVD • Cluster: ECM by storyteller – eliminate small clusters • Project results into GIS • Calculate distribution ellipses for each cluster in each person’s repertoire 17 A Crisis… • Maps were informative since new patterns in the geographic distribution of stories were discovered… – why hadn’t I known about these patterns before? • What other types of patterns, some very small, some very large are lurking in the data? • How can I be sure that my selection of examples is representative or even accurate? 18 A Classic Folklore Problem • Classification in folklore – 1 text = 1 classifier • What happens when the classifier was designed for a different research problem? • Are we missing patterns that are not solely related to single topic classifiers? • Are we missing stories in our searches because of these single topic classifiers? • Does this limit our ability to work with a large archive? 19 Current folklore classifiers are very expensive 20 A lost story… It was the old counselor from Skårupgård who came riding with four headless horses to Todbjærg church. He always drove out of the northern gate, and there by the gate was a stall, they could never keep that stall door closed. They had a farmhand who closed it once after it had sprung open. But one night, after he'd gone to bed, something came after the farmhand and it lifted his bed straight up to the rafters and crushed him quite hard. Then the farmhand shouted and asked them to stop lifting him up there. "No, you've tormented us, but now you'll die..." I heard that's how two farmhands were crushed to death. He wanted to close the door and then they never tried to close it again. 21 Networks to the rescue? • Folklore as traditional communication across social networks – Folklore networks • Social networks of tradition participants • Networks of scholars and collectors • Networks of stories – External networks • Communications networks • Transportation networks • Affiliation networks – Internal networks • Linguistic networks 22 Connecting the dots… P1 P7 I3 P8 S2 S1 I1 P2 I3 I2 P4 P5 P3 S3 S4 P6 I3 23 Storyteller networks • Local networks • Connect all storytellers in a given parish • Connect all storytellers in a family • Fieldtrip networks • Connect all storytellers on a given fieldtrip • Collector-Storyteller networks • Connect all storytellers to all collectors with whom they worked • Inferred / Affiliation networks • Connect storytellers by work groups (eg millers, fiddlers, etc) • Connect storytellers by other affiliations (eg gender, age, education) 24 Story networks • Connect stories to: – People: • storytellers • people mentioned – Places • places collected • places mentioned – Each other • • • • By shared indexing By shared keyword (keyword extraction) By shared topic (topic modeling using LDA) By shallow ontology (tango index) 25 An initial graph of the ETK study corpus 26 Lost in a thicket of stories, keywords, etc 27 Folklore Spaghetti 28 Graph clustering • Use a tuned version of MCL clustering for graphs – iteratively generates stochastic matrices, also known as Markov matrices (van Dongen 2000) – 2973 nodes / 52663 edges 29 Structure emerges and the graph becomes useful 30 Remember our ghost story? • DS IV 650 • Classified as a story about manor lords, not ghosts! • Impossible to find in the archive • Can I use networks to find this story? • Will it help me find other stories of interest? 31 32 33 Almost all the surrounding stories are cataloged as ghost stories! 34 DS II B 147 is a story of interest—not a ghost story but strongly connected to DS IV 650… 35 DS II B 147 • A story about a house elf at a farm in Egå... • Ends as follows: – When they got home, the farmhand was happy because now he’d gotten something to use for feed, and afterward nis could go and feed just as much as he wanted to. Then they got another farmhand, and he didn’t want to let him go on like that. But he got lifted up in his bed and all the way up to the rafters, so he lay there dead when people got up the next morning. 36 The trouble with house elves… • You can’t always find them… • They act in unpredictable ways… • The things they do turn out to be pretty mean and nasty 37 New research question • What is the relationship between ghosts and house elves in 19th century Denmark and why might there be such a relationship? 38 Some tentative conclusions 39 Directions for future work • Labeling – Can we automatically label nodes given a sparsely labeled graph? (LDA-G, Homophily algorithms) • Anomaly detection / Community detection – Can we automatically find “stories of interest” on our graph? • Multimodal networks – Integrate network information from several networks • Dynamic networks – Understanding how network changes over time • Geographic visualization of network models 40 A Very Special Thanks to • IPAM – Peter Jones – Mark Green – Russ Caflisch • Colleagues and friends from Search Engines 2007 41 Additional thanks to • • • • • Peter Broadwell, UCLA James Abello, DIMACS Tina Eliassi-Rad, LLNL/Rutgers Nischal Devanur, Rutgers UCLA’s Center for Digital Humanities 42 Funded by… – Nordic Council of Ministers – The American Council of Learned Societies – NSF Eager Grant IIS- 0970179 • With Lancaster (ECAI), Buckland (ECAI), EliassiRad (Rutgers) and Faloutsos (CMU) – Google Books Humanities grant – Many ideas derived from • NEH Institute for Advanced Topics in Digital Humanities, “Networks and Network Analysis for the Humanities” (NEH HT5001609) 43