Supply and Demand Analysis in NDLTD Based on Patron Specialty and Contents Statistics The 9th International Symposium on Electronic Theses and Dissertations Quebec City, Quebec, Canada June 7-10, 2005 Seonho Kim, Seungwon Yang, Edward A. Fox Digital Library Research Laboratory, Virginia Tech Blacksburg, VA 26061 USA Overview • • • • • Purpose of Study Data Set (ETDs, patrons, queries) Our Approach Data Analysis Conclusions and Future Work

Purpose of Study • • • • • Analysis of ETD Subjects (supply) Analysis of Users, Queries (demand) Comparison of the Above Two Users' Years of Experience in Their Field Distribution of Date Stamp of ETDs

Overview • • • • • Purpose of Study Data Set (ETDs, patrons, queries) Our Approach Data Analysis Conclusions and Future Work

Data Set - ETDs • Up-to-date Union archive harvested from Online Computer Library Center (OCLC) • Using OAI/ODL Harvester [2] by Hussein Suleman • Total of 242,688 records

Example – ETD Metadata • <dc oai_dc="" dc="" xsi="" schemaLocation=""><title>Composer-Centered Computer-Aided Soundtrack Composition</title><creator>Vane, Roland Edwin</creator><subject>Computer Science</subject><subject>human computer interaction</subject><subject>music composition</subject><subject>soundtracks</subject><subject>creativity</ subject><description>For as long as computers have been around, people have looked for ways to involve them in music…. </description><publisher>University of Waterloo</publisher><date>2006</date><type>Electronic Thesis or Dissertation</type><format>application/pdf</format><identifier>http://etd.u</identifier><language>en</language><rights >Copyright: 2006

Patrons, Queries • User Profile Data (Oct. 2005 – May 2006) • Online User Survey [3] as part of User Modeling study • Total 1100 User Data that include – Majors, specialties, years of experience, and demographic information. – Queries and detailed research interests

User Profile Form

Example – User Profile Data • <user> <userID>shk</userID> <email></email> <name><first>Sh</first> <last>King</last> </name><major>CS</major><broadresearch>Digital Library <specific>User interface</specific> <experience>8,2</experience></broadresearch><group /> <query><item freq="79">digital library</item> <item freq="33">computer science</item> <item freq="25">virginia tech</item> <item freq="9">artificial intelligence</item> <item freq="5">digital library.</item> </query><selected><item freq="15">Digital Library</item> <item freq="6">Electronic Theses and Dissertations</item> </selected><proposed><item freq="80">Digital Library</item> <item freq="65">Data</item> </proposed></user>

Overview • • • • • Purpose of Study Data Set (ETDs, patrons, queries) Our Approach Data Analysis Conclusions and Future Work

Categorization of Academic Subjects • Built Our Own Classification Categories • Based on Colleges / Faculties in - Virginia Tech, University of Virginia, George Mason University, VCU and Virginia State University • Identified - 7 categories and 77 subcategories - Word patterns for each subcategories

Categorization of Academic Subjects • 7 categories and selected 77 subcategories 7 Categories Selected 77 Sub-categories 1 Architecture and Design ArchitectureConstruction, LandscapeArchitecture,… 2 Law Law 3 Medicine, Nursing and Veterinary Medicine Dentistry, Medicine, Pharmacy, Nursing,… 4 Arts and Science Agriculture, AnimalPoultry,Biology,... 5 Engineering and Applied Science ComputerScience, Material, Electronics,… 6 Business and Commerce Buisiness, Economics, Management,… 7 Education 8 Others (unclassifiable) Education

Categorization of Academic Subjects • Each subcategory has a set of word patterns - Matching table developed • Process of word matching table development 1. Run our subject-matching classifier program 2. Count each unclassified subject & sort them. 3. If num. > 10, add the unclassified subject to matching table 4. Repeat 1 – 3 until num. < 10 for all unclassified subjects

Categorization of Academic Subjects • Matching Table 77 categories Word Patterns Education /bildung/, /pedagog/, /fakul/, /educa/, /teaching/,… Geology /geolog/, /geoscience/,… LibraryScience /librari/, /library/, /informatik/,… … …

Categorization of Academic Subjects Unclassified ETD Subjects: • Approx. 85 % unique • Approx. 10 % only two occurrences

Measuring Supply – Demand • ETD Supply (Num. of Resources) - 242,688 ETDs classified into 7 categories and counted • Patron's Demand (Num. of Queries) - 4519 queries (in 1100 user data) classified into 7 categories - "Sum of all queries" in each category calculated as Demand of a Category number of queries user category

ETD Classification • Based on the "first" subject field • <dc oai_dc="" dc="" xsi="" schemaLocation=""><title>Composer-Centered Computer-Aided Soundtrack Composition</title><creator>Vane, Roland Edwin</creator><subject>Computer Science</subject><subject>human computer interaction</subject><subject>music composition</subject><subject>soundtracks</subject><subject>creativity</ subject><description>For as long as computers have been around, people have looked for ways to involve them in music…. </description><publisher>University of Waterloo</publisher><date>2006</date><type>Electronic Thesis or Dissertation</type><format>application/pdf</format><identifier>http://etd.u</identifier><language>en</language><rights >Copyright: 2006

User Classification • Based on the "major", "broadresearch", and "specific" fields in each user profile • <user> <userID>shk</userID> <email></email> <name><first>Sh</first> <last>King</last> </name><major>CS</major><broadresearch>Digital Library <specific>User interface</specific> <experience>8,2</experience></broadresearch><group /> <query><item freq="79">digital library</item> <item freq="33">computer science</item> <item freq="25">virginia tech</item> <item freq="9">artificial intelligence</item> <item freq="5">digital library.</item> </query><selected><item freq="15">Digital Library</item> <item freq="6">Electronic Theses and Dissertations</item> </selected><proposed><item freq="80">Digital Library</item> <item freq="65">Data</item> </proposed></user>

Challenges • Varieties in describing research subjects Solution: we built a subject matching table 77 categories Decision patterns Education /bildung/, /pedagog/, /fakul/, /educa/, /teaching/,… Geology /geolog/, /geoscience/,… LibraryScience /librari/, /library/, /informatik/,… Arts /music/, …

Challenges • Interdisciplinary ETDs – e.g., "Music Education" – Solution: adjust matching order • Unclassifiable ETDs – – – – – Null Entry (No subject field data) Erroneous entries (e.g., "Ph.D", "Georgia","") Typo (e.g. "edcuation", "poluition") Too much detail (e.g., "pulsars", "muon", "cytochrome") Abbreviations (e.g., "MOCVD", "OFDM")

Overview • • • • • Purpose of Study Data Set (ETDs, patrons, queries) Our Approach Data Analysis Conclusions and Future Work

Resource Distribution Resource Distribution in NDLTD 2 1 3 4 8 1 2 3 4 5 6 7 1 Architecture and Design 2 Law 3 Medicine, Nursing and Veterinary Medicine 4 Arts and Science 5 Engineering and Applied Science 6 Business and Commerce 7 Education 8 Others. (unclassifiable) 8 5 7 6

User Distribution User Distribution in NDLTD 1 1 Architecture and Design 2 Law 3 Medicine, Nursing and Veterinary Medicine 4 Arts and Science 5 Engineering and Applied Science 6 Business and Commerce 7 Education 8 Others. (unclassifiable) 2 3 8 4 7 1 2 3 4 5 6 7 8 5 6

Query Distribution 1 Architecture and Design Query Distribution in NDLTD 1 2 Law 2 3 4 8 5 1 2 3 4 5 6 7 8 3 Medicine, Nursing and Veterinary Medicine 4 Arts and Science 5 Engineering and Applied Science 6 Business and Commerce 7 Education 7 6 8 Others. (unclassifiable)

Supply-Demand Comparison 1 Architecture and Design ETD Resources and User Demands (Number of Queries) in NDLTD 50% ETDs 2 Law Demands 3 Medicine, Nursing and Veterinary Medicine 45% 40% 35% 30% 4 Arts and Science 25% 5 Engineering and Applied Science 20% 15% 10% 5% 6 Business and Commerce 0% 7 Education 1 2 3 4 5 Academic Categories 6 7 8 8 Others. (unclassifiable)

ti ngF ETD supply Ar t Astr o nom y B io c B io lo hem i st ry gi cal Eng in eeri n g B io lo gy B ot a ny B usi ness Chem i cal Chem i stry Co m mun i cat io Co m n put e Cro p rSc ie Soi lE nc e n vSc ie nce s Dai ry Scien ce Dent istr y Eco lo gy Eco n o mic s Educ at io n El ec Eng in tr on ic eeri n s g Scie n ce Eng li sh Ent o mo l o gy Envi r o nm e nt F am F or e i ly ig nLa ngu a F oo d g esL i ter a t ures F or e str y Geo g ra ph Go ve y r nme Geo l nt Int e rna og y ti on a l Affa ir in anc e Aer o spac e Ag ri c ul tur e An im a lPo ul try An th r opo lo gy Appa r elHo using Ar ch Ar ch i tect a eol o gy ure C o nst r uct io n Acc o un Supply-Demand of 77 Subcategories (1/2) Supply/Demand 77 Subcategories (1/2) 12% 10% User Demand 8% 6% 4% 2% 0%

ETD Supply Law Lib r ar yScie nce Lin gu i sti cs Lit er a tu re Man a g eme nt Mat er i al s Mech a ni cs Medic i ne Mete o r ol o gy Mat he ma ti c s Min ing Min er al Music Nava l Nucl e ar N u r si ng Oce a nEng in eeri ng Pha rm a cy Phi lo s op hy Phy si cs Pl ant Po li ti c s Pub li c Psych o l og y Admi nistr a t io nPo li cy Pub li c Affai r So cio lo g y Sta ti s tic s Ur ba n Pla nn i ng V eter in ar y Wil dli fe Woo d Zoo lo gy Hist or y Ho rt ic ul tur e Ho sp it al it yT our ism Huma nDeve Huma l o pme nNutr nt it io nE xer cis e Indus tri al Infor m at ic s Int erd isci pl in Lan ds ary ca peA rchi te c tur e Supply-Demand of 77 Subcategories (2/2) Supply/Demand 77 Subcategories (2/2) 12% User Demand 10% 8% 6% 4% 2% 0%

User Expertise Years Users' Expertise in Years 200 180 160 120 100 80 60 40 20 50 35 28 26 24 22 20 18 16 14 12 10 8 6 4 2 0 0 Users 140 Years

Expertise Years and Demand Expertise Years and Demand 25% Users Demand 20% 15% 10% 5% 50 er ro r 39 40 30 35 27 28 25 26 23 24 21 22 19 20 17 18 16 14 15 12 13 10 11 9 8 7 6 5 4 3 2 1 0 0% Years

17 xx 18 xx 19 0x 19 1x 19 2x 19 3x 19 4x 19 5x 19 6x 19 7x 19 8x 19 90 19 91 19 92 19 93 19 94 19 95 19 96 19 97 19 98 19 99 20 00 20 01 20 02 20 03 20 04 20 05 20 06 error Date Stamp of ETD 60,000 50,000 40,000 30,000 20,000 10,000 0 Year

Date Stamp of ETD • ETDs from seventeen hundreds ? - Some of scanned copies from European universities - Oldest ETDs are from British universities - Some of the older dates are typos - you'd have to check each one to know for sure

Overview • • • • • Purpose of Study Data Set (ETDs, patrons, queries) Our Approach Data Analysis Conclusions and Future Work

Conclusions • • • • • Analysis of ETD Subjects (supply) Analysis of User Queries (demand) Comparison of the Above Two Users' years of experience in their field Date Stamp of ETDs Learned the future directions…

Future Work • Use of widely-used classification system - e.g., Dewey Decimal Classification 22 • More detailed classification of ETDs - Include title, abstract and other subject field data - Utilize "discipline" in ETD_MS format records But only approx. 7000 records • Use of user behavior data - e.g., Clicking of query results in NDLTD

References [1] NDLTD, Networked Digital Library of Theses and Dissertations, available at, 2006 [2] Hussein Suleman, "OAI/ODL Harvester", available at [3] Seonho Kim, Uma Murthy, Kapil Ahuja, Sandi Vasile, Edward A. Fox, "Effectiveness of Implicit Rating Data on Characterizing Users in Complex Information Systems", Springer-Verlag LNCS3652, 9th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2005), 2005, 186-194 [4 (unclassifiable) (Unclassifiable) ETD 2006, Quebec, Canada 36 Thank You ETD 2006, Quebec, Canada 37 Questions or Comments? ETD 2006, Quebec, Canada 38 User Data - Fields • <query> : entered by the user • <proposed> : ETD results clustered and displayed • <selected> : cluster labels clicked by the user ETD 2006, Quebec, Canada 39