The Vector Space Model - University of Wolverhampton

Vocabulary Spectral Analysis as an Exploratory Tool for Scientific Web Intelligence Mike Thelwall Professor of Information Science University of Wolverhampton Contents Introduction to Scientific Web Intelligence  Introduction to the Vector Space Model  Vocabulary Spectral Analysis  Low frequency words  Part 1 Scientific Web Intelligence Scientific Web Intelligence  Applying web mining and web intelligence techniques to collections of academic/scientific web sites  Uses links and text  Objective: to identify patterns and visualize relationships between web sites and subsites  Objective: to report to users causal information about relationships and patterns Academic Web Mining  Step 1: Cluster domains by subject content, using text and links  Step 2: Identify patterns and create visualizations for relationships  Step 3: Incorporate user feedback and reason reporting into visualization This presentation deals with Step 1, deriving subject-based clusters of academic webs from text analysis Part 2 Introduction to the Vector Space Model Overview The Vector Space Model (VSM) is a way of representing documents through the words that they contain  It is a standard technique in Information Retrieval  The VSM allows decisions to be made about which documents are similar to each other and to keyword queries  How it works: Overview Each document is broken down into a word frequency table  The tables are called vectors and can be stored as arrays  A vocabulary is built from all the words in all documents in the system  Each document is represented as a vector based against the vocabulary  Example  Document A – “A dog and a cat.” a dog and 2 1 1  Document B – “A frog.” a frog 1 1 cat 1 Example, continued  The vocabulary contains all words used – a, dog, and, cat, frog  The vocabulary needs to be sorted – a, and, cat, dog, frog Example, continued  Document A: “A dog and a cat.” a and cat dog frog – Vector: (2,1,1,1,0)  2 1 1 1 0 Document B: “A frog.” a and cat dog frog – Vector: (1,0,0,0,1) 1 0 0 0 1 Measuring inter-document similarity  For two vectors d and d’ the cosine similarity between d and d’ is given by: d d' d d' Here d X d’ is the vector product of d and d’, calculated by multiplying corresponding frequencies together  The cosine measure calculates the angle between the vectors in a high-dimensional virtual space  Stopword lists  Commonly occurring words are unlikely to give useful information and may be removed from the vocabulary to speed processing – E.g. “in”, “a”, “the” Normalised term frequency (tf) A normalised measure of the importance of a word to a document is its frequency, divided by the maximum frequency of any term in the document  This is known as the tf factor.  Document A: raw frequency vector: (2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)  Inverse document frequency (idf)  A calculation designed to make rare words more important than common words  The idf of word i is given by N idf i  log ni  Where N is the total number of documents and ni is the number that contain word i tf-idf The tf-idf weighting scheme is to multiply the tf factor and idf factors for each word  Words are important for a document if they are frequent relative to other words in the document and rare in other documents  Part 3 Vocabulary Spectral Analysis Subject-clustering academic webs through text similarity 1 1. Create a collection of virtual documents consisting of all web pages sharing a common domain name in a university. – – – – Doc. 1 = cs.auckland.ac.uk 14,521 pgs Doc. 2 = www.auckland.ac.nz 3,463 pgs … Doc. 760 = www.vuw.ac.nz 4,125 pgs Subject-clustering academic webs through text similarity 2 2. 3. 4. 5. Convert each virtual document into a tf-idf word vector Identify clusters using k-means and VSM cosine measures Rank words for importance in each ‘natural’ cluster Cluster Membership Indicator Manually filter out high-ranking words in undesired clusters  Destroys the natural clustering of the data to uncover weaker subject clustering Cluster Membership Indicator For a cluster C of documents and tdf-idf weights wij cmi(C , i)  w jC C ij  w jC ij n C The next slide shows the top CMI weights for an undesired non-subject cluster Word massey palmerston and the of in north students research a Frequency 32991 9023 1883534 3605107 2263812 1317941 21348 127178 186161 1254004 Domains 364 305 674 689 683 655 414 550 546 659 CMI 0.30587 0.09137 0.0794 0.0746 0.06782 0.06556 0.06431 0.05753 0.05687 0.05616 Eliminating low frequency words  Can test whether removing low frequency words increases or decreases subject clustering tendency – E.g. are spelling mistakes? Need partially correct subject clusters  Compare similarity of documents within cluster to similarity with documents outside cluster  Eliminating low frequency words Law 0.5 Architecture Sport 0.4 Maths 0.3 Planning Social studies 0.2 Engineering Languages 0.1 Physics Chemistry Business -0.1 640 320 160 80 40 20 10 9 8 7 6 5 4 3 0 2 Intra-subject average correlation minus intersubject average correlation Psychology Education Medicine Env. Sci. Food Computing -0.2 Biology -0.3 General Minimum domains containing word Arts Summary  For text based academic subject web site clustering: – need to select vocabularies to break natural clustering and allow subject clustering – consider ignoring low frequency words because they do not have high clustering power – Need to automate the manual element as far as possible  The results can then form the basis of a visualization that can give feedback to the user on inter-subject connections

The Vector Space Model - University of Wolverhampton

Related documents

Products

Support

The Vector Space Model - University of Wolverhampton

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib