The Vector Space Model - University of Wolverhampton

advertisement
Vocabulary Spectral
Analysis as an Exploratory
Tool for Scientific Web
Intelligence
Mike Thelwall
Professor of Information Science
University of Wolverhampton
Contents
Introduction to Scientific Web
Intelligence
 Introduction to the Vector Space Model
 Vocabulary Spectral Analysis
 Low frequency words

Part 1
Scientific Web Intelligence
Scientific Web Intelligence

Applying web mining and web intelligence
techniques to collections of
academic/scientific web sites
 Uses links and text
 Objective: to identify patterns and visualize
relationships between web sites and subsites
 Objective: to report to users causal
information about relationships and patterns
Academic Web Mining

Step 1: Cluster domains by subject content,
using text and links
 Step 2: Identify patterns and create
visualizations for relationships
 Step 3: Incorporate user feedback and
reason reporting into visualization
This presentation deals with Step 1, deriving
subject-based clusters of academic webs from text
analysis
Part 2
Introduction to the Vector
Space Model
Overview
The Vector Space Model (VSM) is a
way of representing documents through
the words that they contain
 It is a standard technique in Information
Retrieval
 The VSM allows decisions to be made
about which documents are similar to
each other and to keyword queries

How it works: Overview
Each document is broken down into a
word frequency table
 The tables are called vectors and can
be stored as arrays
 A vocabulary is built from all the words
in all documents in the system
 Each document is represented as a
vector based against the vocabulary

Example

Document A
– “A dog and a cat.”
a
dog
and
2
1
1

Document B
– “A frog.”
a
frog
1
1
cat
1
Example, continued

The vocabulary contains all words used
– a, dog, and, cat, frog

The vocabulary needs to be sorted
– a, and, cat, dog, frog
Example, continued

Document A: “A dog and a cat.”
a and cat dog frog
– Vector: (2,1,1,1,0)

2 1
1
1
0
Document B: “A frog.”
a and cat dog frog
– Vector: (1,0,0,0,1)
1 0
0
0
1
Measuring inter-document
similarity

For two vectors d and d’ the cosine similarity
between d and d’ is given by:
d d'
d d'
Here d X d’ is the vector product of d and d’,
calculated by multiplying corresponding
frequencies together
 The cosine measure calculates the angle
between the vectors in a high-dimensional
virtual space

Stopword lists

Commonly occurring words are unlikely
to give useful information and may be
removed from the vocabulary to speed
processing
– E.g. “in”, “a”, “the”
Normalised term frequency (tf)
A normalised measure of the
importance of a word to a document is
its frequency, divided by the maximum
frequency of any term in the document
 This is known as the tf factor.
 Document A: raw frequency vector:
(2,1,1,1,0), tf vector: (1, 0.5, 0.5, 0.5, 0)

Inverse document frequency (idf)

A calculation designed to make rare words
more important than common words
 The idf of word i is given by
N
idf i  log
ni

Where N is the total number of documents
and ni is the number that contain word i
tf-idf
The tf-idf weighting scheme is to
multiply the tf factor and idf factors for
each word
 Words are important for a document if
they are frequent relative to other words
in the document and rare in other
documents

Part 3
Vocabulary Spectral
Analysis
Subject-clustering academic
webs through text similarity 1
1.
Create a collection of virtual
documents consisting of all web pages
sharing a common domain name in a
university.
–
–
–
–
Doc. 1 = cs.auckland.ac.uk 14,521 pgs
Doc. 2 = www.auckland.ac.nz 3,463 pgs
…
Doc. 760 = www.vuw.ac.nz 4,125 pgs
Subject-clustering academic
webs through text similarity 2
2.
3.
4.
5.
Convert each virtual document into a tf-idf
word vector
Identify clusters using k-means and VSM
cosine measures
Rank words for importance in each ‘natural’
cluster Cluster Membership Indicator
Manually filter out high-ranking words in
undesired clusters

Destroys the natural clustering of the data to
uncover weaker subject clustering
Cluster Membership Indicator
For a cluster C of documents and tdf-idf weights wij
cmi(C , i) 
w
jC
C
ij

w
jC
ij
n C
The next slide shows the top CMI weights for an undesired
non-subject cluster
Word
massey
palmerston
and
the
of
in
north
students
research
a
Frequency
32991
9023
1883534
3605107
2263812
1317941
21348
127178
186161
1254004
Domains
364
305
674
689
683
655
414
550
546
659
CMI
0.30587
0.09137
0.0794
0.0746
0.06782
0.06556
0.06431
0.05753
0.05687
0.05616
Eliminating low frequency words

Can test whether removing low
frequency words increases or
decreases subject clustering tendency
– E.g. are spelling mistakes?
Need partially correct subject clusters
 Compare similarity of documents within
cluster to similarity with documents
outside cluster

Eliminating low frequency words
Law
0.5
Architecture
Sport
0.4
Maths
0.3
Planning
Social studies
0.2
Engineering
Languages
0.1
Physics
Chemistry
Business
-0.1
640
320
160
80
40
20
10
9
8
7
6
5
4
3
0
2
Intra-subject average correlation minus intersubject average correlation
Psychology
Education
Medicine
Env. Sci.
Food
Computing
-0.2
Biology
-0.3
General
Minimum domains containing word
Arts
Summary

For text based academic subject web site
clustering:
– need to select vocabularies to break natural
clustering and allow subject clustering
– consider ignoring low frequency words because
they do not have high clustering power
– Need to automate the manual element as far as
possible

The results can then form the basis of a
visualization that can give feedback to the
user on inter-subject connections
Download