Field Profiling Productivity Top Journals Top Researchers Measuring Scholarly Impact in the field of Semantic Web Data: 44,157 papers with 651,673 citations from Scopus (1975-2009), and 22,951 papers with 571,911 citations from WOS (1960-2009) Impact through citation Impact Top Journals Top Researchers Rising Stars In WOS, M. A. Harris (Gene Ontology-related research), T. Harris (design and implementation of programming languages) and L. Ding (Swoogle – Semantic Web Search Engine) are ranked as the top three authors with the highest increase of citations. In Scopus, D. Roman (Semantic Web Services), J. De Bruijn (logic programming) and L. Ding (Swoogle) are ranked as top three for the significant increase in number of citations. Ding, Y. (2010). Semantic Web: Who is Who in the field, Journal of Information Science, 36(3): 335-356. Section 1 DATA COLLECTION Steps Step 1: Data collection Using journals Using keywords Example INFORMATION RETRIEVAL, INFORMATION STORAGE and RETRIEVAL, QUERY PROCESSING, DOCUMENT RETRIEVAL, DATA RETRIEVAL, IMAGE RETRIEVAL, TEXT RETRIEVAL, CONTENT BASED RETRIEVAL, CONTENT-BASED RETRIEVAL, DATABASE QUERY, DATABASE QUERIES, QUERY LANGUAGE, QUERY LANGUAGES, and RELEVANCE FEEDBACK. Web of Science Go to IU web of science http://libraries.iub.edu/resources/wos For example, Select Core Collection search “information Retrieval” for topics, for all years Web of Science Output Output Python Download Python: https://www.python.org/downloads/ In order to run Python flawlessly, you might have to change certain environment settings in Windows. In short, your path is: My Computer ‣ Properties ‣ Advanced ‣ Environment Variables In this dialog, you can add or modify User and System variables. To change System variables, you need non-restricted access to your machine (i.e. Administrator rights). User variable: C:\Program Files (x86)\Python27\Lib; Or go to command line using “Set” and “echo %path%” Python Script for conversion #!/usr/bin/env python # encoding: utf-8 """ conwos.py convert WOS file into format. """ import sys import os import re paper = 'paper.tsv' reference = 'reference.tsv' defsource = 'source' def main(): global defdestination global defsource source = raw_input('What is the name of source folder?\n') if len(source) < 1: source = defsource files = os.listdir(source) fpaper = open(paper, 'w') fref = open(reference, 'w') uid = 0 for name in files: if name[-3:] != "txt": continue fil = open('%s\%s' % (source, name)) print '%s is processing...' % name first = True Conwos1.py Python Script for conversion for line in fil: line = line[:-1] if first == True: first = False else: uid += 1 record = str(uid) + "\t" refs = "" elements = line.split('\t') for i in range(len(elements)): element = elements[i] if i == 1: authors = element.split('; ') for j in range(5): if j < len(authors): record += authors[j] + "\t" else: record += "\t" elif i == 29: refs = element refz = getRefs(refs) for ref in refz: fref.write(str(uid) + "\t" + ref + "\n") continue record += element + "\t" fpaper.write(record[:-1] + "\n") fil.close() fpaper.close() fref.close() Python Script for conversion def getRefs(refs): refz = [] reflist = refs.split('; ') for ref in reflist: record = "" segs = ref.split(", ") author = "" ind = -1 if len(segs) == 0: continue for seg in segs: ind += 1 if isYear(seg): record += author[:-2] + "\t" + seg + "\t" break else: author += seg + ", " ind += 1 if ind < len(segs): if not isVol(segs[ind]) and not isPage(segs[ind]): record += segs[ind] + "\t" ind += 1 else: record += "\t" else: record += "\t" Python Script for conversion if ind < len(segs): if isVol(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if ind < len(segs): if isPage(segs[ind]): record += segs[ind][1:] + "\t" ind += 1 else: record += "\t" else: record += "\t" if record[0] != "\t": refz.append(record[:-1]) return refz Python Script for conversion def isYear(episode): pattern = '^\d{4}$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True def isVol(episode): pattern = '^V\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True def isPage(episode): pattern = '^P\d+$' regx = re.compile(pattern) match = regx.search(episode) if match != None: return True if __name__ == '__main__': main() Convert output to database Using python script: conwos1.py Output: paper.tsv, reference.tsv Convert output to database Paper.tsv Convert output to database Reference.tsv Load them to Access Import data from external data at Access Access Tables Paper table Access Tables Citation table Section 2 PRODUCTIVITY & IMPACT Productivity Top Authors Find duplicate records (Query template) Productivity Top Journals Find duplicate records (Query template) Productivity Top Organizations Find duplicate records (Query template) Impact Highly cited authors Find duplicate records (Query template) Impact Highly cited journals Find duplicate records (Query template) Impact Highly cited articles Find duplicate records (Query template) Other indicators What are other indicators to measure productivity and impact: Time Journal impact factor Journal category Keyword … think about something in-depth, what are your new indicators? Section 3 AUTHOR-COCITATION NETWORK Top 100 highly cited authors First select the set of authors with whom you want to build up the matrix Select top 100 highly cited authors Author Cocitation Network Author Cocitation Network Author Cocitation Network Load the network to SPSS Load the network to SPSS Section 4 CLUSTERING Clustering Analysis Aim: create clusters of items that have similarity with others in the same cluster and differences with those outside of the cluster. So to create similarity within cluster and difference between clusters. Items are called cases in SPSS. There are no dependent variables for cluster analysis Clustering Analysis The degree of similarity and dissimilarity is measured by distance between cases Euclidean Distance measures the length of a straight line between two cases The numeric value of distance should be at the same measurement scale. If it is based on different measurement scales, Transform to the same scale Or create a distance matrix first Clustering Hierarchical clustering does not need a decision on the number of cluster first, good for a small set of cases K-means does need # of clusters first, good for a large set of cases Hierarchical Clustering Hierarchical Clustering Hierarchical Clustering: Data Data. The variables can be quantitative, binary, or count data. Scaling of variables is an important issue--differences in scaling may affect your cluster solution(s). If your variables have large differences in scaling (for example, one variable is measured in dollars and the other is measured in years), you should consider standardizing them (this can be done automatically by the Hierarchical Cluster Analysis procedure). Hierarchical Clustering: Data Case Order Cluster solution may depend on the order of cases in the file. You may want to obtain several different solutions with cases sorted in different random orders to verify the stability of a given solution. Hierarchical Clustering: Data Assumptions. The distance or similarity measures used should be appropriate for the data analyzed. Also, you should include all relevant variables in your analysis. Omission of influential variables can result in a misleading solution. Because hierarchical cluster analysis is an exploratory method, results should be treated as tentative until they are confirmed with an independent sample. Hierarchical Clustering: Method Nearest neighbor or single linkage Furthest neighbor or complete linkage The dissimilarity between cluster A and B is represented by the maximum of all possible distances between cases in A and B Between-groups linkage or average linkage The dissimilarity between cluster A and B is represented by the minimum of all possible distances between cases in A and B The dissimilarity between cluster A and B is represented by the average of all possible distances between cases in A and B Within-groups linkage The dissimilarity between cluster A and B is represented by the average of all the possible distances between the cases within a single new cluster determined by combining cluster A and B. Hierarchical Clustering: Method Centroid clustering Ward’s method The dissimilarity between cluster A and B is represented by the distance between the centroid for the cases in cluster A and the centroid for the cases in cluster B. The dissimilarity between cluster A and B is represented by the “loss of information” from joining the two clusters with this loss of information being measured by the increase in error sum of squares. Median clustering The dissimilarity between cluster A and cluster B is represented by the distance between the SPSS determined median for the cases in cluster A and the median for the cases in cluster B. All three methods should use squared Euclidean distance rather than Euclidean distance Measure for Interval Euclidean distance. The square root of the sum of the squared differences between values for the items. This is the default for interval data. Squared Euclidean distance. The sum of the squared differences between the values for the items. Pearson correlation. The product-moment correlation between two vectors of values. Cosine. The cosine of the angle between two vectors of values. Chebychev. The maximum absolute difference between the values for the items. Block. The sum of the absolute differences between the values of the item. Also known as Manhattan distance. Minkowski. The pth root of the sum of the absolute differences to the pth power between the values for the items. Customized. The rth root of the sum of the absolute differences to the pth power between the values for the items. Transform values Z scores.Values are standardized to z scores, with a mean of 0 and a standard deviation of 1. Range -1 to 1. Each value for the item being standardized is divided by the range of the values. Range 0 to 1. The procedure subtracts the minimum value from each item being standardized and then divides by the range. Maximum magnitude of 1. The procedure divides each value for the item being standardized by the maximum of the values. Mean of 1. The procedure divides each value for the item being standardized by the mean of the values. Standard deviation of 1. The procedure divides each value for the variable or case being standardized by the standard deviation of the values. Hierarchical Clustering: Method Hierarchical Clustering Identify relatively homogeneous groups of cases (or variables) based on selected characteristics, using an algorithm that starts with each case (or variable) in a separate cluster and combines clusters until only one is left. Distance or similarity measures are generated by the Proximities procedure Hierarchical Clustering: Statistics Hierarchical Clustering: Statistics Agglomeration schedule Proximity matrix Displays the cases or clusters combined at each stage, the distances between the cases or clusters being combined, and the last cluster level at which a case (or variable) joined the cluster. Gives the distances or similarities between items. Cluster Membership Displays the cluster to which each case is assigned at one or more stages in the combination of clusters. Available options are single solution and range of solutions. Hierarchical Clustering: Plot Dendrograms Icicle plots can be used to assess the cohesiveness of the clusters formed and can provide information about the appropriate number of clusters to keep. display information about how cases are combined into clusters at each iteration of the analysis. (User can specify a range of clusters to be displayed) Orientation: a vertical or horizontal plot. Hierarchical Clustering: Plot Hierarchical Clustering: Result Dendrogram using Average Linkage (Between Groups) Dendrogram with Ward linkage K-Means Clustering K-Means can handle large number of cases But it requires users to specific # of clusters K-Means Clustering: Method Iterate and classify # of iteration Convergence criteria Classify only No iteration K-Means Clustering: Method Cluster Centers initial cluster centers and the file (if required), which contains the final cluster centers. Read initial from: we specify the file which contains the initial cluster centers, and in Write final as: we specify the file which contains the final cluster centers. K-Means Clustering: Method Iterate By default 10 iterations and convergence criterion 0 are given. Use running means Yes: cluster centers change after the addition of each object. No: cluster centers are calculated after all objects have been allocated to a given cluster. Maximum Iterations (no more than 999) Convergence Criterion K-Means Clustering: Method K-Means Clustering: Statistics The output will show the following information: initial cluster centers, ANOVA table. Each case distance from cluster center. K-Means Clustering: Method K-Means K-Means: Result Initial Cluster Centers Vectors with their values based on the # of cluster variables. K-Means: Result WEEK12 Courtesy: Angelina Anastasova, Natalia Jaworska from University of Ottawa MULTIDIMENSIONAL SCALING Multidimensional Scaling (MDS): What Is It? Generally regarded as exploratory data analysis. Reduces large amounts of data into easy-to-visualize structures. Attempts to find structure (visual representation) in a set of distance measures, e.g. dis/similarities, between objects/cases. Shows how variables/objects are related perceptually. How? By assigning cases to specific locations in space. Distances between points in space match dis/similarities as closely as possible: Similar objects: Close points Dissimilar objects: Far apart points MDS Example: City Distances Distances Matrix: Symmetric Cluster Spatial Map Dimensions 1: North/South 2: East/West The Process of MDS: The Data Data of MDS: similarities, dissimilarities, distances, or proximities reflects amount of dis/similarity or distance between pairs of objects. Distinction between similarity and dissimilarity data dependent on type of scale used: Dissimilarity scale: Low =high similarity & High =high dissimilarity. Similarity scale: Opposite of dissimilarity. E.g. On a scale of 1-9 (1 being the same and 9 completely different) how similar are chocolate bars A and B? Dissimilarity scale. SPSS requires dissimilarity scales. Data Collection for MDS (1) Direct/raw data: Proximities’ values directly obtained from empirical, subjective scaling. • E.g. Rating or ranking dis/similarities (Likert scales). Indirect/derived data: Computed from other measurements: correlations or confusion data (based on mistakes) (Davidson, 1983). Data collection: Pairwise comparison, grouping/sorting tasks, direct ranking, objective method (e.g. city distances). Pairwise comparisons: All object pairs randomly presented: # of pairs = n(n-1)/2, n = # of objects/cases Can be tedious and inefficient process. Type of MDS Models (1) MDS model classified according to: 1) Type of proximities: Metric/quantitative: Quantitative information/interval data about objects’ proximities e.g. city distance. Non-metric/qualitative: Qualitative information/nominal data about proximities e.g. rank order. 2) Number of proximity matrices (distance, dis/similarity matrix). • Proximity matrix is the input for MDS. • The above criteria yield: 1) Classical MDS: One proximity matrix (metric or non-metric). 2) Replicated MDS: Several matrices. 3) Weighted MDS/Individual Difference Scaling: Aggregate proximities and individual differences in a common MDS space. Types of MDS (2) More typical in Social Sciences is the classification of MDS based on nature of responses: 1) Decompositional MDS: Subjects rate objects on an overall basis, an “impression,” without reference to objective attributes. Production of a spatial configuration for an individual and a composite map for group. 2) Compositional MDS: Subjects rate objects on a variety of specific, pre-specified attributes (e.g. size). No maps for individuals, only composite maps. The MDS Model Classical MDS uses Euclidean principles to model data proximities in geometrical space, where distance (dij) between points i and j is defined as: xi and xj specify coordinates of points i and j on dimension a, respectively. The modeled Euclidean distances are related to the observed proximities, ij, by some transformation/function (f). Most MDS models assume that the data have the form: ij = f(dij) All MDS algorithms are a variation of the above (Davidson, 1983). Output of MDS MDS Map/Perceptual Map/Spatial Representation: 1) Clusters: Groupings in a MDS spatial representation. These may represent a domain/subdomain. 2) Dimensions: Hidden structures in data. Ordered groupings that explain similarity between items. Axes are meaningless and orientation is arbitrary. In theory, there is no limit to the number of dimensions. In reality, the number of dimensions that can be perceived and interpreted is limited. Diagnostics of MDS (1) MDS attempts to find a spatial configuration X such that the following is true: f(δij) ≈ dij(X) Stress (Kruskal’s) function: Measures degree of correspondence between distances among points on the MDS map and the matrix input. Proportion of variance of disparities not accounted for by the model: Range 0-1: Smaller stress = better representation. None-zero stress: Some/all distances in the map are distortions of the input data. Rule of thumb: ≤0.1 is excellent; ≥0.15 not tolerable. Diagnostics of MDS (2) R2 (RSQ): Proportion of variance of the disparities accounted for by the MDS procedure. Weirdness Index: Correspondence of subject’s map and the aggregate map outlier identification. R2≥0.6 is an acceptable fit. Range 0-1: 0 indicates that subject’s weights are proportional to the average subject’s weights; as the subject’s score becomes more extreme, index approaches 1. Shepard Diagram: Scatterplot of input proximities (X-axis) against output distances (Y-axis) for every pair of items. Step-line produced. If map distances fall on the step-line this indicates that input proximities are perfectly reproduced by the MDS model (dimensional solution). Interpretation of Dimensions Squeezing data into 2-D enables “readability” but may not be appropriate: Poor, distorted representation of the data (high stress). Scree plot: Stress vs. number of dimensions. E.g. cities distance Primary objective in dimension interpretation: Obtain best fit with the smallest number of possible dimensions. How does one assign “meaning” to dimensions? Meaning of Dimensions Subjective Procedures: Labelling the dimensions by visual inspection, subjective interpretation, and information from respondents. “Experts” evaluate and identify the dimensions. Validating MDS Results Split-sample comparison: Multi-sample comparison: Original sample is divided and a correlation between the variables is conducted. New sample is collected and a correlation is conducted between the old and new data. Comparisons are done visually or with a simple correlation of coordinates or variables. Assessing whether MDS solution (dimensionality extraction) changes in a substantial way. MDS Caveats Respondents may attach different levels of importance to a dimension. Importance of a dimension may change over time. Interpretation of dimensions is subjective. Generally, more than four times as many objects as dimensions should be compared for the MDS model to be stable. “Advantages” of MDS Dimensionality “solution” can be obtained from individuals; gives insight into how individuals differ from aggregate data. Reveals dimensions without the need for defined attributes. Dimensions that emerge from MDS can be incorporated into regression analysis to assess their relationship with other variables. “Disadvantages” of MDS Provides a global measure of dis/similarity but does not provide much insight into subtleties (Street et al., 2001). Increased dimensionality: Difficult to represent and decreases intuitive understanding of the data. As such, the model of the data becomes as complicated as the data itself. Determination of meanings of dimensions is subjective. “SPSSing” MDS • In the SPSS Data Editor window, click: Analyze > Scale > Multidimensional Scaling • Select four or more Variables that you want to test. • You may select a single variable for the Individual Matrices for window (depending on the distances option selected). • If Data are distances (e.g. cities distances) option is selected, click on the Shape button to define characteristic of the dissimilarities/proximity matrices. • If Create distance from data is selected, click on the Measure button to control the computation of dissimilarities, to transform values, and to compute distances. • In the Multidimensional Scaling dialog box, click on the Model button to control the level of measurement, conditionality, dimensions, and the scaling model. • Click on the Options button to control the display options, iteration criteria, and treatment of missing values. MDS: A Psychological Example “Multidimensional scaling modelling approach to latent profile analysis in psychological research” (Ding, 2006) Basic premise: Utilize MDS to investigate types or profiles of people. “Profile:” From applied psych where test batteries are used to extract and construct distinctive features/characteristics in people. MDS method was used to: Derive profiles (dimensions) that could provide information regarding psychosocial adjustment patterns in adolescents. Assess if individuals could follow different profile patterns than those extracted from group data, i.e. deviations from the derived normative profiles. Study Details: Methodology Participants: College students (µ=23 years, n=208). Instruments: Self-Image Questionnaire for Young Adolescents (SIQYA). Variables: Body Image (BI), Peer Relationships (PR), Family Relationships (FR), Mastering & Coping (MC), Vocational-Educational Goals (VE), and Superior Adjustment (SA) Three mental health measures of well-being: Kandel Depression Scale UCLA Loneliness Scale Life Satisfaction Scale Data for MDS Scored data for MDS profile analysis Sample data for 14 individuals: BI=body image, PR=peer relations, FR=family relations, MC=mastery & coping, VE=vocational & educational goal, SA=superior adjustment, PMI-1=profile match index for Profile 1, PMI-2=profile match index for Profile 2, LS=life satisfaction, Dep=depression, PL=psychological loneliness The Analysis: Step by Step Step 1: Estimate the number of profiles (dimensions) from the latent variables. MDS map Euclidean distance model Kruskal's stress = 0.00478 • Excellent stress value. RSQ = 0.9998 Configuration derived in 2 dimensions. 2.0 pr 1.5 1.0 .5 ve 0.0 mc sa bi -.5 -1.0 fr -1.5 -2 Profile 1 -1 0 1 2 3 Scale values of two MDS profiles (dimensions) in psychosocial adjustment. MDS map Euclidean distance model 2.0 pr 1.5 1.0 .5 ve 0.0 mc sa bi -.5 Profile 2 -1.0 fr -1.5 -2 -1 0 1 2 3 Profile 1 Normative profiles of psychosocial adjustments in young adults. Each profile represents prototypical individual. References Davidson, M. L. (1983). Multidimensional scaling. New York: J. Wiley and Sons. Ding, C. S. (2006). Multidimensional scaling modelling approach to latent profile analysis in psychological research. International Journal of Psychology 41 (3), 226-238. Kruskal, J.B. & Wish M.1978. Multidimensional Scaling. Sage. Street, H., Sheeran, P., & Orbell, S. (2001). Exploring the relationship between different psychosocial determinants of depression: a multidimensional scaling analysis. Journal of Affective Disorders 64, 53– 67. Takane, Y., Young, F.W., & de Leeuw, J. (1977). Nonmetric individual differences multidimensional scaling: An alternating least squares method with optimal scaling features, Psychometrika 42 (1), 7–67. Young, F.W., Takane, Y., & Lewyckyj, R. (1978). Three notes on ALSCAL, Psychometrika 43 (3), 433–435. http://www.analytictech.com/borgatti/profit.htm http://www2.chass.ncsu.edu/garson/pa765/mds.htm http://www.terry.uga.edu/~pholmes/MARK9650/Classnotes4.pdf MDS- SPSS MDS- SPSS MDS- SPSS MDS- SPSS MDS- SPSS MDS- SPSS MDS- SPSS A field map Combining MDS with Clustering methods Draw clusters on MDS plots using MS Paint Identify cluster labels Mapping the field of IR Author Co-citation Map in the field of Information Retrieval (1992-1997) Data: 1,466 IR-related papers was selected from 367 journals with 44,836 citations. Examples McCain, K. (1990). Mapping authors in intellectual space: A technical overview. Journal of the American Society for Information Science & Technology, 41(6), 433-443. Ding,Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of Information Retrieval: An Author Cocitation Analysis, 19871997, Journal of Information Science, 25(1): 6778. FACTOR ANALYSIS Factor Analysis To identify underlying variables, or factors, that explain the pattern of correlations within a set of observed variables. It is a data reduction to identify a small number of factors that explain most of the variance observed in a large number of variables. Assumption: variables or cases should be independent (we can use correlation to check whether some variables are dependent) Descriptive The Coefficients option produces the Rmatrix, and the Significance levels option will produce a matrix indicating the significance value of each correlation in the R-matrix.You can also ask for the Determinant of this matrix and this option is vital for testing for multicollinearity or singularity. Extraction The scree plot was described earlier and is a useful way of establishing how many factors should be retained in an analysis. The unrotated factor solution is useful in assessing the improvement of interpretation due to rotation. If the rotated solution is little better than the unrotated solution then it is possible that an inappropriate (or less optimal) rotation method has been used. Rotation The interpretability of factors can be improved through rotation. Rotation maximizes the loading of each variable on one of the extracted factors whilst minimizing the loading on all other factors. Rotation works through changing the absolute values of the variables whilst keeping their differential values constant. If you expect the factors to be independent then you should choose one of the orthogonal rotations (varimax). Score This option allows you to save factor scores for each subject in the data editor. SPSS creates a new column for each factor extracted and then places the factor score for each subject within that column. These scores can then be used for further analysis, or simply to identify groups of subjects who score highly on particular factors. Options SPSS will list variables in the order in which they are entered into the data editor. Although this format is often convenient, when interpreting factors it can be useful to list variables by size. By selecting Sorted by size, SPSS will order the variables by their factor loadings. There is also the option to Suppress absolute values less than a specified value (by default 0.1). This option ensures that factor loadings within ±0.1 are not displayed in the output. This option is useful for assisting in interpretation; It should be clear that the first few factors explain relatively large amounts of variance (especially factor 1) whereas subsequent factors explain only small amounts of variance. SPSS then extracts all factors with eigenvalues greater than 1, now we have 23 factors The eigenvalues associated with these factors are displayed in the columns labelled Extraction Sums of Squared Loadings. Result Factor and its members Pick up loading (absolute value)>0.7 to interpret the factor, with >0.4 to report as the members for the factor Scree Plot Mapping the field of IR Ding, Y., Chowdhury, G. and Foo, S. (1999). Mapping Intellectual Structure of Information Retrieval: An Author Cocitation Analysis, 1987-1997, Journal of Information Science, 25(1): 67-78. Broader Thinking Trying other networks: Co-author network, journal co-citation network Trying factor analysis for your survey data