BRIEF OVERVIEW OF WEB ANALYSIS SOFTWARE Sun-Ki Chai Dept. of Sociology University of Hawai`i at Manoa sunki@hawaii.edu This software contains a flexible, social network-oriented crawler that locates and analyzes a virtual community surrounding a particular interest area on the World Wide Web, and generates a large dataset containing a wide range of qualitative and quantitative information about each site in the community. The program begins its crawl with a set of "seed" urls and / or keywords provided by the user. As it proceeds, the program incrementally expands the crawl, using a combination of link relatedness criteria based on sociological network concepts, in conjunction with content filtering, to incrementally identify and probe additional sites most likely to belong to the virtual community defined by the user. The exact criteria for expansion of the crawl, as well as the halting criteria, are configurable by the user. Once it halts, the software generates from its crawls a quantitative and qualitative database containing a wide variety of information about the sites in the virtual community, suitable for input into statistical and content analysis software. Among the types of variables included in the database are the following: Network Analysis list and # of sites linking to each site list and # of sites linked to from each site Content Analysis frequency list for all words appearing on each site frequency of user specified words or phrases for each site title words on site description words on site size of site (# of words, pages, bytes) list and # internal graphics and multimedia links character encoding (non-English sites) average age of site pages (where server returns date) date of site establishment* site category* Popularity Analysis visit count for each site* survey-reported user ranking* Geographical Analysis registered domain registering agent's address* * Compiled from publicly accessible web databases As mentioned, the software is meant to be extremely versatile, and can be used to address a variety of topics across different disciplines. Here are some examples of possible applications for the data that are generated by the program: (1) Examining how a site's popularity is influenced by its pattern of links to and from other sites within a virtual community. (This issue is examined in the demonstration paper within a set of sites on Korean-American affairs). (2) Using regression analysis to test which types of site content characteristics (subject matter, size, graphic intensity) are most significant in determining a site's popularity. (3) Determining which types of other sites are most likely to link to a given site, and what types of interest groups these sites represent. (4) Determining how closely sites representing two different interest groups are linked to one another (crawling over communities on two different topics, then examining overlap between the final results. The software is motivated by the existing and growing demand for a product that can provide a simple, quick, and inexpensive way for researchers to collect information on the nature of virtual communities and the sites that make them up. Research about the Internet, and particularly the World Wide Web, is perhaps the fastest-growing empirical field of study in the social sciences, and is gaining great prominence in the government policy and business arenas as well. In particular, researchers are anxious to know more information about the characteristics of a community of sites relating to a particular topic, the relationships between sites in such a community, and the type of audience to which such sites appeal. However, one notable thing about existing studies of the social structure of the Web is that few make any use of systematic quantitative or qualitative data, particularly data drawn from the Web itself. This shortage is not due to any lack of interest in systematic analysis within the academic, business, and government communities. Indeed, statistics, content analysis, and other systematic techniques are pervasive throughout each of the aforementioned professions. Nor is problem the inaccessibility of information from which such data could be generated, since one of the notable characteristic of the Web is the fact that much of the information on it is publicly accessible to anyone with an Internet connection. The real problem is the lack of a readily available means for extracting systematic data on virtual communities from available information, and putting in usable form. The current software project seeks to provide a general-purpose solution to this problem.