ORGANIZING INFORMATION ASSETS Understanding Taxonomies and More As organizations purchase and create more and more digital content, employees find it increasingly challenging to find and reuse specific pieces of information from the growing quantities of unstructured information contained in shared drives, intranets and portal sites. Embedded “search” functionality is an expectation of almost any software product and many search engine vendors promise automatic, dynamic classification of content. Some go as far as to suggest that using a pre-defined taxonomy as a framework for tagging documents is unnecessary—that content can be automatically classified “on the fly.” What is the reality? What value does a taxonomic structure provide in search efficiency and retrieval precision? The reality is that until computers can consistently and accurately recognize concepts, in addition to terms or character strings, using a taxonomy as a framework for categorizing documents will aid in navigation and retrieval. Natural language searching and keyword searching yield high retrieval but can miss essential pieces of content that do not contain the specific terms that are being searched or articles and documents about concepts that are described in various ways. A taxonomy can be counted on to improve search precision, facilitate discovery when drilling down into a subject hierarchy and provide a window into the knowledge domain of the organization. In this unit of the Information Professional Resource Center, the following resources will help you become better acquainted with the importance of taxonomies: Taxonomy FAQ Taxonomy options Importance of a taxonomy Additional reading: http://www.sla.org/content/resources/infoportals/taxonomies.cfm See also: White papers available at www.Factiva.com/collateral: How to Utilize Enterprise Information Architecture to Enable Enterprise Information Integration Making Solid Business Decisions through Intelligent Indexing Taxonomies Information professionals bring unique skills to the area of information management and can play a pivotal role in designing an information architecture that will help knowledge workers quickly find the information they need for their work and leverage intellectual assets of the organization. Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 1 ORGANIZING INFORMATION ASSETS Taxonomy FAQ How is taxonomy defined? A taxonomy is a hierarchical structure for organizing information, the science of classification according to a predetermined system. The term is commonly used in biology and other natural sciences to refer to a means for classifying a living organism in relation to other similar organisms. In the biological taxonomy, an organism occupies one specific place while in a lexical system, a concept term can be placed in more than one category if appropriate. What is a controlled vocabulary? A controlled vocabulary is a closed list of terms (or phrases) that is used for consistently indexing or labeling items admitted to content repositories. The controlled vocabulary terms are also used to increase relevance in the search process. An alphabetic list of terms, a hierarchical list of terms, and a list of related terms are examples of controlled vocabularies. A taxonomy is a type of controlled vocabulary, generally a hierarchical view of the controlled vocabulary. The controlled vocabulary can be thought of as the glue that binds together related content objects produced in various departments or business units across the organization. What is the difference between a taxonomy and a thesaurus? As noted above, a taxonomy is a set of terms arranged hierarchically while a thesaurus typically shows relationships between terms, including broader terms, narrower terms, related terms, preferred terms and “use for” terms. What is metadata? Metadata is broadly defined as ‘data about data’—including subject indexing terms and other properties such as author, language, date of creation, etc. that will make it easy to find a particular record or informational artifact. How is a taxonomy developed? Developing a taxonomy is a complex effort that involves, at a high level, defining types of content housed in enterprise systems, identifying the vocabulary used to retrieve content and understanding hierarchies and relationships between terms. Using a manual approach to developing a taxonomy typically involves subject matter experts: Examining a representative set of exemplary documents Extracting, classifying and organizing significant terms and concepts Mapping synonyms to terms and concepts Evaluating other term lists and selecting appropriate terms to be added to the taxonomy Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 2 Working with content creators and users to agree on definitions Reaching agreement on terms to be included in the controlled vocabulary for indexing and for classification Categorization software can speed up the process considerably by automating the document analysis and term extraction process. Exemplary documents are used as the basis for training the system and automatically generating rules that determine the categories into which content will be placed. What is a faceted classification scheme? Facets are various views or attributes of a topic or object. A faceted classification scheme attempts to present all aspects of a subject in a way that enables the user to easily focus on the aspect(s) of interest—for example, by clicking on a folder containing the sub-topic(s) of interest. When searching on a company name, a large number of results might be conveniently grouped into categories such as: products, geography, competitors, and intellectual property—depending on the body of content being searched. What are the benefits of automated classification? Automatic classification is accomplished with software designed to rapidly scan documents sets (or other content objects) and assign the objects to categories, sometimes according to an underlying taxonomy, although a taxonomy is not always a part of the software. Benefits are consistency in assigning categories, speed of processing and lower costs. What are the benefits of manual classification? Manual classification, done by individuals who are subject matter experts, tends to be accurate because human judgment typically overcomes the ambiguity of language. It must be recognized that manual classification is usually a slow, laborious process and thus, a costly process. The manual classification project has also been found to be surprisingly subjective. Hybrid systems take advantage of the speed of automated classification, but use human expertise for creating and fine tuning rules and for spot checking accuracy of the automated systems to provide optimal results. What are success criteria? How do you know if a taxonomy project has been successful? A good taxonomy allows users to 1) search the way they think and 2) quickly retrieve the information they are seeking because the content has been analyzed, categorized and labeled according to a lexical scheme that is meaningful for a particular business environment. A taxonomy that reflects the language of the business should favorably impact knowledge worker productivity. Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 3 ORGANIZING INFORMATION ASSETS Developing a Taxonomy Taxonomies can be custom built by information scientists and business consultants who specialize in this area. There are also taxonomies for many knowledge domains that can be licensed or purchased. Whether the taxonomy will be used as a search tool, as a Web site navigation aid or for tagging content in the content management system, it is crucial to devote time and effort to creating a robust structure that reflects the language of your business. Think about the following options for developing a taxonomy as you plan the strategy for managing information in your organization: Build your own taxonomy–with in-house staff or with consulting assistance Build your own taxonomy using categorization software with fine tuning by subject matter experts License a taxonomy from an organization that has already created taxonomies closely related to your business and your industry from a content management software company, publisher, trade association or content aggregator (The taxonomy structure will need to be modified or built out to match your business content.) Implement a taxonomy that is in the public domain after adapting it to match your business content Most organizations will not have to “start from scratch” to build a taxonomy. It is important to learn about efforts to organize information already underway across the enterprise, perhaps at the departmental or functional group level. For example, purchasing departments may have a standardized list of raw materials purchased and information professionals likely have subject lists used for identifying external content in their collections. Classification hierarchies and other metadata generated for content management systems are valuable, as are database query logs. Collecting and reusing keyword lists like these will provide the foundation for an organizational taxonomy and will shorten the development process. Predefined taxonomies are available for many disciplines at varying levels of granularity or depth. These taxonomies result from extensive research as well as familiarity and long experience with content pertaining to an industry or discipline. The taxonomy builders typically employ best practices for categorizing content. A potential drawback is that these predefined taxonomies are built for managing collections of books or journals and thus lack vocabulary relating to business processes. This drawback is easily overcome by selecting the portion(s) of the predefined taxonomy applicable to your environment; you can then revise sections of the taxonomy that do not exactly meet your needs and incorporate additional terms and concepts specific to your organization. Software tools for creating taxonomies learn from representative “training” documents and suggest categories based on the content. These tools are becoming more sophisticated and more accurate. The underlying programs crawl designated content repositories and rely on analysis of word patterns and occurrence of terms and complex business rules to group similar documents or exclude documents from a category. New training documents can be fed into the categorization engine; the software learns from the training documents and makes better Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 4 decisions going forward. Ideally, the programs enable the taxonomy structure to work in tandem with search and retrieval capabilities and integrate seamlessly with other IT applications. If there are ECM (enterprise content management) systems in use in your organization, they may have taxonomy creation capabilities or such modules may be easily added. Business users should be involved in the taxonomy development efforts—whether a manual effort or a software solution—to build awareness of the information organization efforts, to take advantage of their knowledge of specific areas of the business, and most important—to make sure the system developed will help them efficiently find the information they need. Some questions that should be explored in preliminary stages include: What types of items are stored in enterprise repositories? Can audio, video and image files be classified with content management software? Who will use this content? How do different user groups name content types? How quickly are content repositories growing? Can search and classification software under consideration scale to handle rapidly growing volumes of information? Is the content in languages in addition to English; can software being considered handle non- English materials? Companies with experience in developing and deploying a taxonomy find that not all parts of the taxonomy are relevant to all business units. However a central taxonomy repository allows for the most efficient updating and maintenance. Since a taxonomy is a living entity, developing it is only the beginning of the process. There must be a commitment to testing the taxonomy in varied applications and refining it as content evolves and as business conditions change. Regularly scheduled reviews to add or change sections of the taxonomy will keep it fresh and in synch with content changes. Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 5 ORGANIZING INFORMATION ASSETS Value of a Taxonomy Why do organizations care about a taxonomy? There might not be significant interest in a taxonomy per se, but there is no doubt that organizations care about being able to efficiently find, retrieve and repurpose content from enterprise systems. There is a frequently quoted statistic that more than 80% of information housed in corporate repositories is unstructured data. Even for small and mid-size companies, managing this amount of textual information is challenging. It is more challenging for large global organizations. For companies embracing a market-driven strategy, as opposed to a product strategy, reducing time to market is as important as innovation and quality. Systematically organizing information, especially unstructured information, is crucial for effectively managing the volume of content accumulating in all enterprises. A sound underlying structure is what will enable the knowledge worker to quickly retrieve relevant information from the desktop environment. What may appear to the user as serendipitous discovery of useful information is actually possible because of careful preparation and processing of the content. A taxonomy for labeling content is a cornerstone of that infrastructure. Microsoft recently reported a 40% improvement in hit rates and a doubling of satisfaction metrics using even a relatively primitive taxonomic system. Users spent significantly less time trying to find a given document. (http://www.it-director.com/article_pf.php?articleid=3757) The lack of a common vocabulary and robust taxonomy structure used across business units can effectively undermine employee productivity and ultimately, have a negative impact on the quality of customer service. Benefits to the enterprise taxonomy include: The ability of one business group to leverage product and industry expertise of other groups Reduced “reinventing the wheel” syndrome and duplication of effort in creating intellectual capital Surfacing hidden assets when the taxonomy is employed in navigation schemes as well as in search and retrieval programs More unified way of serving clients More satisfactory user experience The same principles of design that are used for internal content repositories should be used for public Web sites to ensure that clients and potential clients have a satisfactory experience and positive impression of an organization based on clarity, usability and accurate retrieval at the Web site. Information professionals can make a huge contribution to taxonomy discussions by keeping the information-seeking behavior and information needs of users at the forefront of the discussions. Unit 4 – Organizing Information Assets Factiva – A Dow Jones & Reuters Company 6