Automated Generation of Metadata for Studying and Teaching about

Automated Generation of Metadata for Studying and Teaching about Africa from an Africancentric Perspective: Opportunities, Barriers and Signs of Hope Abdul Karim Bangura Abstract As ironical as it may seem, most of what I know today about African history I learned in the West, and the opportunities availed to me to travel back and forth to conduct research in Africa were made possible by living in the United States. Yet, as I have demonstrated in a relatively recent work (Bangura, 2005), after almost three centuries of employing Western educational approaches, many African societies are still characterized by low Western literacy rates, civil conflicts and underdevelopment. It is obvious that these Western educational paradigms, which are not indigenous to Africans, have done relatively little good for Africans. Thus, I argued in that work that the salvation for Africans hinges upon employing indigenous African educational paradigms which can be subsumed under the rubric of ubuntugogy, which I defined as the art and science of teaching and learning undergirded by humanity towards others. Therefore, ubuntugogy transcends pedagogy (the art and science of teaching), andragogy (the art and science of helping adults learn), ergonagy (the art and science of helping people learn to work), and heutagogy (the study of self-determined learning). As I also noted, many great African minds, realizing the debilitating effects of the Western educational systems that have been forced upon Africans, have called for different approaches. One of the biggest challenges for studying and teaching about Africa in Africa at the higher education level, however, is the paucity of published material. Automated generation of metadata is one way of mining massive datasets to compensate for this shortcoming. Thus, this essay raises and addresses the following three major research questions: (1) What is automated generation of metadata and how can the technique be employed from an Africancentric perspective? (2) What are the barriers for employing this approach? (3) What signs are on the horizon that point to possibilities of overcoming these barriers? After addressing these questions, conclusions and recommendations are offered. Keywords: Metadata, Data Mining, Information and Communication Technology (ICT), Africa, Africancentrism, Ubuntugogy Introduction While a great deal of attention has been paid to the “digital divide” within developed countries and between those countries and the developing ones, most Africans do not even have such luxury as access to books, periodicals, radio and television channels, which is precisely why information and communication technology (ICT) is so important to Africa. ICT has the potential to have a positive impact on Africa’s development. So, how can Africans transform that potential into reality? And how can Africans access that technology? Without access, that technology cannot do much for Africans—thus, the essence of digital technology. The digital often refers to the newest ICT, particularly the Internet. There are, of course, other more widely available forms of ICT, such as radio and telephones. But there are many problems concerning the generally abysmal state of networks of every kind on the continent that make it difficult to fully utilize the development potential of even this technology. Africa’s electrical grid is grossly inadequate, resulting in irregular or nonexistent electrical supplies. The biggest problem is that in many countries, significant power distribution networks are non-existent in rural areas. Africa’s phone systems are spotty and often rely on antiquated equipment, and progress is hamstrung by bureaucracy and, in most instances, state-owned monopolies. But African governments have the power to alter these circumstances and, gradually, some are doing so. The signs of progress are unbelievable. A few years ago, a couple of countries had Internet access. Today, all 54 countries and territories in Africa have permanent connections, and there is also rapidly growing public access provided by phone shops, schools, police stations, clinics, and hotels. Although Africa is becoming increasingly connected, access to the Internet, however, is progressing at a limited pace. Of the 770 million people in Africa, only one in every 150, or approximately 5.5 million people in total, now uses the Internet. There is roughly one Internet user for every 200 people, compared to a world average of one user for every 15 people, and a North American and European average of about one in every two people. An Internet or E-mail connection in Africa usually supports a range of three to five users. The number of dial-up Internet subscribers now stands at over 1.3 million, up from around one million at the end of 2000. Of these, North Africa accounts for about 280,000 subscribers and South Africa accounts for 750,000 (Lusaka Information Dispatch, 2003). Kenya now has more than 100,000 subscribers and some 250 cyber cafes across the country (BBC, 2002). The widespread penetration of the Internet in Africa is still largely confined to the major cities, where only a minority of the total population lives. Most of the continent’s capital cities now have more than one internet service provider (ISP); and in early 2001, there were about 575 public ISPs across the continent. Usage of the Internet in Africa is still considered a privilege for a few individuals and most people have never used it (Lusaka Information Dispatch, 2003). In Zambia, for example, there are now about five ISPs, which include Zamnet, Microlink, Coppernet, Uunet, and Sambia Telecommunication Service (ZAMTEL), which is government owned. Most people in Lusaka go to Internet cafes to check for their E-mail unlike surfing the Internet to conduct research (Lusaka Information Dispatch, 2003). Indeed, ICT can play a substantial role to improve access to all forms of education (formal schooling, adult literacy, and vocational educational training) and to strengthen the economic and democratic institutions in African countries. It can also help to address the major issue of this essay: i.e. one of the biggest challenges for studying and teaching about Africa in Africa at the higher education level being the paucity of published material. I suggest in this paper that automated generation of metadata is one way of mining massive datasets to compensate for this shortcoming. Thus, this essay raises and addresses the following three major research questions: (1) What is automated generation of metadata and how can the technique be employed from an Africancentric perspective? (2) What are the barriers for employing this approach? (3) What signs are on the horizon that point to possibilities of overcoming these barriers? After addressing these questions, conclusions and recommendations are offered. Automated Generation of Metadata The capabilities of generating and collecting data, observed Alshameri (2006), have been increasing rapidly. The computerization of many business and government transactions with the attendant advances in data collection tools, he added, has provided huge amounts of data. Millions of databases have been employed in business management, government administration, scientific and engineering management, and many other applications. This explosive growth in data and databases 2 has generated an urgent need for new techniques and tools that can intelligently and automatically transform the processed data into useful information and knowledge (Chen et al., 1996). This chapter explores the nature of data mining and how it can be used in doing research on African issues. Data mining is the task of discovering interesting patterns from large amounts of data where the data can be stored in databases, data warehouses, or other information repositories. It is a young interdisciplinary field, drawing upon such areas as database systems, data warehousing, statistics, machine learning, data visualization, information retrieval, and high-performance computing. Other contributing areas include neural networks, pattern recognition, spatial data analysis, image databases, signal processing, and many application fields, such as business, economics, and bioinformatics. Data mining denotes a process of nontrivial extraction of implicit, previously unknown and potentially useful information (such as knowledge rules, constraints, regularities) from data in databases. The information and knowledge gained can be used for applications ranging from business management, production control, and market analysis to engineering design and scientific exploration. There are also many other concepts, appearing in some literature, carrying a similar or slightly different definitions, such as knowledge mining from databases, knowledge extraction, data archaeology, data dredging, data analysis, etc. By knowledge discovery in databases, interesting knowledge, regularities, or high-level information can be extracted from the relevant sets of data in databases and be investigated from different angles, thereby serving as rich and reliable sources for knowledge generation and verification. Mining information and knowledge from large databases has been recognized by many researchers as a key research topic in database systems and machine learning and by many industrial companies as an important area with an opportunity for major revenue generation. The discovered knowledge can be applied to information management, query processing, decision making, process control, and many other applications. Researchers in many different fields, including database systems, knowledge-based systems, artificial intelligence, machine learning, knowledge acquisition, statistics, spatial databases, and data visualization have shown great interest in data mining. Furthermore, several emerging applications in information providing services, such as on-line services and the World Wide Web, also call for various data mining techniques to better understand user behavior in order to ameliorate the service provided and to increase business opportunities. Mining Massive Data Sets Recent years have witnessed an explosion in the amount of digitally-stored data, the rate at which data is being generated, and the diversity of disciplines relying upon the availability of stored data. Massive data sets are increasingly important in a wide range of applications, including observational sciences, product marketing, and the monitoring and operations of large systems. Massive datasets are collected routinely in a variety of settings in astrophysics, particle physics, genetic sequencing, geographical information systems, weather prediction, medical applications, telecommunications, sensors, government databases, and credit card transactions. The nature of these data is not limited to a few esoteric fields, but, arguably, to the entire gamut of human intellectual pursuits, ranging from images on Web pages to exabytes (~1018 bytes) of astronomical data from sky surveys (Hambrusch et al., 2003). There is a wide range of problems and application domains in science and engineering that can benefit from data mining. In several of these fields, techniques similar to data mining have been used 3 for many years, albeit under different names (Kamath, 2001). For example, in the area of remote sensing, rivers and boundaries of cities have been identified using image understanding methods. Much of the use of data mining techniques in the past has been for data obtained from observations of experiments, as one-dimensional signals or two-dimensional images. These techniques, however, are increasingly attracting the attention of scientists involved in simulating complex phenomena on massively parallel computers. They realize that, among other benefits, the semi-automated approach of data mining can complement visualization in the analysis of massive datasets produced by the simulations. There are different areas which provide for the use of data mining. The following are some examples: (a) Astronomy and Astrophysics have long used data mining techniques such as statistics that aid in the careful interpretation of observations that are an integral part of astronomy. The data being collected from astronomical surveys are now being measured in terabytes (~1012 bytes), because of the new technology of the telescopes and detectors. These datasets can be easily stored and analyzed by high performance computers. Astronomy data present several unique challenges. For example, there is frequently noise in the data due to the sensors used for collecting data: atmospheric disturbances, etc. The data may also be corrupted by missing values or invalid measurements. In the case of images, identifying an object within an image may be non-trivial, as natural objects are frequently complex and image processing techniques based on the identification of edges or lines are inapplicable. Furthermore, the raw data, which are in highdimensional space, must be transformed into a lower-dimensional feature space, resulting in a high pre-processing cost. The volumes of data are also large, further exacerbating the problem. All these characteristics, in addition to the lack of ground truth, make astronomy a challenging field for the practice of data mining (Grossman et al., 2001). (b) Biology, Chemistry, and Medicine—informatics, chemical informatics and medicine are all areas where data mining techniques have been used for a while and are increasingly gaining acceptance. In bioinformatics, which is a bridge between biology and information technology, the focus is on the computational analysis of gene sequences (Cannataro et al., 2004). Here, the data can be gene sequences, expressions, or protein information. Expressions mean information on how the different parts of a sequence are activated, whereas protein data represent the biochemical and biophysical structures of the molecules. Research in bioinformatics related to sequencing of the human genome evolved from analyzing the effects of individual genes to a more integrated view that examines whole ensembles of genes as they interact to form a living human being. In medicine, image mining is used on the analysis of images from mammograms, MRI scans, ultrasounds, DNA micro-arrays and X-rays for tasks such as identifying tumors, retrieving images with similar characteristics, detecting changes, and genomics. In addition to these tasks, data mining can be employed in the analysis of medical records. In the chemical sciences, the information overload problem is becoming staggering as well, with the chemical abstract service adding about 700,000 new compounds to its database each year. Chemistry data are usually obtained either by experimentation or by computer simulation. The need for effective and efficient data analysis techniques is also being driven by the relatively new field of combinatorial chemistry, which essentially involves reactivating a set of starting chemicals in all possible combinations, thereby producing large datasets. Data mining is being used to analyze chemical datasets for molecular patterns and to identify systematic relationships between various chemical compounds. 4 (c) Earth Sciences, Climate Modeling, and Remote Sensing are replete with data mining opportunities. They cover a broad range of topics, including climate modeling and analysis, atmospheric sciences, geographical information systems, and remote sensing. As in the case of astronomy, this is another area in which the vast volumes of data have resulted in the use of semi-automated techniques for data analysis. Earth science data can be particularly challenging from a practical view point, and they come in many different formats, scales and resolutions. Extensive work is required to pre-process the data, including image processing, feature extraction, and feature selection. It suffices to say that the volumes of earth sciences data are typically enormous, with the NASA Earth Observing System expected to generate over 11,000 terabytes of data upon completion. Much of these data is stored in flat files, not databases. (d) Computer Vision and Robotics are characterized by a substantial overlap. There are several ways in which the two fields can benefit each other. For example, computer vision applications can benefit from the accurate machine learning algorithms developed in data mining, while the extensive work done in image analysis and fuzzy logic for computer vision and robotics can be used in data mining as well, especially for applications involving images (Kamath, 2001). The applications of data mining methodologies in computer vision and robotics are quite diverse. They include automated inspection in industries for tasks such as detecting errors in semiconductor masks and identifying faulty widgets in assembly line productions; face recognition and tracking of eyes, gestures, and lip movements for problems such as lip-reading; automated television studios, video conferencing and surveillance, medical imaging during surgery as well as for diagnostic purposes, and vision for robot motion control. One of the key characteristics of the problems in computer vision and robotics is that they must be done in real time (Kamath 2001). In addition, the data collection and analysis can be tailored to the task being performed as the objects of interest are likely to be similar to one another. (e) Engineering—with sensors and computers becoming ubiquitous and powerful, and engineering problems becoming more complex, there is a greater focus on gaining a better understanding of these problems through experiments and simulations. As a result, large amounts of data are being generated, providing an ideal opportunity for the use of data mining techniques in areas such as structural mechanics, computational fluid dynamics, material science, and the semi-conductor industry. Data from sensors are being used to address a variety of problems, including detection of land mines, identification of damage in aerodynamic systems (e.g., helicopters) or physical structures (e.g., bridges), and nondestructive evaluation in manufacturing quality control, to name just a few. In computer simulation, which is increasingly seen as the third mode of science, complementing theory and experiment, the techniques from data mining are yet to gain widespread acceptance (Marusic et al., 2001). Data mining techniques are also employed in studying the identification of coherent structures in turbulence. (f) Financial Data Analysis—most banks and other financial institutions offer a wide variety of banking services such as checking, savings, and business and individual customer transactions. Added to that are credit services like business mortgages and investment services such as mutual funds. Some also offer insurance and stock investment services. Financial data collected in the banking and financial industries are often relatively complete, reliable and of high quality, which facilitate systematic data analysis and data mining. Classification and clustering methods can be used for customer group identification and targeted marketing. For example, customers with similar behaviors regarding banking and loan payments may be grouped together by multidimensional clustering techniques (Han et al., 2001). Effective clustering and collaborative 5 filtering methods such as decision trees and nearest neighbor classification can help in identifying customer groups, associate new customers with an appropriate customer group, and facilitate targeted marketing. Data mining can also be used to detect money laundering and other financial crimes by integrating information from multiple databases, as long as they are potentially related to the study. (g) Security and Surveillance comprise another active area for data mining methodologies. They include applications such as fingerprint and retinal identification, human face recognition, and character recognition in order to identify people and their signatures for access, law enforcement or surveillance purposes. Data mining techniques can also be used in tasks such as automated target recognition. The preceding areas have benefited from the scientific and engineering advances in data mining. Added to these are various technological areas that produce enormous amounts of data, such as high energy physics data from particle physics experiments that are likely to exceed a petabyte (~1015 bytes) per year and data from the instrumentation of computer programs run on massively parallel machines that are too voluminous to be analyzed manually. What is becoming clear, however, is that the data analysis problems in science and engineering are getting more complex and more pervasive, giving rise to a great opportunity for the application of data mining methodologies. Some of these opportunities are discussed in the following subsection. Requirements and Challenges of Mining Massive Data Sets In order to conduct effective data mining, one needs to first examine what kind of features an applied knowledge discovery system is expected to have and what kind of challenges one may face at the development of data mining techniques. The following are some of the challenges: (a) Handling of different types of high-dimensional data. Since there are many kinds of data and databases used in different applications, one may expect that a knowledge discovery system should be able to perform effective data mining on different kinds of data. Massive databases contain complex data types, such as structured data and complex data objects, hypertext and multimedia data, spatial and temporal data, remote sequencing, transaction data, legacy data, etc. These data are typically high-dimensional, with attributes numbering from a few hundreds to the thousands. There is an urgent demand for new techniques for data retrieval and representation, new probabilistic and statistical models for high-dimensional indexing, and database querying methods. A powerful system should be able to perform effective data mining on such complex types of data as well. (b) Efficiency and scalability of data mining algorithms. With the increasing size of data, there is a growing appreciation for algorithms that are scalable. To effectively extract information from a huge amount of data in databases, the knowledge discovery algorithms must be efficient and scalable to large databases. Scalability refers to the ability to use additional resources such as the central processing unit (CPU) and memory in an efficient manner to solve increasingly larger problems. It describes how the computational requirements of an algorithm grow with problem size. 6 (c) Usefulness, certainty and expressiveness of data mining results. Scientific data, especially data from observations and experiments, are noisy. Removing the noise from data, without affecting the signal, is a challenging problem in massive datasets. Noise, missing or invalid data, and exceptional data should be handled elegantly in data mining systems. The discovered knowledge should accurately portray the contents of the database and be useful for certain applications. (d) Building reliable and accurate models and expressing the results. Different kinds of knowledge can be discovered from a large amount of data. These discovered kinds of knowledge can be examined from different views and presented in different forms. This requires the researcher to build a model that reflects the characteristics of the observed data and to express both the data mining requests and the discovered knowledge in high-level languages or graphical user interfaces, so that the discovered knowledge can be understandable and directly usable. (e) Mining distributed data. The widely available local and wide-area computer networks, including the Internet, connect many sources of data and form huge distributed heterogeneous databases, such as the text data that are distributed across various Web servers or astronomy data that are distributed as part of a virtual observatory. On the one hand, mining knowledge from different sources of formatted or unformatted data with diverse data semantics poses new challenges to data mining. On the other hand, data mining may help disclose the high-level data regularities in heterogeneous databases which can hardly be discovered by simple query systems. Moreover, the huge size of the database, the wide distribution of data, and the computational complexity of some data mining methods motivate the development of parallel and distributed data mining algorithms. (f) Protection of privacy and data security. When data can be viewed from many different angles and at different abstraction levels, it can threaten the goal of ensuring data security and guarding against the invasion of privacy (Chen et al., 1996). It is important to study when knowledge discovered may lead to an invasion of privacy and what security measures can be developed to prevent the disclosure of sensitive information. (g) Size and type of data. Science datasets range from moderate to massive, with the largest being measured in terabytes. As more complex simulations are performed and observations over long periods at higher resolution are conducted, the data will grow to the petabyte range. Data mining infrastructure should support the rapidly increasing data volume and the variety of data formats that are used in the scientific domain. (h) Data visualization. The complexity and noise of massive data affect data visualization. Scientific data are collected from variant sources by using different sensors. Data visualization is needed to use all available data to enhance an analysis. Unfortunately, a difficult problem may emerge when data are collected on different resolutions, using different wavelengths, under different conditions, with different sensors (Kamath, 2001). Collaborations between computer scientists and statisticians are yielding statistical concepts and modeling strategies to facilitate data exploration and visualization. For example, recent work in multivariate data analysis involves ranking multidimensional observations based on their relative importance for information extraction and modeling, thereby contributing to the visualization of high dimensional objects such as cell gene expression, profile and image. 7 Mining Spatial Databases The study and development of data mining algorithms for spatial databases are motivated by the large amount of data collected through remote sensing, medical equipment, and other instruments. Managing and analyzing spatial data became an important issue due to the growth of the applications that deal with geo-reference data. A spatial database stores a large amount of spacerelated data, such as maps, pre-processed remote sensing or medical imaging data. Spatial databases have many features distinguishing them from relational databases. They carry topological and/or distance information, usually organized by sophisticated, multidimensional spatial indexing structures that are accessed by spatial data access methods and often require spatial reasoning, geometric computation, and spatial knowledge representation techniques. Another difference is the query language that is employed to access spatial data. The complexity of the spatial data type is another important feature (Palacio et al., 2003). The explosive growth in data and databases used in business management, government administration and scientific data analysis has created the need for tools that can automatically transform the processed data into useful information and knowledge. Spatial data mining is a subfield of data mining that deals with the extraction of implicit knowledge, spatial relationships, or other interesting patterns not explicitly stored in spatial databases (Koperski et al., 1995). Such mining demands an integration of data mining with spatial database technologies. It can be used for understanding spatial data, discovering spatial relationships and relationships between spatial and non-spatial data, constructing spatial knowledge databases, reorganizing spatial databases, and optimizing spatial queries. It is expected to have wide applications in geographic imaging, navigation, traffic control, environmental studies, and many other areas where spatial data are employed (Han et al., 2001). A crucial challenge to spatial data mining is the exploration of efficient spatial data mining techniques due to the huge amount of spatial data and the complexity of spatial data types and spatial access methods. Challenges in spatial data mining arise from different issues. First classical data mining is designed to process numbers and categories, whereas spatial data are more complex and include extended objects such as points, lines, and polygons. Second, while classical data mining works with explicit inputs, spatial predicates and attributes are often implicit. Third, classical data mining treats each input independently of other inputs, while spatial patterns often exhibit continuity and high autocorrelation among nearby features (Shekhar et al., 2002). Related Work Statistical spatial data analysis has been a popular approach used to analyze spatial data. This approach handles numerical data well and usually suggests realistic models of spatial phenomena. Different methods for knowledge discovery, algorithms and applications for spatial data mining are created. Classification of spatial data has been analyzed by some researchers. A method for classification of spatial objects has been proposed by Ester et al. (1997). Their proposed algorithm is based on ID3 algorithm, and it uses the concept of neighborhood graphs. It considers not only nonspatial properties of the classified objects, but also non-spatial properties of neighboring objects: objects are treated as neighbors if they satisfy the neighborhood relations. Ester et al. (2000) also define topological relations as those which are invariant under topological transformations. If both objects are rotated, translated, or scaled simultaneously, the relations are preserved. These scholars present a definition of topological relations derived from the nine intersections model: i.e. the topological relations between two objects are (1) disjoint, (2) meet, (3) overlap, (4) equal, (5) cover, 8 (6) covered-by, (7) contain, and (8) inside; the second type of relations refers to (9) distance relations. These relations compare the distance between two objects with a given constant using arithmetic operators like <, >, and =. The distance between objects is defined as the minimum distance between them. The third type of relations they define are the direction relations. They define a direction relation A R B of two spatial objects using one representative point of the object A and all points of the destination object B. It is possible to define several possibilities of direction relations depending on the points that are considered in the source and the destination objects. The representative point of a source object may be the center of the object or a point on its boundary. The representative point is used as the origin of a virtual coordinate system, and its quadrants define the directions. Fayyad et al. (1996) used decision tree methods to classify images of stellar objects to detect stars and galaxies. About three terabytes of sky images were analyzed. Similar to the mining association rules in transactional and relational databases, spatial association rules can be mined in spatial databases. Spatial association describes the spatial and non-spatial properties which are typical for the target objects but not for the whole database (Ester et al., 2000). Koperski et al. (1995) introduced spatial association rules that describe associations between objects based on spatial neighborhood relations. An example can be the following: is_a(X,”African_countries”)^receiving(X,”Western_aid”)→highly(X,”corrupt”)[0.5%,90%] This rule states that 90% of African countries receiving Western aid are also highly corrupt, and 0.5% of the data belongs to such a case. Spatial clustering identifies clusters or densely populated regions according to some distance measurement in a large, multidimensional dataset. There are different methods for spatial clustering such as the k-mediod clustering algorithms (Ng et al. 1994) and the Generalized Density Based Spatial Clustering of Applications with Noise (GDBSCAN) that relies on a destiny-based notion of clusters (Sander et al., 1998). Visualizing large spatial datasets became an important issue due to the rapidly growing volume of spatial datasets, which makes it difficult for a human to browse such datasets. Shekhar et al. (2002) have constructed a Web-based visualization software package for observing the summarization of spatial patterns and temporal trends. The visualization software will help users gain insight and enhance their understanding of the large data. Mining Text Databases Text databases consist of large collections of documents from various sources, such as news articles, research papers, books, digital libraries, E-mail messages, and Web pages. Text databases are rapidly growing due to the increasing amount of information available in electronic forms, such as electronic publications, E-mail, CD-ROMs, and the World Wide Web (which also can be considered as a huge interconnected dynamic text and multimedia database). Data stored in most text databases are semi-structured data in that they are neither completely unstructured nor completely structured. For example, a document may contain a few structured fields, such as a title, author’s name(s), publication date, length, category, etc., and also contain some largely unstructured text components, such as an abstract and contents. Traditional information retrieval techniques have become inadequate for the increasingly vast amounts of text data (Han et al., 2001). Typically, only a small fraction of the many available documents will be relevant to a given individual user. Without knowing what could be in the 9 documents, it is difficult to formulate effective queries for extracting and analyzing useful information from the data. Users need tools to compare different documents, rank the importance and relevance of the documents, or find patterns and trends across multiple documents. Thus, text mining has become an increasingly popular and essential theme in data mining. Text mining has also emerged as a new research area of text processing. It focuses on the discovery of new facts and knowledge from large collections of texts that do not explicitly contain the knowledge to be discovered (Gomez et al., 2001). The goals of text mining are similar to those of data mining, since it attempts to find clusters, uncover trends, discover associations, and detect deviations in a large set of texts. Text mining has also adopted techniques and methods of data mining, such as statistical techniques and machine learning approaches. Text mining helps one to dig out the hidden gold from textual information, and it leaps from old fashioned information retrieval to information and knowledge discovery (Dorre et al., 1999). Basic Measures for Text Retrieval Information retrieval is a field that has been developing in parallel with database systems for many years. Unlike the field of database systems, however, which has focused on query and transaction processing of structured data, information retrieval is concerned with the organization and retrieval of information from a large number of text based documents. A typical information retrieval problem is to locate relevant documents based on user input, such as keywords or example documents. This type of information retrieval system includes online library catalog systems and online document management systems (Berry et al., 1999). It is vital to know how accurate or correct a text retrieval system is in retrieving documents based on a query. The set of documents relevant to a query can be called “{Relevant},” whereas the set of documents retrieved is denoted as “{Retrieved}.” The set of documents that are both relevant and retrieved is denoted as “{Relevant}∩{Retrieved}.” There are two basic measures for assessing the quality of a retrieval system: (1) precision and (2) recall (Berry at al., 1999). The precision of a system is the ratio of the number of relevant documents retrieved to the total number of documents retrieved. In other words, it is the percentage of retrieved documents that are in fact relevant to the query—i.e. the correct response. Precision can be represented as follows: Precision = │{Relevant}∩{Retrieval}│ │{Retrieved}│ The recall of a system is the ratio of the number of relevant documents retrieved to the total number of relevant documents in the collection. Stated differently, it is the percentage of documents that are relevant to the query and were retrieved. Recall can be represented the following way: Recall = │{Relevant}∩{Retrieval}│ │{Relevant}│ Word Similarity Information retrieval systems support keyword-based and/or similarity-based retrieval. In keywordbased information retrieval, a document is represented by a string, which can be identified by a set of keywords. A good information retrieval system should consider synonyms when answering the 10 queries. For example, synonyms such as “automobile” and “vehicle” should be considered when searching the keyword “car.” There are two major difficulties with a keyword-based system: (1) synonymy and (2) polysemy. In a synonymy problem, keywords such as “software product” may not appear anywhere in a document, even though the document is closely related to a software product. In a polysemy problem, a keyword such as “regression” may mean different things in different contexts. The similarity-based retrieval system finds similar documents based on a set of common keywords. The output of such retrieval should be based on the degree of relevance, where relevance is measured in terms of the closeness and relative frequency of the keywords. A text retrieval system often associates a stop list with a set of documents. A stop list is a set of words that are deemed “irrelevant.” For instance, “a,” “the,” “of,” “for,” “with,” etc. are stop words, even though they may appear frequently. The stop list depends on the document itself: for example, together, “artificial intelligence” could be an important keyword in a newspaper; it may, however, be considered a stop word on research papers presented at an artificial intelligence conference. A group of different words may share the same word stem. A text retrieval system needs to identify groups of words in which the words in a group are small syntactic variants of one another, and collect only the common word stem per group. For example, the group of words “drug,” “drugged,” and “drugs” share a common word stem, “drug,” and can be viewed as different occurrences of the same word. Panel et al. (2002) computed the similarity among a set of documents or between two words, wi and wj, using the cosine coefficient of their mutual information vectors: Σmiwic X miwjc c sim(wi, wj ) = _____________________________ ________________________ √ Σmiwic2 x Σmiwjc2 c c mi where wic is the positive mutual information between context (c) and the word (w). Fc(w) be the frequency count of the word, w, occurring in context c: Miwic Fc(w) ______N________ = Σ Fi(w) Σ Fc(J) i X j _____ N N where N = ΣΣ Fi(j), is the total frequency counts of all words and their context. i j Related Work Text mining and applications of data mining to structured data derived from texts have been the subjects of many research projects in recent years. Most text mining has used natural language processing to extract key terms and phrases directly from documents. Pantel at al. (2002) have proposed a clustering algorithm, Clustering By Committee (CBC), in which the centroid of a cluster is constructed by averaging the feature vectors of a subset of the 11 cluster members. The subset is viewed as a committee that determines which other elements belong to the cluster. Pantel and his partners divided the algorithm into three phases. In the first phase, they found top similar elements. To compute the top similar words of a word w, they sorted w’s features according to their mutual information with w. They computed only the pairwise similarities between w and the words that share high mutual information features with w. In the second phase, they found committees—a set of recursively tight clusters in the similarity space—and other elements that are not covered by any committee. An element is said to be covered by a committee if that element’s similarity to the centroid of the committee exceeds some high similarity threshold. Assigning elements to clusters is the last phase on the CBC algorithm. In this phase, every element is assigned to the cluster containing the committee to which it is most similar. Wong et al. (1999) designed a text association system based on ideas from information retrieval and syntactic analysis. In their system, the corpus of narrative text is fed into a text engine for topic extractions, and then the mining engine reads the topics from the text engine and generates topic association rules which are sent to the visualization system for further analysis. There are two text engines developed on this system. The first one is word-based and results in a list of content-bearing words for the corpus. The second one is concept-based and results in concepts based on the corpus. Dhillon et al. (2001) designed a vector space model to obtain a highly efficient process for clustering very large collections exceeding 100,000 documents in a reasonable amount of time on a single processor. They used efficient and scalable data structures such as local and global hash tables. In addition, a highly efficient and effective spherical k-means algorithm was used, since both the document and concept vectors lie on the surface of a high-dimensional sphere. The basic idea of the vector space model is to represent each document as a vector of certain weighted word frequencies. Each vector component reflects the importance of a particular term in representing the semantics or meaning of that document. The vectors for all documents in a database are stored as the columns of a single matrix (Berry et al., 1999). A database containing a total of d documents described by t terms is represented as a t x d term-bydocument matrix A. The d vectors representing the d documents form the columns of the matrix. Thus, the matrix element aij is the weighted frequency at which the term i occurs in document j. In the vector space model, the columns of A are the document vectors, whereas the rows of A are the term vectors. To create the vector space model, there are important parsing and extraction steps needed, such as all unique words from the entire set of documents. Eliminate all “stop words” such as “a,” “and,” “the,” etc. For each document, count the number of occurrences of each word. Eliminate “highfrequency” and “low-frequency” non-content-bearing words by using the heuristic or informationtheoretic criteria. Finally, for each word, w, assign a unique identifier between one to w to each remaining word and unique identifier between one to d to each document. The geometric relationship between document vectors and also the term vectors can help to identify the similarities and differences between the document’s content and also in term usage. In the vector space model, in order to find the relevant documents, the user queries the database by using the vector space representation of those documents. Query matching is finding the documents most similar to the query in use and weighting the terms. In the vector space model, the documents selected are those geometrically closest to the query according to some measure. Mining Remote Sensing Data The data volumes of remote sensing are rapidly growing. National Aeronautics and Space Administration’s (NASA) Earth Observing System (EOS) program alone produces massive data 12 products with total rates more than 1.5 terabytes per day (King et al., 1999). Application and products of Earth observing and remote sensing technologies have been shown to be crucial to global social, economic and environmental well being (Yang et al., 2001). In order to help scientists search massive remotely sensed databases and find data of interest to them, and then order the selected data sets or subsets, several information systems have been developed for data ordering purposes to face the challenges of the rapidly growing volumes of data, since the traditional method where a user downloads data and uses local tools to study the data residing on a local storage system is no longer helpful. To find interesting data, scientists need an effective and efficient way to search through the data. Metadata are provided in a database to support data searching by commonly used criteria such as spatial coverage, temporal coverage, spatial resolution, and temporal resolution. Since metadata search itself may still result in large amounts of data, some textual restrictions, such as keyword searches, could be employed for interdisciplinary researchers to select data of interest to them. The usual data selection procedure is to specify a spatial/temporal range and to see what datasets are available under those conditions. Yang and his colleagues (1998) developed a distributed information system, the Seasonal International Earth Science Information Partner (SIESIP), which is a federated system that provides services for data searching, browsing, analyzing and ordering. The system will provide not only data, but also data visualization, analysis and user support capabilities. The SIESIP system is a multi-tiered, client-server architecture, with three physical sites or nodes, distributing tasks in the areas of user service, access to data and information products, archiving as needed, ingest and interoperability options, and other aspects. This architecture can serve as a model for many distributed Earth system science data. There are three phases of user interaction with the data and information system; each phase can be followed by other phases or it can be conducted independently: Phase 1, Metadata Access: using the metadata and browse images provided by the SIESIP system, the user browses the data holding. Metadata knowledge is incorporated into the system and a user can issue queries to explore this knowledge. Phase 2, Data Discovery/Online Data Analysis: the user gets a quick estimate of the type and quality of data found in Phase 1. Analytical tools are then utilized as needed, such as statistical functions and visualization algorithms. Phase 3, Data Order: after the user locates the dataset of interest, s/he is now ready to order datasets. If the data are available through SIESIP, the information system will handle the data order; otherwise, an order will be issued to the appropriate data provider such as Earth Science Distributed Active Archive Center (GES DAAC), on behalf of the user, or necessary information will be forwarded to the user for this task for further action. The database management system is used for the system to handle catalogue metadata and statistical summary data. The database system supports two major kinds of queries. The first query is used to find the right data files for analysis and ordering based on catalogue metadata only. The second one queries data contents which are supported by the statistical summary data. Data mining techniques help scientific data users not only in finding rules or relations among different data but also in finding the right datasets. With the rapid growth of massive data volumes, scientists need a fast way to search for the data in which they are interested. In this case, scientists need to search data based not only on metadata but also actual data values. The main goal of the data mining techniques in the SIESIP system is to find spatial regions and/or temporal ranges over 13 which parameter values fall into certain ranges. The main challenge is the speed and the accuracy, because they affect each other inversely. Different data mining techniques can be applied on remote sensing data. Classification of remotely sensed data is used to assign corresponding levels with respect to groups with homogeneous characteristics, with the aim of discriminating multiple objects from one another within the image. Also, several methods exist for remote sensing image classification. Such methods include both traditional statistical or supervised approaches and unsupervised approaches that usually employ artificial neural networks. Mining Astronomical Data Astronomy has become and immensely data-rich field, with numerous digital sky surveys across a range of wavelengths and many terabytes of pixels and billions of detected sources, often with tens of measured parameters for each object. The problem with the astronomical database is not only the very large size, but also the variable quality of the data and the nature of astronomical objects with their very wide dynamic range in apparent luminosity and size present additional challenges. The great changes in astronomical data enable scientists to map the universe systematically, and in a panchromatic manner. Scientists can study the galaxy and the large scale structure in the universe statistically and discover unusual or new types of astronomical objects and phenomena (Brunner et al., 2002). Astronomical data and their attendant analyses can be classified into the following five domains: (1) Imaging data are the fundamental constituents of astronomical observations, capturing a twodimensional spatial picture of the universe within a narrow wavelength region at a particular epoch or instance of time. (2) Catalogs are generated by processing the imaging data. Each detected source can have a large number of measured parameters, including coordinates, various flux quantities, morphological information, and a real extant. (3) Spectroscopy, polarization, and other follow-up measurements provide detailed physical quantification of the target systems, including distance information, chemical composition, and measurements of the physical fields present at the source. (4) Studying the time domain provides important insights into the nature of the universe by identifying moving objects, variable sources, or transient objects. (5) Numerical simulations are theoretical tools which can be compared with observational data. Handling and exploring these vast new data volumes, and actually making real scientific discoveries, pose considerable technical challenges. Many of the necessary techniques and software packages, including artificial intelligence techniques like neural networks and decision trees, have been already successfully applied to astronomical problems such as pattern recognition and object classification. Clustering and data association algorithms have also been developed. As early as 1936, Edwin Hubble established a system to classify galaxies into three fundamental types. First, elliptical galaxies had an elliptical shape with no other discernible structure. Second, spiral galaxies had an elliptical nucleus surrounded by a flattened disk of stars and dust containing a 14 spiral pattern of brighter stars. And third, irregular galaxies have irregular shapes and did not fit into the other two categories. Humphrey and his partners (2001) created an automated classification system for astronomical data. They visually classified 1,500 galaxy images obtained from the Automated Plate Scanner (APS) database in the region of the north galactic pole. Given the size and brightness of galaxy images taken into consideration, images that were difficult to classify were removed from this sample. Grossman and company (2001) developed an application which simultaneously works with two geographically distributed astronomical source catalogs: (1) the Two Micron All Sky Survey (2MASS) and (2) the Digital Palomar Observatory Sky Survey (DPOSS). The 2MASS data are in the optical wavelengths, whereas the DPOCC data are in the infrared range. These scientists created a virtual observatory supporting the statistical analysis of many millions of stars and galaxies with data coming from both surveys. By using the data space transfer protocol (DSTP), a platform independent way to share data over a network, they built a query for finding all pairs from the DPOSS and 2MASS. The query visualized by coloring the data as red, if they appear in 2MASS; blue, if they appear in DPOSS; or magenta, if they appear in both surveys. The client DSTP application formulates the fuzzy join and sends the resulting stars and galaxies back to the client application. The DSTP protocol enables an application in one location to locate, access, and analyze data from several other locations. Mining Bioinformatics Data Bioinformatics is described by Cannataro and his colleagues (2004) as a bridge between the life sciences and computer science. It has also been described by Barker and his associates (2004) as a cross-disciplinary field in which biologists, computer scientists, chemists, and mathematicians work together, each bringing a unique point of view. The term bioinformatics has a range of interpretations, but the core activities of bioinformatics are widely acknowledged: storage, organization, retrieval and analysis of biological data obtained by experiments or by querying databases. The increasing volume of biological data collected in recent years has prompted increasing demand for bioinformatics tools for genomic and proteomic (the set of proteins encoded by the genome to define models representing and analyzing the structure of the proteins contained in each cell) data analysis. Bioinformatics applications’ design should represent the biological data and databases efficiently; contain services for data transformation and manipulation such as searching in protein databases, protein structure prediction, and biological data mining; describe the goals and requirements of applications and expected results; and support querying and computing optimization to deal with large datasets (Wong et al., 2001). Bioinformatics applications are naturally distributed, due to the high number of datasets involved. They require higher computing power, due to the large size of datasets and the complexity of basic computations; they may access heterogeneous data; they require a secure software infrastructure because they could access private data owned by different organizations. Cannataro et al. (2004) show technologies that can fulfill bioinformatics requirements. The following are some of these technologies: (a) Ontologies to describe the semantics of data sources, software components and bioinformatics tasks. An ontology is a shared understanding of well defined domains of interest, which is realized as a set of classes or concepts, properties, functions and instances. 15 (b) Workflow Management Systems to specify in an abstract way complex (distributed) applications, integrating and composing individual simple services. A workflow is a partial or total automation of a business process in which a collection of activities must be executed by some entities (humans or machines) according to certain procedural rules. (c) Grid Infrastructure to show its security, distribution, service orientation, and computational power. (d) Problem Solving Environment to define and execute complex applications, hiding software development details. These researchers developed two types of ontologies: (1) OnBrowser for browsing and querying ontologies and (2) DAMON for the data mining domain describing resources and processes of knowledge discovery in databases. The latter is used to describe data mining experimentations on bioinformatics data. Cannataro et al. (2004) designed PROTEUS, a Grid-based Problem Solving Environment (GPSE), for composing and running bioinformatics applications on the Grid. They used ontologies for modeling bioinformatics processes and Grid resources and workflow techniques for designing and scheduling bioinformatics applications. PROTEUS assists users in formulating bioinformatics solutions by choosing among different available bioinformatics applications or by composing a new one as collections on the Grid. It is used to present and analyze results and then compare them with past results to form the PROTEUS knowledge base. PROTEUS combines existing open source bioinformatics software and publicavailable biological databases by adding metadata to software, modeling applications through ontology and workflows, and offering pre-packaged Grid-aware bioinformatics applications. Web-based bioinformatics application platforms are popular tools for biological data analysis within the bioscience community. Wong et al. (2001) developed a prototype based on integrating different biological databanks into a unified XML framework. The prototype simplifies the software development process of bioinformatics application platforms. The XML-based wrapper of the prototype demonstrated a way to convert data from different databanks into XML format and be stored in XML database management systems. DNA data analysis is an important topic in biomedical research. Recent research in the area has led to the discovery of genetic causes for many diseases and disabilities, as well as the discovery of new medicines and approaches for disease diagnosis, prevention, and treatment. Data mining has become a powerful tool and contributes substantially to DNA analysis in the following ways, according to Han et al. (2001): (a) Semantic integration of heterogeneous, distributed genome database: due to the highly distributed, uncontrolled generation and use of a wide variety of DNA data, the semantic integration of such heterogeneous and widely distributed genome databases becomes a pivotal task for systematic and coordinated analysis of DNA databases. (b) Similarity search and comparison among DNA sequences: one of the most important search problems in genetic analysis is similarity search and comparison among DNA sequences. Gene sequences isolated from diseased and healthy species can be compared to identify critical differences between the two classes of genes. This can be done by first retrieving the gene sequences from the two tissue classes and then finding and comparing the frequently occurring patterns of each class. 16 (c) Association analysis and identification of co-occurring gene sequences: association analysis methods can be used to help determine the kinds of genes that are likely to co-occur in target samples. Such analysis would facilitate the discovery of groups of genes and the study of interactions and relationships between them. (d) Visualization tools and genetic data analysis: complex structures and sequencing patterns of genes are most effectively presented in graphs, trees, cuboids, and chains by various kinds of visualization tools. Visualization, thus, plays an important role in biomedical data mining. Research Methodology The following is a discussion of the proposed approach for mining massive datasets for studying Africa from an Africancentric perspective. It depends on the METANET concept: a heterogeneous collection of scientific databases envisioned as a national and international digital data library which would be available via the Internet. I consider a heterogeneous collection of massive databases such as remote sensing data and text data. The discussion is divided into two separate, but interrelated, sections: (1) the automated generation of metadata and (2) the query and search of the metadata. Automated Generation of Metadata Metadata simply means data about data. They can be characterized as any information required to makeg other data useful in information systems. Metadata are a general notion that captures all kinds of information necessary to support the management, query, consistent use and understanding of data. Metadata help users to discover, understand and evaluate data, and help data administrators to manage data and control their access and use. Metadata describe how, when and by whom a particular set of data was collected, and how the data are formatted. Metadata are essential for understanding information stored in data warehouses. In general, there exist metadata to describe file and variable types and organization, but they have minimal scientific content data. In raw form, a dataset and its metadata have minimal usability. For example, not all the image datasets in the same file form that are produced by the satellite-based remote sensing platform are important to scientists. In fact, only the image datasets that contain certain patterns will be of interest to a scientist. Scientists need metadata about the image datasets’ content to enable scientists to narrow their search time, taking into consideration the size of the datasets: i.e. terabyte datasets (Wegman, 1997). Creating a digital object and linking it to the dataset will make the data usable, and at the same time the search operation for a particular structure in a dataset will be a simple indexing operation on the digital objects linked to the data. The objective of this process it to link digital objects with scientific meaning to the dataset at hand and make the digital objects part of the searchable metadata associated with the dataset. Digital objects will help scientists to narrow the scope of the datasets that the scientists must consider. In fact, digital objects reflect the scientific content of the data, but they do not replace the judgment of the scientists. The digital objects will essentially be named for patterns to be found in the datasets. The goal is to have a background process, launched either by the database owner or, more likely, via an applet created by a virtual data center, examining databases available on the data-Web and searching within datasets for recognizable patterns. Once a pattern is found in a particular dataset, the digital object 17 corresponding to that pattern is made part of the metadata associated with that set. If the same pattern is contained in other distributed databases, pointers would be added to that metadata pointing to metadata associated with the distributed databases. The distributed databases will then be linked through the metadata in the virtual data center. At least one of the following three different methods is to be used to generate the patterns for which a researcher can search. The first method is to delineate empirical or statistical patterns that have been observed over a long period of time and may be thought to have some underlying statistical structure. An example of an empirical or statistical pattern is a certain pattern in DNA sequencing. The second method is to generate the model-based patterns. This method is predictive if verified on real data. The third method is to tease out patterns found by clustering algorithms. With this method, patterns are delineated by purely automated techniques that may or may not have scientific significance (Wegman, 1997). Query and Search The notion of the automated creation of metadata is to develop metadata that reflect the scientific content of the datasets within the database rather than just data structure information. The locus of the metadata is the virtual data center where it is reproduced. The general desideratum for the scientist is to have a comparatively vague question that can be sharpened as s/he interacts with the system. Scientists can create a query, and this query may be sharpened to another vague one, but the data may be accessible from several distributed databases for the later query. The main issues of the retrieval process are the browser mechanism for requesting data when the user has a precise query and an expert system query capability that would help the scientist reformulate a vague question into a form that may be submitted more precisely. Query and search would comprise four major elements: (1) a client browser, (2) an expert system for query refinement, (3) a search engine, and (4) a reporting mechanism. These four elements are described in the following subsections. Client Browser The client browser would be a piece of software running on a scientist’s client machine, which is likely to be a personal computer (PC) or a workstation. The main idea is to have a graphical user interface (GUI) that would allow the user to interact with a more powerful server in the virtual data center. The client software is essentially analogous to the myriad of browsers available for the World Wide Web (WWW). Expert Systems for Query Refinement A scientist interacts with a server via two different scenarios. In the first scenario, the scientist knows precisely the location and type of data s/he desires. In the second scenario, the scientist knows generally the type of questions s/he would like to ask, but has little information about the nature of the databases with which s/he hopes to interact. The first scenario is relatively straightforward, but the expert system would sill be employed to keep a record of the nature of the query. The idea is to use the query as a tool in the refinement of the search process. The second scenario is more complex. The approach is to match a vague query formulated by the scientist to 18 one or more of the digital objects discovered in the automated generation of metadata phase. Disciplined experts give rules to the expert system to perform this match. The expert system would attempt to match the query to one or more digital objects. The scientist has the opportunity to confirm the match when s/he is satisfied with the proposed match or to refine the query. The expert system would then engage the search engine in order to synthesize the appropriate datasets. The expert system would also take advantage of the interaction to form a new rule for matching the original query to the digital objects developed in the refinement process. Thus, two aspects emerge: (1) the refinement of the precision of an individual search and (2) the refinement of the search process. Both aspects share tactical and strategic goals. The refinement would be greatly aided by the active involvement of the scientist. S/he would be informed about how his/her particular query was resolved, allowing him/her to reformulate the query efficiently. The log files of these iterative queries would be processed automatically to inspect the query trees and, possibly, improve their structure. Also, two other considerations of interest emerge. First, other experts not necessarily associated with the data repository itself may have examined certain datasets and have commentary in either informal annotations or in the refereed scientific literature. These commentaries should form part of the metadata associated with the dataset. Part of the expert system should provide an annotation mechanism that would allow users to attach commentary or library references (particularly digital library references) as metadata. Obviously, such annotations may be self-serving and potentially unreliable. Nonetheless, the idea is to alert the scientist to information that may be useful. User derived metadata would be considered secondary metadata. The other consideration is to provide a mechanism for indicating data reliability. This would be attached to a dataset as metadata, but it may in fact be derived from the original metadata. For example, a particular data collection instrument may be known to have a high variability. Thus, any set of data that is collected by this instrument, no matter where in the data it occurred, should have as part of the attached metadata an appropriate caveat. Hence, an automated metadata collection technique should be capable of not only examining the basic data for patterns, but also examining the metadata themselves; and, based on collateral information such as just mentioned, it should be able to generate additional metadata. Search Engine As noted earlier, large scale scientific information systems will likely be distributed in nature and contain not only the basic data, but also structured metadata: for example, sensor type, sensor number, measurement data, and unstructured metadata such as a text-based description of the data. These systems will typically have multiple main repository sites that together will house a major portion of the data as well as some smaller sites and virtual data centers containing the remainder of the data. Clearly, given the volume of the data, particularly within the main servers, high performance engines that integrate the processing of the structured and unstructured data are necessary to support desired response rates for user requests. Both database management systems (DBMS) and information retrieval systems provide some functionality to maintain data. DBMS allow users to store unstructured data as binary large objects (BLOB), and information retrieval systems allow users to enter structured data in zoned fields. DBMS, however, offer only a limited query language for values that occur in BLOB attributes. Similarly, information retrieval systems lack robust functionality for zoned fields. Additionally, information retrieval systems traditionally lack efficient parallel algorithms. Using a relational database approach for information retrieval allows for parallel processing, since almost all 19 commercially available parallel engines support some relational DBMS. An inverted index may be modeled as a relation. This treats information retrieval as an application of a DBMS. Using this approach, it is possible to implement a variety of information retrieval functionality and achieve good run-time performance. Users can issue complex queries including both structured data and text. The key hypothesis is that the use of a relational DBMS to model an inverted index will (a) permit users to query both structured data and text via standard Structured Query Language (SQL)—in this regard, users may use any relational DBMS that support standard SQL; (b) permit the implementation of traditional information retrieval functionality such as Boolean retrieval, proximity searches, and relevance ranking, as well as non-traditional approaches based on data fusion and machine learning techniques; and (c) take advantage of current parallel DBMS implementations, so that acceptable run-time performance can be obtained by increasing the number of processors applied to the problem. Reporting Mechanism The most important issue for a reporting mechanism is not only to retrieve datasets appropriate to the needs of the scientist, but scaling down the potentially large databases the scientist must consider. Put differently, the scientist would consider megabytes (~106 bytes) instead of terabytes (~1012 bytes) of data. The search and retrieval process may still result in a massive amount of data. The reporting mechanism would, therefore, initially report the nature and magnitude of the datasets to be retrieved. If the scientist agrees that the scale is appropriate for his/her needs, then the data will be delivered by a file transfer protocol (FTP) or a similar mechanism to his/her local client machine or to another server where s/he wants the synthesized data to be stored. Implementation To help scientists search for massive databases and find data of interest to them, a good information system should be developed for data ordering purposes. The system should be performing well based on the descriptive information of the scientific datasets or metadata, such as the main purpose of the datasets, the spatial and temporal coverage, the production time, the quality of the datasets, and the main features of the datasets. Scientists want to have an idea of what the data look like before ordering them, since metadata searching alone cannot meet all scientific queries. Thus, content-based searching or browsing and preliminary analysis of data based on their actual values will be inevitable in these application contexts. One of the most common content-based queries is to find large enough spatial regions over which the geophysical parameter values fall into certain intervals given a specific observation time. The query result could be used for ordering data as well as for defining features associated with scientific concepts. For researchers of African topics to be able to maximize the utility of this content-based query technique, there must exist a Web-based prototype through which they can demonstrate the idea of interest. The prototype must deal with different types of massive databases, with special attention being given to the following and other aspects that are unique to Africa: (a) African languages with words encompassing diacritical marks (dead and alive) 20 (b) Western colonial languages (dead and alive) (c) Other languages such as Arabic, Russian, Hebrew, Chinese, etc. (d) Use of desktop software such as Microsoft Word or Corel WordPerfect to type words with diacritical marks and then copy and paste them into Internet search lines (e) Copying text in online translation sites and translating them into the target language The underlying approach must be pluridiscipinary, which involves the use of open and resourcebased techniques available in the actual situation. It has, therefore, to draw upon the indigenous knowledge materials available in the locality and make maximum use of them. Indigenous languages are, therefore, at the center of the effective use of this methodology. What all this suggests is that the researcher must revisit the indigenous techniques that take into consideration the epistemological, cosmological and methodological challenges. Hence, the researcher must be culture-specific and knowledge-source-specific in his/her orientation. Thus, the process of redefining the boundaries between the different disciplines in our thought process is the same as that of reclaiming, reordering and, in some cases, reconnecting those ways of knowing, which were submerged, subverted, hidden or driven underground by colonialism and slavery. The research should, therefore, reflect the daily dealings of society and the challenges of the daily lives of the people. Towards this end, at least the following six questions should guide pluridisciplinary research: (1) How can the research increase indigenous knowledge in the general body of global human development? (2) How can the research create linkages between the sources of indigenous knowledge and the centers of learning on the continent and the Diaspora? (3) How can centers of research in the communities ensure that these communities become “research societies”? (4) How can the research be linked to the production needs of the communities? (5) How can the research help to ensure that science and technology are generated in relevant ways to address problems of the rural communities where the majority of the people live and that this is done in indigenous languages? (6) How can the research help to reduce the gap between the elite and the communities from which they come by ensuring that the research results are available to everyone and that such knowledge is drawn from the communities? (For more on this approach, see Bangura 2005.) In the collection of remote sensing and text databases, one must implement a prototype system that contains at least a four-terabyte storage capability with high performance computing. Remote sensing data are available through NASA JPL, NASA Goddard, and NASA Langley Research Center. The prototype system will allow scientists to make queries against disparate types of databases. For instance, queries on remote sensing data can focus on the features observed in images. Those 21 features may be environmental or artificial features which consist of points, lines, or areas. Recognizing features is the key to interpretation and information extraction. Images differ in their features, such as tone, shape, size, pattern, texture, shadow, association, etc. Tone refers to the relative brightness or color objects in the image. It is the fundamental element for distinguishing between different targets or features. Shape refers to the general form, structure, or outline of an object. Shape can be a very distinctive clue for interpretation. Size of objects in an image is a function of scale. It is important to assess the size of a target relative to other objects in a scene, as well as the absolute size, to aid in the interpretation of that target. Pattern refers to the spatial arrangement of visibly discernible objects. Texture refers to the arrangement and frequency of tonal variation in a particular area of an image. Shadow will help in the interpretation by providing an idea of the profile and relative height of a target or targets which may make identification easier. Association takes into account the relationship among other recognizable objects or features in proximity to the target of interest. Other features of the images that also should be taken into consideration include percentage of water, green land, cloud forms, snow, and so on. The prototype system will help scientists to retrieve images that contain different features; the system should be able to handle complex queries. This calls for some knowledge of African fractals, which I have defined as a self-similar pattern—i.e. a pattern that repeats itself on an ever diminishing scale (Bangura, 2000:7). As Ron Eglash (1999) has demonstrated, first, traditional African settlements typically show repetition of similar patterns at ever-diminishing scales: circles of circles of circular dwellings, rectangular walls enclosing ever-smaller rectangles, and streets in which broad avenues branch down to tiny footpaths with striking geometric repetition. He easily identified the fractal structure when he compared aerial views of African villages and cities with corresponding fractal graphics simulations. To estimate the fractal dimension of a spatial pattern, Eglash used several different approaches. In the case of Mokoulek, for instance, which is a black-and-white architectural diagram, a twodimensional version of the ruler size versus length plots were employed. For the aerial photo of Labbazanga, however, an image in shades of gray, a Fourier transform was used. Nonetheless, according to Eglash, we cannot just assume that African fractals show an understanding of fractal geometry, nor can we dismiss that possibility. Thus, he insisted that we listen to what the designers and users of these structures have to say about it. This is because what may appear to be an unconscious or accidental pattern might actually have an intentional mathematical component. Second, as Eglash examined African designs and knowledge systems, five essential components (recursion, scaling, self-similarity, infinity, and fractional dimension) kept him on track of what does or does not match fractal geometry. Since scaling and self-similarity are descriptive characteristics, his first step was to look for the properties in African designs. Once he established that theme, he then asked whether or not these concepts had been intentionally applied, and started to look for the other three essential components. He found the clearest illustrations of indigenous self-similar designs in African architecture. The examples of scaling designs Eglash provided vary greatly in purpose, pattern, and method. As he explained, while it is not difficult to invent explanations based on unconscious social forces— for example, the flexibility in conforming designs to material surfaces as expressions of social flexibility—he did not believe that any such explanation can account for its diversity. He found that from optimization engineering, to modeling organic life, to mapping between different spatial structures, African artisans had developed a wide range of tools, techniques, and design practices based on the conscious application of scaling geometry. Thus, for example, instead of using the Koch curve to generate the branching fractals used to model the lungs and acacia tree, Eglash used passive lines that are just carried through the iterations without change, in addition to active lines that create a growing tip by the usual recursive replacement. 22 For the text database, the prototype system must consider polysymy and synonymy problems in the queries. Polysymy means words having multiple meanings: e.g. “order,” “loyalty,” and “ally.” Synonymy means multiple words having the same meaning: e.g., “jungle” and “forest,” “tribe” and “ethnic-group,” “language” and “dialect,” “tradition” and “primitive,” “corruption” and “lobbying.” The collected documents will be placed into categories depending on the documents’ subjects. Scientists can search into those documents and retrieve only the ones related to queries of interest. Scientists can search via words or terms, and then retrieve documents on the same category or from different categories as long as they are related to the words or terms in which the scientists are interested. Barriers Many barriers to gain access to the Internet can hamper the automated generation of metadata for studying and teaching about Africa in the continent. These barriers include bandwith, copyright laws, costs, politics and bureaucracy, training and personnel, and unreliability or system glitches. These obstacles are discussed sequentially in the following subsections. Bandwith Bandwith is broadly defined as the rate of data transfer: i.e. the capacity of the Internet connection being used. The greater the bandwith, the greater the capacity for faster downloads from the Internet. It is among the biggest problems African universities face in accessing the Internet. The universities have been forced to buy bandwith from much more expensive satellite companies. What is worse is that the purchases have been made through middlemen, increasing the costs even more. Nonetheless, the evolving technology and steadily falling satellite prices could help to ameliorate this barrier. In fact, various organizations in Africa are getting together to buy bandwith as a “bridging” strategy to obtain Internet access via satellite until terrestrial fiber-optic cable becomes available. With new technology, along with the continued growth of satellite companies, this could be a leapfrog technology that will enable Africa to avoid laying much of the terrestrial cable that was essential in the developed world (Walker, 2005). Universities involved in the Partnership for Higher Education in Africa (PHEA)—a collaborative effort involving Carnegie Corporation of New York, along with the MacArthur, Ford, and Rockefeller foundations, which have pledged $100 million over a five-year period to help strengthen African universities—tend to be among the leaders in adapting ICT. But even among them, there are considerable differences, as the pace of change can be dizzying. The University of Jos in Nigeria, for example, has blazed a trail in ICT among West African universities, generally regarded as the region that has the most problems. At Jos, one could get a broadband connection from several labs on campus and download large research files. About 12 years ago, a group of individuals at the university who were dedicated to making ICT flourish took on the challenge to make the institution gain a high level of connectivity. They did not have much money, but they had strong institutional support, going over the tenure of three vice chancellors. In 1979, Jos had a student and staff population of 15,000, but it had no ICT staff. There were less than 10 computers and fewer than 10 people who had any computer skills. But today, Jos has over 3,000 E-mail users, over 400 networked computers, all three of the campuses are linked by 15 local area networks (LANs) utilizing fiber optics and Cisco switches, and an established tradition of training (Walker, 2005). 23 At Makerere University in Uganda, there exists a notable program of ICT advances. Nonetheless, the university is still coping with a bandwith problem. The fundamental challenge is that satellite access is quite expensive. For every $500 an American institution pays for access per month, Makerere pays $28,000 (Walker, 2005). Most African universities do not have enough bandwiths to access the millions of digitized images of the pages of the most prestigious journals. Organizations which offer such resources such as JSTOR have explored putting much of their material on local servers for African universities, but publishers object to the suggestion arguing that the kind of security controls and abuse monitoring they have on the Internet would not be maintained on local servers (Walker, 2005). Copyright Laws A strenuous challenge that has emerged in managing and accessing the Internet in Africa has to do with copyright laws. African scientists and scholars are being denied access to World Wide Web (WWW) resources because of increasingly contentious intellectual property rights debates. Among the groups involved in the debates is the PHEA, whose goal on this issue is to improve African universities’ capacity to utilize technology. Currently, as I mentioned earlier, the PHEA is in the midst of an effort to help the universities gain control of the cost and training issues surrounding online access. Its initial focus has been to facilitate the formation of a coalition of African universities that will be better positioned to negotiate lower bandwith prices (Walker, 2005). The development of limited “virtual libraries” highlights what has become the largest obstacle to African scholars’ access to WWW resources. In the developed countries, there is a tradition of each student having his/her own textbooks, copies of journals, etc. The tradition just does not exist in Africa. The reality is that people in the continent copy books for 3,000 students who cannot afford to buy them. This is certainly what publishers do not want, since it violates their copyright to photocopy books and journals. The situation might be different if these publishers were to provide their products to African students, teachers and scholars at lower costs. Publishers must ask themselves just how much potential income they lose in poor African countries. Copyrights on Western business and computer science books, for example, are exorbitantly expensive for African institutions. The publishers are charging $800 for the books compared to $1,000 the universities charge for tuition. The idea of textbooks cost as much as tuition is staggering for many people. Part of the solution may hinge upon a collective bargaining process similar to the universities’ bandwith consortium that would negotiate lower fees from publishers (Walker, 2005). Publishers and database vendors are requiring African institutions to guarantee that they will know exactly who is accessing the information at all times at all the universities so that appropriate fees can be charged. African institutions are asking why vendors should care about how many students will have access to the information. They rhetorically ask how vendors can be sure that if they bought a hard copy book half the village will not just copy it. African institutions also perceive copyright-related problems as a true barrier to development. They believe that people in developing countries cannot afford to get caught up in the trap of the global intellectual property regime, which strongly favors the West, where most of the laws are made and enforced. They believe that for Africa, ICT has become a tool for survival. Anything that stands in the way is unacceptable, especially when African states are confronted with so many other challenges. They also believe that the worst thing that could happen to African nations is to become the total consumers of information, as the property rights issues are becoming barriers that prevent Africans from placing their knowledge in the global stream. Africans have unique cultures, languages, histories, environments, fauna, flora, archaeology and increasingly valuable information in the hard sciences, 24 and they must figure out a way to barter African intellectual property for access to others. Meanwhile, many African scholars are voicing growing concern about their inability to access Internet resources because of copyright issues. Professor I. S. Diso, vice chancellor of Nigeria’s Kano University of Technology, is heading a French-funded investigation into the alleged harmful restrictions copyrights are placing on African scholars, scientists, and researchers (Walker, 2005). In Africa, concern about copyright as a barrier is reflected in the growing realization of the role it can play in preserving unique African knowledge and protecting its ownership. Out of fear that the works of African scientists and scholars might be stolen in developed countries, university officials across the continent are reluctant to have those works placed on the WWW (Walker, 2005). Costs In Africa, the average total cost of using a local dial-up Internet account for 20 hours a month is about $68 per month. This includes usage fees and local telephone time, but not telephone line rental. ISP subscription rates vary between $10 and $100 per month. The prevailing high level of poverty in most of Africa is among the factors hindering people to use the Internet for sustainable development. Most Africans prefer to spend their hard earned money on buying food to fill their stomachs as opposed to surfing the Internet. The cost of computers is another factor that is still plaguing access to the Internet in Africa because governments have not reduced duty on computers (Lusaka Information Dispatch, 2003). When they gained independence, many African countries embarked upon serious efforts to build telephone lines, but the attempts have been stymied by inefficient state-owned monopolies and high rates of theft of the copper wire used in telephone installations. Consequently, universities interested in gaining access to the Internet almost all had to do so by buying bandwith from satellite companies, often at more than 100 times the cost in developed countries. The bandwith price has started to come down, with the average African university paying $4 less per kilobit per second. And now that African universities have formed a “bandwith club,” they are expecting the prices to come down even more. In the new tender they are floating, they are looking at a target price of no more than $2.50 per kilobit per second. This will help some universities to get up to six times more bandwith for what they are currently paying (Walker, 2005). Access to the Internet in Africa is hampered by unfair costs. The continent is being ripped off to the tune of some $500 million a year for hooking up to the WWW. This extra cost is partly to blame for slowing the spread of the Internet in Africa and helping to sustain the “digital divide.” The continent is being forced by Western companies to pay the full cost of connecting to worldwide networks, leading to the exploitation of the continent’s young Internet industry. The problem is that International Telecommunications Union regulations, which should ensure the costs of telephone calls between Africa and the West are split 50:50, are not being enforced with regard to the Internet. British Telecom or American Online does not pay a single cent to send E-mail to Africa. The total cost of any E-mail sent to or received by an Internet user in Africa is paid entirely by African ISPs. Consequently, the current and latent demand for bandwith in Africa cost about $1 billion a year (BBC, 2002). No matter how well it is managed, however, many critics argue that the amount of money universities are spending on bandwith is inappropriate. They say that universities are worshipping at the altar of the Internet when the money could be better spent to remedy grossly overcrowded classes, dilapidated infrastructure, and water and sewer systems that sometimes do not work. Supporters of spending on bandwith counter by stating that with proper bandwith, an instructor might be able to teach 10,000 more students through distance learning, as opposed to hiring 24 25 more professors. They add that bandwith allows, for example, all the courses at the Massachusetts Institute of Technology (MIT) and other Internet learning resources to be online and adapted to meet the needs of universities in Africa. It should be mentioned, however, that in addition to its monetary cost, Internet use has another major cost: i.e. someone in Europe or in the United States can download 1,000 abstracts in a brief period of time; in Africa, it will take the person two days. That slows down the process of research and discourages users from relying on the system (Walker, 2005). Politics and Bureaucracy There exist principled objections and bureaucratic pathologies towards the ICT focus within African educational and political institutions. Some professors just do not believe that distance learning is an effective educational strategy. Many people do not believe that those who push the ICT focus make a lot of sense, and others are scared that the technology might take away their jobs (Walker, 2005). Another major reason why African universities have lagged so far behind in accessing the Internet has to do with the history of authoritarian and repressive governments on the continent. Many in the last generation of African leaders viewed mass communications as a security risk. Some current leaders hold this same view as well, believing that it could be dangerous to allow many people access to a communications tool like the WWW that is not easily monitored or controlled. They see them like private radio stations which opposition parties have used as tools to overthrow governments. As a consequence, many African governments have been very slow to provide the kind of regulatory and funding assistance taken for granted by universities in developed countries (Walker, 2005). Endless red tape, lack of clear policy, unreliable power supplies and monopoly by banks hold back the IT sector. Adding to this is inter-country rivalry for the best contender of an African Internet hub: it is argued that Nigeria has too much corruption, Senegal is francophone, and Ghana is not ready yet (Hale, 2003). Also, wars and military incursions sabotage African universities, which are supposed to be the places of open inquiry, as they become completely muzzled. In Sierra Leone, for example, the educational institutions became targets for destruction during its 11-year civil war (Kargbo, 2002). In Nigeria, the universities came to have the same kind of ethnic and religious intolerance and corruption that the military bred into the larger society. The result has been people not trusting one another and being afraid to cooperate on projects or even ask questions. Even with the restoration of democracy in Nigeria, there has been no concerted and coordinated approach to provide universities with the necessary technological infrastructure needed for advancement. While in South Africa, for instance, there are special rates on all kinds of things for educational institutions, in Nigeria, universities are charged the full commercial rate for telephone and electric services, on which the government has monopolies. Nigeria has two national telecommunications carries owned by the government. While both of them have a fiber-optic backbone, none has connected any university or even asked (Walker, 2005). Another example of how outdated and uncoordinated Nigerian government regulations continue to hamper ICT development has to do with bureaucratic pathologies. For example, when computers and other ICT equipment are sent by overseas donors, they will sit at customs for ten or more months while the tax on them accumulates with time. Once all that is added together, the donations tend to make no sense, for they become just as expensive as buying new ones. Even vendors of Voice Over Internet Protocol (VOIP) services who have been quite successful in gaining numerous customers are worried that eventually the Nigerian government will try to stifle VOIP because it means the end of the state-owned telecommunications service (Walker, 2005). 26 In Zambia, where the government has no policy on the use of ICT, the opinion leaders see no need to entice and educate people on the need to use the Internet and to make the full use of computers. Some people who have computers just keep them for prestige rather than enhancing them for the benefit of improving their economic aspect or business. Most politicians are concentrating on improving agriculture, fighting corruption, diversifying the economy but forgetting that ICT usage could help to double the effort of achieving economic development that most people require. A challenge Zambians are facing is that most politicians who speak on their behalf are ignorant about Internet usage; if they are aware of it, they just know it by virtue of being elected as members of parliament (Lusaka Information Dispatch, 2003). Despite the political and bureaucratic malaise, progress is being made. For example, in Uganda, after much lobbying by universities and other members of the ICT community, the government has agreed, in principle, to lay fiber-optic cables whenever it builds new roads. Building a road costs $3 million per kilometer, and adding fiber optics would only be an additional $100,000 per kilometer (Walker, 2005). Insufficient Training and Personnel Most people in Africa say they have heard about the Internet, while others say they are ignorant about its usage (Lusaka Information Dispatch, 2003). Training a new generation in managing and accessing the Internet is a major issue in Africa. Managing bandwith involves training and cultural transformations at nearly all universities. Much of the bandwith universities are buying is being wasted, as it is difficult to prevent students from doing selfish things like downloading videos and music. Building firewalls and monitoring use on a per-student basis is also required. Universities must also establish local area networks (LAN) that can be used instead of the more expensive Internet connections for many activities. For instance, a great deal of the research and materials which individuals need can be placed on a LAN and shared through it. Better management of LANS can substantially reduce Internet costs (Walker, 2005). Unreliability or System Glitches Many of the ICT systems in African universities are unreliable, either because of power outages, a modem is broken, or someone did not pay the bill. Hardware systems are “pinged” at African universities all the time, and many of them run for only four hours a day. Researchers have been known to wait two days to download a large file and only find out later that the file was corrupted because online access was interrupted. When this happens several times, some researchers simply give up (Walker, 2005). Irregular or non-existent electricity supply, high intensity lighting strikes during the rainy season, and creaking infrastructure are other major barriers to Internet usage, especially outside the big cities and towns (Lusaka Information Dispatch, 2003; Kargbo, 2002; Hale, 2003). Hopeful Sings on the Horizon Despite all of the preceding barriers to gaining access to the Internet in Africa, there are hopeful signs on the horizon. To begin with, there is the “leap-frog” potential of evolving technology, as ICT is becoming more reliable and easier to use every day. Africa can be one case where the last can 27 become the first. The technology is changing so rapidly that having had it for a long time is no longer a significant advantage. One money-saving potential for African users is VOIP, which could greatly reduce many universities’ telephone bills (Walker, 2005). Universities are also exploring alternatives such as placing as much information as possible on virtual libraries. These media have opened university libraries to large populations, making libraries renewed places in the universities’ lives (Kargbo, 2002; Walker, 2005). Virtual libraries offer one way of allowing the dissemination of knowledge while still maintaining some measure of control over its distribution. One virtual library already being used by some African universities is a project run out of the University of Iowa in the United States called eGranary, which is described as an “Internet substitute.” Project participants use very large hard drives to store nearly two million documents that publishers and authors are willing to share. Once the hard drives are installed locally, users can access the material much faster than trying to use the Internet. These media have everything from a virtual hospital with thousands of pieces of patient literature to full textbooks. There are nearly 50 eGranaries installed in sub-Saharan Africa (Walker, 2005). JSTOR, as I stated earlier, was originally developed and funded by the Andrew W. Mellon Foundation and now receives support from Carnegie Corporation, is another organization working to provide copyrighted scholarly material to African universities. JSTOR started as a way to electronically store back issues of many scholarly journals so that American university libraries could free up some shelf space. Eventually, it realized that it had a valuable resource for developing countries. It has digitized 17 million images of the pages of 400 of the most prestigious journals in 45 different disciplines. Scholars around the world use JSTOR for research. When it comes to African universities, most of them just do not have enough bandwith to effectively access JSTOR’s resources (Walker, 2005). The debate about how to make copyrighted Internet resources available to African scholars has become prominent in the international controversy over access to information. The Open Access Movement, a growing collection of intellectuals, academics, personnel at nongovernmental organizations (NGOs) and government officials who believe that knowledge should be free, is advancing Africa’s case. The movement is calling for owners of copyrights to be more generous in making information available to African institutions. Universities in developed and developing countries are battling academic and scientific journal publishers for their regular, substantial subscription price increases, and their use of bundling, which forces libraries to subscribe to journals they do not want in order to get the ones they do. Committees in both the British and American legislatures have passed resolutions demanding that government-sponsored research be available for free. Responding to the growing pressure, Reed Elsevier, the world’s largest publisher of scientific journals, for example, announced that authors publishing in its journals would be allowed to post articles in institutional repositories. Another development favoring the Open Access Movement is the announcement of Google, the world’s most popular Internet search engine, that it has reached an agreement with some of America’s leading university research libraries to begin converting their holdings into digital files that would be freely searchable over the Internet. In addition, Google’s competitor, Yahoo, as well as others such as Amazon.com, is in a mad rush to get their shares of the $12 billion scholarly journal business. Google has developed a separate site called Google Scholar for academic researchers. These rapid developments may ultimately work to the advantage of developing nations like those in Africa (Walker, 2005). An initiative called the Halfway Proposition urges fellow African countries to develop national exchanges and then interconnect regional ones, as has been done in other parts of the developing world. This would at least mean that revenues generated from intra-African E-mail will stay in the continent as opposed to going to Western nations. While no one really know just how much intra- 28 African traffic exists, it will certainly grow and become significant. And if even only five percent of the traffic is intra-regional, it would add up to a sizeable amount (BBC, 2002). The most hopeful sign on the ICT horizon in Africa is the exuberant optimism. Every study or report that has been done on the continent shows that everyone who works on ICT issues is optimistic. African universities feel that they have been shut off for so long from the global knowledge community that they are now hungry and thirsty, and are just full speed ahead (Walker, 2005). Despite the many woes of ICT on the continent, there is the steadfast belief in the potential of Africa to become a Silicon Valleyesque hi-tech hub. It wants to take a slice of the outsourcing that has been won by India’s Bangalore and win the foreign investors that have shunned Africa for so long. And there is certainty that technology must be the way to improved economic prosperity. When computer fairs are held in Africa, long queues form at the stands. Would-be students wait patiently to fill out registration forms for computer courses in anything from basic word processing to diplomas in programming. Some potential students admit afterwards that they have no idea how or whether they will be able to pay the fees to attend such course, but they, like Africa, are determined to find a way of using technology to enter the arena of the global economy (Hale, 2003). Conclusions and Recommendations Data mining techniques and visualization must play a pivotal role in retrieving substantive electronic data to study and teach about African phenomena in order to discover unexpected correlations and causal relationships, and understand structures and patterns in massive data. Data mining is a process for extracting implicit, nontrivial, previously unknown and potentially useful information such as knowledge rules, constraints, and regularities from data in massive databases. The goals of data mining are (a) explanatory—to analyze some observed events, (b) confirmatory—to confirm a hypothesis, and (c) exploratory—to analyze data for new or unexpected relationships. Typical tasks for which data mining techniques are often used include clustering, classification, generalization, and prediction. The most popular methods include decision trees, value prediction, and association rules often used for classification. Artificial Neural Networks are particularly useful for exploratory analysis as non-linear clustering and classification techniques. The algorithms used in data mining are often integrated into Knowledge Discovery in Databases (KDD)—a larger framework that aims at finding new knowledge from large databases. While data mining deals with transforming data into information or facts, KDD is a higher-level process using information derived from a data mining process to turn it into knowledge or integrate it into prior knowledge. In general, KDD stands for discovering and visualizing the regularities, structures and rules from data, discovering useful knowledge from data, and for finding new knowledge. Visualization is a key process in Visual Data Mining (VDM). Visualization techniques can provide a clearer and more detailed view on different aspects of the data as well as results of automated mining algorithms. The exploration of relationships between several information objects, which represent a selection of the information content, is an important task in VDM. Such relations can either be given explicitly, when being specified in the data, or they can be given implicitly, when the relationships are the result of an automated mining process: for example, when relationships are based on the similarity of information objects derived by hierarchical clustering. Understanding and trust are two major aspects of data visualization. Understanding is undoubtedly the most fundamental motivation behind visualizing massive data. If scientists understand what has been discovered from data, then they can trust the data. To help scientists understand and trust the implicit data discovered and useful knowledge from massive datasets concerning Africa, it is imperative to present the data in various forms, such as boxplots, scatter 29 plots, 3-D cubes, data distribution charts, as well as decision trees, association rules, clusters, outliers, generalized rules, etc. The software called Crystal Vision is a good tool for visualizing data. It is an easy to use, selfcontained Windows application designed as a platform for multivariate data visualization and exploration. It is intended to be robust and intuitive. Its features include scatter plot matrix views, parallel coordinate views, rotating 3-D scatter plot views, density plots, multidimensional grand tours implemented in all views, stereoscopic capability, saturation brushing, and data editing tools. It has been used successfully with datasets as high as 20 dimensions and with as many as 500,000 observations (Wegman 2003). Crystal Vision is available at the following Internet site: <ftp://www.galaxy.gmu.edu/pub/software/CrystalVisionDemo.exe> In light of all these possibilities, it is imperative that there be on-the-ground commitment on the part of implementers, as well as university and government authorities, in order to achieve sustainable ICT in Africa. Only through their participation will the Internet transform the classroom, change the nature of learning and teaching, and change information seeking, organizing and using behavior. Finally, as I have suggested elsewhere (Bangura, 2005), the provision of education in Africa must employ ubuntugogy (which I define as the art and science of teaching and learning undergirded by humanity toward others) to serve as both a given and a task or desideratum for educating students. Ubuntugogy is undoubtedly part and parcel of the cultural heritage of Africans. Nonetheless, it clearly needs to be revitalized in the hearts and minds of some Africans. Although compassion, warmth, understanding, caring, sharing, humanness, etc. are underscored by all the major world orientations, ubuntu serves as a distinctly African rationale for these ways of relating to others. The concept of ubuntu gives a distinctly African meaning to, and a reason of motivation for, a positive attitude towards the other. In light of the calls for an African Renaissance, ubuntugogy urges Africans to be true to their promotion of peaceful relations and conflict resolution, educational and other developmental aspirations. We ought never to falsify the cultural reality (life, art, literature) which is the goal of the student’s study. Thus, we would have to oppose all sorts of simplified or supposedly simplified approaches and stress instead the methods which will achieve the best possible access to real life, language and philosophy References Alshameri, F. J. 2006. Automated generation of metadata for mining image and text data. Doctoral dissertation, George Mason University, Fairfax, Virginia. Bangura, A. K. 2005. Ubuntugogy: An African educational paradigm that transcends pedagogy, andragogy, ergonagy and heutagogy. Journal of Third World Studies xxii, 2:13-54. Bangura, A. K. 2000. Chaos Theory and African Fractals. Washington, DC: The African Institution Publications. Bangura, A. K. 2000. Book Review of Ron Eglash’s African Fractals: Modern Computing and Indigenous Design. Nexus Network Journal 2, 4. Barker, J. and J. Thornton. 2004. Software engineering challenges in bioinformatics. Proceedings of the 26th International Conference on Software Engineering (ICSE ‘04). 30 BBC. April 15, 2002. The great African internet robbery. BBC News Online. Retrieved on April 18, 2002 from http://news.bbc.co.uk/2/hi/africa/1931120.stm Berry, M., Z. Drmac and E. Jessup. 1999. Matrices, vector spaces, and information retrieval. Society for Industrial Applied Mathematics (SIAM) 41, 2:335-362. Brunner, R., S. Djorgovsky, T. Prince and A. Szalay. 2002. Massive datasets in astronomy. In J. Abello et al., eds. Handbook of Massive Datasets. Norwell, MA: Kluwer Academic Publishers. Cannataro, M., C. Comito, A. Guzzo and P. Veltri. 2004. Integrating ontology and workflow in PROTEUA, a grid-based problem solving environment for bioinformatics. Proceedings of the International Conference on Information Technology: Coding and Computing (ITCC ‘04). Chen, M., J. Han and P. Yu. 1996. Data mining: An overview from a database perspective. IEEE Transactions on Knowledge and Data Engineering 8, 6:866-883. Dhillon, I., J. Han and Y. Guan. 2001. Efficient clustering of very large document collection. In R. Grossman et al., eds. Data Mining for Scientific and Engineering Applications. Norwell, MA: Kluwer Academic Publishers. Dorre, J., P. Gerstl and R. Seiffert. 1999. Text mining: Finding nuggets in mountains of textual data. KDD-99:398-401. San Diego, CA. Eglash, Ron. 1999. African Fractals: Modern Computing and Indigenous Design. New Brunswick, NJ: Rutgers University Press. Ester, M., H. Kriegel and J. Sander. 2001. Algorithms and applications for spatial data mining. Geographic Data Mining and Knowledge Discovery, Research Monographs in GIS. Taylor and Francis, 167-187. Ester, M., H. Kriegel and J. Sander. 1997. Spatial data mining: A database approach. Proceedings of the International Symposium on Large Spatial Databases. SSD '97:47-66. Berlin, Germany. Ester, M., A. Frommelt, H. Kriegel and J. Sander. 2000. Spatial data mining: Database primitives, algorithms and efficient DBMS support. Data Mining and Knowledge Discovery 4, 2/3:193-216. Fayyad U. M., G. Piatetsky-Shapiro and P. Smyth. 1996. From data mining to knowledge discovery: An overview. U. M. Fayyad et al., eds. Advances in Knowledge Discovery and Data Mining. Menlo Park, CA: AAAI Press. Gomez, M., A. Gelbuhk, A. Lopez and R. Yates. 2001. Text mining with conceptual graphs. IEEE 893-903. Grossman, R., E. Creel, M. Mazzucco and R. Williams. 2001. A dataspace infrastructure for astronomical data. In R. Grossman et al., eds. Data Mining for Scientific and Engineering Applications. Norwell, MA: Kluwer Academic Publishers. Hale, Briony. May 16, 2003. Africa’s tech pioneers play catch up. BBC News Online. Retrieved on May 17, 2003 from http://news.bbc.co.uk/2/hi/business/3033185.stm 31 Hambrusch, S., C. Hoffman, M. Bock, S. King and D. Miller. 2003. Massive data: Management, analysis, visualization, and security. A School of Science Focus Area at Purdue University Report. Han, J. and M. Kamber. 2001. Data Mining: Concepts and Techniques. San Francisco, CA: Morgan Kaufman Publishers. Humphreys, R., J. Cabanela and J. Kriessler. 2001. Mining astronomical databases. In R. Grossman et al., eds. Data Mining for Scientific and Engineering Applications. Norwell, MA: Kluwer Academic Publishers. Kafatos, M., R. Yang, X. Wang, Z. Li and D. Ziskin. 1998. Information technology implementation for a distributed data system serving earth scientists: Seasonal to International ESIP. Proceedings of the 10th International Conference on Scientific and Statistical Database Management 210-215. Kamath, C. 2001. On mining scientific datasets. In R. Grossman et al., eds. Data Mining for Scientific and Engineering Applications. Norwell, MA: Kluwer Academic Publishers. Kargbo, John Abdul. March 2002. The internet in schools and colleges in Sierra Leone. First Monday 7, 3. Retrieved on March 18, 2002 from http://firstmonday.org/issues/issue7_3/kargbo/index.html King, M. and R, Greenstone. 1999. EOS Reference Handbook. Washington, DC: NASA Publications. Koperski, K. and J. Han. 1995. Discovery of spatial association rules in geographic information databases. Proceedings of the 4th International Symposium on Advances in Spatial Databases. 47-66. Portland, ME. Koperski, K. and J. Han and N. Stefanovic. 1998. An efficient two-step method for classification of spatial data. Proceedings of the Symposium on Spatial Data Handling. 45-54. Vancouver, Canada. Lusaka, Information Dispatch. January 07, 2003. Internet access still a nightmare in Africa. Retrieved on February 13, 2003 from http://www.dispatch.co.zm/modules.php?name=News&file=article&sid=180 Marusic, I., G. Candler, V. Interrante, P. Subbareddy and A. Moss. 2001. Real time feature extraction for the analysis of turbulent flows. In R. Grossman et al., eds. Data Mining for Scientific and Engineering Applications. Norwell, MA: Kluwer Academic Publishers. Ng, R. and J. Han. 1994. Efficient and effective clustering methods for spatial data mining. Proceedings of the 20th International Conference on Very Large Databases. 144-155, Santiago, Chile. Palacio, M., D. Sol and J. Gonzalez, 2003. Graph-based knowledge representation for GIS data. Proceedings of the Fourth Mexican International Conference on Computer Science (ENC ‘03). Pantel, P. and D. Lin. 2002. Discovering word senses from text. In Proceedings of SIGKDD-01. San Frncisco, CA. 32 Sander, J. M. Ester and H. Kriegel. 1998. Density-based clustering in spatial databases: A new algorithm and its applications. Data Mining and Knowledge Discovery 2, 2:169-194. Shekhar, S., C. Lu, P. Zhang, and R. Liu. 2002. Data mining and selective visualization of large spatial datasets. Proceedings of the 14th IEEE International Conference on Tools with Artificial Intelligence (ICTAI ‘02). Walker, Kenneth. Spring 2005. Bandwith and copyright: Barriers to knowledge in Africa? Carnegie Reporter 3, 2:1-5. Wegman, E. 2003. Visual data mining. Statistics in Medicine 22:1383-1397 plus 10 color plates. Wegman, E. 1997. A Guide to Statistical Software. Available at http://www.galaxy.gmu.edu/papers/astr1.html Wong, P., P. Whitney and J. Thomas. 1999. Visualizing association rules for text mining. In G. Wills and D, Keim, eds. Proceedings of IEEE Information Visualization ‘99. Los Alamitos, CA: IEEE CS Press. Wong, R. and W. Shui. 2001. Utilizing multiple bioinformatics information sources: An XML database approach. Proceedings of the Second IEEE International Symposium on Bioinformatics and Bioengineering. Yang, R., X. Deng, M. Kafatos, C. Wang and X. Wang. 2001. An XML-based distributed metadata server (DIMES) supporting earth science metadata. Proceedings of the 13th International Conference on Scientific and Statistical Database Management 251-256. Acknowledgment This essay benefitted greatly from the insights of my colleague, Professor Faleh J. Alshameri, albeit all shortcomings herein lie with me. About the Author Abdul Karim Bangura is Professor of Research Methodology and Political Science at Howard University in Washington, DC, USA. He holds a PhD in Political Science, a PhD in Development Economics, a PhD in Linguistics, and a PhD in Computer Science. He is the author and editor/contributor of 57 books and more than 450 scholarly articles. He has served as President, United Nations Ambassador, and member of many scholarly organizations. He is the winner of numerous teaching and other scholarly and community awards. He also is fluent in about a dozen African and six European languages and studying to strengthen his proficiency in Arabic and Hebrew. 33

Automated Generation of Metadata for Studying and Teaching about

Related documents

Products

Support

Automated Generation of Metadata for Studying and Teaching about

Related documents

Add this document to collection(s)

Add this document to saved

Suggest us how to improve StudyLib