Data collection and document generation system for data-oriented approaches Yuichi Mori1 , Yoshiro Yamamoto2 and Hiroshi Yadohisa3 1 2 3 Department of Socio-Information, Okayama University of Science, 1-1 Ridai-cho, Okayama 700-0005, Japan mori@soci.ous.ac.jp Department of Mathematics, Tokai University, 1117 Kita-Kaname, Hiratsuka 259-1292, Japan yamamoto@sm.u-tokai.ac.jp Department of Culture and Information Science, Doshisha University, 1-3 Tatara Miyakodani, Kyoto 610-0394, Japan hyadohis@mail.doshisha.ac.jp Summary. A web-based statistics system DoSS@d (Data-oriented Statistical System, http://mo161.soci.ous.ac.jp/@d/) has been developed mainly for educational use, which archives a large number of data sets and the corresponding analysis stories (analysis reports of the actual processes) and provides an online analysis system. Now we have just implemented an online data collection system with a real time analysis function and an online document generation system with a database registration function to DoSS@d . The data collection system has been developed for not only desktop computers but also mobile information equipments including portable phone systems which can access to the Internet. The document generation system provides a useful user interface to generate an XML document of the collected data and to register it to the database in DoSS@d . These implemented systems can not only reinforce the data collection function so that users can easily collect data sets and temporarily perform the simple analysis anywhere and anytime, but also can make DoSS@d more accessible and useful. Key words: web-based application, databank of data sets, mobile statistics, online analysis, XML document 1 Introduction Many teaching strategies have been proposed based on educational trials, and many data sets suitable for educational use have been published on the Internet. However, it remains difficult for users to find data sets suitable for their intended purpose and obtain documents that describe how to analyze the data (we call this kind of documentation “analysis story”). The compilation of good examples is therefore considered important for learning processes and procedures relevant to past analyses. The archiving of example data sets and the corresponding analysis stories is also expected to facilitate in the learning of statistical packages in a computational environment, where users can follow the steps of the example computation as an instructional tool. 1634 Yuichi Mori, Yoshiro Yamamoto and Hiroshi Yadohisa Recognizing the potential of archiving such examples, the authors began development of a kind of databank on the Internet. This databank represents an online database of data sets and analysis stories and also incorporates an online analysis system that performs automatic analysis based on the analysis story (i.e., using the same parameters as ones in the original analysis). This environment has been named the “Data oriented Statistical System” or DoSS@d [MYY03, HMYY04, MHYY05], where “@d” reinforces that the system is used for real data. Currently more than 200 examples classified by research subject and statistical method are stored in the database. When utilized for statistical education, teaching scenarios can be developed easily, giving the students the chance to learn various statistical techniques using real data sets as well as to master statistical software using the online analysis function. Students can also perform their own analysis to confirm the results of the analysis story through the use of simple operations, and can easily examine the effect of using different parameters. Furthermore, statisticians or researchers who wish to evaluate their analysis can use the system to find suitable data sets for their evaluation. It is easy to understand that more than enough number of good data sets and analysis stories are essential in consideration of the main purpose of the system. So we should try to gather good data sets and analysis stories as many as possible. Now we have just implemented an online data collection system with a real time analysis function and an online document generation system with a database registration function to DoSS@d . The data collection system has been developed for not only desktop computers but also mobile information equipments including portable phones which can connect to the Internet via wireless communication system, so that we can collect data and perform simple analysis anywhere and anytime. The document generation system provides a useful user interface to make an XML document describing the collected data in the DoSS@d format and to register it to the database in DoSS@d without difficulties. Such new systems make DoSS@d more useful and effective and allow users to collect and upload data rapidly and easily. This paper presents an outline of DoSS@d firstly and then the details of prototype of the online data collection and document generation systems implemented in DoSS@d . 2 DoSS@d The whole structure of DoSS@d is indicated in Fig. 1. The system is located at http://mo161.soci.ous.ac.jp/@d/ and orignally consists of three subsystems, DoDStat@d (data and analysis story database), DoAStat@d (online analysis system) and DoLStat@d (learning courses), which are colored with dark gray in Fig. 1. 2.1 DoDStat@d (Data-oriented Database of Statistics) DoDStat@d is the database system of DoSS@d , which is the core of whole system. Each stored data set consists of a data description and the data body. The former is written in XML and describes attributes of data such as data name, description, source/reference, research subject, statistical method, cases, variables and executable analysis, as displayed in Fig. 3. The latter is provided as tab-, comma- Data collection and document generation system 1635 Fig. 1. Stucture of DoSS@d and space-delimited values, and can be displayed and downloaded through an ordinary browser by clicking a link in “Data set” area in the data description page (see Fig. 3). DoDStat@d also stores analysis stories written in XML. The user is able to select an interesting or appropriate data set using a retrieval key such as research subject and statistical method listed in Table 2 as well as plain text keyword (Fig. 3 is one of 41 data sets found by searching with a key “Economics”). 2.2 DoAStat@d (Data-oriented Analysis System of Statistics) DoAStat@d is a web-based application for the analysis of any data set stored in DoDStat@d as well as in the local computer. In addition to allowing users to analyze data sets, DoAStat@d also provides a function that allows users to easily obtain the same results as described in the analysis story of the data by automatically importing the parameters stored in the XML document of the analysis story. Two versions of DoAStat@d have been implemented: a CGI version using R Server (DoA R) and a Java version using XploRe Quantlet Server (DoA X). See the details in [HMYY04]. 2.3 DoLStat@d (Data-oriented Learning System of Statistics) DoLStat@d is a learning system, in which a variety of learning courses such as “Statistics introductory course”, “Economics course” and “Visualization course” are 1636 Yuichi Mori, Yoshiro Yamamoto and Hiroshi Yadohisa Fig. 2. Retrieval keys in DoDStat@d Category Key Research subjects Agriculture Economics Education Engineering Government Medical Miscellaneous Nutrition Psychology Science Social science Sports Statistical Test methods ANOVA Graph Regression Multivariate Time series Descriptive Fig. 3. An example of data description page provided. Each course consists of from four to seven analysis stories which are selected from DoDStat@d and ordered educationally according to the study target of the course. See the details in [HMYY04] and [MHYY05]. 3 Data collection and document generation systems In addition to the three subsystems described in Sect. 2, we have just implemented two more subsystems to DoSS@d : an online data collection system with a real time analysis function (middle gray part in Fig. 1) and an online document generation system with a database registration function (light gray part in Fig. 1). So far there was no online system of data collection and registration in DoSS@d. The data sets and analysis stories stored currently in DoSS@d are ones that the authors and their collaborators collected, modified in the appropriate format and registered to the database by hand. The new subsystems are therefore a great help for us to collect a new data promptly and to register it to the database without difficulties. Data collection and document generation system 1637 On the other hand, when collecting and analyzing data, we often meet the necessity to collect data at different places in a short period and to observe the analysis of the temporary data during collection, especially in such researches as public-opinion poll, exit poll and traffic volume survey. Considering those situations, a rapid and mobile data collection system with a real time analysis function is desirable. This kind of rapid and mobile data collection environment can be realized recently because several wireless network services and portable telephone services are becoming available here and there, by which we can access to the Internet. We have therefore let the new subsystems have more mobility, i.e., we have implemented data handling functions for mobile information equipments including mobile computers and portable phones which can connect to the Internet with wireless communication systems. 3.1 Data Collection System Data Collection System consists of three modules: • Data Collection Manager • Data Collection Control • Basic Analysis Control and a link to Document Generation System. At first, using Data Collection Manager (Fig. 4) which administers all the functions in Data Collection System, an administrator of the system registers a project in which data collection is conducted and staffs (survey conductors) who handle the data collection in the project. A survey conductor creates a data entry form of the project on the web using Data Collection Control (Fig. 5). More than one data entry forms can be registered to one project. Data Collection Control provides a variety of support tools and templates to create an entry item easily based on the item’s type such as singlechoice, multiple-choice, numeric open-end and text open-end. A finished data entry form is added to the list in Data Collection Control in two types of formats: HTML format for ordinary internet browser (Fig. 6) and mobile HTML format for portable phones (Fig. 7). Data Collection Control allows a conductor to set the open/expire date of a form, edit items and modify or delete a form in the list. When the open date of a form comes, the form is open at a particular URL on the Internet automatically until its expire date. Then data collectors (or respondents in a questionnaire survey) input data into the data entry form one by one. Inputted data is appended to the temporary data body in the server every time one record is submitted. A conductor and collectors can obtain the summary of simple statistics of the temporary data anytime by using Basic Analysis Control (Fig. 8) which is linked from Data Collection Control. Simple statistics and graphs displayed on a portable phone screen are as in Fig. 9. These functions allow them, for example, to observe the change of the data during collection. If they want to analyze the data in detail, DoAStat@d can be used. They can also download the data set in the CSV format at the point of time through an ordinary browser (mobile HTML does not provide the download function). 1638 Yuichi Mori, Yoshiro Yamamoto and Hiroshi Yadohisa Fig. 5. Data Collection Control Fig. 4. Data Collection Manager 3.2 Document Generation System When a sufficient volume of data is collected, a conductor checks the data body and confirms whether it is suitable for DoSS@d . If it is suitable, the conductor creates the data description file using Document Generator (Fig. 10) which helps to make an XML document of the data attributes in the DoSS@d format (an XML file is displayed like Fig. 3 on an ordinary browser) and generates three data bodies (in HTML format, tab- and space-delimited values) automatically based on the CSV file of the data. As shown in Fig. 10, most of attributes are imported from Data Collection System in case the data was collected using that system. The completed XML file and four data bodies including the CSV file are automatically registered (uploaded) to DoDStat@d . 4 Concluding remarks We can use Data Collection System separately as an online data collection and analysis tool. Since the system is utilized anywhere and anytime through wired and wireless network (especially using portable phone services), it develops the potential of mobile data collection and analysis environment. Furthermore, since this system and Document Generation System work as useful user interfaces between data collection and DoSS@d , we can increase the number of data sets in DoDStat@d to cover all users’ interests. Thus the new subsystems implemented here can not only reinforce the data collection function so that users can easily collect data sets and temporarily perform the simple analysis, but also can make DoSS@d more accessible and useful. Data collection and document generation system 1639 Fig. 7. Screen shots of data collection forms for portable phone Fig. 6. An example of data collection form Fig. 8. Simple statistics of a temporary data Fig. 9. Screen shot of simple statistics for portable phone 1640 Yuichi Mori, Yoshiro Yamamoto and Hiroshi Yadohisa Fig. 10. Data description XML generator References [MYY03] Mori, Y., Yamamoto, Y., Yadohisa, H.: Data-oriented Learning System of Statistics based on Analysis Scenario/Story (DoLStat). Bulletin of the International Statistical Institute, 54th Session Proceedings Volume LX Two Books, Book 2, 74–77 (2003) [HMYY04] Honda, K., Mori, Y., Yamamoto, Y. and Yadohisa, H.: Web-Based Analysis System in Data-oriented Statistical System ”DoSS@d”. In: Antoch, J (ed), COMPSTAT2004 Proceedings in Computational Statistics, 12091216, Phisica-Verlag (2004) [MHYY05] Mori, Y., Honda, K., Yadohisa, H. and Yamamoto, Y.: Interactive analysis system on the web for data-oriented approaches. The 55th Session of the International Statistical Institute, Abstract Book 306-307 and CD-ROM (2005)