NTU/Intel M2M Project: Wireless Sensor Networks Content Analysis and Management Special Interest Group Data Analysis Team Monthly Report 1. Team Organization Principal Investigator: Shou-De Lin Co-Principal Investigator: Mi-Yen Yeh Team Members: Chih-Hung Hsieh (postdoc), Yi-Chen Lo (PhD student), Perng-Hwa Kung (Graduate student), Ruei-Bin Wang (Graduate student), Yu-Chen Lu (Undergraduate student), Kuan-Ting Chou (Undergraduate student), Chin-en Wang(Graduate student) 2. Discussion with Champions a. Number of meetings with champion in current month: 2(F2F) b. Major comments/conclusion from the discussion: about how to use the newly collect data 3. Progress between last month and this month a. Topic1: Clustering streams using MSWave. 1)In order to test the performance on large data, we generated the synthetic data by the ( same random walk data model used http://www.cs.ucr.edu/~eamonn/SIGKDD_trillion.pdf in references or http://www.cs.ucr.edu/~eamonn/UCRsuite.html). For every stream, we generate it by the random walk whose every step size is a normal distributed random number with mean 0 and standard deviation 1. We generate 12500 streams with length 12500 by the model. 2)From Fig. (a), we can see that both MSWave-L and MSWave-S still saved more transmission cost than CP, PRP, and LEEWAVE-M even though the scale of data set increased. Furthermore, we can notice that the difference between these methods became more distinguished than we saw in temperature data as the data were plotted by semi-log graph. That means the larger scale of data we face, the better MSWave-L and MSWave-S work. Moreover, the gap between MSWave-L and MSWave-S also increased as |Q| became larger which was the same with our discussion in the previous. 3) Fig. (b) shows the performance of pruning of MSWave-L. Although the scale of data increased, the performance of pruning was still good when the difference between k(Here was 30) and M (Here was 500) is large. Due to the performance of pruning, the reduction in the transmission cost was much more significant than temperature data. (a) (b) b. Topic2: Exploiting Correlation among Sensors 1) Trials: 1. Use closest similarity first to determine order - Fixing program: sampling too much data (25%-> 27~29%) makes some sensors send more frequently (but does it cause good results compared to random sampling with the same rate?) 2. 3. Change the way to mod sensors as clusters - Ex: cluster size = 12, mod 5 - (1, 6, 12) (2, 7, 12) (3,8)… to (1, 2, 3) ( 4, 5, 6)(7,8)… - No improvement (worse) Do the one which has closest similarity first without clustering - No improvement (worse) c. Topic 3: Distributed Nearest Neighbor Search of Time Series Using Dynamic Time Warping 1) FTW-based method: FTW with Coarse C and Coarse Q - - Original Proposal by FTW paper - Both candidates and query use the setting of reduced length - Each segment has only one node New segment size = Old segment size / 2 each step - Reusing min / max of old segments 2) FTW-based method: FTW with Accurate C and Coarse Q - Sites have all nodes of candidates (sensors) - Both candidates and query use the setting of reduced length - Each segment has only one node - Candidates keep original streams - Difference - Reduced Length: Query uses the setting of reduced length like original FTW - Original Length: Query uses the setting of original length by duplicating min / max data of segments - Each segment has multiple nodes 3) The experimental results shows that FTW-based method performs better than AP-based method. The robustness and stability of FTW-based methods will be evaluated in the next weeks. d. Topic 4: Learning a sparse model in on-line and semi-supervised manner. 1) According to the current experimental results - The learnt model of support vector machine using the ramp-lossed function is more sparse than using hinge-lossed function. However, using the hinge-lossed function in the initial phase of on-line learning will provide a perform-well initial decision boundary, and using the ramp-lossed function in the following learning process will help to alleviate the problem of overfitting the outliers and simultaneously to keep the learnt model as sparse as only using ramp-lossed function. - The performance of adding the unlabeled data to training set with the semi-supervised learning manner did not be improved compared with only using labeled training data. - In the following weeks, we will try to fix the problem of current semi-supervised learning framework, by using the improved variant of ramp-lossed function or adopting new framework of semi-supervised learning. 4. Brief plan for the next month a. We will continuous paper survey and refine our proposed approaches. b. To implement our proposed approaches and evaluate their performance. 5. Research Byproducts a. Paper: N/A b. Served on the Editorial Board of International Journals: N/A c. Invited Lectures: N/A d. Significant Honors / Awards: N/A