NTU/Intel M2M Project: Wireless Sensor Networks Content Analysis and Management Special Interest Group Data Analysis Team Monthly Report 1. Team Organization Principal Investigator: Shou-De Lin Co-Principal Investigator: Mi-Yen Yeh Team Members: Chih-Hung Hsieh (postdoc), Yi-Chen Lo (PhD student), Perng-Hwa Kung (Graduate student), Ruei-Bin Wang (Graduate student), Yu-Chen Lu (Undergraduate student), Kuan-Ting Chou (Undergraduate student), Chin-en Wang (Graduate student) 2. Discussion with Champions a. Number of meetings with champion in current month: we met the champion once on 7/26(F2F) b. Major comments/conclusion from the discussion: discussing the proposal for next year. 3. Progress between last month and this month a. Topic1: Clustering streams using MSWave. 1) Bonds from LEEWAVE seem not to work well so far based on the result of experiment: {a) Data set: pamap; b) Dimension: 105; c) Sample 3514 instances from 3465000}. 2) Discussion: - The Haar wavelet transform is usually used in time series. - Nevertheless, the data use in the paper is not time series but a vector composed of some features with different meanings. (Not smooth if it seems as a time series.) - It is possible to hurt the performance of pruning in Haar wavelet. - If experiment results are bad, perhaps we can overcome the problem by some techniques of preprocessing data. - Most of current works make use of the hashing functions to find the similarity between vectors. (The more close two vectors are, the similar result of the hashing are.) - After discussion with En-Hsu, He said the approach we use to find the inner product through wavelet seems better since we can get the true bounds (To some extent, the Haar wavelet transform is also some kind of hashing function.) - Perhaps we can compare the two methods in different scenarios. b. Topic2: Exploiting Correlation among Sensors 1) Trials: Use closest similarity first to determine order - Fixing program: sampling too much data (25%-> 27~29%) cause some sensors being sent more frequently (but cause good results compared to random sampling with the same rate?) 2) Summary - Lower MAE than random by determined order - Problems i. Over sampled 1. ii. Compare random sampling with same rate Some sensors sent more c. Topic 3: Distributed Nearest Neighbor Search of Time Series Using Dynamic Time Warping 1) Progress: - Rewriting testing code of both frameworks - New theoretical discovery i. FTW-based lower / upper bounds must be increasing / decreasing ii. We can save signals and space that keep lower / upper bounds at the previous level - Discussion on Framework 2 i. Threshold sent to site 1. Comparison with threshold locally 2. The site returns a signal to indicate if the server continue to send rest of the query 3. If the whole query is sent and the exact DTWs < threshold, the site returns the exact DTWs to the server to update the threshold ii. Iteration order 1. Sites from small to larger lower bounds 2) Pseudo codes of framework 1 and framework2: - Framework 1 - Framework 2 d. Topic 4: Intelligent Transportation System (ITS) Machine Learning Group. 1) Video, audio, and data from sensors (accelerometer, magnetic, gyro, and GPS) of riding scooter by using smart phone are collected. 2) Work: Predict whether driver will stop at intersection or not using only sensor data. - 117 driving cases at the intersection (25.024819, 121.543399) between Fuxing South Road and Hoping East Road. - i. 66/117 stop cases (56.4%); ii. 51/117 non-stop cases. (43.6%) Used Features i. GPS( longitude, latitude, altitude, GPS_speed, GPS_accuracy, GPS_bearing) ii. ACCELEROMETER_x, ACCELEROMETER_y, ACCELEROMETER_z iii. ORIENTATION_posx, ORIENTATION_oriw, ORIENTATION_posy, ORIENTATION_posz, ORIENTATION_orix, ORIENTATION_oriy, ORIENTATION_oriz - Reasults i. LibLinear 1. Accuracy = 71.7949% (84/117) 2. Cross Validation Accuracy = 61.5385% ii. LibSVM with RBF kernel - 1. Training Accuracy = 88.8889% (104/117) 2. Best c=512.0, g=0.03125 CV rate=67.5214% Now we focus on: i. Collect more data described by appropriate features. ii. Try to cluster drivers as “Aggressive”, “Conservative”, and “Neutral” clusters, and do two-staged prediction. iii. Find some interesting application of Trajectory Pattern Mining. 4. Brief plan for the next month a. We will continuous paper survey and refine our proposed approaches. b. To implement our proposed approaches and evaluate their performance. 5. Research Byproducts a. Paper: N/A b. Served on the Editorial Board of International Journals: N/A c. Invited Lectures: N/A d. Significant Honors / Awards: N/A