T2 Data Processing Parser Haofu Liao Co-­‐op Student EMC Corporation Haofu.Liao@emc.com Design and Algorithm I. Parser is the most important part of the T2 data processing tool. A parser should be able to parse one or several .csv format T2 file and save the results to database. Concerns When designing a parser there are several concerns about T2 file. § § § § § Missing Metrics – T2 files don’t contains the metrics we are interested in Missing Timestamps – T2 files aren’t recorded every 5 minutes strictly Missing Files – we lost some T2 files from certain days “-­‐1” – some data value is -­‐1 which means you this value is invalid Large Values – some data value is extremely large For missing metrics, we will put “-­‐1” as the value in the database. For missing timestamps, we just ignore it and reduce the total time of that day. For missing files, it can be taken as a special case of missing timestamps. If a file is missing then all the timestamps in that file are missing. For “-­‐1”, it is tricky and a set of rules is explained in Appendix. For large values, we don’t take care of it in the processing phase. Hence, it’s the analyzer’s job to remove the outliers in post processing. Structure Parser .csv Pool Parse Cache Database As the diagram shows, the parser will work in the following way: Parsing the .csv files from the same system one by one and save and update the results in the cache. After all the files from the same database have been parsed, we will flush the results back to database. Algorithm The algorithm we are going to talk about is only for parsing a single T2 file. It is exactly how the “Parser.pm” module works. For more detail about “Parser.pm”, see the Implementation section. 1. Index all fields in .csv file and Store each field to a 2-­‐D array by its index. For example, the field located at the second row and the third column will be stored into the 2-­‐D array “data” at location (1, 2). So, we can query the field by “data[1][2]” ( according to perl syntax index start from 0 ) in the future. 2. Scan the 2-­‐D array “data”, and record useful information and their indices. If Step 1 can be taken as adding page numbers to a book, then Step 2 is adding the table of content to a book. For example, in later processing, we need to know the indices of certain metric, so that we can get the data of this metric by its name and its category. Below are the items we need to scan from “Data”: a. Date (More precisely, the starting date) Parse the field that contains date information, and record “Date” directly. b. Interval (Window Size) Parse the field that contains interval information, and record “Interval” directly. c. Basic Time Info (Include Start Time, End Time, Number of Timestamps, Start Time Index, and End Time Index) Here the processing is tricky. Sometimes, data is recorded at mid-­‐night, so we need to separate our data to two days. We detect the mid-­‐night data by scanning the timestamp one by one and recording the situation when timestamp is decreasing. When a mid-­‐night happens, we will record two sets of Basic Time Info. Each set of “Basic Time Info”, are stored in a hash table with “Date” as their key. d. System ID Parse the field that contains system id information, and record “System ID” directly. e. Category and Metric Get the index of certain metric of certain category, and then store it into a 2-­‐D hash table that uses “Category” and “Metric” as the key value. Since for some of the metrics from Directors BE and Directors FE, we may have two set of director name (“05E and FA-­‐05E”), we will not record the index of the director name without uppercase letter prefixed. 3. For each date (probably two days in a file) and each metric, read its data from “Data” by its indices. Here we use the information that we get from Step 2 to query the indices. After we get the data, store it into a matrix data structure (offered by linear algebra package). 4. Calculate the “Max”, “Avg”, and “Total Count” with the rules we descripted in Appedix. II. Implementation Script and Class List Name Type Parser Class (.pm) Cache Class (.pm) IOStream Class (.pm) Document Class (.pm) Database Class (.pm) Chrono Class (.pm) Algorithms Miscellaneous analysis.pl Class (.pm) Class (.pm) Script (.pl) Description The core algorithm of parser. Collect data from .csv file and store them back to cache A multi-­‐key hash table for storing the data from parser, after all data from one system have been stored it will flush all its data back to database A interface class for writing and reading data from file, database, cache, or memory The data structure for each item that should be stored in cash or database Used for communication between parser and MongoDB Used for handling the date and time operation in .csv file Algorithms used in data aggregation. Other functions needed in parser Generate a report about T2 data and save it to export_database export_report Script (.pl) Script (.pl) installation Script (.pl) parser query Setting Script (.pl) Script (.pl) Class (.pm) MongoDB Export data from MongoDB to .csv files Export a report about T2 data to .csv file Install dependent libraries that used in parser: Carp Data::Dumper DateTime DateTime::Format::Natural ExtUils::F77 MongoDB Moose OpenGL PDL Text::CSV Executable parser program Display MonogDB content Settings for parser The list above contains all the files we have for the parser. Among them, “Document.pm”, “Cache.pm”, “Database.pm” and “IOStream.pm” modules are for writing and updating data in different data structure. “Parser.pm” is the core module. It does the actually work for parsing. Its “parse” method can only parse one .csv file at a time. “parser.pl” is the main program for parser. It can parse all the .csv files in a folder and inserts the results into database. How to Use To use this program, you need to install all the modules from the installation list (installation.pl) first. Then you can run the parser.pl by typing “perl.exe –I [Parser Module Path] [parser.pl File Path] [CSV Folder Path] [MongoDB Server]”. Where the “MongoDB Server” option should be in following format: mongodb://server:port. Appendix – Averaging Methodology Symbol Description 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" All data -­‐1 -­‐1 -­‐1 All “-­‐1” -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!! -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" 𝑣!! 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 “-­‐1” for certain time stamps (only) Metric not present or “-­‐1” for whole 4 hour window (only) -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" “-­‐1” for both certain time stamps and whole 4 hour window -­‐1 𝑣!" -­‐1 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!! 𝑣!" “-­‐1” appears randomly System Symbol 𝑣! 𝑣! 𝑣! 𝑣! Avg 𝑣! + 𝑣! + 𝑣! + 𝑣! 4 Max Total Count max 𝑣! , 𝑣! , 𝑣! , 𝑣! 4 𝑣! + 𝑣! 2 max 𝑣! , 𝑣! 2 number of non-­‐“-­‐ 1” values -­‐1 -­‐1 -­‐1 𝑣! -­‐1 𝑣! -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 Update Avg Average weighted by total time (equals to total count times interval) Average weighted by total time (equals to total count times interval) “-­‐1” if no previous avg exist. Otherwise keep previous avg Directors FE Symbol Avg Max Total Count Update Avg 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" Average horizontally then average vertically -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!! -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" 𝑣!! 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 max 𝑣!! … 𝑣!" (Maximum of all element). No,this will not make sense knowing the FE metrics we are tracking, we will have to use the average of the whole file as the first max and then if another file comes along that has a higher average then that will become the max. This means that we track maximum of averages over T2 files. This applies to all the rows in the doc pertaining to FE directors 4 Average weighted by total time -­‐1 -­‐1 -­‐1 “-­‐1” if no previous avg exist. Otherwise keep previous avg Similar to all data situation, but during horizontal aggregation “-­‐1” is ignored from total time, and the denominator of the averaging function is the number of non-­‐“-­‐ 1” data. 𝑣!" , 𝑣!" , 𝑣!! max 𝑣 , 𝑣 , 𝑣 !" !" !" Maximum of all non-­‐ “-­‐1” value 2 number of non-­‐“-­‐1” values Average weighted by total time (equals to total count times interval) 4 Average weighted by total time (equals to total count times interval) During vertical aggregation, if data is max 𝑣!" … 𝑣!" not present or all “-­‐1” for a particular FE Maximum of all non-­‐ director, then the “-­‐1” value data is ignored and the denominator of the averaging function is the number of FE directors that the data is present. -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!! 𝑣!" 𝑎𝑣𝑔 𝑣!" , 𝑣!" + 𝑎𝑣𝑔 𝑣!" , 𝑣!" 2 = 𝑣!" + 𝑣!" + 𝑣!" + 𝑣!" 4 It is tricky, and no such situation exists in sample database Maximum of all non-­‐ “-­‐1” value 2 number of non-­‐“-­‐1” values Average weighted by total time (equals to total count times interval) N/A N/A N/A max 𝑣!" , 𝑣!" , 𝑣!" , 𝑣!" Directors BE Symbol 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" 𝑣!" 𝑣!" 𝑣!" 𝑣!! 𝑣!" Avg Max Total Count Update Avg Average horizontally then add vertically 𝑠! = 𝑣!! + 𝑣!" + 𝑣!" 𝑠! = 𝑣!" + 𝑣!! + 𝑣!" 𝑠! = 𝑣!" + 𝑣!" + 𝑣!! 𝑠! = 𝑣!" + 𝑣!" + 𝑣!" max 𝑠! , 𝑠! , 𝑠! , 𝑠! Maximum of the sum of each column 4 Average weighted by total time -­‐1 “-­‐1” if no previous avg exist. Otherwise keep previous avg -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!! -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" 𝑣!! 𝑣!" 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 𝑣!" 𝑣!" 𝑣!! 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 -­‐1 -­‐1 𝑣!" -­‐1 𝑣!" -­‐1 𝑣!" 𝑣!" -­‐1 -­‐1 -­‐1 -­‐1 -­‐1 𝑣!! 𝑣!" Similar to all data 𝑠! = 𝑣!! + 𝑣!" + 𝑣!" situation, but during 𝑠 ! = 𝑣!" + 𝑣!" + 𝑣!! horizontal aggregation max 𝑠! , 𝑠! “-­‐1” is ignored from total time, and the Maximum of the sum of denominator of the each column, the averaging function is columns that contain “-­‐1” are ignored the number of non-­‐“-­‐1” data. During vertical aggregation, if data is not present or all “-­‐1” −1 for a particular BE No column is non-­‐“-­‐ director, ignore it and 1” add exist data together, then times the total number of rows. 𝑎𝑣𝑔 𝑣!" , 𝑣!" + 𝑎𝑣𝑔 𝑣!" , 𝑣!" ×3 2 = 𝑣!" + 𝑣!" + 𝑣!" + 𝑣!" ×3 4 It is tricky, and no such situation exists in sample database. OK. Even if it does exist, you can still sue the method stated in the row above? −1 No column is non-­‐“-­‐ 1” N/A 2 number of non-­‐“-­‐1” values Average weighted by total time (equals to total count times interval) 4 Average weighted by total time (equals to total count times interval) 2 number of non-­‐“-­‐1” values Average weighted by total time (equals to total count times interval) N/A N/A