T2 Data Processing Architecture Haofu Liao Co-­‐op Student EMC Corporation Haofu.Liao@emc.com I. Introduction What Is T2 Data? T2 data is the data we collect from the VMAX or other storage systems. It records the performance information of a system. The information is divided into five categories. They are System, RDFAStats Ports FE, Directors RDF, Directors FE and Directors BE. For each category, several metrics are recorded. Below is a sample of T2 file. It is from a small part of System level data. You can find that a T2 data usually record the read and write information of a system and the data is usually recorded every 5 minutes. System Metric 19:05 19:10 19:15 192604029 ios per sec 26329.15 24498.35 22864.79 192604029 reads per sec 16842.43 13754.35 11611.58 192604029 writes per sec 9486.72 10744 11253.21 Main Task The goal of this processing is to find statistic information from the T2 data. But since the size of T2 Data is huge and the structure of T2 files is complex, it is a challenge for us to process and analyze it. Our task starts from a T2 data warehouse. It contains the T2 data of thousands of systems and for each system there are thousands of T2 files. Our main job is to process these T2 files one by one and then collect the information we want from them. There are four requirements for the processed data. The first one is that the data should be reusable. It means, each time when we want to do another analysis or have a new requirement for the data we still can use our current results. The second one is we must make sure the analysis based on the processed data can be done in a short time. It means we don’t have to process T2 again to get the results we need. The third one is that the size of processed data shouldn’t be large. Since the storage capacity we can use is limited, we have to limit the size of the processed data to less than 1 TB. And the last requirement is that the processing time shouldn’t be too long. II. Design Key Features According to the requirements above I come up with three key features that our processing tool should have. They are database, aggregation, and parallel computing. Database – A database is needed in our design, because with a database we can query the processed T2 data easily and get the data in a short time. Hence, the further analysis based on the processed T2 data will be very efficient and can be one in a short time. Aggregation – As we mentioned before, usually, the T2 data is recorded every 5 minutes. Since, in most of the cases when we do data analysis we don’t need data in such detail, we could aggregate the per 5 minutes data to daily data and thus can greatly reduce the size of processed data. In this way, we only need to keep the daily peak, daily average, and other time info. Below is a sample of the aggregated data, Date 3/23/2013 Category Metric Avg Max Total Time Total Count System writes per sec 500 1578 85800 286 System read hits per sec 273 1680 85800 286 System write hits per sec 497 1573 85800 286 Parallel Computing – Consider the size of T2 data we are going to process, a parallel computing mechanism is desired. A Simple Work Flow Before starting to introduce the structure of the processing tool which is designed for parallel computing, let us first look at how we can accomplish our goal serially. Below is the structure of a simple work flow. T2 Data from certain system Extractor .csv Pool Parser Database In this structure, we will process T2 data from certain system. Actually, all the processes are system based, which means we will only process the T2 data from the same system at once. And we won’t pick data from another system until all the data from current system has been processed. Our original T2 data from the data warehouse is compressed and is in the .ttp format. So we use an extractor here to extract the .csv files we need from the compressed T2 data. After we get all the .csv files from the same system, we will parse and aggregate them one by one, then insert the results into database. Architecture Below is the architecture my processing tool. Basically, it will do the same thing as a single work flow will do. The difference is each worker (Downloader, Extractor, and Parser) will work independently. And you can launch as much as workers you want. Workflow 1 Downloader 1 .zip Pool 1 Downloader 2 .zip Pool 2 Extractor 1 .csv Pool 1 Parser 1 Extractor 2 .csv Pool 2 Parser 2 .csv Pool n Parser n T2 Data Warehouse Workflow 2 ... Database Workflow n Downloader n .zip Pool n Extractor n Schedule Scheduler 1. Workflow There are n (the value of n is decided by the performance of server) workflows in the structure. Each workflow consists of a downloader, a .zip pool, an extractor, a .csv pool, and a parser. 2. Downloader (system based) A downloader will download source files from remote server automatically. We need a downloader here is because downloading is slow (Compared with the speed of processing). Notice: A downloader is useful horizontally (each workflow). It can separate the work of downloading and extracting. Thus the extractor can work continuously and no need to wait for a file to be downloaded. But since the bandwidth of network is limited, a downloader is unnecessary vertically (each column). I still keep n downloaders here is to separate the work logically. Hence, the design of “schedule” (we will talk about it later) can be simple and extractors won’t work on the same file. A sleeping mechanism is designed to avoid the race of network bandwidth between downloaders. Features: a. Sleep if the zip pool is full and update its status b. Look up the un-­‐downloaded system’s path from schedule for downloading c. Download system .zip folder from remote server to .zip pool d. Update current system status to schedule e. Check current scheduler command (worker status) and change working mode accordingly f. Reset invalid schedule system status (downloading related) and remove invalid downloaded .zip files 3. .zip Pool A .zip pool will store all the zip files downloaded from source. Each .zip pool consists of several system .zip folders. 4. Extractor (system base) An extractor will extract .csv files from .zip files. Features: a. Sleep if the .zip pool is empty or the .csv pool is full and update its status b. Look up the downloaded system from schedule for extracting c. Extract .csv files from .zip files and putting them to the corresponding system .csv folder. d. Update current system status to schedule e. Check current scheduler command (worker status) and change working mode accordingly f. Delete the system .zip folder after all the .zip files in this folder have been extracted g. Reset invalid schedule system status (extracting related) and remove invalid extracted .csv files 5. .csv Pool A .csv pool will store all the csv files extracted from .zip pool. Each .csv pool consists of several system .csv folders. 6. Parser (system base) A parser will parse the .csv files and send the results to database. Features: a. b. c. d. e. Sleep if .csv pool is empty and update its status Pick a system .csv folder for parsing Parse .csv files of the entire system. The results will be stored in a cache (multi-­‐key hash) temporarily. Update current system status to schedule Checking current scheduler command (worker status) and change working mode accordingly f. Deleting the system .csv folder after all the .csv files in this folder have been parsed and then flushing all the results in cache back to database. g. Reset invalid schedule system status (parsing related) and remove invalid extracted .csv files 7. Schedule A schedule is the intermediary between workers (downloaders, extractors and parsers) and the scheduler. A scheduler will send commands to workers through a schedule. And workers will report their working status to scheduler through the schedule. Type of records that stored in schedule: a. System status Status System ID Path Is Downloading Is Downloaded Is Extracting Is Extracted Is Parsing Is Parsed “System ID” is the id of current system “Path” is the folder path of current system in remote server. “Is Downloading” and “Is Downloaded” the working status of the downloader on current system “Is Extracting” and “Is Extracted” the working status of the extractor on current system “Is Parsing” and “Is Parsed” the working status of the parser on current system Here are seven working status that will appear on the schedule Is Downloading 0 Workflow ID 0 0 0 0 0 Status 1 Status 2 Status 3 Status 4 Status 5 Status 6 Status 7 Is Is Extracting Is Extracted Is Parsing Is Parsed Downloaded 0 0 0 0 0 0 0 0 0 0 Workflow ID 0 0 0 0 Workflow ID Workflow ID 0 0 0 Workflow ID 0 Workflow ID 0 0 Workflow ID 0 Workflow ID Workflow ID 0 Workflow ID 0 Workflow ID 0 Workflow ID b. Worker status Worker Type Workflow ID “Worker Type”: Downloader, Extractor, Parser “Workflow ID”: 1 … n Worker Status “Worker Status”: Start, Sleep, Stop 8. Scheduler The scheduler will monitor and control worker status. Features: a. b. c. d. e. f. g. h. III. Initialize the schedule Create workers Look up system status Modify worker status Display current configurations Rebuild schedule from database Update schedule Reset Schedule Implementation Interface, Programs and Source Files Based on the design above, I developed a processing tool. Below is its interface and its file list Configuration.cs The class that contains all the parameters that this tool needs. This class is designed for holding input Json content. Connection.cs This class is used for recording MongoDB server information. Downloader.cs The downloader class. This class offers all the methods that relates to a downloader. Extractor.cs The extractor class. This class offers all the methods that relates to a extractor. LogWindow.cs In order to look up the log informtion of each worker, we create a logwindow where we can dump all the logs. Metrics.cs This class connects to "metrics" collection of remote mongoDB database and offers serveral methods for users to query this collection. Parser.cs The parser class. This class offers all the methods that relates to a parser. Settings.cs This class was supposed to hold all the settings for workers and scheduler. But in latest version of this tool it is used for store input configurations or parameters. Schedule.cs This class connects to "schedule" collection of remote mongoDB database and offers serveral methods for users to query this collection. Scheduler.cs The scheduler class. This class offers all the methods that relates to a scheduler. This tool consists of four binaries. They are Downloader.exe, Extractor.exe, Parser.exe and Scheduler.exe. For details about how to workers, please type “[Worker].exe -­‐-­‐help” from command line. Even though worker binaries can run independently, it’s better to launch them from Scheduler.exe. How to Use Scheduler.exe Before launching the Scheduler.exe program, you need first create a configuration file for it. The configuration file is actually a Json file that with following fields SourcePath The path of the T2 data warehouse BenchmarkPath The path of the benchmark folder. This folder contains the zip pools and csv pools for workflows. Server Specify MongoDB deamon program's (server) ip and port. Please use following format: mongodb://server:port SystemTypes Specify certain types of system by serial number. Please use following format: \"[Type1]|[Type2]|[Type3]|...\" ZipPoolName The name of zip pool ZipPoolVolume The maximum folders that a zip pool can contain CsvPoolName The name of csv pool CsvPoolVolume The maximum folders that a csv pool can contain DownloaderPath The path of Downloader.exe ExtractorPath The path of Extractor.exe ParserPath The path Parser.exe WorkFlowIds The Id of workflows. Please use following format: [Id1, Id2, Id3, ...] Ttp2BtpProgram The path of StpTtpCnv.exe program from Stptools Btp2CsvProgram The path of StpRpt.exe program from Stptools MetricFilter The path of the metric filter for StpRpt.exe program PerlProgram The path of perl.exe program from perl programming language PerlModulePath The path of perl modules for perl parser.pl PerlScriptPath The path of parser.pl script After a configuration file has been created, there are two ways to use it to configure the Scheduler.exe program. The first way is putting this file in the same folder with Scheduler.exe and then run Scheduler.exe directly. The second way is running Scheduler.exe program by typing “Scheduler.exe -­‐-­‐config [Configuration file path]” in command line. When you can run the Scheduler.exe program successfully, please use following commands to start the processing Environment Display current configurations Launch [Downloaders/Extractors/Parsers] Launch workers Stop [Downloaders/Extractors/Parsers] Stop workers Clean Up Reset schedule Rebuild Rebuild schedule from database Update Update the system path field of the schedule WorkerMonitor [Downloader/Extractor/Parser] [ID] Open the log window of a worker Worker Status Look up worker status Exit Exit this program