White-Mihoff False Filtering Tool (White, E., Mihoff, M., Jones, B., Bajona, L., Halfyard, E. 2014. White-Mihoff False Filtering Tool) Introduction OTN has developed a tool which will assist with filtering false detections. The first level of filtering involves identifying isolated detections. The original concept came from work done by Easton White. He was kind enough to share his research database with OTN. We did some preliminary research and developed a proposal for a filtering tool based on what Easton had done. This proof of concept was presented to Steve Kessel and Eddie Halfyard in December 2013 and a decision was made to develop a tool for general use. This tool will provide the following functions: Suspect Detections: The first part of the tool will identify obvious false detections. The user will input a file of detections and a time period in minutes. Any tag detection which has more than that time since the previous detection AND more than that time until the next detection is flagged. If the input time is 60 minutes, any detection which has more than 60 minutes from the last detection and more than 60 minutes until the next detection, of that tag, will be flagged. The station and receiver at which the detections occur is not considered. A file is output containing all the suspect detections which meet the criteria. This file can be examined and edited by the user. Distance Matrix: There is an option of having the tool create a distance matrix. The distance matrix is an output file containing station pairs and the distance between them in metres. This is a ‘crow flies’ distance. Only station pairs which occur in sequence will be present. If animals go from station1 to station2 then to station3, but no animals go from station1 to station3, only station pairs 1-2 and 2-3 will be in the output file. There is an additional column named ‘real_distance’ in this file. This column is for use in the distance matrix merge tool. Filtered Detection file: Once the user has examined the file of suspect detections and decided it is acceptable then these detections can be filtered from the input file creating a new output file of detections. A new distance matrix can also be requested. This may be desirable as eliminating some detections could change the station pair list. It is possible to provide your own file of suspect detections and override the expected file. The environment used to process the data is persistent. That means files you load and process will be there the next day, the next week, the next year. It may start to fill up. We have developed a cleanup function which will clear all the background objects but will not touch the data folders. It is up to the user to manage the data folders. Minimum Requirements for Input Detection File To use the OTN False Filtering tool your detection file must meet minimum requirements. The tool expects specific columns with specific names to be present. Some of those columns have expected formats and specific constraints. If any of the following conditions are not met the file will be rejected. File Type expected: CSV in UTF-8 encoding, with commas between values. Files can be converted with NotePad++ or by using the file_conversion_driver.r in the sandbox folder. See “Convert Encoding Instructions.doc” in the Documents folder. Column: unqdetecid must be present. - Must contain unique values. If the count of unique values does not match the count of records the file will be rejected. - Can be a simple sequence number or any other combination of characters you choose. Column: catalognumber must be present. - This can be an animal id or a transmitter id. Whatever you want to use to group the detections together. Column: datecollected must be present. - Must be format YYYY-MM-DD HH:MI:SS or YYYYMM-DDTHH:MI:SS - All digits must be present. If your seconds are missing you will have to add them. Column: station must be present. - May be empty unless a distance matrix is requested. Column: latitude and longitude - Only required if distance matrix requested - Must be in numeric format decimal degrees. Notes: • • Your detection file may have any other additional columns you wish. OTN detection extract files satisfy all of the required conditions. These files can be found in the Detection Extracts folder of your project repository where xxx is your OTN project code http://members.oceantrack.org/data/repository/xxx/detectionextracts • VUE export files may satisfy the requirements with a few small changes. Renaming some columns and adding column unqdetecid. Trouble Shooting: • If you encounter a hard error see Troubleshooting Guide.doc in the Documents folder. We have given solutions to some errors we found in testing. • If this does not solve your problem contact marta.mihoff@dal.ca and we will find a solution. Cautions: • Opening your detection file in XLS or ODT will reformat the dates. If you do this do not save the file. If you accidently save it, you will need to create another input file. • Uploading a large CSV file to the tool may cause your browser to crash. You can avoid this by uploading a zipped file. • If you are working with very large files you may have to increase RAM available to the OTNSandbox. You will be able to tell if you need to do this as the application will run very slowly. See Appendix in “Install OTN Sandbox.doc” for instructions on how to do this. Usage When using this tool for your research please use citation: (White, E., Mihoff, M., Jones, B., Bajona, L., Halfyard, E., 2014. White-Mihoff False Filtering Tool) Detailed usage instructions are in subsequent sections. This is the itinerary: - Open url http://192.168.56.101:8787/ in your favourite browser and bookmark. - Sign in with user sandbox, pw otn123. - Navigate to folder Rstudio. - Upload your input detection file into folder "data". - Open file sandbox/filter_driver.r which is the driver for the filtering tool. It is a simple R-script. There will be several switches and variables to set which control features of the tool such as input overrides and output file versioning. Input/Output file versioning • For the first time file name is put into a variable and the version number is set to ‘00’. • For subsequent executions, to save time, the initial file will not be reloaded if there has been a load with the same file name. If your initial file has changed between loads you need to rename it to get it to reload. • Output files will be put into the folder data. • Output files will have the version number incremented by one. • Output files will never be overwritten. If the output file(s) exist the program will halt and you will be asked to rename or delete the output files. Function loadDetections() • Uses switches SuspectDetections and DistanceMatrix. Both should be set to TRUE if you want both files output. • If you want only a suspect file set switch DistanceMatrix to FALSE • If you want only a distance matrix set switch SuspectDetections to FALSE • Also uses switch ReloadInputFile. Large files take a long time to load. We have provided this switch to bypass the reload if you are working from the same input file. • The output file of suspect detections is intended for you to examine. • You may edit the file by deleting records which represent detections you think are OK. And you may add records. You may want to edit a copy of the output file. Function filterDetections() • Uses switches DistanceMatrix and overrideSuspectDetectionFile. • If you have a suspect file you have created or edited then set overrideSuspectDetectionFile TRUE. You will need to provide the name of the override file as well. • If you want a new DistanceMatrix to be created then this switch should be TRUE. The station pairs may change when detections are deleted so you may always want this to be TRUE. • • • • • • • Also uses switch ReloadInputFile. Large files take a long time to load. We have provided this switch to bypass the reload if you are working from the same input file. An output detection file is created identical in structure to the input, but missing those detections which match the input suspect detection file. If the value in column suspect_detection in the suspect detection file matches a value in column unqdetecid in the input detection file that detection will not appear in the output detection file. Distance Matrix output File This file represents only the station pairs that occurred in sequence. If an animal never went from station 4 to station 5 then pair 4-5 will not be in the output file. The distance is calculated as the straight line distance between the pairs in metres using a PostGIS function. If more than one receiver was deployed at a station with different lats and longs then the average lat and long is used for the station position. These values appear in the output file. This is one good reason to execute these functions using detections from only one year. If you reuse stations but know the position may be quite different from the previous year then the results will be skewed. At this point column real_distance will be null. This will be used if you want to override the “crow flies” distance. See document Distance Matrix Merge Instructions.doc in the Documents folder. Loading Detections into Sandbox 1. In your file explorer. Open your OTNsandbox folder. 2. Copy your detection files into the "data" folder. Filtering Data 1. In RStudio, Open the filter_driver.r file from RStudio’s file manager, under the “sandbox” folder. The full file path is “/home/RStudio/sandbox/”. 2. Change the line of code (line 22) detection_file <- ‘detections.csv’ to use the name of the detection file you have placed into the data folder. The code is case sensitive so make sure you type the detection name correctly. 3. The input_version_id variable is appended onto the generated detection tables to identify detection loading version. Change the version_id if needed. Valid input_version_id values do not include any special characters, spaces or capitalized characters. Adjust the “time_interval” variable (line 24) if you wish to use another time interval for evaluation of suspect detections. Adjust the “detection_radius” variable (line 26) to the average distance from receviers that tags can be detected. This value can be left blank and changed later using matrix merge script. Place your text cursor at the beginning of the filter_driver.r file (line1) and click on the Run button until you reach the line containing “loadDetections()” statement (line 37). If your detection extract is formatted properly you should receive a message in RStudio’s console outlining how many detections were loaded. The script will print out ‘Loading complete’. 4. 5. 6. If the detection file contains errors, this process will report what is wrong with the provided input. Filtering Detections 1. 2. 3. Move your text cursor { } to line containing “filterDetections()” statement (line 47) and press run line { } to execute the detection filtering process.” The results of the processing will appear in rStudio’s console window. This process uses the input file you provided for the loadDetections() step and the list of suspect detections that the loadDetections step created. Both a new version of the input file and a new station distance matrix will be created to the output folder. The script will print out ‘Filtering complete’ when process is finished.