White-Mihoff False Filtering Tool

advertisement
White-Mihoff False Filtering Tool
(White, E., Mihoff, M., Jones, B., Bajona, L., Halfyard, E. 2014. White-Mihoff False Filtering
Tool)
Introduction
OTN has developed a tool which will assist with filtering false detections. The first level of
filtering involves identifying isolated detections. The original concept came from work
done by Easton White. He was kind enough to share his research database with OTN. We
did some preliminary research and developed a proposal for a filtering tool based on what
Easton had done. This proof of concept was presented to Steve Kessel and Eddie Halfyard
in December 2013 and a decision was made to develop a tool for general use. This tool will
provide the following functions:
Suspect Detections: The first part of the tool will identify obvious false detections. The
user will input a file of detections and a time period in minutes. Any tag detection which
has more than that time since the previous detection AND more than that time until the
next detection is flagged. If the input time is 60 minutes, any detection which has more than
60 minutes from the last detection and more than 60 minutes until the next detection, of
that tag, will be flagged. The station and receiver at which the detections occur is not
considered. A file is output containing all the suspect detections which meet the criteria.
This file can be examined and edited by the user.
Distance Matrix: There is an option of having the tool create a distance matrix. The
distance matrix is an output file containing station pairs and the distance between them in
metres. This is a ‘crow flies’ distance. Only station pairs which occur in sequence will be
present. If animals go from station1 to station2 then to station3, but no animals go from
station1 to station3, only station pairs 1-2 and 2-3 will be in the output file. There is an
additional column named ‘real_distance’ in this file. This column is for use in the distance
matrix merge tool.
Filtered Detection file: Once the user has examined the file of suspect detections and
decided it is acceptable then these detections can be filtered from the input file creating a
new output file of detections. A new distance matrix can also be requested. This may be
desirable as eliminating some detections could change the station pair list. It is possible to
provide your own file of suspect detections and override the expected file.
The environment used to process the data is persistent. That means files you load and
process will be there the next day, the next week, the next year. It may start to fill up. We
have developed a cleanup function which will clear all the background objects but will not
touch the data folders. It is up to the user to manage the data folders.
Minimum Requirements for Input Detection File
To use the OTN False Filtering tool your detection file must meet minimum requirements.
The tool expects specific columns with specific names to be present. Some of those columns
have expected formats and specific constraints. If any of the following conditions are not
met the file will be rejected.
File Type expected: CSV in UTF-8 encoding, with commas between values. Files can be
converted with NotePad++ or by using the file_conversion_driver.r in the sandbox folder.
See “Convert Encoding Instructions.doc” in the Documents folder.
Column: unqdetecid must be present. - Must contain unique values. If the count of unique
values does not match the count of records the file will be rejected. - Can be a simple
sequence number or any other combination of characters you choose.
Column: catalognumber must be present. - This can be an animal id or a transmitter id.
Whatever you want to use to group the detections together.
Column: datecollected must be present. - Must be format YYYY-MM-DD HH:MI:SS or YYYYMM-DDTHH:MI:SS - All digits must be present. If your seconds are missing you will have to
add them.
Column: station must be present. - May be empty unless a distance matrix is requested.
Column: latitude and longitude - Only required if distance matrix requested - Must be in
numeric format decimal degrees.
Notes:
•
•
Your detection file may have any other additional columns you wish.
OTN detection extract files satisfy all of the required conditions. These files can be
found in the Detection Extracts folder of your project repository where xxx is your
OTN project code http://members.oceantrack.org/data/repository/xxx/detectionextracts
•
VUE export files may satisfy the requirements with a few small changes. Renaming
some columns and adding column unqdetecid.
Trouble Shooting:
•
If you encounter a hard error see Troubleshooting Guide.doc in the Documents folder.
We have given solutions to some errors we found in testing.
•
If this does not solve your problem contact marta.mihoff@dal.ca and we will find a
solution.
Cautions:
•
Opening your detection file in XLS or ODT will reformat the dates. If you do this do not
save the file. If you accidently save it, you will need to create another input file.
•
Uploading a large CSV file to the tool may cause your browser to crash. You can avoid
this by uploading a zipped file.
•
If you are working with very large files you may have to increase RAM available to the
OTNSandbox. You will be able to tell if you need to do this as the application will run
very slowly. See Appendix in “Install OTN Sandbox.doc” for instructions on how to do
this.
Usage
When using this tool for your research please use citation: (White, E., Mihoff, M., Jones, B.,
Bajona, L., Halfyard, E., 2014. White-Mihoff False Filtering Tool) Detailed usage instructions
are in subsequent sections. This is the itinerary: - Open url http://192.168.56.101:8787/ in
your favourite browser and bookmark. - Sign in with user sandbox, pw otn123. - Navigate
to folder Rstudio. - Upload your input detection file into folder "data". - Open file
sandbox/filter_driver.r which is the driver for the filtering tool. It is a simple R-script.
There will be several switches and variables to set which control features of the tool such
as input overrides and output file versioning.
Input/Output file versioning
•
For the first time file name is put into a variable and the version number is set to ‘00’.
•
For subsequent executions, to save time, the initial file will not be reloaded if there has
been a load with the same file name. If your initial file has changed between loads you
need to rename it to get it to reload.
•
Output files will be put into the folder data.
•
Output files will have the version number incremented by one.
•
Output files will never be overwritten. If the output file(s) exist the program will halt
and you will be asked to rename or delete the output files.
Function loadDetections()
•
Uses switches SuspectDetections and DistanceMatrix. Both should be set to TRUE if
you want both files output.
•
If you want only a suspect file set switch DistanceMatrix to FALSE
•
If you want only a distance matrix set switch SuspectDetections to FALSE
•
Also uses switch ReloadInputFile. Large files take a long time to load. We have
provided this switch to bypass the reload if you are working from the same input file.
•
The output file of suspect detections is intended for you to examine.
•
You may edit the file by deleting records which represent detections you think are OK.
And you may add records. You may want to edit a copy of the output file.
Function filterDetections()
•
Uses switches DistanceMatrix and overrideSuspectDetectionFile.
•
If you have a suspect file you have created or edited then set
overrideSuspectDetectionFile TRUE. You will need to provide the name of the override
file as well.
•
If you want a new DistanceMatrix to be created then this switch should be TRUE. The
station pairs may change when detections are deleted so you may always want this to
be TRUE.
•
•
•
•
•
•
•
Also uses switch ReloadInputFile. Large files take a long time to load. We have
provided this switch to bypass the reload if you are working from the same input file.
An output detection file is created identical in structure to the input, but missing those
detections which match the input suspect detection file. If the value in column
suspect_detection in the suspect detection file matches a value in column unqdetecid
in the input detection file that detection will not appear in the output detection file.
Distance Matrix output File
This file represents only the station pairs that occurred in sequence. If an animal never
went from station 4 to station 5 then pair 4-5 will not be in the output file.
The distance is calculated as the straight line distance between the pairs in metres
using a PostGIS function.
If more than one receiver was deployed at a station with different lats and longs then
the average lat and long is used for the station position. These values appear in the
output file.
This is one good reason to execute these functions using detections from only one
year. If you reuse stations but know the position may be quite different from the
previous year then the results will be skewed.
At this point column real_distance will be null. This will be used if you want to override
the “crow flies” distance. See document Distance Matrix Merge Instructions.doc in the
Documents folder.
Loading Detections into Sandbox
1.
In your file explorer. Open your OTNsandbox folder.
2.
Copy your detection files into the "data" folder.
Filtering Data
1.
In RStudio, Open the filter_driver.r file from RStudio’s file manager, under the
“sandbox” folder. The full file path is “/home/RStudio/sandbox/”.
2.
Change the line of code (line 22) detection_file <- ‘detections.csv’ to use the name of
the detection file you have placed into the data folder. The code is case sensitive so
make sure you type the detection name correctly.
3.
The input_version_id variable is appended onto the generated detection tables to
identify detection loading version. Change the version_id if needed. Valid
input_version_id values do not include any special characters, spaces or capitalized
characters.
Adjust the “time_interval” variable (line 24) if you wish to use another time interval for
evaluation of suspect detections.
Adjust the “detection_radius” variable (line 26) to the average distance from receviers
that tags can be detected. This value can be left blank and changed later using matrix
merge script.
Place your text cursor at the beginning of the filter_driver.r file (line1) and click on the
Run button until you reach the line containing “loadDetections()” statement (line 37).
If your detection extract is formatted properly you should receive a message in
RStudio’s console outlining how many detections were loaded. The script will print out
‘Loading complete’.
4.
5.
6.
If the detection file contains errors, this process will report what is wrong with the
provided input.
Filtering Detections
1.
2.
3.
Move your text cursor { } to line containing “filterDetections()” statement (line 47) and
press run line { } to execute the detection filtering process.”
The results of the processing will appear in rStudio’s console window. This process
uses the input file you provided for the loadDetections() step and the list of suspect
detections that the loadDetections step created.
Both a new version of the input file and a new station distance matrix will be created
to the output folder. The script will print out ‘Filtering complete’ when process is
finished.
Download