Minutes of the second COST expert meeting

advertisement
Minutes of the second COST expert meeting
Sukarrieta, 24 – 26 June 2008
Participants
Lucia Zarauz
Marcel Machiels
David Hirst
Leire Iba
David Maxwell
Mathieu Merzéréaud
Alastair Pout
Lisa Readdy
Bruno Reale
Paz Sampedro
Mario Sbrana
Dorleta Garcia
Joël Vigneau (Chairman)
lzarauz@suk.azti.es
Marcel.Machiels@wur.nl
david.hirst@nr.no
libaibarriaga@suk.azti.es
david.maxwell@cefas.co.uk
mathieu.merzereaud@ifremer.fr
a.pout@marlab.ac.uk
lisa.readdy@cefas.co.uk
reale@cibm.it
paz.sampedro@co.ieo.es
msbrana@cibm.it
dgarcia@suk.azti.es
joel.vigneau@ifremer.fr
Availability of datasets
During the meeting, an Italian dataset was made available to the developers. A discussion
occurred on the coverage of sampling strategies and case studies and a set of 3 tables (below)
circulated to summarise exactly the coverage of each of the datasets. From this set of tables, it
will be decided which would be the datasets to be made available very rapidly, although the
limited time remaining to the project will make it difficult to extend too much the case
studies.
Volume of discards/retained fraction (table to be updated on the COST webpage)
Sampling strategy / Sampling
Observer-at-sea
source
Discards fraction
Retained fraction
Sole
Sole
Trips – Unsorted
FRS_obs_trips
FRS_obs_trips
Trips – sorted by categories
Length/age structure (table to be updated on the COST webpage)
Observer-at-sea
Sampling strategy / Sampling source
Unsorted
Length +
ALK
Trips
Commercial
categories
Commercial categories
Direct age
Trips
Unsorted
Discards
fraction
Sole
FRS_obs_trips
Auction/Market
Harbour
Retained
Landed fraction
fraction
Sole
Italy
FRS_obs_trips
Cod/whi/had/sai2004
cefas
Sole
Commercial
categories
Commercial categories
Biological parameters (table to be updated on the COST webpage)
Auction/Market
Parameter /
Observer-at-sea
Harbour
Sampling source
Scientific survey
Sole
Age
Purchase
Sole
Weight
Sole
Sole
Maturity
Sole
Sole
Sex-ratio
Sole
Sole
Fecundity
Modification of the Data Exchange Format
The Data Exchange Format is meant to be very stable, as any modification lead to correction
of potentially all the functions developed in COST. From the beginning of the project, it was
anticipated that the creation of datasets covering the different case studies would probably put
the specifications of the DEF under pressure for modifications. In order to strengthen the DEF
while keeping its stability, it is important to evaluate each of the demands and wonder if the
raised issue may be circumvented by a function or a modification of the dataset itself before
accepting the change. It is also important to note that the following accepted changes are
considered to be the last modifications of the DEF for this project.
Considering the time remaining before the end of the project, the development of the
functions should not be stopped waiting for the creation of new datasets. The modifications
have been discussed by correspondence during and after the meeting and the following were
accepted:
Modifications of the comments
Page 4: “COST accepts the DATRAS format for uploading of CA records” change to
“DATRAS data may be uploaded into COST for estimating biological parameters and
building age-Length-Keys. In order to estimate the parameters at age, the information for
estimating the length structure of the population will be required.”
HH Comment 16 below table:
Text appended: “(Condition subject to check during upload).”
CA Comment 3 below table:
Text changed from: “Only applicable for Herring and Whitefish.” To “Only
applicable for Herring (Clupea harengus), Salmon (Salmo salar) and Common
whitefish (Coregonus lavaretus).”
CA New note below table:
“If ‘Station no’ is missing, then the first HH record for the same trip matching on
VesselFlagCountry, LandingCountry, Year, Quarter, Month, Area and
StatisticalRectangle is assumed to be representative for the CA record (used to
provide the FAC if needed). All CA records with sampling type different than ‘V’
(vendor) should match at least one HH record on VesselFlagCountry,
LandingCountry, Year, Quarter, Month, Area and StatisticalRectangle (Condition
subject to check during upload). (except for CA's from survey data).”
Modifications of the fields
TR + HH + SL + HL + CA.TripNumber:
Range changed from “1 to 9999” to “1 to 999,999”
TR.VesselType:
Mandatory field changed to Optional.
Reference to footnote 6 should be changed to footnote 1
TR + CE + CL.Harbour (New fields).
Optionals, not key fields.
HH + CA + CL + CE: Area and Rectangle:
Change text to : “Area = level3 (level 4 for Baltic and Med) and Rectangle = level5
(NA for Med). Levels referring to the new DCR (199/2008).”
Addition of “GSA is used in the Mediterranean”, provided that
a. There is a clear geographical boundary between the area where Statistical
rectangles are used and where GSAs are used.
b. There is no place where Statistical rectangles are used some times (some years
or some species or some fisheries or some countries) and GSAs other times.
HH + CA + CL + CE: SubStatisticalRectangle(New field):
This field is splitting the statistical rectangle in multiple polygons. The polygons are
nationally defined, but it is encouraged that this is coordinated internationally (in the
RCMs). The same goes for the coding. The field is a part of the natural key. It is
optional and it is a string of up to 10 characters. When a value is given in
SubStatisticalRectangle, then StatisticalRectangle should also be given (Condition
subject to check during upload).
HH.Fishing activity category National.
Description changed to: "National coding system. Bound to the DCR matrix (Com
Reg. XXX/2008) level 6 as children i.e. a national stratification of metier.
HH + CL + CE : FAC fields:
New comment added: “’Fishing activity category European lvl 6’ is mandatory for
data from 2009 and onwards (Condition subject to check during upload).”
New comment added: “Either “Fishing activity category European lvl 5” or “Fishing
activity category European lvl 6” should be provided - not both. Preferably lvl 6 since
this includes the lvl 5 information (Condition subject to check during upload).”
Modification of FAC-code system: For gear that do not have a mesh size or selection
device use a “-“. Example Longline fishery for demersal fishery would be:
“LLS_DEF_-_-_-”.
Clarification of FAC-code system: For fisheries where there is no regulation on
meshsize a mesh size or mesh size in selection device use a “0“. Example gillnet
fishery for demersal fishes in the Mediterranean would be: “GNS_DEM_0_-_0”."
HH.Gear:
Field to be removed since this field is not needed anymore (included in mandatory
FAC lvl 5 field).
SL.Sex (new field):
This field is needed to get overall mean weight for each sex separately for Nephrops
and megrim. Agreement on addition of this field with the codes: Male and Female.
The field will be an optional key field.
SL.Taxon (Addition of the field refused):
See modification of CL.Species to CL.Taxon
SL.CommercialSizeCategoryScale:
A new code for Nephrops sorted into whole and tails.
SL.CommercialSizeCategory:
“Whole” or “Tails” in case of Nephrops (encoded as 0 and 1) referencing the new
scale above.
SL.Weight :
Text change from: “Whole weight in gram. Decimals not allowed. For sea sampling:
Weight of the corresponding stratum. For market sampling: Catch weight is per
definition equal to Sample weight.”
To:
“Whole weight in gram. Decimals not allowed.
Weight of the corresponding stratum (Species -Catch category - size category - Sex)”
HL.Sex :
No change. When Sl.Sex is implemented a new comment is needed: “HL.Sex should
match SL.Sex if this is set to a specific value (Condition subject to check during
upload)”
HL.LengthClass:
Text : "Lower bound of length class"
CA.Age.
Text should be: “Estimated age” and be changed to optional.
CA.AgingMethod (New field)
The description field should contain the aging medium (« Otoliths », « scales », ...)
together with the method used for reading (« in toto », « break & burn », « slides with
transmitted light », ...).
CA.MicroscopicMaturityStage (Addition of the field refused):
No agreement to include this field. It was found that only one maturity stage
information should be provided per individual. Data for comparison of different
methods is to be kept in national and ad-hoc research DB’s. However we do agree that
the maturity staging method should be reported together with the value (just like for
aging), so we suggest:
CA.MaturityScale:
new Code "Crustacean scale"
CA.MaturityStage
New codes “Berried”, “Not berried” (integers) to be included in a maturity scale for
crustaceans: "Crustaceans scale
CA.MaturityStagingMethod (New field).
This field would have codes for histological methods as well. Codes so far: “Visual”,
“Histological”
CL.Species: renamed to Taxon
Rename "Taxon", extend the valid codelist for this to the list of species + the list of
higher level taxons. The code list in the systems (FishFrame and COST, should hold
the information about taxonomic relations for the higher level taxons that are needed.
Example: the raising procedure needs to know that Lophius sp. consists of
L.piscatorius and L.budagessa.
In addition, a working document (annex A) was made available to the group specifying the
ways to encapsulate the COST R functions in Fishframe. This working document will be
included to the final report, under the section ‘relation with other platforms’. No specific
requirements are mentioned for COST developers, apart underlying the complexity of the
encapsulation, thus promoting the minimisation of the number of functions to develop (S4
methods philosophy).
COSTcore – Presentation and discussion
The package COSTcore has released a version 1.0 since April 2008. Some recent
improvements have been brought so that the version used was 1.2-1. During the discussion
some issues were raised which will need to be implemented very rapidly:
 subset function able to subset on any CS tables, i.e. area in HH, species in SL.
 The check functions should be reviewed to better focus on the major errors and
included in the importation methods.
 Inclusion of the field “Date” in the consolidated table HH. This field is important for
the model-based estimates that need to assess the real age of the individuals
throughout the year.
 Addition of the year information to the consolidated field ‘time’ in all tables, whatever
time stratification is used, e.g. 2006 – 1 for quarter or month 1 in 2006. This is meant
to keep the year information in the consolidated tables and authorise multi-years
analysis by COST functions if necessary.
Tasks related to COSTcore
Attributed to
Deadline
Develop Subset function
Mathieu &
Ernesto
July 2008
Review the check functions
Alastair
July 2008
Addition/modifications of fields in consolidated tables
Mathieu
July 2008
COSTeda – Presentation and discussion
The COSTeda was presented and no major modification is planned. It is proposed that a large
beta testing should be envisaged in the coming months in order to permit the release of the
version 1.0.
Tasks related to COSTeda
Attributed to
Deadline
Test all exploratory analysis function with any datasets
present in COSTdata
All developers August 2008
Include the GSA geographical limits in the mapping
functions
Alastair
July 2008
COSTdbe/COSTmbe – Generic functions
A certain number of functions are to be developed very quickly in order to address common
issues to both design-based and model-based estimates. The functions and object to develop
are the following :

Zero-values : the COST datasets do not contain any value for species not caught. The
inclusion of 0-values into the estimation must follow strict rules, e.g. fishing operation
sampled and all species considered. This issue has already been addressed for the
analytical methods, thus there is a need to extract the code and make it generic into a
function all methods should call primarily to calculating the estimates and associated
variances.

Defining the sampling strategy : there is a need for a function scrutinising the whole
dataset in order (i) to define which was the sampling strategy and (ii) check that all
related information is present. The end user should either validate the sampling
strategy defined by the function or specify it in an argument. The decision rule for
defining the sampling strategy based on the dataset information is given in annex B.

Managing gaps in ALK : gaps in Age-Length keys is considered as a major issue,
especially for bootstrapping (see following sections). A function is needed to check
the matching between the length structure of the parameter to estimate and the
associated length structure of the ALK. In case of non-matching, it was seen
preferable to propose the users with suggestions of automatic solutions. The
experience shows that gaps in ALK are usually solved by “expert-filling”, which can
be summarised by manual tabulation of empty cells. It was found preferable to
propose appropriate grouping of length classes. The different cases encountered are
o Case 1 : proportion of missing length class is too high (to be
defined/parameterised). The ALK is refused and the function is stopped.
o Case 2 : the gaps are spread all over the ALK (to be defined/parameterised).
The proposed solution would be to increase the steps of the length classes,
limited to 2 and 3 cm steps. Each step should consider if the new ALK remains
in case 2 or has evolved in case 3 or 4.
o Case 3 : few small gaps (to be defined/parameterised) are encountered in the
middle of the ALK. The proposed solution should be to sum the upper and
downer length class numbers-at-age and introduce this in the missing length
class. The modified ALK should be evaluated if it has evolved in case 4.
o Case 4 : The gaps are at the extrema. The proposed grouping should be done
only at these extrema considering the first filled length class and a number (to
be defined/parameterised) of filled large length classes.
o In general : the user should keep the hand on the solutions adopted (to be
discussed in the case of bootstrapping!). In all cases, the recoding of the length
classes should be done on the consolidated CA table and on a duplicate of the
outcome of the length structure estimates (not on the consolidated HL table
and not changing the original length structure estimates).

Outcome object : setting identical objects (at least identical headers/containers) as an
outcome of the functions participate to the genericity of COST. This object should
have a common structure, a core of identical headers/containers and specific
headers/containers relating to the function used. The core headers should be
o
o
o
o
o
o
o
o
o
o
o
o
o
o
o
$species
recall of SL$spp (+ SL$taxon + SL$sex)
$catchCat
recall of the catch category (Discards / Landings)
$param
recall of the parameter estimated (N, W, maturity, sex-ratio, …)
$strataDes
time space and technical stratification considered
$methodDesc
recall of the method (analytical, bootstrap, Bayesian)
$nSamp
Number of samples
$nMes
Number of individual measured
$lenStruc
Estimates of the length structure (param-at-length)
$lenVar
Estimates of the variance of $lenStruc
$ageStruc
Estimates of the length structure (param-at-age)
$Agevariance
Estimates of the variance of $ageStruc
totalN
estimates of the total of the parameters
totalNvar
estimates of the variance of totalN
totalW
estimates of the total weight of the parameters
totalWvar
estimates of the variance of totalW
It was agreed that the outcome object should match the demand of the end user in term
of aggregation. For this purpose, the functions should all contain an argument
specifying the aggregation level, e.g. one metier - one area - yearly estimates, one area
- all metier – quarterly estimates, etc…

Outcome graphs : the outcome graphs based on the common headers of the outcome
object should be implemented once for all functions. Graphs for specific information
related to the function used (e.g. distribution of replicates) should be developed in the
appropriate package.
The precision indicators to implement have to be in accordance with the new DCR. The
specifications required by the regulation are:

Chapter II Section B.4 : Where reference is made to precision/confidence level the
following distinction shall apply:

Level 1/2/3: level making it possible to estimate a parameter with a precision of
plus or minus 40% for a 95% confidence level or a coefficient of variation (CV)
of 20% used as an approximation. +/- 25% or CV = 12.5% and +/- 5% or CV =
2.5% respectively for level 2 and 3.

Chapter III Section B1.4.2.a : Data related to quarterly estimates of discards length
and age composition for Group 1 and Group 2 species must lead to a precision of
level 1

Chapter III Section B2.4.
(1)
For stocks of species that can be aged, average weights and lengths for each
age shall be estimated at a precision level 3, up to such an age that
accumulated landings for the corresponding ages account for at least 90 % of
the national landings for the relevant stock
(2)
For stocks for which age reading is not possible, but for which a growth curve
can be estimated, average weights and lengths for each pseudo age (e.g
derived from the growth curves) shall be estimated with a precision of level 2,
up to such an age that accumulated landings for the corresponding ages
account for at least 90 % of the national landings for the relevant stock.
(3)
For maturity, fecundity and sex ratios, a choice may be made between
reference to age or length, provided that Members States which have to
conduct the corresponding biological sampling, have agreed the following:
(a)
For maturity and fecundity, calculated as proportion of mature fish,
precision of level 3 must be achieved within the age and/or length
range, the limits of which correspond to a 20 % and 90 % of mature
fish;
(b)
For sex ratio, calculated as proportion of females, precision of level 3
must be achieved, up to such an age or length that cumulated landings
for the corresponding ages or lengths account for at least 90 % of the
national landings for this stock.
It was found that the CV did not fit to proportion
estimates and it was proposed to measure the half-width
of the confidence intervals instead. For all vector
estimates (parameter-at-length and parameter-at-age) the
standard mean of the length/age corresponding to the
DCR specifications should be proposed.
Tasks related to general issues
Attributed to
Deadline
0-values function
Mathieu
July
Sampling strategy function
Mathieu
July
Gaps in ALK function
Mathieu
July
Outcome object
Mathieu
July
Outcome graph
Mathieu
End September
COSTdbe/COSTmbe - Estimates of volume of discards
The analytical estimates of volume of discards (number and weight) is ongoing and is almost
ready for testing. The estimation are following the outcomes of the ICES workshop on discard
Raising procedures (WKDRP1) and the working document used in support (Vigneau, 20072).
The sampling units for bootstrap estimates are clearly the fishing trips (primary sampling
unit) and the fishing operations (secondary sampling units). The nested bootstrap, i.e. doing a
bootstrap on fishing operations for each replicate of fishing trips is likely to be problematic,
both in term of time duration and number of secondary sampling units to resample. Moreover,
it is known that the overall variance of discards estimates is lead by the between trips
variance. The solution could come from a hybrid bootstrap, i.e. bootstrapping the fishing trips
and taking into account the within-trips variance at each step of the resampling. There is a
need to investigate the literature about this to validate any method that is going to be used.
The Bayesian model is an integrated model that estimates the parameters all together. Because
the model will be generating the samples from a virtual population, the deadline is shortened
to end of August.
Tasks related to the estimation of volume of discards
Create generic function to estimate total volume (weight
or number) based on :
a)multistage sampling (raised by trip)
b)ratio-to-size (raised by fishing operations [2
stages])
c)ratio-to-size (raised by fishing days [3 stages])
d)ratio-to-an-auxiliary variable
Deliverables
Attributed to
Deadline
Mathieu
End September
Mathieu
December
Investigation on a hybrid bootstrap
David M.
Joël
End July
Bootstrap estimates
David M.
End September
Deliverables
David M.
December
Bayesian estimates
David H.
End August
Deliverables
David H.
December
COSTdbe/COSTmbe – Estimates of length and age structure
Analytical estimates : The analytical estimates of length and age structure is ongoing. The
problems raised by the matching of length structure and ALK should be solved with the Gap
filling function.
Bootstrap estimates : The big issue for the bootstrap is the high probability of generating
incomplete ALKs when resampling the age samples. Although a function will be
implemented to fill the gaps in the ALK (see section above), there is a risk of rejection of the
ALK because of too many gaps. The alternative could be to resample the individual ages
grouped by length classes. This way of doing, respectful of the field protocol is also likely to
face length classes (in the extrema) with very few individuals, causing problems in a
resampling process. In absence of without-risk procedure for bootstrapping the ALK, it was
1
2
http://www.ices.dk/reports/ACOM/2007/WKDRP/WKDRP07.pdf
http://www.ifremer.fr/docelec/doc/2006/acte-2699.pdf
decided not to resample the individuals in the whole ALK as often seen in the literature, but
implementing both the bootstrap of the age samples and the bootstrap stratified by length
classes. The simulation package will further evaluate the goodness of fit of the procedures
used and validate or not the approach.
It was also decided to implement only simple bootstrap and avoid all the reducing bias
variants found in the R bootstrap package. The optimisation of the bootstrap estimates should
be done in a second stage or a continuation of COST project.
Tasks related to the estimation of length and age
structure
Attributed to
Deadline
Analytical estimates
Marcel
End September
Deliverables
Marcel
December
Bootstrap estimates
David M.
End September
Deliverables
David M.
December
Bayesian estimates
David H.
End August
Deliverables
David H.
December
COSTdbe/COSTmbe - Estimates of biological parameters
The analytical functions are beta versions. Fitting a curve to the data should not be the default
option but should be authorised with an argument to the function. It was noted that the
parameters estimates are reflecting the parameters in the catches and not in the population.
The functions should all have an argument specifying the sampling type and the time window
to focus only on some precise subset of the data (international requisite for maturity on a time
window which subset should be done on the month field here).
The bootstrap parameters-at-age should face the same problems as for the age structure
described above, when using an ALK.
Tasks related to the estimation of biological parameters
Analytical Estimates of empirical weight-at-length,
maturity-at-length, sex-ratio-at-length and variances
[Beta – versions to be improved]
Bootstrap Estimates of empirical weight-at-length,
maturity-at-length, sex-ratio-at-length and variances
[Beta – versions to be improved]
Estimates from models
a)weight-at-length (length-weight relationship)
b)weight-at-age (Von Bertalanffy)
c)Maturity-at-length (logistic)
d)Maturity-at-age (logistic)
e)Sex-ratio-at-length (binomial model)
f)Sex-ratio-at-age (binomial model)
Bayesian Estimates of empirical weight-at-length,
maturity-at-length, sex-ratio-at-length and variances
[Beta – versions to be improved]
Deliverables
Attributed to
Deadline
Mathieu
End September
Paz
End July
Paz
(in COSTmbe
package)
End September
David H.
End September
All developers December
COSTsim – Description of work for the package
The work to be done by this package can be split into two categories: (i) the comparison of
model-based, design-based and analytical methods on ‘true’ simulated populations, and (ii)
simple optimisation of sampling. Given the large potential scope of the work to be done in
this package and the limited time remaining, the discussion focused on the reduction of the
scope to the essential needs. For example, the optimisation of sampling strategies will be
limited to the comparison of a limited number of possibilities.
The starting point was to consider the method to generate a true population from the datasets
available, or rather generating COST format samples on a limited number of sampling
strategies. The function will be made available from the Bayesian model by David Hirst by
the end of August.
The different situation to cover with the methods will emulate the sampling strategies
available in the datasets and will generate different sampling effort in the strata including
poorly sampled strata.
The performance statistics that will be used to compare the methods are the following:
 Bias: the difference between the expected value of the estimator (the mean of the
estimates of all possible samples that can be taken from the population) and the true
population value.
 Coverage of the CI : proportion of times the CI contains the true value
 Precision: the difference between a sample estimate and the mean of the estimates of
all possible samples, of the same size, that can be taken from the population. The
variance is a possible quantitative value to assess precision.
 Accuracy: the difference between a sample estimate and the true population value. It
is a combination of both Bias and Precision. The mean squared error between the
estimates and the true value is a common measure to quantify accuracy.
 Investigation in the literature on other statistics (Walther et al. 2005).
Walther et. al 2005.
The optimisation methods will not consider the optimisation of the sampling allocation in the
strata with relation to their size and heterogeneity (Neyman optimisation), but only the
response of the precision and variance under different sampling size both in length and age.
The optimisation based on the size of the strata can be done by the function FFFF of the
exploratory Data Analysis package.
WP6 diagram
Calculate point
and variance
estimates using
design and model based
methods from
WP5 and WP6
Sample the real population
replicating different
Sampling Procedures
Simulate
Real haul
Population
based on
real
sampling
data
Population
(whole population of
hauls)
Samples
Estimates
WP5 function
Compare
real and estimated parameters
(Performance statistics)
Which method performs best in each Situation.
Sampling Size assessment
Tasks related to the simulation package
Generate a function that generates virtual samples based
on a real dataset.
Generate virtual samples under different conditions
Attributed to
Deadline
David Hirst
End of august
Dorleta
End September
Compare the methods
Dorleta
End October
Agreement on the conclusions
Leaders of the
package
Mid December
Sample size
Dorleta
End November
Deliverables
Dorleta
December
References
Walther, B. A. and Moore, J. L. 2005. The concepts of bias, precision and accuracy, and their
use in testing the performance of species richness estimators, with a literature review of
estimator performance _/ Ecography 28: 815_/829.
Agenda
Beginning of September
Availability of the functions by end of September
Using the S4 classes
Encapsulation of usual R scripts
To be done at the final stage of the project
Final report of the project
Help man pages for each of the packages
User manuals (based on the EDA user manual) + presentation section + importation section
ANNEX A – Working document for the COST Expert Meeting, June 2008
FishFrame/COST Integration Overview
Introduction
During the development of FishFrame the development team have created a module to interact with
R, RConnector. The intent of this module was to provide the interface between FishFrame and COST.
RConnector can be configured to interact with individual R scripts. This configuration is provided
through specifically structured Xml files, one per R script. Through this interface it is possible to
define that COSTcore data objects, data frames (provided as the results of a stored procedure) and
simple scalar values can be provided as parameters to the R script. The expected output of an R
script can also be defined to include data frames, lists and images. An Example of a R script and its
associated Xml configuration file is given in the section “Example”, while a listing of the XSD schema
used to validate these configuration files is given in Appendix A.
FishFrame/COST Integration
All COST functions to be integrated into FishFrame will need to have there XML configuration defined
as well as an entry point within the FishFrame user interface, probably within the data processing or
reporting sections. The FishFrame application will also be responsible for providing all required data
to the COST function, in the form of the previously mentioned data types.
COST functions will be integrated into FishFrame by including a copy of the R scripts on the
FishFrame web server. The initial phase of integrating a function will include the creations of the
configuration file and user interface entry point. Upon future releases of COST functions any changes
within the interface of a function will need to be mirrored within the XML configuration. New COST
functions will also be able to be included.
COST functions will not have write access to the FishFrame database.
Example
Interface to R Function
Create an R function wrapper; this function can accept 3 types of arguments:
1. COSTcore data objects
2. data frames resulting from Stored Procedure calls
3. Simple values: (double, Boolean, string, integer)
There is a restriction on the order of arguments (see the schema in Appendix A). The simple values
must come after the COST core data objects or data frames. Any graphics output will be written to a
.png file if specified in the XML definition of the function wrapper.
All output values (data frames or simple values) must be returned in a list with named elements,
whose names must correspond to the XML definition.
Example (myScript.R):
myFunction <- function( foo, maxDepth)
{
## do some calculations and make a plot…
aDataFrame = . . .
aPvalue = . . .
return( list( outData = aDataFrame, pval = aPvalue ) );
ANNEX A – Working document for the COST Expert Meeting, June 2008
}
Creating XML configuration
The next step is to create an XML document that matches the function wrapper. This document
defines how to call function wrapper and what kind of output it generates, and it must be an instance
of (i.e. it can be validated against) the schema file: “RScriptSchema.xsd”. (online schema validator:
http://tools.decisionsoft.com/schemaValidate/)
Example:
<?xml version="1.0"?>
<RScriptDefinition>
<name>My Script</name>
<functionName>myFunction</functionName>
<source sourceType="script">myScript.R</source>
<inputData>
<dataType>CS</dataType>
<name>foo</name>
</inputData>
<parameter>
<name>maxDepth</name>
<type>integer</type>
<description>Some description of the parameter maxDepth</description>
</parameter>
<output>
<graph>true</graph>
<dataFrame>outData</dataFrame>
<value>
<name>p-value</name>
<type>double</type>
<variableName>pval</variableName>
<description>This value explains p-value</description>
</value>
</output>
</RScriptDefinition>
In this example the input parameter “foo” is a COSTcore CS data object, other valid data types would
be CE and CL COSTcore data objects or StoredProcedure which would relate to a data frame resulting
from the execution of the defined stored procedure. The parameter “maxDepth” is a simple value of
type integer. How to pass these data values into the R function is defined in the following section.
This example defines that the R function will output a graph, a data frame named “outData”, and a
simple value “pval”. How to retrieve the returned data is described in section “Retrieving return
data”.
Running an R Function through RConnector
The R function can now be executed through the RConnector module:
RScriptResult result = rcon.RunScript("C:/RScriptSchema.xsd",
"C:/myScript.xml",
"C:/myGraph.png",
someCSdata, 10 );
Where "C:/myGraph.png" is the output file for the graph coming from the R function, and
“someCSdata” and “10” is the input data to the R function.
ANNEX A – Working document for the COST Expert Meeting, June 2008
The return object, “result”, of the class RScriptResult will contain the output from the R function,
which would be.the data frame “outData” and the simple value “pval”.
ANNEX A – Working document for the COST Expert Meeting, June 2008
Appendix A – RscriptSchema.xsd
<?xml version="1.0" encoding="utf-8"?>
<xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="unqualified"
attributeFormDefault="unqualified">
<!-- Definitions of simple elements-->
<xs:element name="description" type="xs:string"/>
<!-- Datatype is used for describing the type of raw data-->
<xs:simpleType name="dataTypeEnum">
<xs:restriction base="xs:string">
<xs:enumeration value="CS"/>
<xs:enumeration value="CL"/>
<xs:enumeration value="CE"/>
<xs:enumeration value="StoredProcedure"/>
</xs:restriction>
</xs:simpleType>
<!-- sourceTypeEnum describes the type of R input function-->
<xs:simpleType name="sourceTypeEnum">
<xs:restriction base="xs:string">
<xs:enumeration value="script"/>
<xs:enumeration value="lib"/>
</xs:restriction>
</xs:simpleType>
<!-- basic R types -->
<xs:simpleType name="RType">
<xs:restriction base="xs:string">
<xs:enumeration value="double"/>
<xs:enumeration value="integer"/>
<xs:enumeration value="boolean"/>
<xs:enumeration value="string"/>
</xs:restriction>
</xs:simpleType>
<!-- Enumeration, only strings and characters-->
<xs:element name="enum">
<xs:simpleType>
<xs:list itemType="xs:string"/>
</xs:simpleType>
</xs:element>
<xs:simpleType name="variableName">
<xs:restriction base="xs:string">
<xs:pattern value="([a-z]|[A-Z])[a-zA-Z0-9_.]*"/>
<!-- regular expression for a valid variable name in R -->
</xs:restriction>
</xs:simpleType>
<!-- Definition of attributes-->
<!-- Definition of complex elements-->
<xs:element name="source">
<xs:complexType>
<xs:simpleContent>
<xs:extension base="xs:string">
<xs:attribute name="sourceType" type="sourceTypeEnum" use="required"/>
</xs:extension>
</xs:simpleContent>
</xs:complexType>
</xs:element>
<!-- Input data (raw)-->
<xs:element name="inputData">
<xs:complexType>
<xs:sequence>
<xs:element name="dataType" type="dataTypeEnum"/>
<xs:element name="name" type="xs:string"/>
</xs:sequence>
</xs:complexType>
ANNEX A – Working document for the COST Expert Meeting, June 2008
</xs:element>
<!-- Parameters to the R-script (flags etc. beyond raw data)-->
<xs:element name="parameter">
<xs:complexType>
<xs:sequence>
<!-- parameter name (in R)-->
<xs:element name="name" type="variableName"/>
<!-- type of parameter-->
<xs:choice>
<xs:element name="type" type="RType"/>
<xs:element ref="enum"/>
</xs:choice>
<!-- optional default value-->
<xs:element name="default" type="xs:string" minOccurs="0"/>
<!-- optional description-->
<xs:element ref="description" minOccurs="0"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<!-- A "simple" value that the R-script returns -->
<xs:complexType name="value">
<xs:sequence>
<xs:element name="name" type="xs:string"/>
<xs:element name="type" type="RType"/>
<xs:element name="variableName" type="variableName"/>
<xs:element ref="description" minOccurs="0"/>
</xs:sequence>
</xs:complexType>
<xs:element name="output">
<xs:complexType>
<xs:sequence>
<!-- does the script/function generate graphics output -->
<xs:element name="graph" type="xs:boolean"/>
<xs:element name="dataFrame" type="variableName" minOccurs="0"
maxOccurs="unbounded"/>
<xs:element name="value" type="value" maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
<!-//////////////////////////////
Script definition begins
//////////////////////////////
-->
<xs:element name="RScriptDefinition">
<xs:complexType>
<xs:sequence>
<!-- Unlike most complexTypes above, new stuff can be added here without
breaking the library parser-->
<!-- The name of the script-->
<xs:element name="name" type="xs:string"/>
<!-- optional description-->
<xs:element ref="description" minOccurs="0"/>
<!-- The name of the function to call in R-->
<xs:element name="functionName" type="xs:string"/>
<xs:element ref="source"/>
<xs:element ref="inputData" maxOccurs="unbounded"/>
<xs:element ref="parameter" maxOccurs="unbounded"/>
<xs:element ref="output"/>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:schema>
ANNEX B – Decision rule for the function investigating the sampling strategies
Related to each HH$catchCat & SL$spp
Check that HH$catchReg =All & HH$sppReg=All
Check data integrity - conditional to the function (discards W& N, LS, AS, bio
parameters)
1- Sampling for length in fishing trips - Unsorted catch
HL (+/-)
HH@fishing activities (+)
SL@commCat (-)
CL@fishing activities (=HH)
CL@commCat (--)
2 - Sampling for length in fishing trips - Commercial Categories
HL (+/-)
HH@fishing activities (+)
SL@commCat (+)
CL@fishing activities (=HH)
CL@commCat (--)
3 - Sampling for length in Commercial categories
HL (+/-)
HH@fishing activities (-)
SL@commCat (+)
CL@fishing activities (--)
CL@commCat (=SL)
4 - sampling for age in fishing trips - unsorted catch
HL (--)
HH@fishing activities (+)
SL@commCat (-)
CA$trpCode (+)
CA$staNum (+)
CL@fishing activities (=HH)
CL@commCat (--)
5 - sampling for age in fishing trips - Commercial categories
HL (--)
HH@fishing activities (+)
SL@commCat (+)
CA$trpCode (+)
CA$staNum (+)
CL@fishing activities (=HH)
CL@commCat (--)
6 - Sampling for age in commercial categories
HL (--)
HH@fishing activities (-)
SL@commCat (+)
CA$trpCode (+)
CA$staNum (+)
CL@fishing activities (--)
CL@commCat (=SL)
(+) = all cells filled
(-) = at least one cell not field
ANNEX B – Decision rule for the function investigating the sampling strategies
(--) = entirely empty
Within one strata, only mixture of 1 & 2 or 4 & 5 are authorised
Download