Minutes of the second COST expert meeting Sukarrieta, 24 – 26 June 2008 Participants Lucia Zarauz Marcel Machiels David Hirst Leire Iba David Maxwell Mathieu Merzéréaud Alastair Pout Lisa Readdy Bruno Reale Paz Sampedro Mario Sbrana Dorleta Garcia Joël Vigneau (Chairman) lzarauz@suk.azti.es Marcel.Machiels@wur.nl david.hirst@nr.no libaibarriaga@suk.azti.es david.maxwell@cefas.co.uk mathieu.merzereaud@ifremer.fr a.pout@marlab.ac.uk lisa.readdy@cefas.co.uk reale@cibm.it paz.sampedro@co.ieo.es msbrana@cibm.it dgarcia@suk.azti.es joel.vigneau@ifremer.fr Availability of datasets During the meeting, an Italian dataset was made available to the developers. A discussion occurred on the coverage of sampling strategies and case studies and a set of 3 tables (below) circulated to summarise exactly the coverage of each of the datasets. From this set of tables, it will be decided which would be the datasets to be made available very rapidly, although the limited time remaining to the project will make it difficult to extend too much the case studies. Volume of discards/retained fraction (table to be updated on the COST webpage) Sampling strategy / Sampling Observer-at-sea source Discards fraction Retained fraction Sole Sole Trips – Unsorted FRS_obs_trips FRS_obs_trips Trips – sorted by categories Length/age structure (table to be updated on the COST webpage) Observer-at-sea Sampling strategy / Sampling source Unsorted Length + ALK Trips Commercial categories Commercial categories Direct age Trips Unsorted Discards fraction Sole FRS_obs_trips Auction/Market Harbour Retained Landed fraction fraction Sole Italy FRS_obs_trips Cod/whi/had/sai2004 cefas Sole Commercial categories Commercial categories Biological parameters (table to be updated on the COST webpage) Auction/Market Parameter / Observer-at-sea Harbour Sampling source Scientific survey Sole Age Purchase Sole Weight Sole Sole Maturity Sole Sole Sex-ratio Sole Sole Fecundity Modification of the Data Exchange Format The Data Exchange Format is meant to be very stable, as any modification lead to correction of potentially all the functions developed in COST. From the beginning of the project, it was anticipated that the creation of datasets covering the different case studies would probably put the specifications of the DEF under pressure for modifications. In order to strengthen the DEF while keeping its stability, it is important to evaluate each of the demands and wonder if the raised issue may be circumvented by a function or a modification of the dataset itself before accepting the change. It is also important to note that the following accepted changes are considered to be the last modifications of the DEF for this project. Considering the time remaining before the end of the project, the development of the functions should not be stopped waiting for the creation of new datasets. The modifications have been discussed by correspondence during and after the meeting and the following were accepted: Modifications of the comments Page 4: “COST accepts the DATRAS format for uploading of CA records” change to “DATRAS data may be uploaded into COST for estimating biological parameters and building age-Length-Keys. In order to estimate the parameters at age, the information for estimating the length structure of the population will be required.” HH Comment 16 below table: Text appended: “(Condition subject to check during upload).” CA Comment 3 below table: Text changed from: “Only applicable for Herring and Whitefish.” To “Only applicable for Herring (Clupea harengus), Salmon (Salmo salar) and Common whitefish (Coregonus lavaretus).” CA New note below table: “If ‘Station no’ is missing, then the first HH record for the same trip matching on VesselFlagCountry, LandingCountry, Year, Quarter, Month, Area and StatisticalRectangle is assumed to be representative for the CA record (used to provide the FAC if needed). All CA records with sampling type different than ‘V’ (vendor) should match at least one HH record on VesselFlagCountry, LandingCountry, Year, Quarter, Month, Area and StatisticalRectangle (Condition subject to check during upload). (except for CA's from survey data).” Modifications of the fields TR + HH + SL + HL + CA.TripNumber: Range changed from “1 to 9999” to “1 to 999,999” TR.VesselType: Mandatory field changed to Optional. Reference to footnote 6 should be changed to footnote 1 TR + CE + CL.Harbour (New fields). Optionals, not key fields. HH + CA + CL + CE: Area and Rectangle: Change text to : “Area = level3 (level 4 for Baltic and Med) and Rectangle = level5 (NA for Med). Levels referring to the new DCR (199/2008).” Addition of “GSA is used in the Mediterranean”, provided that a. There is a clear geographical boundary between the area where Statistical rectangles are used and where GSAs are used. b. There is no place where Statistical rectangles are used some times (some years or some species or some fisheries or some countries) and GSAs other times. HH + CA + CL + CE: SubStatisticalRectangle(New field): This field is splitting the statistical rectangle in multiple polygons. The polygons are nationally defined, but it is encouraged that this is coordinated internationally (in the RCMs). The same goes for the coding. The field is a part of the natural key. It is optional and it is a string of up to 10 characters. When a value is given in SubStatisticalRectangle, then StatisticalRectangle should also be given (Condition subject to check during upload). HH.Fishing activity category National. Description changed to: "National coding system. Bound to the DCR matrix (Com Reg. XXX/2008) level 6 as children i.e. a national stratification of metier. HH + CL + CE : FAC fields: New comment added: “’Fishing activity category European lvl 6’ is mandatory for data from 2009 and onwards (Condition subject to check during upload).” New comment added: “Either “Fishing activity category European lvl 5” or “Fishing activity category European lvl 6” should be provided - not both. Preferably lvl 6 since this includes the lvl 5 information (Condition subject to check during upload).” Modification of FAC-code system: For gear that do not have a mesh size or selection device use a “-“. Example Longline fishery for demersal fishery would be: “LLS_DEF_-_-_-”. Clarification of FAC-code system: For fisheries where there is no regulation on meshsize a mesh size or mesh size in selection device use a “0“. Example gillnet fishery for demersal fishes in the Mediterranean would be: “GNS_DEM_0_-_0”." HH.Gear: Field to be removed since this field is not needed anymore (included in mandatory FAC lvl 5 field). SL.Sex (new field): This field is needed to get overall mean weight for each sex separately for Nephrops and megrim. Agreement on addition of this field with the codes: Male and Female. The field will be an optional key field. SL.Taxon (Addition of the field refused): See modification of CL.Species to CL.Taxon SL.CommercialSizeCategoryScale: A new code for Nephrops sorted into whole and tails. SL.CommercialSizeCategory: “Whole” or “Tails” in case of Nephrops (encoded as 0 and 1) referencing the new scale above. SL.Weight : Text change from: “Whole weight in gram. Decimals not allowed. For sea sampling: Weight of the corresponding stratum. For market sampling: Catch weight is per definition equal to Sample weight.” To: “Whole weight in gram. Decimals not allowed. Weight of the corresponding stratum (Species -Catch category - size category - Sex)” HL.Sex : No change. When Sl.Sex is implemented a new comment is needed: “HL.Sex should match SL.Sex if this is set to a specific value (Condition subject to check during upload)” HL.LengthClass: Text : "Lower bound of length class" CA.Age. Text should be: “Estimated age” and be changed to optional. CA.AgingMethod (New field) The description field should contain the aging medium (« Otoliths », « scales », ...) together with the method used for reading (« in toto », « break & burn », « slides with transmitted light », ...). CA.MicroscopicMaturityStage (Addition of the field refused): No agreement to include this field. It was found that only one maturity stage information should be provided per individual. Data for comparison of different methods is to be kept in national and ad-hoc research DB’s. However we do agree that the maturity staging method should be reported together with the value (just like for aging), so we suggest: CA.MaturityScale: new Code "Crustacean scale" CA.MaturityStage New codes “Berried”, “Not berried” (integers) to be included in a maturity scale for crustaceans: "Crustaceans scale CA.MaturityStagingMethod (New field). This field would have codes for histological methods as well. Codes so far: “Visual”, “Histological” CL.Species: renamed to Taxon Rename "Taxon", extend the valid codelist for this to the list of species + the list of higher level taxons. The code list in the systems (FishFrame and COST, should hold the information about taxonomic relations for the higher level taxons that are needed. Example: the raising procedure needs to know that Lophius sp. consists of L.piscatorius and L.budagessa. In addition, a working document (annex A) was made available to the group specifying the ways to encapsulate the COST R functions in Fishframe. This working document will be included to the final report, under the section ‘relation with other platforms’. No specific requirements are mentioned for COST developers, apart underlying the complexity of the encapsulation, thus promoting the minimisation of the number of functions to develop (S4 methods philosophy). COSTcore – Presentation and discussion The package COSTcore has released a version 1.0 since April 2008. Some recent improvements have been brought so that the version used was 1.2-1. During the discussion some issues were raised which will need to be implemented very rapidly: subset function able to subset on any CS tables, i.e. area in HH, species in SL. The check functions should be reviewed to better focus on the major errors and included in the importation methods. Inclusion of the field “Date” in the consolidated table HH. This field is important for the model-based estimates that need to assess the real age of the individuals throughout the year. Addition of the year information to the consolidated field ‘time’ in all tables, whatever time stratification is used, e.g. 2006 – 1 for quarter or month 1 in 2006. This is meant to keep the year information in the consolidated tables and authorise multi-years analysis by COST functions if necessary. Tasks related to COSTcore Attributed to Deadline Develop Subset function Mathieu & Ernesto July 2008 Review the check functions Alastair July 2008 Addition/modifications of fields in consolidated tables Mathieu July 2008 COSTeda – Presentation and discussion The COSTeda was presented and no major modification is planned. It is proposed that a large beta testing should be envisaged in the coming months in order to permit the release of the version 1.0. Tasks related to COSTeda Attributed to Deadline Test all exploratory analysis function with any datasets present in COSTdata All developers August 2008 Include the GSA geographical limits in the mapping functions Alastair July 2008 COSTdbe/COSTmbe – Generic functions A certain number of functions are to be developed very quickly in order to address common issues to both design-based and model-based estimates. The functions and object to develop are the following : Zero-values : the COST datasets do not contain any value for species not caught. The inclusion of 0-values into the estimation must follow strict rules, e.g. fishing operation sampled and all species considered. This issue has already been addressed for the analytical methods, thus there is a need to extract the code and make it generic into a function all methods should call primarily to calculating the estimates and associated variances. Defining the sampling strategy : there is a need for a function scrutinising the whole dataset in order (i) to define which was the sampling strategy and (ii) check that all related information is present. The end user should either validate the sampling strategy defined by the function or specify it in an argument. The decision rule for defining the sampling strategy based on the dataset information is given in annex B. Managing gaps in ALK : gaps in Age-Length keys is considered as a major issue, especially for bootstrapping (see following sections). A function is needed to check the matching between the length structure of the parameter to estimate and the associated length structure of the ALK. In case of non-matching, it was seen preferable to propose the users with suggestions of automatic solutions. The experience shows that gaps in ALK are usually solved by “expert-filling”, which can be summarised by manual tabulation of empty cells. It was found preferable to propose appropriate grouping of length classes. The different cases encountered are o Case 1 : proportion of missing length class is too high (to be defined/parameterised). The ALK is refused and the function is stopped. o Case 2 : the gaps are spread all over the ALK (to be defined/parameterised). The proposed solution would be to increase the steps of the length classes, limited to 2 and 3 cm steps. Each step should consider if the new ALK remains in case 2 or has evolved in case 3 or 4. o Case 3 : few small gaps (to be defined/parameterised) are encountered in the middle of the ALK. The proposed solution should be to sum the upper and downer length class numbers-at-age and introduce this in the missing length class. The modified ALK should be evaluated if it has evolved in case 4. o Case 4 : The gaps are at the extrema. The proposed grouping should be done only at these extrema considering the first filled length class and a number (to be defined/parameterised) of filled large length classes. o In general : the user should keep the hand on the solutions adopted (to be discussed in the case of bootstrapping!). In all cases, the recoding of the length classes should be done on the consolidated CA table and on a duplicate of the outcome of the length structure estimates (not on the consolidated HL table and not changing the original length structure estimates). Outcome object : setting identical objects (at least identical headers/containers) as an outcome of the functions participate to the genericity of COST. This object should have a common structure, a core of identical headers/containers and specific headers/containers relating to the function used. The core headers should be o o o o o o o o o o o o o o o $species recall of SL$spp (+ SL$taxon + SL$sex) $catchCat recall of the catch category (Discards / Landings) $param recall of the parameter estimated (N, W, maturity, sex-ratio, …) $strataDes time space and technical stratification considered $methodDesc recall of the method (analytical, bootstrap, Bayesian) $nSamp Number of samples $nMes Number of individual measured $lenStruc Estimates of the length structure (param-at-length) $lenVar Estimates of the variance of $lenStruc $ageStruc Estimates of the length structure (param-at-age) $Agevariance Estimates of the variance of $ageStruc totalN estimates of the total of the parameters totalNvar estimates of the variance of totalN totalW estimates of the total weight of the parameters totalWvar estimates of the variance of totalW It was agreed that the outcome object should match the demand of the end user in term of aggregation. For this purpose, the functions should all contain an argument specifying the aggregation level, e.g. one metier - one area - yearly estimates, one area - all metier – quarterly estimates, etc… Outcome graphs : the outcome graphs based on the common headers of the outcome object should be implemented once for all functions. Graphs for specific information related to the function used (e.g. distribution of replicates) should be developed in the appropriate package. The precision indicators to implement have to be in accordance with the new DCR. The specifications required by the regulation are: Chapter II Section B.4 : Where reference is made to precision/confidence level the following distinction shall apply: Level 1/2/3: level making it possible to estimate a parameter with a precision of plus or minus 40% for a 95% confidence level or a coefficient of variation (CV) of 20% used as an approximation. +/- 25% or CV = 12.5% and +/- 5% or CV = 2.5% respectively for level 2 and 3. Chapter III Section B1.4.2.a : Data related to quarterly estimates of discards length and age composition for Group 1 and Group 2 species must lead to a precision of level 1 Chapter III Section B2.4. (1) For stocks of species that can be aged, average weights and lengths for each age shall be estimated at a precision level 3, up to such an age that accumulated landings for the corresponding ages account for at least 90 % of the national landings for the relevant stock (2) For stocks for which age reading is not possible, but for which a growth curve can be estimated, average weights and lengths for each pseudo age (e.g derived from the growth curves) shall be estimated with a precision of level 2, up to such an age that accumulated landings for the corresponding ages account for at least 90 % of the national landings for the relevant stock. (3) For maturity, fecundity and sex ratios, a choice may be made between reference to age or length, provided that Members States which have to conduct the corresponding biological sampling, have agreed the following: (a) For maturity and fecundity, calculated as proportion of mature fish, precision of level 3 must be achieved within the age and/or length range, the limits of which correspond to a 20 % and 90 % of mature fish; (b) For sex ratio, calculated as proportion of females, precision of level 3 must be achieved, up to such an age or length that cumulated landings for the corresponding ages or lengths account for at least 90 % of the national landings for this stock. It was found that the CV did not fit to proportion estimates and it was proposed to measure the half-width of the confidence intervals instead. For all vector estimates (parameter-at-length and parameter-at-age) the standard mean of the length/age corresponding to the DCR specifications should be proposed. Tasks related to general issues Attributed to Deadline 0-values function Mathieu July Sampling strategy function Mathieu July Gaps in ALK function Mathieu July Outcome object Mathieu July Outcome graph Mathieu End September COSTdbe/COSTmbe - Estimates of volume of discards The analytical estimates of volume of discards (number and weight) is ongoing and is almost ready for testing. The estimation are following the outcomes of the ICES workshop on discard Raising procedures (WKDRP1) and the working document used in support (Vigneau, 20072). The sampling units for bootstrap estimates are clearly the fishing trips (primary sampling unit) and the fishing operations (secondary sampling units). The nested bootstrap, i.e. doing a bootstrap on fishing operations for each replicate of fishing trips is likely to be problematic, both in term of time duration and number of secondary sampling units to resample. Moreover, it is known that the overall variance of discards estimates is lead by the between trips variance. The solution could come from a hybrid bootstrap, i.e. bootstrapping the fishing trips and taking into account the within-trips variance at each step of the resampling. There is a need to investigate the literature about this to validate any method that is going to be used. The Bayesian model is an integrated model that estimates the parameters all together. Because the model will be generating the samples from a virtual population, the deadline is shortened to end of August. Tasks related to the estimation of volume of discards Create generic function to estimate total volume (weight or number) based on : a)multistage sampling (raised by trip) b)ratio-to-size (raised by fishing operations [2 stages]) c)ratio-to-size (raised by fishing days [3 stages]) d)ratio-to-an-auxiliary variable Deliverables Attributed to Deadline Mathieu End September Mathieu December Investigation on a hybrid bootstrap David M. Joël End July Bootstrap estimates David M. End September Deliverables David M. December Bayesian estimates David H. End August Deliverables David H. December COSTdbe/COSTmbe – Estimates of length and age structure Analytical estimates : The analytical estimates of length and age structure is ongoing. The problems raised by the matching of length structure and ALK should be solved with the Gap filling function. Bootstrap estimates : The big issue for the bootstrap is the high probability of generating incomplete ALKs when resampling the age samples. Although a function will be implemented to fill the gaps in the ALK (see section above), there is a risk of rejection of the ALK because of too many gaps. The alternative could be to resample the individual ages grouped by length classes. This way of doing, respectful of the field protocol is also likely to face length classes (in the extrema) with very few individuals, causing problems in a resampling process. In absence of without-risk procedure for bootstrapping the ALK, it was 1 2 http://www.ices.dk/reports/ACOM/2007/WKDRP/WKDRP07.pdf http://www.ifremer.fr/docelec/doc/2006/acte-2699.pdf decided not to resample the individuals in the whole ALK as often seen in the literature, but implementing both the bootstrap of the age samples and the bootstrap stratified by length classes. The simulation package will further evaluate the goodness of fit of the procedures used and validate or not the approach. It was also decided to implement only simple bootstrap and avoid all the reducing bias variants found in the R bootstrap package. The optimisation of the bootstrap estimates should be done in a second stage or a continuation of COST project. Tasks related to the estimation of length and age structure Attributed to Deadline Analytical estimates Marcel End September Deliverables Marcel December Bootstrap estimates David M. End September Deliverables David M. December Bayesian estimates David H. End August Deliverables David H. December COSTdbe/COSTmbe - Estimates of biological parameters The analytical functions are beta versions. Fitting a curve to the data should not be the default option but should be authorised with an argument to the function. It was noted that the parameters estimates are reflecting the parameters in the catches and not in the population. The functions should all have an argument specifying the sampling type and the time window to focus only on some precise subset of the data (international requisite for maturity on a time window which subset should be done on the month field here). The bootstrap parameters-at-age should face the same problems as for the age structure described above, when using an ALK. Tasks related to the estimation of biological parameters Analytical Estimates of empirical weight-at-length, maturity-at-length, sex-ratio-at-length and variances [Beta – versions to be improved] Bootstrap Estimates of empirical weight-at-length, maturity-at-length, sex-ratio-at-length and variances [Beta – versions to be improved] Estimates from models a)weight-at-length (length-weight relationship) b)weight-at-age (Von Bertalanffy) c)Maturity-at-length (logistic) d)Maturity-at-age (logistic) e)Sex-ratio-at-length (binomial model) f)Sex-ratio-at-age (binomial model) Bayesian Estimates of empirical weight-at-length, maturity-at-length, sex-ratio-at-length and variances [Beta – versions to be improved] Deliverables Attributed to Deadline Mathieu End September Paz End July Paz (in COSTmbe package) End September David H. End September All developers December COSTsim – Description of work for the package The work to be done by this package can be split into two categories: (i) the comparison of model-based, design-based and analytical methods on ‘true’ simulated populations, and (ii) simple optimisation of sampling. Given the large potential scope of the work to be done in this package and the limited time remaining, the discussion focused on the reduction of the scope to the essential needs. For example, the optimisation of sampling strategies will be limited to the comparison of a limited number of possibilities. The starting point was to consider the method to generate a true population from the datasets available, or rather generating COST format samples on a limited number of sampling strategies. The function will be made available from the Bayesian model by David Hirst by the end of August. The different situation to cover with the methods will emulate the sampling strategies available in the datasets and will generate different sampling effort in the strata including poorly sampled strata. The performance statistics that will be used to compare the methods are the following: Bias: the difference between the expected value of the estimator (the mean of the estimates of all possible samples that can be taken from the population) and the true population value. Coverage of the CI : proportion of times the CI contains the true value Precision: the difference between a sample estimate and the mean of the estimates of all possible samples, of the same size, that can be taken from the population. The variance is a possible quantitative value to assess precision. Accuracy: the difference between a sample estimate and the true population value. It is a combination of both Bias and Precision. The mean squared error between the estimates and the true value is a common measure to quantify accuracy. Investigation in the literature on other statistics (Walther et al. 2005). Walther et. al 2005. The optimisation methods will not consider the optimisation of the sampling allocation in the strata with relation to their size and heterogeneity (Neyman optimisation), but only the response of the precision and variance under different sampling size both in length and age. The optimisation based on the size of the strata can be done by the function FFFF of the exploratory Data Analysis package. WP6 diagram Calculate point and variance estimates using design and model based methods from WP5 and WP6 Sample the real population replicating different Sampling Procedures Simulate Real haul Population based on real sampling data Population (whole population of hauls) Samples Estimates WP5 function Compare real and estimated parameters (Performance statistics) Which method performs best in each Situation. Sampling Size assessment Tasks related to the simulation package Generate a function that generates virtual samples based on a real dataset. Generate virtual samples under different conditions Attributed to Deadline David Hirst End of august Dorleta End September Compare the methods Dorleta End October Agreement on the conclusions Leaders of the package Mid December Sample size Dorleta End November Deliverables Dorleta December References Walther, B. A. and Moore, J. L. 2005. The concepts of bias, precision and accuracy, and their use in testing the performance of species richness estimators, with a literature review of estimator performance _/ Ecography 28: 815_/829. Agenda Beginning of September Availability of the functions by end of September Using the S4 classes Encapsulation of usual R scripts To be done at the final stage of the project Final report of the project Help man pages for each of the packages User manuals (based on the EDA user manual) + presentation section + importation section ANNEX A – Working document for the COST Expert Meeting, June 2008 FishFrame/COST Integration Overview Introduction During the development of FishFrame the development team have created a module to interact with R, RConnector. The intent of this module was to provide the interface between FishFrame and COST. RConnector can be configured to interact with individual R scripts. This configuration is provided through specifically structured Xml files, one per R script. Through this interface it is possible to define that COSTcore data objects, data frames (provided as the results of a stored procedure) and simple scalar values can be provided as parameters to the R script. The expected output of an R script can also be defined to include data frames, lists and images. An Example of a R script and its associated Xml configuration file is given in the section “Example”, while a listing of the XSD schema used to validate these configuration files is given in Appendix A. FishFrame/COST Integration All COST functions to be integrated into FishFrame will need to have there XML configuration defined as well as an entry point within the FishFrame user interface, probably within the data processing or reporting sections. The FishFrame application will also be responsible for providing all required data to the COST function, in the form of the previously mentioned data types. COST functions will be integrated into FishFrame by including a copy of the R scripts on the FishFrame web server. The initial phase of integrating a function will include the creations of the configuration file and user interface entry point. Upon future releases of COST functions any changes within the interface of a function will need to be mirrored within the XML configuration. New COST functions will also be able to be included. COST functions will not have write access to the FishFrame database. Example Interface to R Function Create an R function wrapper; this function can accept 3 types of arguments: 1. COSTcore data objects 2. data frames resulting from Stored Procedure calls 3. Simple values: (double, Boolean, string, integer) There is a restriction on the order of arguments (see the schema in Appendix A). The simple values must come after the COST core data objects or data frames. Any graphics output will be written to a .png file if specified in the XML definition of the function wrapper. All output values (data frames or simple values) must be returned in a list with named elements, whose names must correspond to the XML definition. Example (myScript.R): myFunction <- function( foo, maxDepth) { ## do some calculations and make a plot… aDataFrame = . . . aPvalue = . . . return( list( outData = aDataFrame, pval = aPvalue ) ); ANNEX A – Working document for the COST Expert Meeting, June 2008 } Creating XML configuration The next step is to create an XML document that matches the function wrapper. This document defines how to call function wrapper and what kind of output it generates, and it must be an instance of (i.e. it can be validated against) the schema file: “RScriptSchema.xsd”. (online schema validator: http://tools.decisionsoft.com/schemaValidate/) Example: <?xml version="1.0"?> <RScriptDefinition> <name>My Script</name> <functionName>myFunction</functionName> <source sourceType="script">myScript.R</source> <inputData> <dataType>CS</dataType> <name>foo</name> </inputData> <parameter> <name>maxDepth</name> <type>integer</type> <description>Some description of the parameter maxDepth</description> </parameter> <output> <graph>true</graph> <dataFrame>outData</dataFrame> <value> <name>p-value</name> <type>double</type> <variableName>pval</variableName> <description>This value explains p-value</description> </value> </output> </RScriptDefinition> In this example the input parameter “foo” is a COSTcore CS data object, other valid data types would be CE and CL COSTcore data objects or StoredProcedure which would relate to a data frame resulting from the execution of the defined stored procedure. The parameter “maxDepth” is a simple value of type integer. How to pass these data values into the R function is defined in the following section. This example defines that the R function will output a graph, a data frame named “outData”, and a simple value “pval”. How to retrieve the returned data is described in section “Retrieving return data”. Running an R Function through RConnector The R function can now be executed through the RConnector module: RScriptResult result = rcon.RunScript("C:/RScriptSchema.xsd", "C:/myScript.xml", "C:/myGraph.png", someCSdata, 10 ); Where "C:/myGraph.png" is the output file for the graph coming from the R function, and “someCSdata” and “10” is the input data to the R function. ANNEX A – Working document for the COST Expert Meeting, June 2008 The return object, “result”, of the class RScriptResult will contain the output from the R function, which would be.the data frame “outData” and the simple value “pval”. ANNEX A – Working document for the COST Expert Meeting, June 2008 Appendix A – RscriptSchema.xsd <?xml version="1.0" encoding="utf-8"?> <xs:schema xmlns:xs="http://www.w3.org/2001/XMLSchema" elementFormDefault="unqualified" attributeFormDefault="unqualified"> <!-- Definitions of simple elements--> <xs:element name="description" type="xs:string"/> <!-- Datatype is used for describing the type of raw data--> <xs:simpleType name="dataTypeEnum"> <xs:restriction base="xs:string"> <xs:enumeration value="CS"/> <xs:enumeration value="CL"/> <xs:enumeration value="CE"/> <xs:enumeration value="StoredProcedure"/> </xs:restriction> </xs:simpleType> <!-- sourceTypeEnum describes the type of R input function--> <xs:simpleType name="sourceTypeEnum"> <xs:restriction base="xs:string"> <xs:enumeration value="script"/> <xs:enumeration value="lib"/> </xs:restriction> </xs:simpleType> <!-- basic R types --> <xs:simpleType name="RType"> <xs:restriction base="xs:string"> <xs:enumeration value="double"/> <xs:enumeration value="integer"/> <xs:enumeration value="boolean"/> <xs:enumeration value="string"/> </xs:restriction> </xs:simpleType> <!-- Enumeration, only strings and characters--> <xs:element name="enum"> <xs:simpleType> <xs:list itemType="xs:string"/> </xs:simpleType> </xs:element> <xs:simpleType name="variableName"> <xs:restriction base="xs:string"> <xs:pattern value="([a-z]|[A-Z])[a-zA-Z0-9_.]*"/> <!-- regular expression for a valid variable name in R --> </xs:restriction> </xs:simpleType> <!-- Definition of attributes--> <!-- Definition of complex elements--> <xs:element name="source"> <xs:complexType> <xs:simpleContent> <xs:extension base="xs:string"> <xs:attribute name="sourceType" type="sourceTypeEnum" use="required"/> </xs:extension> </xs:simpleContent> </xs:complexType> </xs:element> <!-- Input data (raw)--> <xs:element name="inputData"> <xs:complexType> <xs:sequence> <xs:element name="dataType" type="dataTypeEnum"/> <xs:element name="name" type="xs:string"/> </xs:sequence> </xs:complexType> ANNEX A – Working document for the COST Expert Meeting, June 2008 </xs:element> <!-- Parameters to the R-script (flags etc. beyond raw data)--> <xs:element name="parameter"> <xs:complexType> <xs:sequence> <!-- parameter name (in R)--> <xs:element name="name" type="variableName"/> <!-- type of parameter--> <xs:choice> <xs:element name="type" type="RType"/> <xs:element ref="enum"/> </xs:choice> <!-- optional default value--> <xs:element name="default" type="xs:string" minOccurs="0"/> <!-- optional description--> <xs:element ref="description" minOccurs="0"/> </xs:sequence> </xs:complexType> </xs:element> <!-- A "simple" value that the R-script returns --> <xs:complexType name="value"> <xs:sequence> <xs:element name="name" type="xs:string"/> <xs:element name="type" type="RType"/> <xs:element name="variableName" type="variableName"/> <xs:element ref="description" minOccurs="0"/> </xs:sequence> </xs:complexType> <xs:element name="output"> <xs:complexType> <xs:sequence> <!-- does the script/function generate graphics output --> <xs:element name="graph" type="xs:boolean"/> <xs:element name="dataFrame" type="variableName" minOccurs="0" maxOccurs="unbounded"/> <xs:element name="value" type="value" maxOccurs="unbounded"/> </xs:sequence> </xs:complexType> </xs:element> <!-////////////////////////////// Script definition begins ////////////////////////////// --> <xs:element name="RScriptDefinition"> <xs:complexType> <xs:sequence> <!-- Unlike most complexTypes above, new stuff can be added here without breaking the library parser--> <!-- The name of the script--> <xs:element name="name" type="xs:string"/> <!-- optional description--> <xs:element ref="description" minOccurs="0"/> <!-- The name of the function to call in R--> <xs:element name="functionName" type="xs:string"/> <xs:element ref="source"/> <xs:element ref="inputData" maxOccurs="unbounded"/> <xs:element ref="parameter" maxOccurs="unbounded"/> <xs:element ref="output"/> </xs:sequence> </xs:complexType> </xs:element> </xs:schema> ANNEX B – Decision rule for the function investigating the sampling strategies Related to each HH$catchCat & SL$spp Check that HH$catchReg =All & HH$sppReg=All Check data integrity - conditional to the function (discards W& N, LS, AS, bio parameters) 1- Sampling for length in fishing trips - Unsorted catch HL (+/-) HH@fishing activities (+) SL@commCat (-) CL@fishing activities (=HH) CL@commCat (--) 2 - Sampling for length in fishing trips - Commercial Categories HL (+/-) HH@fishing activities (+) SL@commCat (+) CL@fishing activities (=HH) CL@commCat (--) 3 - Sampling for length in Commercial categories HL (+/-) HH@fishing activities (-) SL@commCat (+) CL@fishing activities (--) CL@commCat (=SL) 4 - sampling for age in fishing trips - unsorted catch HL (--) HH@fishing activities (+) SL@commCat (-) CA$trpCode (+) CA$staNum (+) CL@fishing activities (=HH) CL@commCat (--) 5 - sampling for age in fishing trips - Commercial categories HL (--) HH@fishing activities (+) SL@commCat (+) CA$trpCode (+) CA$staNum (+) CL@fishing activities (=HH) CL@commCat (--) 6 - Sampling for age in commercial categories HL (--) HH@fishing activities (-) SL@commCat (+) CA$trpCode (+) CA$staNum (+) CL@fishing activities (--) CL@commCat (=SL) (+) = all cells filled (-) = at least one cell not field ANNEX B – Decision rule for the function investigating the sampling strategies (--) = entirely empty Within one strata, only mixture of 1 & 2 or 4 & 5 are authorised