Trawl Catch Statistics Surname: Student No. Introduction Many fisheries agencies keep detailed statistics on fish stocks, sampling specifically for that purpose from research vessels. We have at hand trawl catch statistics for a coastal estuary for the years 1999 and 2000. The data are in the form: SMALLMOUTH_FLOUNDER SPOT BLUE_CRAB BAY_ANCHOVY ATLANTIC_CROAKER 1999 1999 1999 1999 1999 JUN JUL MAY AUG FEB 84 180 27 43 253 where the first column is the fish species, the second column is the year, the third column is the month and the fourth column is fish length in mm. There are data for 53,856 fish in the dataset. In this exercise, you are asked to interrogate the dataset to answer some questions of specific interest. Part 1: Data Entry Input the data to a workfile called TRAWL. You will need to notify SAS that the species variable is a character variable of up to 21 characters, otherwise your species names will be truncated to 8 characters. Do this with a LENGTH SPECIES $ 15; statement in the DATA step immediately before the INPUT statement. Transform the length measurements from mm to cm with an assignment statement. Add a label to the length variable reading "TOTAL LENGTH IN CM". Paste your program code here. DATA TRAWL; INFILE "C:\AAAAA\TRAWL.DAT"; LENGTH SPECIES$ 21; INPUT SPECIES$ YEAR MONTH$ LENGTH; LENGTH=LENGTH/10; RUN; Copyright Arthur Georges 2002 1 Confirm that that the data have been correctly input. Outline what measures you took to confirm that the data had been correctly input here. 1. 2. Referred to the LOG Window to confirm that 53,856 data lines were read. Perused the data in the EXPLORER Window to confirm contents. Part 2: Summary by Species Perform an appropriate analysis to yield summary statistics for each species, including only sample size, minimum, maximum and mean fish size. Your programming solution to this question should include only a single PROC step, and should make use of the BY statement. Do not forget to sort your data first. Paste your program code here. PROC SORT; BY SPECIES; PROC MEANS DATA=TRAWL N MEAN MIN MAX; VAR LENGTH; BY SPECIES; RUN; Paste an extract of the tabular output from your program here. ------------------------------------- SPECIES=American_eel ------------------------------------The MEANS Procedure Analysis Variable : LENGTH Mean Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 26.0024510 13.3000000 64.4000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ----------------------------------- SPECIES=Atlantic_croaker ----------------------------------Analysis Variable : LENGTH Mean Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 12.3799480 0.4000000 40.3000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ ---------------------------------- SPECIES=Atlantic_menhaden ----------------------------------Analysis Variable : LENGTH Mean Minimum Maximum ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ 9.4760417 1.7000000 32.1000000 ƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒƒ Copyright Arthur Georges 2002 2 Part 3: Graphic Summary by Species Perform an appropriate analysis to yield a barchart showing the relative abundance of the different species in the trawl dataset. Your analysis should yield a high quality barchart. Be sure to add a title to your graph. Paste your program code here. GOPTIONS RESET=ALL; TITLE "BARCHART OF FISH SPECIES COUNTS"; PROC GCHART DATA=TRAWL; HBAR SPECIES / TYPE=PCT; RUN; Paste graphic output from your program here. Copyright Arthur Georges 2002 3 Part 4: Size Distribution Perform an appropriate analysis to yield a histogram showing the size distribution for the most abundant fish species in the dataset. Use a WHERE statement to select only data for that fish species. Your analysis should yield a high quality histogram. Be sure to add a title to your graph. Paste your program code here. GOPTIONS RESET=ALL; PROC GCHART DATA=TRAWL; TITLE "ATLANTIC CROAKER"; VBAR LENGTH / TYPE= PCT SPACE=0; WHERE SPECIES="ATLANTIC_CROAKER"; RUN; Paste graphic output from your program here. Describe in words what you see. 1. 2. The size distribution for Atlantic Croaker is bimodal, and certainly not normal. The bimodality could have arisen from a recruitment event, or may represent sexual size dimorphism. More background on the species is needed for a reasonable interpretation. Calculate a full set of summary statistics for length of the above species. Paste your program code here. PROC UNIVARIATE DATA=TRAWL; VAR LENGTH; WHERE SPECIES="ATLANTIC_CROAKER"; RUN; Paste the tabular output from your program here. Copyright Arthur Georges 2002 4 The UNIVARIATE Procedure Variable: LENGTH Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 9236 12.379948 8.10755145 0.26513208 2022576.74 65.4893819 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 9236 114341.2 65.7323905 -1.2480235 607038.626 0.08436217 Basic Statistical Measures Location Mean Median Mode Variability 12.37995 12.10000 3.00000 Std Deviation Variance Range Interquartile Range 8.10755 65.73239 39.90000 15.70000 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 146.7476 4618 21328233 <.0001 <.0001 <.0001 Quantiles (Definition 5) Quantile 100% Max 99% 95% 90% 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min Estimate 40.3 29.7 24.4 22.5 20.0 12.1 4.3 3.0 2.5 1.6 0.4 Extreme Observations ----Lowest---- ----Highest--- Value Obs Value Obs 0.4 0.6 0.6 0.6 0.6 5925 5926 4481 3733 3732 38.0 38.0 38.4 38.7 40.3 1331 2299 1421 2860 1422 Prepare a complete statistical summary for fish length of the above species. Make sure that your summary conforms to the standard outlined in the worked examples. Copyright Arthur Georges 2002 5 Paste your summary here. 1. 2. 3. 4. 5. 6. Present the smallest and largest observed values. Present the mean, standard error and sample size. As the data are not normal, give also the median and mode(s) The interquartile range is a useful measure of spread for non-normal data. Define an extreme event, or an exceptionally large fish, in terms of percentiles (95th or 99th) Do not forget to give the units of measurement. Exercise 5: Transformations Perform an appropriate analysis to yield a histogram showing the size distribution for Spotted Hake. Your analysis should yield a high quality histogram. Be sure to add a title to your graph. Paste your program code here. GOPTIONS RESET=ALL; PROC GCHART DATA=TRAWL; TITLE "SPOTTED HAKE"; VBAR LENGTH / TYPE=PCT SPACE=0; WHERE SPECIES="SPOTTED_HAKE"; RUN; Paste graphic output from your program here. Describe in words what you see. 1. Non-normal uni-modal distribution strongly skewed to the right. Copyright Arthur Georges 2002 6 Clearly fish length for Spotted Hake is not normally distributed, but it is unimodal. Repeat the analysis on this variable following a standard square root transformation and a log transformation. The transformations are: Y' = LOG10 (Y+ 1) and Y' = SQRT(Y + ½) Paste your program code here. DATA NEW; SET TRAWL; LG=LOG10(LENGTH+1); SQ=SQRT(LENGTH+0.5); LABEL LG="LOG10(LENGTH)" SQ="SQUARE ROOT (LENGTH)"; RUN; GOPTIONS RESET=ALL; PROC GCHART DATA=NEW; TITLE "SPOTTED HAKE"; VBAR LENGTH SQ LG / TYPE=PCT SPACE=0; WHERE SPECIES="SPOTTED_HAKE"; RUN; Paste the graphic output from your program here. SQUARE ROOT LOG BASE 10 What was the effect of the transformation in each case? 1. Square root transformation reduced the skewness, but was not strong enough to remove it altogether. Copyright Arthur Georges 2002 7 2. Log transformation converted distribution to a bell shaped curve. Suspect that the size distribution of Spotted Hake may be normalized by a log transformation. Calculate a full set of summary statistics for length of the above species after applying the transformation that was most effective in normalizing the data. Paste your tabular output here. The UNIVARIATE Procedure Variable: LG (LOG10(LENGTH)) Moments N Mean Std Deviation Skewness Uncorrected SS Coeff Variation 890 1.0953837 0.10780875 0.21127011 1078.21285 9.84209931 Sum Weights Sum Observations Variance Kurtosis Corrected SS Std Error Mean 890 974.891489 0.01162273 0.03109041 10.3326041 0.00361376 Basic Statistical Measures Location Mean Median Mode Variability 1.095384 1.093422 1.093422 Std Deviation Variance Range Interquartile Range 0.10781 0.01162 0.65642 0.14732 Tests for Location: Mu0=0 Test -Statistic- -----p Value------ Student's t Sign Signed Rank t M S Pr > |t| Pr >= |M| Pr >= |S| 303.1149 445 198247.5 <.0001 <.0001 <.0001 Tests for Normality Test --Statistic--- -----p Value------ Shapiro-Wilk Kolmogorov-Smirnov Cramer-von Mises Anderson-Darling W D W-Sq A-Sq Pr Pr Pr Pr 0.996005 0.031558 0.138289 0.94168 < > > > W D W-Sq A-Sq 0.0219 0.0304 0.0359 0.0186 Quantiles (Definition 5) Copyright Arthur Georges 2002 Quantile Estimate 100% Max 1.434569 8 99% 95% 90% 1.369216 1.278754 1.240549 SPOTTED HAKE 21:30 Tuesday, July 24, 2001 3468 The UNIVARIATE Procedure Variable: LG (LOG10(LENGTH)) Quantiles (Definition 5) Quantile Estimate 75% Q3 50% Median 25% Q1 10% 5% 1% 0% Min 1.164353 1.093422 1.017033 0.963788 0.929419 0.857332 0.778151 Extreme Observations ------Lowest------ -----Highest----- Value Obs Value Obs 0.778151 0.778151 0.792392 0.832509 0.832509 890 210 286 889 367 1.38382 1.41330 1.42651 1.43297 1.43457 886 146 108 132 133 Histogram 1.425+* .** .***** .*********** .****************** .**************************** .******************************************* .*********************************** .***************************************** .*********************** .************ .***** .** 0.775+* ----+----+----+----+----+----+----+----+--* may represent up to 4 counts SPOTTED HAKE # 4 8 20 44 71 109 171 139 162 90 47 17 5 3 Boxplot 0 | | | | +-----+ | | *--+--* +-----+ | | | | 0 21:30 Tuesday, July 24, 2001 3469 The UNIVARIATE Procedure Copyright Arthur Georges 2002 9 Variable: LG (LOG10(LENGTH)) Normal Probability Plot 1.425+ * | **** | *****+ | ***** | ***** | ***** | ****** | +**** | ****** | ****** | ****** | ***** |**+ 0.775+* +----+----+----+----+----+----+----+----+----+----+ -2 -1 0 +1 +2 Succinctly summarise what you conclude about the Normality of fish length for the above species following transformation. Include reference to supporting evidence. Paste your summary here. 1. The log transformation clearly normalised the size distribution of Spotted Hake, as evidenced by the Shapiro-wilkes test, the histogram, and the probability plot. Part 6: More complex graphics Use the GROUP option on the VBAR statement to compare the size distributions of the two most common species in the dataset. Paste your program code here. GOPTIONS RESET=ALL; PROC GCHART DATA=TRAWL; TITLE "ATLANTIC CROAKER VERSUS HOGCHOKER"; VBAR LENGTH / TYPE=PCT SPACE=0 GROUP=SPECIES; WHERE SPECIES="HOGCHOKER" OR SPECIES="ATLANTIC_CROAKER"; RUN; Paste graphic output from your program here. Copyright Arthur Georges 2002 10 Describe in words what you see. 1. Clearly the size distributions of these species are very different. You should have commented on the biomodality versus the unimodality, perhaps on the differing maximum sizes of the two species, and on the possible reasons for the differences. Use the GROUP option on the VBAR statement to compare the size distributions of the most common species in 1999 and 2000. Paste your program code here. GOPTIONS RESET=ALL; PROC GCHART DATA=TRAWL; TITLE "ATLANTIC CROAKER VERSUS HOGCHOKER"; VBAR LENGTH / TYPE=PCT SPACE=0 GROUP=YEAR; WHERE SPECIES="ATLANTIC_CROAKER"; RUN; Paste graphic output from your program here. Copyright Arthur Georges 2002 11 Describe in words what you see. 1. The size distributions in the two consecutive years are very similar, though the total catch in 2000 was somewhat less than in 1999. Source The length frequency data were kindly provided by the Virginia Institute of Marine Science, Juvenile Fish and Blue Crab Trawl Survey. The web-based data retrieval system appears online [http://www.fisheries.vims.edu/vimstrawldata/] Copyright Arthur Georges 2002 12