Stephen Mansour, PhD University of Scranton and The Carlisle Group Dyalog ’14 Conference, Eastbourne, UK Many statistical software packages out there: Minitab, R, Excel, SPSS Excel has about 87 statistical functions. 6 of them involve the t distribution alone: T.DIST T.DIST.RT T.DIST.2T T.INV T.INV.2T T.TEST R has four related functions for each of 20 distributions resulting in a total of 80 distribution functions alone Defined Operators! How can we exploit operators to reduce the explosive number of statistical functions? Let’s look at an example . . . Typical attendance is about 100 delegates with a standard deviation of 20. Assume next year’s conference centre can support up to130 delegates. What are the chances that next year’s attendance will exceed capacity? =1-NORM.DIST(130,100,20,TRUE) Now let’s use R-Connect in APL: +#.∆r.x 'pnorm(⍵,⍵,⍵,⍵)' 130 100 20 0 Wouldn’t it be nice to enter: 100 20 normal probability > 130 100 20 (normal probability >) 130 normal probability < 1.64 100 20 normal probability between 110 130 5 0.5 binomial probability = 2 7 tDist criticalValue < 0.05 5 chiSquare randomVariable 13 mean confidenceInterval X (SEX='F') proportion hypothesis ≥ 0.5 GROUPA mean hypothesis = GROUPB variance theoretical binomial 5 0.2 Summary Functions ◦ Descriptive Statistics Probability Distributions ◦ Theoretical Models Relations Summary functions are of the form: 𝑦 = 𝑓 𝑥1 , 𝑥2 , … 𝑥𝑛 They produce a single value from a vector. Structurally they are equivalent to g/ where g is a scalar function and the right argument is a simple numeric vector. A statistic is a summary function of a sample; a parameter is a summary function of a population. Examples ◦ Measures of central tendency: mean, median, mode ◦ Measures of Spread variance, standard deviation, range , IQR ◦ Measures of Position min, max, quartiles, percentiles ◦ Measures of shape skewness, kurtosis Probability Distributions are functions defined in a natural way when they are called without an operator: ◦ Discrete: probability mass function ◦ Continuous: density function Left argument is parameter list Right argument can be any value taken on by the distribution. Probability Distributions are scalar with respect to the right argument. Discrete Distributions Parameter List uniform a - lower bound (default 1), b - upper bound. binomial n - Sample size, p - probability of success poisson λ - average number of arrivals per time period negativeBinomial n - number of success, p - probability of success hyperGeometric m - number of successes , n - sample size , N - Population size multinomial V - List of Values (default 1 thru n), P - List of probabilities totaling 1 Continuous Distributions Parameter List normal μ - theoretical mean (default 0); σ - standard deviation (default 1) exponential λ - mean time to fail rectangular (continuous uniform) a - lower bound (default 0), b - upper bound (default 1) triangular a - lower bound, m - most common value, b - upper bound chiSquare df - degrees of freedom tDist (Student) df - degrees of freedom fDist df1 - degrees of freedom for numerator, df2 - degrees of freedom for denominator Relational functions are dyadic functions whose range is {0,1} 1=relation is satisfied, 0 otherwise. Examples: < ≤ = ≥ > ≠ ∊ between←{¯1=×/×⍺∘.-⍵} By limiting the domain of an operator to one of the previously-defined functional classifications, we can create an operator to perform statistical analysis. For a dyadic operator, each operand can be limited to a particular (but not necessarily the same) functional classification. Operator probability criticalValue confidenceInterval Left Operand Distribution Distribution Summary Right Operand Relation Relation N/A hypothesis goodnessOfFit randomVariable Summary Distribution Distribution Relation N/A N/A theoretical running Summary Summary Distribution N/A Most functions and operators can easily be written in APL. Internals not important to user R interface can be used if necessary for statistical distributions. Correct nomenclature and ease of use is critical. A sample can be represented by raw data, a frequency distribution, or sample statistics. The following items are interchangeable as arguments to the limited domain operators above: Raw data: Vector Frequency Distribution: Matrix Summary Statistics: PropertySpace 2 0 1 2 3 4 D 0 3 4 3 1 0 2 0 4 ⎕←FT←frequency D 3 1 2 2 2 Matrix: Frequency Distribution mean D 1.9 variance D 2.5444 PS←⎕NS '' PS.count←10 PS.mean←1.9 PS.variance←2.544 Namespace: Sample Statistics )LOAD TamingStatistics ◦ All APL version )LOAD TamingStatisticsR ◦ Third party – Must install R (Free) There are many statistical packages out there; some, like R can be used with APL Operator syntax is unique to APL R can be called directly from APL using RCONNECT, but APL operator syntax is easier to understand.