Sample Selection Eurostat Presented by • Desislava Nedyalkova • Swiss Federal Statistical Office The Sample Selection topic • The topic covers two main subjects which correspond to two complementary phases in the process of designing and conducting business surveys : – Sample design and selection – Sample coordination Overview of the topic (I) The sample selection part consists of: • Main theme module which covers the most used sampling designs in business surveys • Two method modules: – Balanced sampling for multi-way stratification – Subsampling for preliminary estimation Overview of the topic (II) The sample coordination part consists of: • Main theme module on sample coordination • Three method modules: – Sample co-ordination using simple random sampling (SRS) with permanent random numbers (PRNs) – Sample coordination using Poisson sampling with permanent random numbers (PRNs) – Assigning random numbers when co-ordination of surveys based on different unit types is considered Sample selection • Designing a sample in business statistics is a challenging task (Sigman and Monsour, 1995): – The population is often skewed. – Dynamic membership: Creation of new businesses Change in structure of businesses Closed-down businesses Changes in type or level of activity – Inter-business relationship. Stratified sampling I • Advantages: – The population can be divided into distinct, independent subpopulations called strata. – Leads to more efficient statistical estimates. – Different sampling techniques, e.g. simple random sampling, can be used for different subpopulations. • Disadvantages: – Requires the selection of relevant stratification variables. – It is not useful when there are no homogeneous subgroups. Stratified sampling II • Questions: – How should strata be constructed? – How should sample size be allocated to strata? • Optimal conditions for stratification: – Elements within a stratum are more similar to each other than to elements in other strata (homogeneous strata). – Large variability between strata, good size variable. – The stratification variables are strongly correlated with the variables of interest. Probability proportional to size (pps) sampling • Alternative to stratification • Main characteristics: – The probability of inclusion of a unit in the sample is proportional to some numeric size variable (e.g. turnover, number of employees). – PPS designs of fixed (sequential Poisson sampling) or random (Poisson sampling) sample size. – Easy implementation (e.g., Hartley and Rao, 1962). – Preferred usage: small samples. – In business statistics : Price Index Surveys. Other sampling schemes • Cut-off sampling (Knaub, 2008) – Non-probability sampling design where some elements of the population have no chance of selection. – Use: in very skewed populations (very many small businesses and a few large ones). • Systematic sampling (Cochran, 1977) • Balanced sampling (Deville & Tillé, 2004) – The Horvitz-Thompson estimate of the total of the auxiliary variable is equal to the population total of the auxiliary variable (design-based approach). One-way stratification • Stratified sampling (one-way stratified design): – Can be used when the objective of the survey is to produce estimates for subpopulations. – Planned sample size for each domain. – May have some drawbacks, especially in structural business surveys (large-scale surveys). Overall sample size could be too large for survey’s economic constrains. Sample allocation may be far from the theoretically desired one. Strata with only few units can lead to higher response burden. • An alternative: multi-way stratification (see e.g., Falorsi and Righi, 2008). Multi-way stratification • Multi-way stratified designs – Controlled selection methods including methods based on controlled rounding problem via linear programming – Methods based on sample coordination • Theoretical and operative problems for largescale surveys can arise with some of these methods. • Balanced sampling by the cube method can overcome these drawbacks. Subsampling for preliminary estimates (I) • In short-term statistics, preliminary estimates are demanded from the NSIs (EU Regulation). • A common approach for dealing with them: – Efficient estimators based on auxiliary information. – No explicit definition for a sampling design for preliminary estimates. – Usually drawn by a non-probabilistic sample design. • An alternative overall strategy involving sample design and estimator definition can be found in the module on preliminary estimates. Subsampling for preliminary estimates (II) • Given a sample survey, a preliminary estimate is defined on the basis of a sample of quick respondents. Main strategy: – A planned subsample for preliminary estimates: PTS – a preliminary theoretical sample is drawn. – Aim: Planned Preliminary Observed Sample (PPOS) as close as possible to PTS. – Intensive follow-up of the PTS. • Design-based or model-based approaches for defining the PTS. Sample co-ordination (I) • Sample overlap between surveys: number of common units at two different sampling occasions. • Independent selection: sample overlaps are not controlled. • Negative coordination: aims at spreading the response burden, sample overlap is minimized. • Positive coordination: for repeated surveys, sample overlap is maximized. Sample co-ordination (II) • Three main dimensions: – Sample coordination between surveys. – Sample coordination over time for the same survey. – Sample coordination of surveys based on different unit types. • Two main types of methods: – Methods based on PRNs (used by most NSIs). – Methods based on linear programming (non-PRN methods) – optimal solution, computationally intensive. Co-ordination between surveys • Positive coordination: – Can facilitate the comparisons between variables of interest on the micro level. – Can facilitate the production of comparable and coherent statistics required by the National Accounts for compiling the GDP using results drom different economic surveys. • Negative coordination: – Depends on the size of the sampling fractions in the different surveys. – Very effective mainly for small businesses. – . Co-ordination over time • Panel: a sample measured repeatedly in time (a period could be a week, a month, a quarter or a year). • Positive coordination over time: – Used to obtain high precision in estimates of change. – The size of the overlap is random. – It depends mainly on the sampling design and changes in the business population. • Sample rotation: a tool for spreading the response burden. Co-ordination of surveys based on different unit types (I) • This kind of coordination is used in Australia, France and Sweden (PRN-methods). • The business register (BR) generally consists of different unit types. • Each business survey uses a unit type in accordance with the statistics to be produced. • PRNs should be assigned to each unit type. Co-ordination of surveys based on different unit types (II) Methods for assigning the PRNs: – PRNs are assigned to each unit type separately. Advantage: a simple method, samples are independent of each other. Disadvantage: does not admit co-ordination between surveys using different unit types. – PRNs are assigned so that co-ordination of unit types through their PRNs is possible. Works well for single-location and single-activity businesses where each unit in a business receives the same PRN. For multiple-location and/or multiple-activity businesses: less efficient. Top-down or bottom-up approach to assign the PRNs (see Lindblom, 2003). Method: Sample co-ordination using SRS with PRNs (I) • The Swedish system for co-ordination of business samples (SAMU) is based on sequential simple random sampling without replacement(SRSWOR). • Sequential SRS (SRSWOR): – Consider a population U of size N (may be a stratum). – Each unit is assigned a PRN uniformly distributed over the interval [0,1]. – Units are sorted in ascending order of their PRNs. – The first n units in the sorted list are selected in the sample. Method: Sample co-ordination using SRS with PRNs (II) • Due to the symmetry of the uniform distribution: – the selection of the last n units in the sorted list also gives a sequential srswor, – the selection of the first n units to the left or to the right of a given point a in [0,1] also yields a srswor (wraparound if not enough units). • Dynamic population – New businesses in the frame (births) receive a new PRN. – Closed-down businesses (deaths) are withdrawn from the frame. Method: Sample co-ordination using SRS with PRNs (III) • Positive co-ordination – Over time: on each occasion a new sequential srswor is drawn from the updated frame (same starting point). – Of two surveys: same starting point and direction are used for both surveys. • Negative co-ordination – For two surveys: we must choose properly the starting points and directions, e.g. different starting points and the same direction. Method: Sample co-ordination using SRS with PRNs (IV) • SAMU allows for positive or negative coordination when different stratifications are used. • SAMU has implemented a system of rotation of samples : – Each unit in the frame is randomly designated to one of five rotation groups. – Random numbers are shifted only in one rotation group each year (RRC method of Ohlsson, 1992). Method: Sample co-ordination using SRS with PRNs (V) • A somewhat different method is used in France (Cotton & Hesse, 1992): – Each unit in the frame receives a uniform random number in [0,1]. – Units are ordered in ascending order of their RNs. – A sequential srswor of size n is drawn in the ordered list. – Negative co-ordination is obtained by permuting the random numbers so that selected units receive the largest RNs and non-selected – the smallest. The rank of the RNs should be respected. Method: Sample co-ordination using SRS with PRNs (VI) • The Cotton & Hesse method: – Can be used only for negative co-ordination. – Is based on permutation of the RNs. – Allows the use of different stratifications when coordinating stratified samples. – A minimum of the expected overlap between two successive stratified samples is guaranteed. – Can be used to co-ordinate sampling units of different types, e.g. enterprises and establishments. Method: Sample co-ordination using Poisson sampling with PRNs (I) • Implemented at SFSO (Qualité, 2009). • Extension of the method of Brewer et al. (1972). • Algorithm: – For each survey, one defines for each unit a zone of selection (can be a union of disjoint intervals). – The total length of the zone of selection corresponds to the inclusion probability for that unit. – A unit is selected if its PRN falls within its zone of selection. Method: Sample co-ordination using Poisson sampling with PRNs (II) • Advantages: – Theoretically simple and easy to implement. – Dynamic populations are easily handled. • Disadvantages: – The random sample size. – Previously, at SFSO stratified sampling was used. – Optimal allocation procedures not need to be modified, except for small sampling strata because of the risk of selecting an empty sample. Example of co-ordination (I) • We consider the selection of a unit in 6 samples (PRN equal to 0.42). We have: – The inclusion probability (pi). – The desired types of coordination : negative (N) or positive (P). – Two panels: samples 1, 3 and 6 are three waves of panel 1 and samples 2 and 5 are two waves of panel 2. – Sample 4 is for a survey conducted only once. – Positive coordination in a panel has a higher priority than negative coordination with the other samples. Example of co-ordination (II) Inclusion probabilities and types of coordination Coordination sample Sample pi Panel Wave 1 2 3 4 1 0.30 1 1 2 0.20 2 1 N 3 0.40 1 2 P N 4 0.20 N N N 5 0.30 2 2 N P N N 6 0.45 1 3 P N P N with 5 N 3 2 1 0 Sample 4 5 6 Selection zones 0.0 0.2 0.4 0.6 Zone of selection 0.8 1.0 Discussion • Sample design and selection: – The sample design determines a survey’s characteristics such as cost, variance and respondent burden. • Sample co-ordination: – An important tool for spreading the response burden. – Higher precision in estimates over time. – A co-ordination system provides a common sampling frame for all surveys. • Sample rotation: – Reducing response burden in periodic surveys. References (I) • • • • • • • • Brewer, K., Early, L., and Joyce, S. (1972). Selecting several samples from a single population, Australian Journal of Statistics, 3:231--239. Cochran, W.G. (1977). Sampling Techniques, Wiley, New York. Cotton, F. and Hesse, C. (1992b). Tirages coordonnés d'échantillons, Technical report, INSEE, Paris. Deville, J.-C. and Tillé, Y. (2004). Efficient balanced sampling: the cube method, Biometrika, 91:893--912. Falorsi P. D, Righi P. (2008). A Balanced Sampling Approach for Multi-way Stratification Designs for Small Area Estimation, Survey Methodology, 34, 223-234. Hartley, H. and Rao, J. (1962). Sampling with unequal probabilities and without replacement. Annals of Mathematical Statistics, 33:350--374. Hesse, C. (1999). Sampling co-ordination: A review by country. Technical Report E9908, Direction des Statistique d'Entreprises, INSEE, Paris. References (II) • • • • • Knaub, J.R., Jr. (2008), Cutoff Sampling, In Encyclopedia of Survey Research Methods (red. P.J. Lavrakas), Sage, London. Lindblom, A. (2003). SAMU - The system for coordination of frame populations and samples from the Business Register at Statistics Sweden, Background Facts on Economic Statistics 2003:3, Statistics Sweden. Ohlsson, E. (1992). The system for co-ordination of samples from the business register at Statistics Sweden. R&D report 1992:18, Statistics Sweden. Qualité, L. (2009). Unequal probability sampling and repeated surveys. Ph.D. thesis, University of Neuchâtel, Switzerland (http://doc.rero.ch/record/12284). Sigman, R. S. and Monsour, N. J. (1995). Selecting Samples from List Frames of Businesses, In Cox, B. G. et al., editors, Business Survey Methods, chapter 8, pages 133—152, Wiley. inc., New York, USA.