Julius P. Fronda BSMT -2A 1. Types of Data A. Data with reference to time factor a) Time-independent data – The term refers to the data, which can be measured repeatedly, e.g., data in geosciences and astronomy such as geological structures, rocks, fixed stars, etc. b) Time-dependent data – These can be measured only once, e.g., certain geophysical or cosmological phenomena like volcanic eruptions and solar flares. Likewise, data pertaining to rare fossils are time-dependent data. B. Data with reference to location factors a) Location-independent data – These are independent of the location of objects measured, e.g., data on pure physics and chemistry. b) Location-dependent data – These are dependent on the location of objects measured. Data in earth sciences and astronomy normally belong to this category. Data on rocks are also location dependent. C. Data with reference to mode of generation a) Primary data – Data are primary when obtained by experiment or observation designed for the measurement b) Derived (reformatted) data – These data are derived by combining several primary data with the aid of a theoretical model. c) Theoretical (predicted) data – These are derived by theoretical calculations. Basic data such as fundamental constants are used in theoretical calculations D. Data with reference to nature of quantitative values a) Determinable data – Data on a quantity, which can be assumed to take a definite value under a given condition, are known as determinable data. Time-dependent data are usually determinable data, if the given condition is understood to include the specification of time. b) Stochastic data – Data relating to a quantity, which take fluctuating values from one sample to another, from one measurement to another, under a given condition are referred to as stochastic. In geosciences, most data are stochastic. E. Data with reference to terms of expression a) Quantitative data – These are measures of quantities expressed in terms of welldefined units, changing the magnitude of a quality to a numerical value. Most data in physical sciences are quantitative data. b) Semi-quantitative data – These data consist of affirmative or negative answers to posed questions concerning different characteristics of the objects involved c) Qualitative data – The data expressed in terms of definitive statements concerning scientific objects are qualitative in nature. Qualitative data in this sense are almost equivalent to established knowledge. F. Data with reference to mode of presentation a) Numerical data – These data are presented in numerical values, e.g., most quantitative data fall in this category. b) Graphic data – Here data are presented in graphic form or as models. In some cases, graphs are constructed for the sake of helping users grasp a mass of data by visual perception. Charts and maps also belong to this category. c) Symbolic data – These are presented in symbolic form, e.g., symbolic presentation of weather data G. Data with reference to scale of measurement Nominal Nominal scales are used for labeling variables, without any quantitative value. “Nominal” scales could simply be called “labels.” Here are some examples, below. Notice that all of these scales are mutually exclusive (no overlap) and none of them have any numerical significance. A good way to remember all of this is that “nominal” sounds a lot like “name” and nominal scales are kind of like “names” or labels. Examples of Nominal Scales Note: a sub-type of nominal scale with only two categories (e.g. male/female) is called “dichotomous.” If you are a student, you can use that to impress your teacher. Note #2: Other sub-types of nominal data are “nominal with order” (like “cold, warm, hot, very hot”) and nominal without order (like “male/female”). Ordinal With ordinal scales, the order of the values is what’s important and significant, but the differences between each one is not really known. Take a look at the example below. In each case, we know that a #4 is better than a #3 or #2, but we don’t know–and cannot quantify–how much better it is. For example, is the difference between “OK” and “Unhappy” the same as the difference between “Very Happy” and “Happy?” We can’t say. Ordinal scales are typically measures of non-numeric concepts like satisfaction, happiness, discomfort, etc. “Ordinal” is easy to remember because is sounds like “order” and that’s the key to remember with “ordinal scales”–it is the order that matters, but that’s all you really get from these. note: The best way to determine central tendency on a set of ordinal data is to use the mode or median; a purist will tell you that the mean cannot be defined from an ordinal set. Example of Ordinal Scales Interval Interval scales are numeric scales in which we know both the order and the exact differences between the values. The classic example of an interval scale is Celsius temperature because the difference between each value is the same. For example, the difference between 60 and 50 degrees is a measurable 10 degrees, as is the difference between 80 and 70 degrees. Interval scales are nice because the realm of statistical analysis on these data sets opens up. For example, central tendency can be measured by mode, median, or mean; standard deviation can Ratio Ratio scales are the ultimate nirvana when it comes to data measurement scales because they tell us about the order, they tell us the exact value between units, AND they also have an absolute zero–which allows for a wide range of both descriptive and inferential statistics to be applied. At the risk of repeating myself, everything above about interval data applies to ratio scales, plus ratio scales have a clear definition of zero. Good examples of ratio variables include height, weight, and duration. Ratio scales provide a wealth of possibilities when it comes to statistical analysis. These variables can be meaningfully added, subtracted, multiplied, divided (ratios). Central tendency can be measured by mode, median, or mean; measures of dispersion, such as standard deviation and coefficient of variation can also be calculated from ratio scales. Summary In summary, nominal variables are used to “name,” or label a series of values. Ordinal scales provide good information about the order of choices, such as in a customer satisfaction survey. Interval scales give us the order of values + the ability to quantify the difference between each one. Finally, Ratio scales give us the ultimate–order, interval values, plus the ability to calculate ratiossince a “true zero” can be defined. Summary of data types and scale measures H. Data with reference to characteristic a) Quantitative data – When the characteristic of observation is quantified we get quantitative data. Quantitative data result from the measurement of the magnitude of the characteristic used. b) Qualitative data – When the characteristic of observation is a quality or attribute, we get qualitative data. 2. Nature of Data Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. Data Data is a set of values of qualitative or quantitative variables; restated, pieces of data are individual pieces of information. Data is measured, collected and reported, and analyzed, whereupon it can be visualized using graphs or images. Data as a general concept refers to the fact that some existing information or knowledge is represented or coded in some form suitable for better usage. According to the Oxford English Dictionary: Data are known facts or things used as basis for inference or reckoning. As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical: The nature of data Data is the plural of datum, so it is always treated as plural. We can find data in all the situations of the world around us, in all the structured or unstructured, in continuous or discrete conditions, in weather records, stock market logs, in photo albums, music playlists, or in our Twitter accounts. In fact, data can be seen as the essential raw material of any kind of human activity. According to the Oxford English Dictionary: Data are known facts or things used as basis for inference or reckoning. As shown in the following figure, we can see Data in two distinct ways: Categorical and Numerical: Categorical data are values or observations that can be sorted into groups or categories. There are two types of categorical values, nominal and ordinal. A nominal variable has no intrinsic ordering to its categories. For example, housing is a categorical variable having two categories (own and rent). An ordinal variable has an established ordering. For example, age as a variable with three orderly categories (young, adult, and elder). Numerical data are values or observations that can be measured. There are two kinds of numerical values, discrete and continuous. Discrete data are values or observations that can be counted and are distinct and separate. For example, number of lines in a code. Continuous data are values or observations that may take on any value within a finite or infinite interval. For example, an economic time series such as historic gold prices. 3. Different Sampling Method Probability Sampling uses randomization to select sample members. You know the probability of each potential member’s inclusion in the sample. For example, 1/100. However, it isn’t necessary for the odds to be equal. Some members might have a 1/100 chance of being chosen, others might have 1/50. Non-probability sampling uses non-random techniques (i.e. the judgment of the researcher). You can’t calculate the odds of any particular item, person or thing being included in your sample. Common Types The most common techniques you’ll likely meet in elementary statistics or AP statistics include taking a sample with and without replacement. Specific techniques include: Bernaulli sampling have independent Bernoulli trials on population elements. The trials decide whether the element becomes part of the sample. All population elements have an equal chance of being included in each choice of a single sample. The sample sizes in Bernoulli samples follow a binomial distribution. Poisson sampling (less common): An independent Bernoulli trial decides if each population element makes it to the sample. Cluster Sampling divide the population into groups (clusters). Then a random sample is chosen from the clusters. It’s used when researchers don’t know the individuals in a population but do know the population subsets or groups. In systematic sampling, you select sample elements from an ordered frame. A sampling frame is just a list of participants that you want to get a sample from. For example, in the equal-probability method, choose an element from a list and then choose every kth element using the equation k = N\n. Small “n” denotes the sample size and capital “N” equals the size of the population. SRS Select items completely randomly, so that each element has the same probability of being chosen as any other element. Each subset of elements has the same probability of being chosen as any other subset of k elements. In stratified sampling, sample each subpopulation independently. First, divide the population into homogeneous (very similar) subgroups before getting the sample. Each population member only belongs to one group. Then apply simple random or a systematic method within each group to choose the sample. Stratified Randomization: a sub-type of stratified used in clinical trials. First, divide patients into strata, then randomize with permuted block randomization. Less Common Types Acceptance-Rejection Sampling: A way to sample from an unknown distribution using a similar, more convenient distribution. Accidental sampling (also known as grab, convenience or opportunity sampling): Draw a sample from a convenient, readily available population. It doesn’t give a representative sample for the population but can be useful for pilot testing. Adaptive sampling (also called response-adaptive designs): adapt your selection criteria as the experiment progresses, based on preliminary results as they come in. Bootstrap Sample: Select a smaller sample from a larger sample with Bootstrapping. Bootstrapping is a type of resampling where you draw large numbers of smaller samples of the same size, with replacement, from a single original sample. The Demon algorithm (physics) samples members of a microcanonical ensemble (used to represent the possible states of a mechanical system which has an exactly specified total energy) with a given energy. The “demon” represents a degree of freedom in the system which stores and provides energy. Critical Case Samples: With this method, you carefully choose cases to maximize the information you can get from a handful of samples. Discrepant case sampling: you choose cases that appear to contradict your findings. Distance sample : a widely used technique that estimates the density or abundance of animal populations. The experience sampling method samples experiences (rather than individuals or members). In this method, study participants stop at certain times and make notes of their experiences as they experience them. Haphazard Sampling: where a researcher chooses items haphazardly, trying to simulate randomness. However, the result may not be random at all — tainted by selection bias. Additional Uncommon Types Inverse Sample: based on negative binomial sampling. Take samples until a specified number of successes have happened. Importance Sampling: a method to model rare events. The Kish grid: a way to select members of a household for interviews and uses a random number tables for the selections. Latin hypercube: used to construct computer experiments. It generates samples of plausible collections of values for parameters in a multidimensional distribution. In line-intercept sampling, a method where you include an element in a sample from a particular region if a certain line segment intersects the element. Use Maximum Variation Samples when you want to include extremes (like rich/poor or young/old). A related technique: extreme case sampling. Multistage sampling; one of a variety of cluster sampling techniques where you choose random elements from a cluster (instead of every member in the cluster). Quota sampling: a way to select survey participants. It’s similar to statified sampling but researchers choose members of a group based on judgment. For example, people closest to the researcher might be chosen for ease of access. Respondent Driven Sampling. A chain-referral sampling method where participants recommend other people they know. A sequential sample doesn’t have a set size; take items one (or a few) at a time until you have enough for your research. It’s commonly used in ecology. Snowball samples: where existing study participants recruit future study participants from people they know. Square root biased samplea way to choose people for additional screenings at airports. A combination of SRS and profiling. References: Afzalm M., & Rizwi, F. (2013). Journal of Islamabad Medical & Dental College. Biostatistics and Data Types, 2(2), 103. Guy, M, R.(2019, September 3). Types of Data & Measurement Scales: Nominal, Ordinal, Interval and Ratio. Retrieved from https://www.mymarketresearchmethods.com/types-ofdata-nominal-ordinal-interval-ratio/. Cuesta, H. (2013, October). Practical Data Analysis. Retrieved from https://subscription.packtpub.com/book/big_data_and_business_intelligence/9781783280 995/1/ch01lvl1sec15/the-nature-of-data Horse,T.(2019). Sampling in Statistics: Different Sampling Methods, Types & Error. Retrieved from https://www.statisticshowto.datasciencecentral.com/probability-and- statistics/sampling-in-statistics/