Dalai Nguyen Chapter 1 Section 1.2 Summary: This section provides an overview of the process involved in conducting a statistical study. This process consists of “prepare, analyze, and conclude”, as well as statistical and critical thinking. Also, it discusses about involves consideration of the context, the source of data, the sampling method. And, construct suitable graphs, explore the data, execute computations. Finally, have statistical significance and practical significance. Definition: - Data: collection of observation, such as measurements, gender, or survey responses. - Statistics: the science of planning studies and experiments, obtaining data, and then organizing, summarizing, presenting, analyzing, interpreting data, and draw conclusion based on them. - Population: complete collection of all measurement. - Census: the collection of data from every member of the population. - Sample: a sub collection of members selected from a population. - Voluntary response sample (or self-selected sample): the respondent themselves decide whether to get involved. Journal question: What is the difference between statistical and practical significance? - Statistical significance is based on large sample size, while practical significance is based on common sense difference. Example: In a test of the Atkins weight loss program, 40 subjects using that program had a mean weight loss of 2.1 kg (or 4.6 pounds) after one year ( based on data from” Comparison of the Atkins, Ornish, Weight Watchers, and Zone Diets for Weight Loss and Heart Disease Risk Reduction,” by Dansinger et., Journal of the American Medical Association, Vol. 293, No. 1). Using formal methods of statistical analysis, we can conclude that the mean weight loss of 2.1 kg is statistically significant. That is, based on statistical criteria, the diet appears to be effective. However, using common sense, it does not seem very worthwhile to pursue a weight loss program resulting in such relatively insignificant results. Someone starting a weight loss program would probably want to lose considerably more than 2.1 kg. Although the mean weight loss of 2.1 kg is statistically significant, it does not have practical significance. The statistical analysis suggests that the weight loss program is effective, but practice considerations suggest that the program is basically ineffective. Section 1.3 Summary: This section should be know and understand the meaning of the term statistic and parameter, as defined below. The term statistic and parameter are used to distinguish between cases in data for a sample, and data for an entire population. Also need to know the different between the terms quantitative data and categorical data. - Definition: Parameter: numerical measurement describing some characteristic of a population. - Statistic: numerical measurement describing some characteristic of a sample. - Quantitative data: data consists of number. - Categorical data: data consists of names and labels. - Discrete data: data values are quantitative and the number of values is finite or countable. - Continuous data: data values are quantitative and the number of values is finite and is not countable. - Nominal level of measurement: data consist of names, labels, categories without ordering. - Ordinal level of measurement: data with a clear order, but differences either cannot be determined or are meaningless. - Interval level of measurement: data with a clear order and differences can be found and are meaningful. Data do not have a natural zero. - Ratio level of measurement: data with a clear order and differences can be found and are meaningful. Data have a natural zero. Journal question: What is the difference between a sample statistic and a population parameter? . Population parameter is the measurement of a population, while sample statistic is sample of people actually surveyed. Example: In a Harris Poll, 2320 adults in the United States were surveyed about body piercing, and 5% of the respondents said that they had a body piercing, but not on the face. Based on the latest available data at the time of this writing, there are 241,472,385 adults in the United States. The results from the survey are a sample drawn from the population of all adults. - Parameter: the population size of 241,472,385 is a parameter, because it is based on the entire population of all adults in the United States. - Statistic: the sample size of 2320 surveyed adults is a statistic, because it is based on a sample, not the entire population of all adults in the United States. The value of 5% is another statistic, because it is also based on the sample, not on the entire population. Section 1.4 Summary: This section will introduces the basics of data collection, and describe some common ways in which observational studies and experiments are conducted. And particular importance is the method of using a simple random sample. - Definition: Observational study: observing and measuring specific characteristics without modifying. - Experiment: applying treatment then proceeds to observe it. - Experimental units: subjects in experiment. - Simple random sample: if n subjects are selected, every possible sample of the same size n has the same chance of being chosen. - Systematic sampling: select every n element. - Convenience sampling: simply results that is easy to get. - Stratified sampling: divide the subjects into different groups and take a random sample from each group. - Cluster sampling: divide the population into groups and random pick a group. - Cross-sectional study: data are measured at one point in time. - Retrospective study: go back in time to collect data over some past period. - Prospective (longitudinal) study: go forward in time and observe groups sharing common factors. - Confounding: investigators are not able to distinguish among the effect of different factors. - Sampling error: sample has been selected with a random method, but there is a discrepancy between the sample result and the true population results. - Non sampling error: result of human error, including such factors as wrong data entries, computing errors, based wording and conclusions, false data provided, and inappropriate statistical method for circumstances. - Non random sampling error: results of using a sampling method that is not random, such as using convenience sample or a voluntary response sample. Journal question: What is a simple random sample? - Simple random sample: a sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen. Example: . Observation study: the typical survey is a good example of an observational study. For example, the Pew Research Center surveyed 2252 adults in the United States and founds that 59% of them go online wirelessly. The respondents were asked questions, but they were not given any treatment, so this is an example of an observational study. . Experiment: in the largest public health experiment ever conducted, 200,745 children were given a treatment consisting of the Salk vaccine, while 201,229 other children were given a placebo. The Salk vaccine injections constitute a treatment that modified the subjects, so this is an example of an experiment. Chapter 2 Section 2.2 Summary: This section is working with large data set, a frequency distribution (or frequency table) is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set. - Definition: Frequency distribution: shows how data are partitioned among several categories (or classes) by listing the categories along with the number (frequency) of data values in each of them. - Lower class limits: smallest numbers that can belong to the different classes. - Upper class limits: longest numbers that can belong to the different classes. - Class boundaries: the numbers used to separate the classes, but without the gaps created by class limits. - Class midpoints: the values in the middle of the classes. It is computed by adding the lower class limit to the upper class limit and dividing the sum by 2. - Class width: the different between two consecutives lower class limits or two consecutives lower class boundaries in a frequency distribution. Journal question: Why do we use frequency distributions? . We use frequency distribution because it is helpful in organizing and summarizing data to help us understand the nature of the distribution of a data set. Example: Table 2-3 summarizes the race/ethic classifications record on traffic tickets issued by Connecticut’s East Haven Police Department during a recent nine-month period. Here is an interesting and revealing fact about the data: table 2-3 shows that 18 of those given tickets were classified by police as being Hispanic, but in fact, 209 of those given tickets had Hispanic names! Race Frequency White 329 Black 15 Asian 0 Hispanic 18 White/Hispanic 4 Blank(no indication) 5 Section 2.3 Summary: This section is discussed about the histograms. A histogram is basically a graph of a frequency distribution, which consists of a graph that is easier to interpret than a table of numbers. Definition: Histograms: is a graph consisting of bars of equal width drawn adjacent to each other (unless there are gaps in the data). The horizontal scale represents classes of quantitative data values and the vertical scale represents frequencies. The heights of the bars correspond to the frequency values. Journal question: Why do we use histograms? . We use histograms because it is basically a graph of a frequency distribution, and it is easier to interpret than a table of number. Example: NASA provides these duration times (in minutes) of all flight of the space shuttle challenger: 7224, 8784, 8709, 11,476, 10,060, 11,844, 10,089, 11,445, 10,125, 1. Why does it not make sense to construct a histogram for this data set? What is notable about this data set? . The data set has an outlier of 1 min. when a data set is so small, the true nature of the distribution cannot be seen with a histogram. Section 2.4 Summary: In this section, working with graphs are excellent tools for describing, exploring, and comparing data. Describing data: in a histogram, for example, consider the distribution, center, variation, and outliers (value that are very far way from almost all of the other data values). Exploring data: look for features of the graph that reveal some useful and/or interesting characteristic of the data set. Comparing data: construct similar graphs to compare data sets. Definition : There is no definition in this section. Journal question: List 3 graphs that enlighten and 2 graphs that deceive. . 3 graphs that enlighten: - Scatter plots - Time-series graph - Histogram . 2 graphs that deceive: - Nonzero axis - Pictographs Example: Listed below are SAT scores from a sample of students. Why it is that a graph of these data will not be very effective in helping us understand the data? 2400 2200 2150 2040 2230 1890 2100 2090 - The data has only one variable, SAT scores. We need a second variable to be able to relate the scores and give meaning to the data.