Pengantar Ilmu Statistika By : Ratih Rahmahwati Hal. 1 SUB POKOK BAHASAN Definisi ilmu statistik Klasifikasi statistik dan Klasifikasi data Materi besar : Statistik deskriptif Statistik inferensi Statistik multivariate Hal. 2 Definisi ilmu statistik DEFINISI menurut The random House Collage Dictionary “Ilmu (science) yang berkaitan dengan pengumpulan (collection), analisis, dan interpretasi suatu fakta numerik atau data” statistik merupakan the science of data. Hal. 3 Peran Statistik Problem : informasi yang tidak lengkap derajat ketidakpastian (level of uncertainty) Statistik menjadi unique karena kemampuannya untuk menyatakan (quantify) ketidakpastian. Kontribusi utama statistik adalah memungkinkan kita membuat inferensi/dugaan – mengestimasi dan membuat keputusan tentang parameter populasi – dengan ukuran ketidakpastian tertentu untuk mengevaluasi dugaan berdasarkan data sampel Experimental research dalam engineering melibatkan penggunaan data eksperimen – yaitu sample – untuk menduga perilaku dari populasi Hal. 4 Klasifikasi statistik Statistical Methods Descriptive Statistics Inferential Statistics Berdasarkan banyaknya data yang dianalisis : •Univariate data analysis : satu data •Bivariate data analysis : dua data •Multivariate data analysis : lebih dari dua data Hal. 5 Klasifikasi data Data Qualitative (Categorical): possess no quantitative interpretation Quantitative (Numerical): represent the quantity/amount of something Measured on a numerical scale Examples: Marital Status Political Party Eye Color (Defined categories) Discrete Examples: Continuous Examples: Number of Children Weight Defects per hour Voltage (Counted items) (Measured characteristics) Hal. 6 Macam Data Berdasarkan nilai (value) dari data/variabel tersebut: Scale : Data values are numeric values on an interval or ratio scale (e.g., age, income). Scale variables must be numeric. Ordinal : Data values represent categories with some intrinsic order (e.g., low, medium, high; strongly agree, agree, disagree, strongly disagree). Ordinal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=low, 2=medium, 3=high). In general, it is more reliable to use numeric codes to represent ordinal data. Note: for ordinal string variables, the alphabetic order of string values is assumed to reflect the true order of the categories. Nominal : Data values represent categories with no intrinsic order (e.g., job category or company division). Nominal variables can be either string (alphanumeric) or numeric values that represent distinct categories (e.g., 1=Male, 2=Female). Hal. 7 Macam Data Berdasarkan sumber data : Data Primer : data yang pengambilannya dilakukan sendiri Data sekunder : data yang didapatkan dari apa yang sudah dikumpulkan pihak lain Berdasarkan sifatnya Data Kuantitatif : menyatakan kuantitas / jumlah dari sesuatu, diukur dalam skala numerik tertentu. Contoh : waktu tunggu (dalam menit) sebelum proses komputer dimulai, Data kualitatif / kategorikal : tidak menunjukkan interpretasi kuantitatif, hanya bisa diklasifikasikan. Contoh : bidang / bagian pekerjaan yang ditempati oleh lulusan perguruan tinggi Berdasarkan cara mendapatkannya : Data diskrit : data yang tertentu nilainya, didapatkan dengan jalan menghitung (countable). Contoh : jumlah pelanggan per jam Data kontinu : data yang tidak tertentu nilainya, didapatkan dengan jalan mengukur (measurale). Contoh : waktu untuk melayani pelanggan Hal. 8 Teknik Sampling Samples Probability Samples Non-Probability Samples kuota purposive Simple Random accidental Systematic Stratified Cluster jenuh snowball Hal. 9 Statistical Sampling Items of the sample are chosen based on known or calculable probabilities Probability Samples Simple Stratified Systematic Cluster Random Teknik sampling yang memberikan peluang yang sama bagi setiap anggota populasi untuk dipilih menjadi anggota sampel Hal. 10 Simple Random Samples Every individual or item from the population has an equal chance of being selected Selection may be with replacement or without replacement Samples can be obtained from a table of random numbers or computer random number generators populasi sampel Hal. 11 Stratified Samples Population divided into subgroups (called strata) according to some common characteristic Simple random sample selected from each subgroup Samples from subgroups are combined into one Population Divided into 4 strata Sample Populasi berstrata Proportionated atau disproportionated sampel Hal. 12 Systematic Samples Decide on sample size: n Divide frame of N individuals into groups of k individuals: k=N/n Randomly select one individual from the 1st group Select every kth individual thereafter N = 64 n=8 First Group k=8 Hal. 13 Cluster Samples Population is divided into several “clusters,” each representative of the population A simple random sample of clusters is selected All items in the selected clusters can be used, or items can be chosen from a cluster using another probability sampling technique Population divided into 16 clusters. Randomly selected clusters for sample B A B A C D E D Hal. 14 Nonprobability sampling Non-Probability Samples kuota purposive accidental jenuh snowball –Teknik sampling yang tidak memberikan peluang yang sama pada semua anggota populasi untuk dipilih menjadi anggota sampel. –Berdasarkan jugdement dan convenience Hal. 15 Nonprobability sampling Sampling kuota pengambilan sampel dari populasi yang mempunyai ciri-ciri tertentu sampai jumlah (kuota) yang diinginkan contoh : sekelompok peneliti yang terdiri dari 5 orang melakukan penelitian terhadap pegawai golongan II. Jumlah sampel ditentukan 100. sehingga setiap anggota peneliti dapat memilih sampel secara bebas sesuai denga karakteristik yang ditentukan (golongan II) sebanyak 20 orang Sampling aksidental pengambilan sampel berdasarkan kebetulan, yaitu siapa saja yang secara kebetulan bertemu dengan peneliti dapat digunakan sebagai sampel bila dipandang orang tersebut cocok sebagai sumber data Hal. 16 Nonprobability sampling Sampling purposive pengambilan sampel dengan pertimbangan tertentu contoh : penelitian tentang disiplin pegawai, maka sampel yang dipilih adalah orang yang ahli dalam bidang kepegawaian saja Sampling jenuh pengambilan sampel dengan mengambil semua anggota populasi sebagai sampel (bila jumlah populasi relatif sedikit, kurang dari 30, sama dengan sensus) Snowball sampling pengambilan sampel yang mula-mula jumlahnya sedikit, kemudian sampel itu diminta memilih teman-temannya untuk dijadikan sampel, begitu seterusnya sehingga jumlah sampel semakin banyakb Hal. 17 STATISTIK DESKRIPTIF Hal. 18 Statistik Deskriptif 1.Involves Collecting Data Presenting Data Characterizing Data 2.Purpose Describe Data Preserve in useful ways Know data patern Summary basic shape of data 50 $ 25 0 Q1 Q2 Q3 Q4 X = 30.5 S2 = 113 Hal. 19 STATISTIK DESKRIPTIF Deskripsi statistik dapat dilakukan dengan dua cara : - - Metode grafik Metode numerik Mendeskripsikan suatu data sangat tergantung pada jenis data apakah kuantitatif atau kualitatif Hal. 20 Bar Chart Solution* Mfg. Lotus Microsoft Wordperf. Others 0% 20% 40% Market Share (%) 60% Hal. 21 Pie Chart Solution* Market Share Wordperf. 10% Others 15% Lotus 15% Microsoft 60% Hal. 22 Dot Chart Solution* Mfg. Lotus Microsoft Wordperf. Others 0% 20% 40% Market Share (%) 60% Hal. 23 Stem-and-Leaf Display 1. Divide Each Observation into Stem Value and Leaf Value Stem Value Defines Class Leaf Value Defines Frequency (Count) 2 144677 3 028 26 4 1 2. Data: 21, 24, 24, 26, 27, 27, 30, 32, 38, 41 Hal. 24 Numerical Data Properties Central Tendency (Location) Variation (Dispersion) Shape Hal. 25 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 26 Central Tendency Statistik Industri Hal. 27 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 28 Mean 1. 2. 3. 4. 5. Measure of Central Tendency Most Common Measure Acts as ‘Balance Point’ Affected by Extreme Values (‘Outliers’) Formula (Sample Mean) n X Xi i 1 n X1 X 2 X n n Hal. 29 Mean Example Raw Data: 10.3 4.98.9 11.7 6.3 7.7 n X Xi i 1 n X1 X 2 X 3 X 4 X 5 X 6 6 103 . 4.9 8.9 117 . 6.3 77 . 6 8.30 Hal. 30 Numerical Data Properties & Mea Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 31 Median 1. Measure of Central Tendency 2. Middle Value In Ordered Sequence If Odd n, Middle Value of Sequence If Even n, Average of 2 Middle Values 3. Position of Median in Sequence Positioning Point n1 2 4. Not Affected by Extreme Values Hal. 32 Median Example Odd-Sized Sample Raw Data:24.122.6 21.5 23.7 22.6 Ordered: 21.5 22.6 22.6 23.7 24.1 Position: 1 2 3 4 5 n1 51 3.0 Positioning Point 2 2 Median 226 . Hal. 33 Median Example Even-Sized Sample Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 n1 61 3.5 Positioning Point 2 2 77 . 8.9 8.30 Median 2 Hal. 34 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 35 Mode 1. Measure of Central Tendency 2. Value That Occurs Most Often 3. Not Affected by Extreme Values 4. May Be No Mode or Several Modes 5. May Be Used for Numerical & Categorical Data Hal. 36 Mode Example No Mode Raw Data: 10.3 4.9 8.9 11.7 6.3 7.7 One Mode Raw Data: 6.3 4.9 8.9 6.3 4.9 4.9 More Than 1 Mode Raw Data: 21 28 28 41 43 43 Hal. 37 Thinking Challenge You’re a financial analyst for PrudentialBache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. Describe the stock prices in terms of central tendency. Hal. 38 Central Tendency Solution* Mean n X Xi i 1 n X1 X 2 X 8 8 17 16 21 18 13 16 12 11 8 155 . Hal. 39 Central Tendency Solution* Median Raw Data:17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 n1 81 4.5 Positioning Point 2 2 16 16 16 Median 2 Hal. 40 Central Tendency Solution* Mode Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Hal. 41 Summary of Central Tendency Measures Measure Equation Mean Xi / n Median (n+1)Position 2 Mode none Description Balance Point Middle Value When Ordered Most Frequent Hal. 42 Variation Statistik Industri Hal. 43 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 44 Range 1. Measure of Dispersion 2. Difference Between Largest & Smallest Observations Range X l argest X smallest Hal. 45 Disadvantages of the Range Ignores the way in which data are distributed 7 8 9 10 11 12 Range = 12 - 7 = 5 7 8 9 10 11 12 Range = 12 - 7 = 5 Sensitive to outliers 4,5 4,120 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3, Range = 5 - 1 = 4 1,1,1,1,1,1,1,1,1,1,1,2,2,2,2,2,2,2,2,3,3,3,3, Range = 120 - 1 = 119 Hal. 46 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Standard Deviation Skew Hal. 47 Variance & Standard Deviation 1. Measures of Dispersion 2. Most Common Measures 3. Consider How Data Are Distributed 4. or ) Show Variation About Mean (X X = 8.3 4 6 8 10 12 Hal. 48 Sample Variance Formula n 2 S i 1 Xi X 2 n - 1 in denominator! (Use N if Population Variance) n1 X1 X 2 X2 X 2 Xn X 2 n1 Hal. 49 Sample Standard Deviation Formula S S 2 n Xi X i 1 2 n1 X1 X 2 X2 X 2 Xn X 2 n1 Hal. 50 Variance Example Raw Data:10.34.9 8.9 11.7 6.3 7.7 n 2 S i 1 Xi X n 2 whereX n1 2 2 S 2 Xi i 1 n 8.3 103 . 8.3 4.9 8.3 77 . 8.3 2 61 6.368 Hal. 51 Thinking Challenge You’re a financial analyst for PrudentialBache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. What are the variance and standard deviation of the stock prices? Hal. 52 Variation Solution* Sample Variance Raw Data: 17 16 21 18 13 16 12 11 n 2 S 2 S i 1 Xi X whereX n1 17 155 . 1114 . n 2 2 16 155 . 2 Xi i 1 n 155 . 11 155 . 2 81 Hal. 53 Variation Solution* Sample Standard Deviation n 2 S S i 1 Xi X n1 2 1114 . 3.34 Hal. 54 Comparing Standard Deviations Data A 11 19 12 20 13 21 14 15 16 17 18 Mean = 15.5 18 Mean = 15.5 s = 3.338 Data B 11 19 12 20 13 21 14 15 16 17 s= .9258 Data C 11 20 12 21 13 14 15 16 17 18 19 Mean = 15.5 s = 4.57 Hal. 55 Summary of Variation Measures Measure Range Equation Description Xlargest - Xsmallest Total Spread Interquartile Range Q3 - Q1 Standard Deviation (Sample) X i X n1 Spread of Middle 50% 2 Dispersion about Sample Mean Standard Deviation X 2 Dispersion about i X Population Mean (Population) N Variance (Sample) (Xi -X )2 n-1 Squared Dispersion about Sample Mean Hal. 56 Numerical Data Properties for grouped data Statistik Industri Hal. 57 Grouped Data Raw data unavailable, but grouped into frequency distribution Hal. 58 Mean k 1 X n i 1 dimana f m i i k n i 1 f i k= number of intervals fi = the frequency in i th interval mi = the midpoint in i th interval Hal. 59 Variance and Standard Deviation Variance s 2 f Standard Deviation k x nX 2 k 2 s s 2 n 1 Hal. 60 Percentiles Xk + 1 – Xk Qd = Xk + [( n + 1) d - k]( (73 + 1)(.25) = 18.5 h-k ) The greatest cumulative frequency not exceeding 18.5 is 7 for the first interval , so k = 7. Q.25 will fall in the second interval, and its cumulative frequency is 30. Q.25 = 100.0 + [( 18.5 - 7] ( 105.0 – 100.0 30 - 7 ) = 102.5 Hal. 61 The Empirical Rule If the data distribution is bell-shaped, then the interval: μ 1σ contains about 68% of the values in the population or the sample X 68 % μ μ 1σ Hal. 62 The Empirical Rule μ 2σ contains about 95% of the values in the population or the sample μ 3σ contains about 99.7% of the values in the population or the sample 95% 99.7 % μ 2σ μ 3σ Hal. 63 Shape Statistik Industri Hal. 64 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Mode Variance Skew Standard Deviation Hal. 65 Shape 1. Describes How Data Are Distributed 2. Measures of Shape Skew = Symmetry Left-Skewed Mean MedianMode Symmetric Right-Skewed Mean=Median=Mode Mode MedianMean Hal. 66 Quartiles & Box Plots Statistik Industri Hal. 67 Quartiles 1. Measure of Noncentral Tendency 2. Split Ordered Data into 4 Quarters Q1 Q2 Q3 25 25 25 % % % 3. Position of i-th Quartile 25 % i n1 Positioning Point ofQi 4 Hal. 68 Quartile (Q1) Example Raw Data:10.34.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 1 n 1 1 6 1 175 Q1 Position . 2 4 4 Q1 6.3 Hal. 69 Quartile (Q2) Example Raw Data:10.34.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 2 n 1 2 6 1 3.5 Q 2 Position 4 4 77 . 8.9 8.3 Q2 2 Hal. 70 Quartile (Q3) Example Raw Data:10.34.9 8.9 11.7 6.3 7.7 Ordered: 4.9 6.3 7.7 8.9 10.3 11.7 Position: 1 2 3 4 5 6 3 n 1 3 6 1 5.25 5 Q 3 Position 4 4 Q 3 103 . Hal. 71 Percentiles The pth percentile in an ordered array of n values is the value in ith position, where p i (n 1) 100 Example: The 60th percentile in an ordered array of 19 values is the value in 12th position: p 60 i (n 1) (19 1) 12 100 100 Hal. 72 Numerical Data Properties & Measures Numerical Data Properties Central Tendency Variation Shape Mean Range Median Interquartile Range Variance Mode Skew Standard Deviation Hal. 73 Interquartile Range 1. Measure of Dispersion 2. Also Called Midspread 3. Difference Between Third & First Quartiles Interquart ile Range Q3 Q1 4. Spread in Middle 50% 5. Not Affected by Extreme Values Hal. 74 Interquartile Range Example: X minimum Median (Q2) Q1 25% 12 Q3 25% 30 X maximum 25% 45 25% 57 70 Interquartile range = 57 – 30 = 27 Hal. 75 Thinking Challenge You’re a financial analyst for PrudentialBache Securities. You have collected the following closing stock prices of new stock issues: 17, 16, 21, 18, 13, 16, 12, 11. What are the quartiles, Q1 and Q3, and the interquartile range? Hal. 76 Quartile Solution* Q1 Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Q1 Position 1 n 1 4 1 8 1 4 2.5 Q1 125 . Hal. 77 Quartile Solution* Q3 Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Q 3 Position 3 n 1 4 3 8 1 4 6.75 7 Q 3 18 Hal. 78 Interquartile Range Solution* Interquartile Range Raw Data: 17 16 21 18 13 16 12 11 Ordered: 11 12 13 16 16 17 18 21 Position: 1 2 3 4 5 6 7 8 Interquart ile Range Q3 Q1 180 . 125 . 5.5 Hal. 79 Box Plot 1. Graphical Display of Data Using 5-Number Summary XsmallestQ1 Median Q3 4 6 8 10 Xlargest 12 Hal. 80 Shape of Box and Whisker Plots The Box and central line are centered between the endpoints if data is symmetric around the median A Box and Whisker plot can be shown in either vertical or horizontal format Hal. 81 Distribution Shape and Box and Whisker Plot Left-Skewed Q1 Q2Q3 Symmetric Right-Skewed Q1Q2Q3 Q1 Q2 Q3 Hal. 82 Box-and-Whisker Plot Example Below is a Box-and-Whisker plot for the following data: Min Max 0 2 Q1 2 Q2 2 3 Q3 3 4 5 5 10 27 2 33 55 00 2 27 27 This data is very right skewed, as the plot Hal. 83 Methods for detecting outliers Outlier : An observation y that is unusually large or small relative to the other values in a data set Outliers typically are attributable to one of the following causes: The measurement is observed, recorded or entered into computer incorrectly’ The measurement comes from different population The measurement is correct, but represents a rare event Hal. 84 Rule of thumb for detecting Outliers Z scores Observation with z scores greater than 3 in absolute value z ( y y) / s Box Plot Observation falling between the inner and outer fences are deemed suspect outliers Observation falling beyond outer fences are deemed highly suspect outliers Hal. 85 Distorting the Truth with Descriptive Techniques Statistik Industri Hal. 86 Errors in Presenting Data 1. Using ‘Chart Junk’ 2. No Relative Basis in Comparing Data Batches 3. Compressing the Vertical Axis 4. No Zero Point on the Vertical Axis Hal. 87 ‘Chart Junk’ Bad Presentation Minimum Wage 1960: $1.00 1970: $1.60 1980: $3.10 1990: $3.80 Good Presentation 4 $ Minimum Wage 2 0 1960 1970 1980 1990 Hal. 88 No Relative Basis Bad Presentation A’s by Class Freq. Good Presentation % 300 30% 200 20% 100 10% 0 A’s by Class 0% FR SO JR SR FR SO JR SR Hal. 89 Compressing Vertical Axis Bad Presentation 200 $ Quarterly Sales Good Presentation 50 $ Quarterly Sales 25 100 0 0 Q1 Q2 Q3 Q4 Q1 Q2 Q3 Q4 Hal. 90 No Zero Point on Vertical Axis Bad Presentation $ Good Presentation Monthly Sales $ 45 60 42 40 39 20 36 0 J M M J S N Monthly Sales J M M J S N Hal. 91 STATISTIK INFERENSI Hal. 92 Inferential Statistics Making statements about a population by examining sample results Sample statistics (known) Population parameters Inference (unknown, Population but can be estimated from sample evidence) Sample Ukuran deskripsi numerik yang dihitung dari sampel disebut statistik (karakteristik sampel) Ukuran deskripsi numerik dari populasi disebut parameter (karakteristik populasi) Hal. 93 Inferential Statistics Drawing conclusions and/or making decisions concerning a population based on sample results. Estimation e.g.: Estimate the population mean weight using the sample mean weight Hypothesis Testing e.g.: Use sample evidence to test the claim that the population mean weight is 120 pounds Hal. 94 STATISTIK MULTIVARIATE Hal. 95