DCCS208(02) Korea University 2019 Fall Introduction to Big Data Chapter 1 & 2 (Week 1) Course overview & introduction Asst. Prof. Minseok Seo mins@korea.ac.kr Course Overview Introduction to Big Data 01 Contents 1. Course Overview Brief introduction of professor & course Object & Aim of the course Assignments & Quiz Evaluation 2. Introduction to Big Data Definition of Big Data Key techniques in Data Science Core technology of Informatics Course Overview Course information Introduction to Big Data, DCCS208(02), Fall 2019. Lecture time: Wed. (6,7) and Thu. (6) Location: Wed. (7-310) and Thu. (7-315) Completion division: Major elective subject Level: Junior / Senior copyrightⓒ 2018 All rights reserved by Korea University 4 / 20 Course Overview Definition of Big Data (Cont.) VS. Which is bigger, elephant or rat? copyrightⓒ 2018 All rights reserved by Korea University 5 / 20 Course Overview Definition of Big Data (Cont.) What is Data? Objects (Samples, Individuals) Attributes (Dimension; Features; Variables) ID Height Weight Age Student 1 189 cm 81 kg 24 Student 2 210 cm 90 kg 26 Student 3 191 cm 92 kg 27 … … … … Student N 162 cm 71 kg 21 copyrightⓒ 2018 All rights reserved by Korea University 6 / 20 Course Overview Definition of Big Data (Cont.) In a narrow sense, Big Data means only sample size. In a broad sense, Big Data represents both sample size and dimensionality. copyrightⓒ 2018 All rights reserved by Korea University 7 / 20 Course Overview Definition of Big Data (Cont.) 3V’s (Volume, Velocity, and Variety) copyrightⓒ 2018 All rights reserved by Korea University 8 / 20 Course Overview Definition of Big Data (Cont.) 5V’s (Volume, Velocity, Variety, Veracity, and Value) Volume: Data size Velocity: Data production speed Variety: Data oriented from various things Veracity: Data accuracy (Trustworthy) Value: Data value Value* copyrightⓒ 2018 All rights reserved by Korea University 9 / 20 Course Overview Relationship between Big-data & Data Science X The amount of data and information is not directly correlated with knowledge generation. But the demand for data scientists will be growing. copyrightⓒ 2018 All rights reserved by Korea University 10 / 20 Course Overview Job market of Big data Furht B., Villanustre F. (2016) Introduction to Big Data. In: Big Data Technologies and Applications. Springer, Cham It is the time to prepare for an academic course to cultivate data analysts commensurate with demand. copyrightⓒ 2018 All rights reserved by Korea University 11 / 20 Course Overview Object & Aim of the course Students who have taken this course expect to be able to learn: Concept of Big Data Computational approaches for Big Data Basic Skill in Data Science Introduction to Big Data Statistical approaches for Big Data R programming Visualization for Big Data copyrightⓒ 2018 All rights reserved by Korea University 12 / 20 Course Overview Course schedule (Before Mid-term exam) Study Contents Week Period 1 09.02 - 09.08 Introduction to Big Data & Data Science 2 09.09 - 09.15 Overall workflow, Computer Software issues, and applications in the Big Data era 3 09.16 - 09.22 Introduction to R programming 4 09.23 - 09.29 Descriptive & Fundamental Statistics 5 09.30 - 10.06 Understanding Data Structures (Types of random variable) 6 10.07 - 10.13 Data Visualization 7 10.14 - 10.20 Preprocessing of Big Data (Quality Control and Prescreening) 8 10.21 - 10.27 Mid-term Exam copyrightⓒ 2018 All rights reserved by Korea University 13 / 20 Course Overview Course schedule (After Mid-term exam) Study Contents Week Period 9 10.28 - 11.03 Parallel and Distributed Processing for Big Data 10 11.04 - 11.10 Statistical Estimation & Modeling 11 11.11 - 11.17 Computational approach for statistical modeling with robustness 12 11.18 - 11.24 Clustering analysis (Unsupervised learning methods) 13 11.25 - 12.01 Classification analysis (Supervised learning methods) 14 11.02 - 12.08 Algorithms of Dimensionality Reduction for Big Data 15 12.09 - 12.15 Trends in various academic & industrial fields for application of Big Data 16 12.16 - 12.22 Final Exam copyrightⓒ 2018 All rights reserved by Korea University 14 / 20 Course Overview Two types of lectures per week Wed. day 2hrs Lecture for Theory Thu. Day 1hr Hands-on lecture The methodology learned in theory class will be exercised in the computer lab. on Thursday. There are two representative computer language for Big data analysis, R and Python. R will be used in this class. It is not required any prior knowledge of the R language because I plan to provide example code for student's practice. https://cran.r-project.org/ copyrightⓒ 2018 All rights reserved by Korea University 15 / 20 Course Overview Exam, Quiz, and Homework Midterm and Final exams There will be two exams. I will ask you to understand the basic computational/statistical algorithm. Quiz There will be two simple quizzes in class to check the student's learning progress of the course (before and after midterm respectively). Homework There will be 4 times assignments. This will be a report on the theory and practice of data analysis learned in class. copyrightⓒ 2018 All rights reserved by Korea University 16 / 20 Course Overview Evaluation plan Midterm Final Quiz Assignment 10% Attendance 30% 20% 10% 30% Absolute grading system Score ≥ 95, you will get A+ Score ≥ 90, you will get A Score ≥ 85, you will get B+ and... copyrightⓒ 2018 All rights reserved by Korea University 17 / 20 Course Overview Textbook No Textbook This course will be proceed based on the presentation slide I will upload presentation slide in Blackboard & my homepage Homepage: https://scholar.harvard.edu/msseo Teaching >> Introduction to Big Data >> Related Materials Reference 1 (Kor. Version) R for Practical Data Analysis (online textbook and free) http://r4pda.co.kr/pdf/r4pda_2014_03_02.pdf Reference 2 (Eng. Version) Introduction to Data Science by Rafael A. Irizarry, 2019. (online textbook and free) https://rafalab.github.io/dsbook/ Reference 3 (Eng. Version) R for Data Science by Garrett Grolemund. (online textbook and free) https://r4ds.had.co.nz/ copyrightⓒ 2018 All rights reserved by Korea University 18 / 20 Course Overview Contact information Prof. Minseok Seo Location: 7-203 Tel: 044-860-1379 Email: mins@korea.ac.kr TA. Heechan Chae Location: 7-328 Email: chay219@korea.ac.kr If you have any questions about the course please email me and I will reply as soon as I see it. If you need to meet in person, please make an appointment by email first. I will be available at Mon: 12:00 - 17:00 | Wed: 10:00 - 13:00 | Thu: 10:00 - 13:00. copyrightⓒ 2018 All rights reserved by Korea University 19 / 20 End of Orientation Contents 1. Course Overview Brief introduction of professor & course Object & Aim of the course Assignments & Quiz Evaluation 2. Introduction to Big Data Concept of Big Data Key techniques in Data Science for Big data Characteristics of Big Data Remind concept of Big Data 5V’s (Volume, Velocity, Variety, Veracity, and Value) Volume: Data size Velocity: Data production speed Variety: Data oriented from various things Veracity: Data accuracy (Trustworthy) Value: Data value Value* copyrightⓒ 2018 All rights reserved by Korea University 22 / 20 Petabyte era 1 PB = 1000000000000000B = 1015bytes = 1000terabytes 1000 PB = 1 exabyte (EB) transferred about 197 PB of data thorough its network each data (2018) processed about 24 petabytes daily (2009) In fact, we can say that we have already entered the exabyte era. copyrightⓒ 2018 All rights reserved by Korea University 23 / 20 Characteristics of Big Data How do you recognize if it's big data or not? Computer Scientist My computer is low on memory for handling this data!! That is Big Data No!!!! This data is over 2TB. Where do I store it????? That is Big Data In short, if you’re having trouble with data processing on your computer (멘붕에 빠지면), it will be due to the Big Data. copyrightⓒ 2018 All rights reserved by Korea University 24 / 20 Characteristics of Big Data How do you recognize if it's big data or not? Statistician When does this calculation end? I was only waiting for 10 years ... Dimensionality is too high!!!! I can’t build statistical model using this data!!! That is Big Data In short, if you’re having trouble with data analysis on your computer (멘붕에 빠지 면), it will be due to the Big Data. copyrightⓒ 2018 All rights reserved by Korea University 25 / 20 Core technologies of Big Data era IT technologies to resolve issue derived from the Big data Software Hardware Prescreening techniques Data Visualization Feature selection Parallel processing Clouding computing Distributed processing Difficulties arise in both hardware and software. But students can approach software difficulties. copyrightⓒ 2018 All rights reserved by Korea University 26 / 20 Computational language for Big Data R and Python Wed. day 2hrs Lecture for Theory Thu. Day 1hr Hands-on lecture There are two representative computer language for Big data analysis, R and Python. R programming language (free and relatively easy) for hands-on lecture. Let’s connect R homepage https://cran.r-project.org/ copyrightⓒ 2018 All rights reserved by Korea University 27 / 20 Install R (Step 1) Download the R installer copyrightⓒ 2018 All rights reserved by Korea University 28 / 20 Install R (Step 2) Download the RStudio Download Rstudio from https://www.rstudio.com/products/rstudio/download/ copyrightⓒ 2018 All rights reserved by Korea University 29 / 20 Install R (Step 3) Install R and Rstudio copyrightⓒ 2018 All rights reserved by Korea University 30 / 20 What is R R is an interpreted computer language. It is possible to interface procedures written in C, C+, and etc., languages for efficiency. System commands can be called from within R R is used for data manipulation, statistics, and graphics. copyrightⓒ 2018 All rights reserved by Korea University 31 / 20 R, S, and S-plus (History of R) S: an interactive environment for data analysis developed at Bell Laboratories since 1976 1988 - S2: RA Becker, JM Chambers, A Wilks 1992 - S3: JM Chambers, TJ Hastie 1998 - S4: JM Chambers Exclusively licensed by AT&T/Lucent to Insightful Corporation, Seattle WA. Product name: “S-plus”. Implementation languages C, Fortran. R: initially written by Ross Ihaka and Robert Gentleman at Dep. of Statistics of U of Auckland, New Zealand during 1990s. Since 1997: international “R-core” team of ca. 15 people with access to common CVS archive. copyrightⓒ 2018 All rights reserved by Korea University 32 / 20 What R does and does not Possible (1) data handling and storage: numeric, textual (2) matrix algebra (3) has tables and regular expressions (4) high-level data analytic and statistical functions (5) OOP (classes) (6) Graphic (7) Programming language: loops, branching, subroutines, and etc., Impossible (1) R is not a database, but connects to DBMSs (2) R has no GUI, but connect to Java, TclTk (3) R is fundamentally very slow, but allows to call own C/C++ code (4) R is no spreadsheet view of data, but connects to Excel/MsOffice (5) R is no professional & commercial support But all R users in the world are developers (Power of Collective intelligence; 집단지성). If you make a meaningful package at any time, you can publish it within 1 second. Therefore, applying latest algorithms are faster than any programming language. copyrightⓒ 2018 All rights reserved by Korea University 33 / 20 Install R (Step 3) Install R and Rstudio copyrightⓒ 2018 All rights reserved by Korea University 34 / 20 End of Slide