Fundamental of Data Sciences Mr. Gai Alier John Department of Mathematics, Physics and Computing Moi University May 15, 2025 Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 1 / 14 Table of content The following are the major units to be covered under this course 1 The concepts and principles of data science 2 Different types and sources of data 3 Overview of the data science workflow 4 Fundamental concepts of machine learning Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 2 / 14 Concepts and principles Definitions Data science is defined as ’a field of study that uses scientific methods, processes, and systems to extract knowledge and insights from data’ (according to the US Census Bureau). Other definitions are: Data Science is concerned with analyzing data and extracting useful knowledge from it. Building predictive models is usually the most important activity for a Data Scientist . Data Science is concerned with analyzing Big Data to extract correlations with estimates of likelihood and error. (Brodie, 2015a). Data science is an emerging discipline that draws upon knowledge in statistical methodology and computer science to create impactful predictions and insights for a wide range of traditional scholarly fields. Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 3 / 14 The journey of DS Data Science as a field of study and its definition, and purposes has a revolutionary journey that started in 1990s. Jeff Wu suggested in 1997 that statistics should be renamed “data science” and statisticians should be known as “data scientists” William S. Cleveland suggested in 2001 that it would be appropriate to alter the statistics field to data science and “to enlarge the major areas of technical work of the field of statistics” by looking to computing and partnering with computer scientists Leo Breiman suggested in 2001 that it was necessary to “move away from exclusive dependence on data models (in statistics) and adopt a more diverse set of tools” such as algorithmic modeling, In 2015, a statement about the role of statistics in data science was released by a number of ASA leaders [van Dyk et al. 2015], saying that “statistics and machine learning play a central role in data science. Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 4 / 14 Data Science workflow The main scope of data and processes are summarised in this diagram Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 5 / 14 Skills for Data Scientist Programming Skills – knowledge of statistical programming languages like R, Python, and database query languages like SQL, Statistics – Good applied statistical skills, including knowledge of statistical tests, distributions, regression, maximum likelihood estimators, etc, Machine Learning – good knowledge of machine learning methods like k-Nearest Neighbors, Naive Bayes, SVM, Decision Forests. Strong Math Skills (Multivariable Calculus and Linear Algebra) - understanding the fundamentals of Multivariable Calculus and Linear Algebra is important as they form the basis of predictive algorithm optimization techniques. Experience with Data Visualization Tools like matplotlib, ggplot and Tableau Excellent Communication Skills – it is incredibly important to describe findings to a technical and non-technical audience. Strong Software Engineering Background Hands-on experience with data science tools Problem-solving aptitude Analytical mind and great business sense Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 6 / 14 What does DSt do? Data scientist roles and responsibilities include: 1 Data mining or extracting usable data from valuable data sources 2 Using machine learning tools to select features, create and optimize classifiers 3 Carry out the preprocessing of structured and unstructured data 4 Enhancing data collection procedures to include all relevant information for developing analytic systems 5 Processing, cleansing, and validating the integrity of data to be used for analysis 6 Analyzing large amounts of information to find patterns and solutions 7 Developing prediction systems and machine learning algorithms 8 Presenting results in a clear manner 9 Propose solutions and strategies to tackle business challenges Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 7 / 14 what is data? definition The observations gathered on the characteristics of interest are the data. The data for a particular subject would consist of observations such as (opinion= do not favor legalization, political party=Republican, religiosity=attend services once a week, education=12 years, annual income in the interval 40-60 thousand dollars, marital status=married, race= White, gender=male), etc. These characteristics we measure in a study, variability occurs naturally among subjects in a sample or population. such characteristics are called variables. Variable A variable is a characteristic that can vary in value among subjects in a sample or population. Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 8 / 14 Major Data Types Data types are categorized broadly into three types: Structured data, Unstructured data: Structured data are organized and stored in a fixed format within databases, making it easily searchable and analyzable. Characteristics of Structured Data: Highly organized and stored in tabular format Follows a strict schema (e.g., database tables) Easily searchable using query languages like SQL Typically stored in relational databases (MySQL, PostgreSQL, Oracle, etc.) Examples Customer information in an e-commerce database (Name, Email, Order History) Financial transactions stored in banking systems Inventory management data in a retail database Medical data:Temperature, blood presssure, health status (sick, not sick), etc Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 9 / 14 Data Types Unstructured Data lacks a predefined format and is not stored in traditional database structures. It constitutes the majority of the data generated today, including multimedia files, social media posts, and IoT sensor output. Characteristics of Unstructured Data: Does not have a predefined model Difficult to store and manage in relational databases Requires specialized tools like NoSQL databases, data lakes, or AI-driven search mechanisms Examples Examples include: Text, images, videos, and audio files,social media posts (tweets, Facebook updates, Instagram stories), Emails and customer feedback surveys Audio recordings from customer service calls Video content from security cameras or YouTube Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 10 / 14 Datification The novelty of data science is based on the fundamental change in our society that has been brought about by the growth of technology. This shift in our society is the source of the originality of data science. The act of collecting information on aspects of the world that have never been quantified before is referred to as ”datification.” In other words: Datafication / Determination refers to the process by which subjects, objects, and practices are transformed into digital data. Data Mining is a process of extracting insight meaning, hidden patterns from collected data that is useful to take a business decision for the purpose of decreasing expenditure and increasing revenue. Big Data is a term related to extracting meaningful data by analyzing the huge amount of complex, variously formatted data generated at high speed, that cannot be handled, or processed by the traditional system Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 11 / 14 Characteristics of Big Data Big data is best described by the following charactersitics, that contextual its exponential growth (3 Vs or 5Vs). The characteristics of Big Data are: Volume: Big data exist in a huge amount. This voluminous nature of a big data is makes it different from other regular data. The size and amount of information are so big that it’s considered big data. Velocity: Velocity means how fast data is created and moves. Velocity describs how big data comes from places like phones, social media, networks, and servers. Variety: Big data comes from many different places and in many different forms. This is what we called Variety in Big Data. Veracity: and Value: Veracity refers to how acccurate the Big Data is ( data without errors), while the Value refers to how useful the data is in term of extracting informative results that can be used decision making. Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 12 / 14 Sources of Big Data The following are some sources of Big Data in the current world: Government’s institutions (eg, Health data, migration data, environmental data etc) Big data companies eg, Google, IBM, Entrans, Miscrosoft, etc Sensors data: With the advancement of internet of things (IoT) devices, the sensors of these devices collect data which can be used for sensor data analytics to track the performance and usage of products. Satellite data: Satellites collect a lot of images and data in terabytes on daily basis through surveillance cameras which can be used to collect useful information. Web traffic: Due to fast and cheap internet facilities many formats of data which is uploaded by users on different platforms can be predicted and collected with their permission for data analysis. The search engines also provide their data through keywords and queries searched mostly. Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 13 / 14 The End Mr. Gai Alier John ( Department of Mathematics, Physics and Computing Fundamental Moi University of Data) Sciences May 15, 2025 14 / 14