Uploaded by John Actually

Data Science Fundamentals: Concepts, Workflow, and Big Data

advertisement
Fundamental of Data Sciences
Mr. Gai Alier John
Department of Mathematics, Physics and Computing
Moi University
May 15, 2025
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
1 / 14
Table of content
The following are the major units to be covered under this course
1
The concepts and principles of data science
2
Different types and sources of data
3
Overview of the data science workflow
4
Fundamental concepts of machine learning
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
2 / 14
Concepts and principles
Definitions
Data science is defined as ’a field of study that uses scientific methods, processes, and systems
to extract knowledge and insights from data’ (according to the US Census Bureau). Other
definitions are:
Data Science is concerned with analyzing data and extracting useful knowledge from it.
Building predictive models is usually the most important activity for a Data Scientist .
Data Science is concerned with analyzing Big Data to extract correlations with estimates of
likelihood and error. (Brodie, 2015a).
Data science is an emerging discipline that draws upon knowledge in statistical methodology
and computer science to create impactful predictions and insights for a wide range of
traditional scholarly fields.
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
3 / 14
The journey of DS
Data Science as a field of study and its definition, and purposes has a revolutionary journey
that started in 1990s.
Jeff Wu suggested in 1997 that statistics should be renamed “data science” and
statisticians should be known as “data scientists”
William S. Cleveland suggested in 2001 that it would be appropriate to alter the statistics
field to data science and “to enlarge the major areas of technical work of the field of
statistics” by looking to computing and partnering with computer scientists
Leo Breiman suggested in 2001 that it was necessary to “move away from exclusive
dependence on data models (in statistics) and adopt a more diverse set of tools” such as
algorithmic modeling,
In 2015, a statement about the role of statistics in data science was released by a number
of ASA leaders [van Dyk et al. 2015], saying that “statistics and machine learning play a
central role in data science.
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
4 / 14
Data Science workflow
The main scope of data and processes are summarised in this diagram
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
5 / 14
Skills for Data Scientist
Programming Skills – knowledge of statistical programming languages like R, Python, and
database query languages like SQL,
Statistics – Good applied statistical skills, including knowledge of statistical tests,
distributions, regression, maximum likelihood estimators, etc,
Machine Learning – good knowledge of machine learning methods like k-Nearest
Neighbors, Naive Bayes, SVM, Decision Forests.
Strong Math Skills (Multivariable Calculus and Linear Algebra) - understanding the
fundamentals of Multivariable Calculus and Linear Algebra is important as they form the
basis of predictive algorithm optimization techniques.
Experience with Data Visualization Tools like matplotlib, ggplot and Tableau
Excellent Communication Skills – it is incredibly important to describe findings to a
technical and non-technical audience.
Strong Software Engineering Background Hands-on experience with data science tools
Problem-solving aptitude Analytical mind and great business sense
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
6 / 14
What does DSt do?
Data scientist roles and responsibilities include:
1
Data mining or extracting usable data from valuable data sources
2
Using machine learning tools to select features, create and optimize classifiers
3
Carry out the preprocessing of structured and unstructured data
4
Enhancing data collection procedures to include all relevant information for developing
analytic systems
5
Processing, cleansing, and validating the integrity of data to be used for analysis
6
Analyzing large amounts of information to find patterns and solutions
7
Developing prediction systems and machine learning algorithms
8
Presenting results in a clear manner
9
Propose solutions and strategies to tackle business challenges
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
7 / 14
what is data?
definition
The observations gathered on the characteristics of interest are the data.
The data for a particular subject would consist of observations such as (opinion= do not favor
legalization, political party=Republican, religiosity=attend services once a week, education=12
years, annual income in the interval 40-60 thousand dollars, marital status=married, race=
White, gender=male), etc.
These characteristics we measure in a study, variability occurs naturally among subjects in a
sample or population. such characteristics are called variables.
Variable
A variable is a characteristic that can vary in value among subjects in a sample or population.
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
8 / 14
Major Data Types
Data types are categorized broadly into three types: Structured data, Unstructured data:
Structured data are organized and stored in a fixed format within databases, making it easily
searchable and analyzable.
Characteristics of Structured Data: Highly organized and stored in tabular format Follows a
strict schema (e.g., database tables) Easily searchable using query languages like SQL
Typically stored in relational databases (MySQL, PostgreSQL, Oracle, etc.)
Examples
Customer information in an e-commerce database (Name, Email, Order History)
Financial transactions stored in banking systems Inventory management data in a retail
database
Medical data:Temperature, blood presssure, health status (sick, not sick), etc
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
9 / 14
Data Types
Unstructured Data lacks a predefined format and is not stored in traditional database
structures.
It constitutes the majority of the data generated today, including multimedia files, social media
posts, and IoT sensor output.
Characteristics of Unstructured Data: Does not have a predefined model Difficult to store and
manage in relational databases Requires specialized tools like NoSQL databases, data lakes, or
AI-driven search mechanisms
Examples
Examples include: Text, images, videos, and audio files,social media posts (tweets, Facebook
updates, Instagram stories), Emails and customer feedback surveys Audio recordings from
customer service calls Video content from security cameras or YouTube
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
10 / 14
Datification
The novelty of data science is based on the fundamental change in our society that has been
brought about by the growth of technology.
This shift in our society is the source of the originality of data science.
The act of collecting information on aspects of the world that have never been quantified
before is referred to as ”datification.”
In other words: Datafication / Determination refers to the process by which subjects, objects,
and practices are transformed into digital data.
Data Mining is a process of extracting insight meaning, hidden patterns from collected data
that is useful to take a business decision for the purpose of decreasing expenditure and
increasing revenue.
Big Data is a term related to extracting meaningful data by analyzing the huge amount of
complex, variously formatted data generated at high speed, that cannot be handled, or
processed by the traditional system
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
11 / 14
Characteristics of Big Data
Big data is best described by the following charactersitics, that contextual its exponential
growth (3 Vs or 5Vs).
The characteristics of Big Data are:
Volume: Big data exist in a huge amount. This voluminous nature of a big data is makes
it different from other regular data. The size and amount of information are so big that
it’s considered big data.
Velocity: Velocity means how fast data is created and moves. Velocity describs how big
data comes from places like phones, social media, networks, and servers.
Variety: Big data comes from many different places and in many different forms. This is
what we called Variety in Big Data.
Veracity: and Value: Veracity refers to how acccurate the Big Data is ( data without
errors), while the Value refers to how useful the data is in term of extracting informative
results that can be used decision making.
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
12 / 14
Sources of Big Data
The following are some sources of Big Data in the current world:
Government’s institutions (eg, Health data, migration data, environmental data etc)
Big data companies eg, Google, IBM, Entrans, Miscrosoft, etc
Sensors data: With the advancement of internet of things (IoT) devices, the sensors of
these devices collect data which can be used for sensor data analytics to track the
performance and usage of products.
Satellite data: Satellites collect a lot of images and data in terabytes on daily basis
through surveillance cameras which can be used to collect useful information.
Web traffic: Due to fast and cheap internet facilities many formats of data which is
uploaded by users on different platforms can be predicted and collected with their
permission for data analysis. The search engines also provide their data through keywords
and queries searched mostly.
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
13 / 14
The End
Mr. Gai Alier John ( Department of Mathematics, Physics and Computing
Fundamental
Moi University
of Data) Sciences
May 15, 2025
14 / 14
Download