Uploaded by نادرة الوحيدي

Introduction to Data Science Slides Chapter 1

advertisement
Introduction to Data Science
Chapter 1: What is Data Science
Introduction
• In today's world, most people are acquainted with the term "data." People with
smartphones and data plans are aware of their data usage, such as for checking email or
posting on social media.
• The concept of data sharing extends beyond personal usage. Individuals may encounter
data sharing in various contexts, such as when considering family mobile plans.
• The term "data" can refer to a wide range of information, from personal data to
institutional records. Data sizes can also vary significantly, from a few kilobytes to several
petabytes.
What is Data Science
• Data science involves uncovering insights through data exploration, hypothesis testing,
and data-driven conclusions.
• It is a field of study and practice that involves the collection, storage, and processing of
data in order to derive important insights into a problem or a phenomenon.
• Such data may be generated by humans (surveys, logs, etc.) or machines (weather data,
road vision, etc.), and could be in different formats (text, audio, video, augmented or
virtual reality, etc.). Csv files.
• Examples:
• Billionaires Statistics Dataset:
• Fruits and vegetables dataset:
• The data science growth is very fast and we continue to generate a huge amount of data
at a massive speed.
What is Data Science
• The “3V model” attempts to study this huge growth od data in a simple
way:
1. Velocity: The speed at which data is accumulated.
2. Volume: The size and scope of the data.
3. Variety: The massive array of data and types (structured and unstructured).
Try to find image
Where Do We See Data Science?
• Recently, you can see data science everywhere and in many fields such as:
 Finance:
 Data scientists help the finance industry obtain the information necessary to make accurate
predictions through capturing and analyzing new sources of data, building predictive models
and running real-time simulations of market events
 Financial data scientists may also partake in fraud detection and risk reduction.
 Data science initiatives even help bankers analyze a customer’s purchasing power to more
effectively try to sell additional banking products.
Where Do We See Data Science?
Public Policy:
Public policy is the application of policies, regulations, and laws to the problems of
society through the actions of government and agencies for the good of a citizenry.
Data science helps governments and agencies gain insights into citizen behaviors that
affect the quality of public life, including traffic, public transportation, social welfare,
community wellbeing, etc.
Where Do We See Data Science?
Politics:
Politics encompasses the processes involved in electing officials who govern a state, as
well as the enactment of policies by these officials. Government funding primarily comes
from taxes.
The real-time application of data science to politics has gained significant momentum in
recent times.
Example: Obama's 2008 Campaign: Data scientists analyzed former US President
Obama's 2008 presidential campaign, focusing on Internet-based campaign efforts. Some
experts believe that without the Internet, Obama might not have become president.
Accurate Voter Targeting: Data scientists have successfully developed highly accurate
voter targeting models and strategies to increase voter participation.
Where Do We See Data Science?
 Politics
 Example: Trump's 2016 Campaign: The 2016 campaign to elect Donald Trump is a notable
instance of data science's role in politics, particularly through social media. Data science was used
to tailor individual messages to specific voters.
 Twitter has emerged as a significant digital public relations tool for politics. Studies analyzed the
content of tweets from both Donald Trump and Hillary Clinton's Twitter handles, revealing
differences in the emphasis on traits, issues, main content, sources of retweets, multimedia use,
and civility. These analyses fall within the realm of data science.
 The Cambridge Analytica data scandal, which came to light in March 2018, involved the data
analytics firm obtaining data on approximately 87 million Facebook users from an academic
researcher. This data was used to target political ads during the 2016 US presidential campaign,
highlighting privacy concerns and ethical issues associated with data use.
Where Do We See Data Science?
Healthcare:
Healthcare is another area in which data scientists keep changing their research
approach and practices.
The healthcare industry is inundated with unprecedented amounts of data, including
biological data like gene expression, DNA sequences, proteomics, and metabolomics.
Data scientists work with vast datasets, combining clinical trial data with physician
observations to focus on patient-centered medical questions and treatment
effectiveness.
Where Do We See Data Science?
Healthcare:
Personal health management is revolutionized by wearable devices like Fitbit, enabling
the collection of various health data.
Researchers use wearables to track physical activity adherence, offering insights into
health behaviors.
Apple and other companies partner with healthcare providers to collect and analyze data
from wearable devices, aiding in health monitoring, diagnosis, and treatment.
Where Do We See Data Science?
Urban Planning
Urban planning is undergoing a significant transformation due to the application of data
science.
New data-driven initiatives in urban planning, often referred to as "informatics," are
focused on acquiring, integrating, and analyzing data to enhance urban systems and
overall quality of life.
Examples of data-driven urban initiatives include platforms like chicagoshovels.org,
which tracks snowplow locations and facilitates community-based efforts to clear
sidewalks during winter, and apps developed by cities like Boston to improve public
services and address infrastructure needs.
Where Do We See Data Science?
Education:
Technology's role in education is evolving beyond merely placing computers in front of
students, with a growing emphasis on data-driven and personalized approaches.
Educators and technology advocates are recognizing the potential for technology to
enhance learning through data analytics.
Big data is seen as a valuable resource for educational improvement, offering insights
into student performance and learning methods.
Data analytics can enable instructors to understand what students know and which
teaching techniques are most effective, leading to more nuanced assessment of learning.
Online tools make it possible to evaluate a wide range of student actions, including
reading habits, resource usage, and the speed of mastering key concepts.
Where Do We See Data Science?
Libraries:
Data science is increasingly applied to library science, bridging the roles of data
professionals and librarians.
Th3 evolving role of data science in libraries enables librarians to automate literature
reviews, extract key ideas and results from thousands of articles, and apply data science
techniques such as network analysis to visualize research trends.
This transformation makes the work of researchers more efficient and accessible than
traditional, manual literature reviews.
How Does Data Science Relate to Other Fields?
How Does Data Science Relate to Other Fields?
Data Science and Statistics
Some consider data science as a subset of statistics.
The distinction between the two fields is primarily due to advancements in modern
computing.
Data science deals with twenty-first-century data problems, such as handling large
databases, data manipulation, and visualization.
Basic statistical knowledge, machine learning, and data visualization are essential skills
for data scientists.
How Does Data Science Relate to Other Fields?
Data Science and Computer Science:
Data science and computer science overlap and support each other.
Computer scientists have developed tools like database systems, visualization
techniques, and algorithms used in data science.
Machine learning is a crucial part of data science, and both fields share many algorithms
and techniques.
How Does Data Science Relate to Other Fields?
Data Science and Engineering:
Engineering in various fields demands data scientists for data-driven solutions.
Technology has transformed industries like construction with "smart" building
techniques.
Data science is essential in processing data generated by advanced technologies.
The role of data science will continue to expand in various engineering applications.
How Does Data Science Relate to Other Fields?
Business analytics (BA) refers to the skills, technologies, and practices for
continuous iterative exploration and investigation of past and current
business performance to gain insight and be strategic. BA focuses on
developing new perspectives and making sense of performance based on
data and statistics. And that is where data science comes in. To fulfill the
requirements of BA, data scientists are needed for statistical analysis,
including explanatory and predictive modeling and fact-based management,
to help drive successful decision-making.
Data Science and Business Analytics
 Business analytics focuses on data-driven decision-making.
 Data scientists are needed for statistical analysis, predictive modeling, and fact-based
management in business analytics.
 There are four types of analytics: decision, descriptive, predictive, and prescriptive.
How Does Data Science Relate to Other Fields?
Data Science, Social Science, and Computational Social Science
Computational social science has bridged the gap between various social science
disciplines by using data science techniques.
It helps connect theories and research findings from different social science branches.
Computational social science raises questions about the ethics and politics of data
science research, especially in sociopolitical contexts.
The Relationship between Data Science and
Information Science
Information vs. Data:
Data is often seen as something raw and meaningless.
• an object that, when analyzed or converted to a useful form, becomes
information.
Information is data that is endowed with meaning and purpose.
The traditional view suggests that data becomes information when it is analyzed or given
context.
For example, the number “480,000” is a data point. But when we add an explanation
that it represents the number of deaths per year in the USA from cigarette smoking, it
becomes information.
Computational Thinking
• Computational thinking is considered an essential skill for everyone, not just computer
scientists. It involves thinking like a computer scientist and is defined as using
abstraction and decomposition when tackling complex tasks or designing large systems.
• It is an iterative process based on the following three stages:
1. Problem formulation (abstraction)
2. Solution expression (automation)
3. Solution execution and evaluation
(analyses).
Computational Thinking
• Hands-On Example 1.1: Computational Thinking:
• Let us consider an example. We are given the following numbers and are tasked with
finding the largest of them: 7, 24, 62, 11, 4, 39, 42, 5, 97, 54. Perhaps you can do it just by
looking at it. But let us try doing it “systematically.”
• Rather than looking at all the numbers at the same time, let us look at two at a time. So,
the first two numbers are 7 and 24. Pick the larger of them, which is 24. Now we take that
and look at the next number. It is 62. Is it larger than 24? Yes, which means, as of now, 62
is our largest number. The next number is 11. Is it larger than the largest number we know
so far, that is, 62? No. So we move on. If you continue this process until you have seen all
the remaining numbers, you will end up with 97 as the largest. And that is our answer.
• More than that, we derived a process that could be applied to not
just 10 numbers (which is not that complex), but to 100 numbers,
1000 numbers, or a billion numbers!
• This is called abstraction and generalization.
• Here, abstraction refers to treating the actual object of interest (10
numbers) as a series of numbers, and generalization refers to being
able to devise a process that is applicable to the abstracted quantity
(a series of numbers) and not just the specific objects (the given 10
numbers).
Finding the
rd
3
largest.
• So, let us step back and try to think of a systematic approach.
• A natural way to solve the problem would be just to scan the shelf
and look for out-of-order pairs, for instance Rowling, J. K., followed by
Lee, Stan, and flipping them around. Flip out-of-order pairs, then
continue your scan of the rest of the shelf, and start again at the
beginning of the shelf each time you reach the end until you make a
complete pass without finding a single out-of-order pair on the entire
shelf.
• But depending on the size of your collection and how unordered the
books are at the beginning of the process, it will take a lot of time. It
is not a very efficient tactic.
Skills For Data Science
• One Twitter quip about data scientists captures their skill set particularly well:
“Data Scientist (n.): Person who is better at statistics than any software
engineer and better at software engineering than any statistician.”
• To become a data scientist, you should possess the following skills and attributes:
1.
Willing to Experiment: Data scientists should be curious and creative problem solvers, capable of
defining intelligent hypotheses and experimenting with data to discover insights. employers are seeking
applicants who can ask questions to define intelligent hypotheses and to explore the data utilizing basic
statistical methods and models.
How many golf balls would fit in a school bus?
1.
2.
Proficiency in Mathematical Reasoning: While you don't need an advanced degree in mathematics or
statistics, a solid understanding of basic statistical methods and mathematical reasoning is essential.
Data Literacy: Data literacy involves the ability to extract meaningful information from datasets, making
data scientists invaluable in assessing data relevance and suitability for interpretation. They should be
able to perform data analysis and create effective data visualizations that convey valuable insights.
Skills For Data Science
Different types of data science jobs and their associated skills include:
• Data Analysts: Entry-level roles that focus on using pre-existing tools and applications for data
retrieval, wrangling, and visualization. Basic skills are needed to work with tools like MySQL
databases and Excel, and may involve performing analyses using tools like Google Analytics or
Tableau.
• Data Engineers: These roles are responsible for developing data management systems and
infrastructure to house and access large datasets. While data scientists with software engineering
backgrounds may excel in this position, it's more about data infrastructure than advanced
statistical or machine learning expertise.
Skills For Data Science
• Data-Driven Product Companies: In these companies, data or data analysis platforms are
the products. Ideal candidates often have formal backgrounds in mathematics, statistics,
or physics, as the focus is on producing data-driven products.
• Reasonably Sized Non-Data Companies: Many modern businesses are data-driven but
not entirely focused on data. Data scientists in these roles work as part of established
teams, performing data analysis, touching production code, and visualizing data. Skills in
handling big data tools (e.g., Hive or Pig) and working with real-world datasets are
essential.
Skills for Data Science
Skills for Data Science
• Hands-On Example 1.2: Analyzing Data:
 For this example, we will use the dataset of average heights and weights for American women
available from OA 1.1.
 This file is in comma-separated values (CSV) format – something that we will revisit in the next
chapter. For now, go ahead and download it. Once downloaded, you can open this file in a
spreadsheet program such as Microsoft Excel or Google Sheets.
 For your reference, this data is also provided in Table 1.1. As you can see, the dataset contains a
sample of 15 observations. Let us consider what is present in the dataset. At the first look, it is
clear that the data is already sorted – both the height and weight numbers range from small to
large. That makes it easier to see the boundaries of this dataset – height ranges from 58 to 72, and
weight ranges from 115 to 164.
Skills for Data Science
• Hands-On Example 1.2: Analyzing Data:
 Next, let us consider averages. We can easily compute average height by adding up the numbers
in the “Height” column and dividing by 15 (because that is how many observations we have). That
yields a value of 65. In other words, we can conclude that the average height of an American
woman is 65 inches, at least according to these 15 observations.
 Similarly, we can compute the average weight – 136 pounds in this case. The dataset also reveals
that an increase in height correlates with the value of weight. This may be clearer using a
visualization.
 If you know any kind of a spreadsheet program (e.g., Microsoft Excel, Google Sheets), you easily
generate a plot of values. Figure 1.4 provides an example. Look at the curve. As we move from left
to right (Height), the line increases in value (Weight)
Skills for Data Science
Hands-On Example 1.2: Analyzing Data:
On average, how much increase can
we expect in weight with an increase
of one inch in height?
Skills for Data Science
Hands-On Example 1.2: Analyzing Data:
Now, let us ask a question: On
average, how much increase can
we expect in weight with an
increase of one inch in height?
Skills for Data Science
• Hands-On Example 1.2: Analyzing Data:
A simple method is to compute the differences in height (72 − 58 = 14 inches) and weight
(164 − 115 = 49 pounds), then divide the weight difference by the height difference, that
is, 49/14, leading to 3.5. In other words, we see that, on average, one inch of height
difference leads to a difference of 3.5 pounds in weight.
 If you want to dig deeper, you may discover that the weight change with respect to the
height change is not that uniform.
On average, an increase of an inch in height results in an increase of less than 3 pounds
in weight for height between 58 and 65 inches (remember that 65 inches is the average).
For values of height greater than 65 inches, weight increases more rapidly (by 4 pounds
mostly until 70 inches, and 5 pounds for more than 70 inches)
Skills for Data Science
• Hands-On Example 1.2: Analyzing Data:
What would you expect the weight to be of an American woman who is 57 inches tall?
To answer this, we will have to extrapolate the data we have. We know from the previous
paragraph that in the lower range of height (less than the average of 65 inches), with
each inch of height change, weight changes by about 3 pounds.
 Given that we know for someone who is 58 inches in height, the corresponding weight is
115 pounds; if we deduct an inch from the height, we should deduct 3 pounds from the
weight. This gives us the answer (or at least our guess), 112 pounds.
• What would you expect the weight of someone who is 73 inches tall
to be?
• More than the answer, what is important is the process. Can you
explain that to someone? Can you document it? Can you repeat it for
the same problem but with different values, or for similar problems,
in the future?
• If the answer to these questions is “yes,” then you just practiced
some science. Yes, it is important for us not only to solve data-driven
problems, but to be able to explain, verify, and repeat that process.
Tools for Data Science
Tools for Data Science
• Data scientists typically use programming and data processing tools to perform their
tasks. In this book, Python, R, and SQL are introduced as fundamental tools for data
science.
• There are no specific tools designed exclusively for data science. However, some tools
are better suited for data science tasks. If you already know other programming
languages or scientific data processing environments, you can use them for data science,
but Python and R are popular choices due to their simplicity and powerful capabilities for
data analysis and visualization.
Tools For Data Science
Here is an example of how python and R could be much more simple than
other languages:
Suppose that we want to print the sentence “Hello World” using Java:
• Step 1: Write the code and save as HellowWorld.java.
public class HelloWorld {
public static void main(String[] args) {
System.out.println(“Hello, World”);
}}
• Step 2: Compile the code.
% javac HelloWorld.java
• Step 3: Run the program.
% java HelloWorld
Tools For Data Science
• In contrast, here is how you do the same in Python:
• Step 1: Write the code and save as hello.py
print(“Hello, World”)
• Step 2: Run the program.
% python hello.py
Python is a scripting language. It means that programs written in Python do not need to be
compiled as a whole like you would do with a program in C or Java; instead, a Python Tools
for Data Science program runs line by line. The language (its syntax and structure) also
provides a very easy learning curve for the beginner, yet giving very powerful tools for
advanced programmers.
Tools For Data Science
• If you want to accomplish the same in R, you type the same
print(“Hello, World”)
in R console.
• Both Python and R offer a very easy introduction to programming, and even if you have
never done any programming before, it is possible to start solving data problems from
day 1 of using either of these.
• Both of them also offer plenty of packages that you can import or call into them to
accomplish more complex tasks such as machine learning
Issues of Ethics, Bias, and Privacy in Data
Science
• Data Science Limitations: Data science is not a panacea, and it cannot solve all
societal and global issues.
• Ethical and Privacy Concerns: Data science and data analysis using statisticalcomputational techniques come with serious issues related to ethics and privacy.
• Data Collection and Origin: Many ethical and privacy concerns are related to the
origin of the data. Understanding how, where, and why data was collected, who
collected it, and the intended use is crucial. Questions arise about whether data
subjects (e.g., individuals) were aware of data collection and its purposes.
Issues of Ethics, Bias, and Privacy in Data
Science
• Data Misuse and Profit: Data misuse can occur when data collectors assume that data
availability grants them the right to use it without consent. For instance, the Cambridge
Analytica case revealed that data analytics firms could obtain Facebook users' data
without their knowledge or consent, using it for political campaigning.
• Lack of Awareness: Often, people are unaware of data collection practices by tech
companies, as services appear "free." The saying "if you're not paying for it, you are the
product" underscores this issue. Major tech companies place significant value on each
user, highlighting the profit potential from user data.
Issues of Ethics, Bias, and Privacy in Data
Science
• Data Exposure and Harm: Data about users have been exposed or shared
intentionally or unintentionally in ways that can harm users.
• Inherent Bias in Data: Even ethically collected data can be biased. Data scientists
must be cautious about inherent bias in data, which can affect the analysis and
insights without clear notice.
• Efforts to Address Issues: Many data and technology companies are striving to
address these ethical, privacy, and bias issues, although success in fully resolving
them has been limited.
• Continuous Efforts: While complete elimination of biases and prejudices might
not be feasible, continuous efforts to mitigate these concerns are essential. As
you progress in data science, remain mindful of these ethical, privacy, and bias
issues.
References
• A Hands-On Introduction to Data Science, By Chirag Shah.
Download