Uploaded by blancojenn

SQL

advertisement
Structured Query Language (SQL)
Visualizing distributions
Three ways of getting insights [can be used together]
1. Calculating summary statistics (mean, median, standard deviation)
2. Running models (linear and logistic regression)
3. Drawing plots (scatter, bar, histogram)
Continuous and categorical variables
-
Continuous: usually numbers (heights, temperatures, revenues)
Categorical: usually text (eye colors, countries, industry)
Can be either (age is continuous, but age group is categorical)
When should you use a histogram?
1. If you have a single continuous variable
2. You want to ask questions about the shape of its distribution.
Modality: how many peaks?
Skewness: is it symmetric?
Kurtosis: how many extreme values?
Box Plots
1. When you have a continuous variable, split by a categorical variable
2. When you want to compare the distributions of the continuous variable for each
category.
Visualizing two variables
When should you use a scatter plot?
1. You have two continuous variables.
2. You want to answer questions about the relationship between the two variables.
Correlation is a measure of how well you can draw a straight line through the points.
Adding a straight line to a scatterplot is a great way to see if you do have a linear relationship
with the X and Y scale. However, sometimes straight lines can be a poor fit, therefore, having a
curve can be more suitable and informative.
When should you a line plot?
1. You have two continuous variables.
2. You want to answer questions about their relationship.
3. Consecutive observations are connected somehow.
Usually, but not always, the X-axis is dates or times. Useful to compare line plots to trend
lines generated from a linear regression.
Time X-axis doesn’t always imply line plot. Line plots need consecutive data values to be
conceptually connected.
Interpreting line plots
Line plots are excellent for comparing two continuous variables, where consecutive
observations are connected somehow. A common type of line plot is to have dates or times on
the x-axis, and a numeric quantity on the y-axis. In this case, “consecutive observations”
means values on successive dates, like today and tomorrow. By drawing multiple lines on the
same plot, you can compare values.
Logarithmic scales for line plots
If you have a dataset where the values span several orders of magnitude, it can be easier to
view them on a logarithmic scale.
Line plots without dates on the x-axis
Although date and times are the most common type of variable for the x-axis in line plot, other
types of variables are possible.
Bar plots (close relative to box plots)
When should you use a bar plot? Most common cases:
1. You have a categorial variable.
2. You want counts or percentages for each category.
Occasionally:
1. You want another numeric score for each category and need to include zero in the plot.
Interpreting bar plots
Bar plots are a great way to see counts of each category in a categorical variable.
Interpreting stacked bar plots
If you care about percentages rather than counts, then stacked bar plots are often a good
choice of plot.
Dot Plots
When should you use a dot plot?
1. You have a categorical variable.
2. You want a display numeric score for each category on a log scale, or
3. You want to display multiple numeric scores for each category.
Interpreting dot plots
Dot plots are similar to bar plots in that they show a numeric metric for each category of a
categorical variable. They have two advantages over bar plots: you can use a log scale for the
metric, and your can display more than one metric per category.
Higher dimensions
X and Y are not the only dimensions: Color, size, transparency, and shape.
Other dimensions for line plots: Color, thickness, transparency, line type (solid, dashes, dots)
Another dimension for scatter plots
If you have a scatter plot but want to distinguish the points in some way based on another
variable, then you have a few options. There are other options to stick to X-Y axes and still
visualize the third dimension. You can change the color, size, transparency, or shape of the
points, or split the plot into multiple panels.
Using color
Color spaces: Red-Green-Blue  In programming these are how colors are viewed.
Color spaces: Cyna-Magenta-Yellow-Black  Graphic designers view.
Type
Purpose
qualitative
Distinguish unordered categories
sequential
Show ordering
diverging
Show above or below a midpoint
Plotting many variables at once
When should you use a pair plot?
-
You have up to ten variables (either continuous, categorical or a mix).
You want to see the distribution for each variable.
You wan to see the relationship between each pair of variables.
When should you use a correlation heatmap?
-
You have lots of continuous variables.
You want to a simple overview of how each pair of variables is related.
When should you use a parallel coordinates plot?
-
You have lots of continuous variables.
You want to find patterns across these variables, or
You wan to visualize clusters of observations
Interpreting correlation heatmaps
If you want to find the relationship between many pairs of numeric variables, you can use a
close relative of the pair plot, namely the correlation heatmap. It takes the correlation scores
you saw from the pair plot, but rather than giving you lots of numbers to look at, it displays
them using colors.
Polar coordinates
When should you use polar coordinates?
-
Almost never
If you have a variable that is naturally circular (time of day, compass direction).
Pie plots
Pie plots (sometimes called pie charts) are extremely popular, but often difficult to
interpret. They are just bar plots converted into polar coordinates, and humans are
generally worse at perceiving angles accurately compared to lengths.
Rose plots
One good use case for polar coordinates is when the data is naturally circular, for
example, when it is a compass direction. If you plot a histogram with polar coordinates,
you get a rose plot.
Here you can see a plot of wind direction data from a meteorological mast. Knowing the
predominant wind direction is important for weather modeling and for determining where
to site wind turbines. Wind measurements were taken at 10-minute intervals over an
eight-month period.
Axes of evil
Bar plot axes
When we look at a bar plot, we use the relative lengths of each bar to help interpret
what is happening. If you don't include zero on the axis used for bar lengths, then the
relative lengths of bars are distorted, and it is easy to be misled.
Dual axes
One popular but terrible idea is to draw a scatter plot or line plot with two different yaxes. This typically happens when you have two metrics with different units, and
different scales that you want to plot against a common x-axis. The problem is that by
changing the relationship between the two axes, you can tell almost any story that you
want with the data.
Sensory Overload
Measures of a good visualization
-
How many interesting insights can your reader get from the plot?
How quickly can they get those insights?
Chart junk
Any element of the plot that distracts from the reader getting insight.
-
Picture
Skeuomorphism: reflections, shadows, etc.
Extra dimensions
Ostentatious colors or lines
Chart junk
Chart junk is anything in a plot that distracts from getting insight. That is, removing it
would make the plot easier to understand.
Multiple plots
Sometimes a dataset is so complex that it takes several plots to explore properly.
Rather than trying to find a single, perfect plot that captures all the insight, you can
combine several plots into a report or – if you want to have fun – a dashboard.
For complex datasets, it is often best to draw lots of simpler plots that each answer a
couple of questions, rather than trying to draw a single plot that answers everything.
Summary
2. Histograms: show a distribution
In Chapter 1 you learned that histograms are excellent for showing the distribution of a
continuous variable, and
3. Box plots: show lots of distributions
that box plots can compactly show the distributions of lots of continuous variables.
4. Scatter plots: compare two numeric variables
In Chapter 2 you saw that scatter plots can show the relationship between two
continuous variables, and
5. Line plots: show trends over time
that line plots are great for showing trends over time.
6. Bar plots: show counts by category
You also saw that bar plots show counts or proportions split by categories, and
7. Dot plots: show log scale metrics by category
that dot plots will do the same, but allow for logarithmic scales and showing multiple
metrics at once.
8. Extra dimensions
In Chapter 3, you saw that using colors or multiple panels are often the best way to add
a third dimension to your plot, since 3D plots are hard to interpret.
9. 3 types of color scale
You also learned that there are three types of color scale: qualitative, sequential, and
diverging.
10. Pair plot: compare many variables
For cases where you need to analyze many variables at once, you saw three types of
plots. Pair plots show relationships between each pair of variables,
11. Correlation heatmap: show related variables
correlation heatmaps show related variables, and
12. Parallel coordinates plot: find patterns across variables
parallel coordinates plots show patterns across many variables.
13. Rose plot: show a cyclical distribution
In Chapter 4, you learned that plots with polar coordinates are usually a bad idea, but
they have niche uses when data is cyclical, like a time of day.
14. Dual axes are bad
You also learned that using dual axes is almost always misleading, and that
15. Eliminate chartjunk
minimalism is a good idea. You should eliminate anything from the plot that distracts
from interpretations.
Introduction Statistics
What is statistics?
The field of statistics – the practice and study of collecting and analyzing data.
Two main branches of statistics:
-
Descriptive/summary statistics – describing or summarizing our data
Inferential statistics – collect a sample of data and apply the results to the
population that the sample represents.
Statistics is everywhere!
-Sports statistics, personal finances
What can statistics do?
-
Allows us to answer practical questions: What is the average salary in the USA?
How many customer inquiries is a company likely to receive per week?
It has applications across society: 1. Developing safer products such as cars or
airplanes 2. Help governments understand the needs of a population.
Validates scientific breakthroughs, such as Covid-19 vaccines.
Limitations of statistics
-Statistics requires specific, measurable questions: Is rock music more popular than
jazz? On average, do women live longer than men?
- We can’t use statistics to find out why relationships exist.
Types of data: numeric
-
Continuous data: Stock Prices
Interval/count data: How many cups of coffee do people drink per day
Visualizing numeric data
Types of data: categorical
-
Nominal data: Eye color
Ordinal data: How strongly do you agree that basketball is the best sport?
Visualizing categorical data
Descriptive / Summary statistics
-
Describe or summarize data
Inferential statistics
-
Use a sample to draw conclusions about a population
How many people purchase clothing following social media advertising?
Selecting columns
SQL is a language for interacting with data stored in something called a relational database.
Relational database a collection of tables. A table is a set of rows and columns, like a
spreadsheet, which represents exactly one type of entity. For example, a table might represent
employees in a company or purchases made, but not both.
Each row, or record, of a table contains information about a single entity. For example, in a
table representing employees, each row represents a single person. Each column, or field, of a
table contains a single attribute for all rows in the table. For example, in a table representing
employees, we might have a column containing first and last names for all employees.
id
name
age
nationality
1
Jessica
22
Ireland
2
Gabriel
48
France
3
Laura
36
USA
id
name
age
nationality
Selecting single columns
SQL can be used to create and modify databases; the focus of this course will be querying
databases. A query is a request for data from a database table (or combination of tables).
Querying is an essential skill for a data scientist since the data you need for your analyses will
often live in databases.
In SQL, you can select data from a table using a SELECT statement. For example, the following
query selects the name column from the people table: SELECT name FROM people;
In this query, SELECT and FROM are called keywords. In SQL, keywords are not case-sensitive,
which means you can write the same query as: select name from people;
That said, it’s good practice to make SQL keywords uppercase to distinguish them from other
parts of your query, like column and table names.
It’s also good practice to include a semicolon at the end of your query. This tells SQL where the
end of your query is!
Selecting multiple columns
- To select multiple columns from a table, simply separate the column names with
commas!
For example, this query selects two columns, name and birthdate, from the people table:
SELECT name, birthdate FROM people;
-
To select all columns from a table there’s a handy shortcut: SELECT * FROM people;
If you want to return a certain number of results, you can use the LIMIT keyword to limit
number of rows returned: SELECT * FROM people LIMIT 10;
Select Distinct
Often your results will include many duplicate values. If you want to select all the unique values
from a column, you can use the DISTINCT keyword.
This might be useful if, for example, you’re interested in knowing which languages are
represented in the films table: SELECT DISTINCT language FROM films;
Learning to COUNT
What if you want to count the number of employees in your employees table? The COUNT()
function lets you do this by returning the number of rows in one or more columns.
For example, this code gives the number of rows in the people table: SELECT COUNT (*) FROM
people;
If you want to count the number of non-missing values in a particular column, you can call
COUNT() on just that column. For example, to count the number of birth dates present in the
people table: SELECT COUNT(birthdate) FROM people;
It’s also common to combine COUNT() with DISTINCT to count the number of distinct values in a
column. For example, this query counts the number of distinct birth dates contained in the
people table: SELECT COUNT(DISTINCT birthdate) FROM people;
Filtering rows
In SQL, the WHERE keywords allows you to filter based on both text and numeric values in a
table. There are a few different comparison operators you can use:
= equal
<> not equal
< less than
> greater than
<= less than or equal to
>= greater than or equal to
For example, you can filter text records such as title. The following code returns all films with
the title ‘Metropolis’: SELECT title FROM films WHERE title = ‘Metropolis’;
*Notice that the WHERE clause always comes after the FROM statement!
Simple filtering of numeric values
The WHERE clause can also be used to filter numeric records, such as years or ages. For
example, the following query selects all details for films with a budget over ten thousand
dollars: SELECT * FROM films WHERE budget > 10000;
Simple filtering of text
The WHERE clause can also be used to filter text results, such as names or countries. For
example, this query gets the titles of all films which were filmed in China: SELECT title FROM
films WHERE country = ‘China’;
Important: In PostgreSQL (this version of SWL we’re using), you must use single quotes with
WHERE.
WHERE AND
Often, you’ll want to select data based on multiple conditions. You can build up your WHERE
queries by combining multiple conditions with the AND keyword. For example, SELECT title
FROM films WHERE release_year > 1994 AND release_year < 2000;
Gives you the titles of films released between 1994 and 2000.
*Note that you need to specify the column name separately for every AND condition. You can
add as many AND conditions as you need!
WHERE AND OR
What if you want to select rows based on multiple conditions where some but not all of the
conditions need to be met? For this, SQL has the OR operator.
For example, the following returns all films released in either 1994 or 2000:
SELECT title FROM films WHERE release_year = 1994 OR release_year = 2000;
Note that you need to specify the column for every OR condition, so the following is invalid:
SELECT title FROM films WHERE release_year = 1994 OR 2000;
When combining AND and OR, be sure to enclose the individual clauses in parentheses, like so:
SELECT title FROM films WHERE (release_year = 1994 OR release_year = 1995) AND
(certification = ‘PG’ OR certification = ‘R’) ;
Otherwise, due to SQL’s precedence rules you may not get the results you’re expecting!
BETWEEN
As you’ve learned, you can use the following query to get titles of all films released in and
between 1994 and 2000:
SELECT title FROM films WHERE release_year >= 1994 AND release_year <= 2000;
Checking for ranges like this is very common, so in SQL the BETWEEN keyword provides a useful
shorthand for filtering values within a specified range. This query is equivalent to the one
above:
SELECT title FROM films WHERE release_year BETWEEN 1994 AND 2000;
It’s important to remember that BETWEEN is inclusive, meaning the beginning and end values
are included in the results!
BETWEEN (2)
Similar to the WHERE clause, the BETWEEN clause can be used with multiple AND and OR
operators, so you can build up your queries and make them even more powerful!
For example, suppose we have a table called kids. We can get the names of all kids between the
ages of 2 and 12 from the United States:
SELECT name FROM kids WHERE age BETWEEN 2 AND 12 AND nationality = ‘USA’;
Take a go at using BETWEEN with AND on the films data to get the title and release year of all
Spanish language films released between 1990 and 2000 (inclusive) with budgets over $100
million. We have broken that problem into smaller steps so that you can build the query as you
go along!
Keyboard Shortcuts
Submit Answer
CTRL+SHIFT+ENTER
Execute Line or Selected Code in Editor
CTRL+ENTER
Submit Multiple Choice
ENTER
Get Hint or Solution
CTRL+H
WHERE IN
As you’ve seen, WHERE is very useful for filtering results. However, if you want to filter based
on many conditions, WHERE can get unwieldy. For example:
SELECT name FROM kids WHERE age = 2 or age = 4 or age = 6 or age = 8 or age 10;
Enter the IN operator! The IN operator allows you to specify multiple values in a WHERE clause,
making it easier and quicker to specify multiple OR conditions! Neat, right?
So, the above example would become simply:
SELECT name FROM kids WHERE age IN (2, 4, 6, 8, 10);
Try using the IN operator yourself!
Get the title and release year of all films released in 1990 or 2000 that were longer than two
hours. Remember, duration is in minutes!
SELECT title, release_year FROM films WHERE (release_year IN (19
90, 2000)) AND (duration > 120);
Get the title and language of all films which were in English, Spanish, or French.
SELECT title, language FROM films WHERE language IN ('English',
'Spanish', 'French');
Get the title and certification of all films with an NC-17 or R certification.
SELECT title, certification FROM films WHERE certification IN ('
NC-17', 'R');
Introduction to NULL and IS NULL
In SQL, NULL represent a missing or unknown value. You can check for NULL values using the
expression IS NULL. For example, to count the number of missing birth dates in the people
table: SELECT COUNT (*) FROM people WHERE birthdate IS NULL;
As you can see, IS NULL is useful when combined with WHERE to figure our what data you’re
missing.
Sometimes, you’ll want to filter out missing values so you only get results which are not NULL.
To do this, you can use the IS NOT NULL operator.
For example, this query gives the names of all people whose birth dates are not missing in the
people table.
SELECT name FROM people WHERE birthdate IS NOT NULL;
NULL and IS NULL
Get the names of people who are still alive, i.e. whose death date is missing.
SELECT name FROM people WHERE deathdate IS NULL
Get the title of every film which doesn't have a budget associated with it.
SELECT title FROM films WHERE budget IS NULL
Get the number of films which don't have a language associated with them.
SELECT COUNT(*) FROM films WHERE language IS NULL
LIKE and NOT LIKE
As you’ve seen, the WHERE clause can be used to filter text data. However, so far, you’ve only
been able to filter by specifying the exact text you’re interested in. In the real world, often
you’ll want to search for a pattern rather than a specific text string.
In SQL, the LIKE operator can be used in a WHERE clause to search for a pattern in a column. To
accomplish this, you use something called a wildcard as a placeholder for some other values.
There are two wildcards you can use with LIKE:
The % wildcard will match zero, one, or many characters in text. For example, the following
query matches companies like ‘Data’, ‘DataC’ “DataCamp’, ‘DataMind’, and so on:
SELECT name FROM companies WHERE name LIKE ‘Data%’;
The _ wildcard will match a single character. For example, the following query matches
companies like ‘DataCamp’, ‘DataComp’, and so on:
SELECT name FROM companies WHERE name LIKE ‘DataC_mp’;
You can also use the NOT LIKE operator to find records that don’t match the pattern you
specify.
Get the names of all people whose names begin with 'B'. The pattern you need is 'B%'.
SELECT name FROM people WHERE name LIKE 'B%';
Get the names of people whose names have 'r' as the second letter. The pattern you
need is '_r%'.
SELECT name FROM people WHERE name LIKE '_r%';
Get the names of people whose names don't start with A. The pattern you need is 'A%'.
SELECT name FROM people WHERE name NOT LIKE 'A%';
Aggregate functions
Often, you will want to perform some calculation on the data in a database. SQL provides a few
functions, called aggregate functions, to help you out with this. For example,
SELECT AVG(budget) FROM films;
Gives you the average value from the budget column if the films table. Similarly, the MAX()
function returns the highest budget:
SELECT MAX(budget) FROM films;
The SUM() function returns the result of adding up the numeric values in a column:
SELECT SUM(budget) FROM films;
You can probably guess what the MIN() function does!
Combining aggregate functions with WHERE
Aggregate functions can be combined with the WHERE clause to gain further insights from your
data. For example, to get the total budget of movies made in the year 2010 or later:
SELECT SUM(budget) FROM films WHERE release_year >= 2010;
A note on arithmetic
In addition to using aggregate function, you can perform basic arithmetic with symbols like +, -,
* and /. So, for example, this gives a result of 12: SELECT (4* 3);
However, the following gives a result of 1: SELECT (4/3);
SQL assumes that if you divide an integer by an integer, you want to get an integer back. So be
careful when dividing! If you want more precision when dividing, you can add decimal places to
your numbers. For example, SELECT (4.0 / 3.0) AS result;
Gives you the result you would expect: 1.333.
It’s AS simple AS aliasing
You may have noticed in the first exercises of this chapter that the column name of your result
was just the name of the functions you used. For example, SELECT MAX(budget) FROM films;
Gives you a result with one column, named max. But what if you use two functions like this?
SELECT MAX(budget), MAX(duration) FROM films;
Well, then you’d have two columns named max, which isn’t very useful!
To avoid situations like this, SQL allows you to do something called aliasing. Aliasing simply
means you assign a temporary name to something. To alias, you use the AS keyword, which
you’ve already seen earlier in this course. For example, in the above example we could use
aliases to make the result clearer:
SELECT MAX(budget) AS max_budget, MAX(duration) AS max_duration FROM films;
Aliases are helpful for making results more readable!
Get the title and net profit (the amount a film grossed, minus its budget) for all films.
Alias the net profit as net_profit.
SELECT title, (gross - budget) AS net_profit FROM films;
Get the title and duration in hours for all films. The duration is in minutes, so you'll need
to divide by 60.0 to get the duration in hours. Alias the duration in hours
as duration_hours.
SELECT title, (duration/60.0) AS duration_hours FROM films;
Get the average duration in hours for all films, aliased as avg_duration_hours.
SELECT AVG(duration) / 60.0 AS avg_duration_hours FROM films;
Even more aliasing
SQL assumes that if you divide an integer by an integer, you want to get an integer back.
This means that the following will erroneously result in 400.0 :
SELECT 45 / 10 *100.0;
This is because 45 / 10 evaluates to an integer (4), and not a decimal number like we would
expect.
So, when you’re dividing make sure at least one of your numbers has a decimal place:
SELECT 45 * 100.0 / 10;
The above now gives the correct answer of 450.0 since the numerator (45 * 100.0) of the
division is now a decimal!
Get the percentage of people who are no longer alive. Alias the result
as percentage_dead. Remember to use 100.0 and not 100!
SELECT COUNT(deathdate) * 100.0 / COUNT(*) AS percentage_dead FR
OM people;
Get the number of years between the newest film and oldest film. Alias the result
as difference.
SELECT MAX(release_year) - MIN(release_year) AS difference FROM
films;
Get the number of decades the films table covers. Alias the result as number_of_decades.
The top half of your fraction should be enclosed in parentheses.
SELECT (MAX(release_year)- MIN(release_year)) / 10.0 AS number_o
f_decades FROM films;
ORDER BY
In this chapter you’ll learn how to sort and group your results to gain further insight. Let’s go!
In SQL, the ORDER BY keyword is used to sort results in ascending or descending order
according to the values of one or more columns.
By default ORDER BY will sort in ascending order. If you want to sort the results in descending
order, you can use the DESC keyword. For example,
SELECT title FROM films ORDER BY release_year DESC;
Gives you the titles of films sorted by release year, from newest to oldest.
Note: Aggregate functions can be used if mentioned in SELECT.
Sorting single columns
Now that you understand how ORDER BY works, give these exercises a go!
Get the names of people from the people table, sorted alphabetically.
SELECT name FROM people ORDER BY name;
Sorting single columns (2)
Get the title of films released in 2000 or 2012, in the order they were released.
SELECT title FROM films WHERE (release_year = 2000 OR release_ye
ar = 2012) ORDER BY release_year;
Get all details for all films except those released in 2015 and order them by duration.
SELECT * FROM films WHERE release_year <> 2015 ORDER BY duratio
n;
Get the title and gross earnings for movies which begin with the letter 'M' and order the
results alphabetically.
SELECT title, gross FROM films WHERE title LIKE 'M%' ORDER BY ti
tle;
Sorting single columns (DESC)
To order results in descending order, you can put the keyword DESC after you ORDER BY. For
example, to get all the names in the people table, in reverse alphabetical order:
SELECT name FROM people ORDER BY name DESC;
Sorting multiple columns
ORDER BY can also be used to sort on multiple columns. It will sort by the first column specified,
then sort by the next, then the next, and so on. For example,
SELECT birthdate, name FROM people ORDER BY birthdate, name;
Sorts on birth dates first (oldest to newest) and then sorts on the names in alphabetical order.
The order of columns in important!
Try using ORDER BY to sort multiple columns! Remember, to specify multiple columns you
separate the column names with a comma.
Get the birth date and name of people in the people table, in order of when they were
born and alphabetically by name.
SELECT birthdate, name FROM people ORDER BY birthdate, name;
GROUP BY
Now you know how to sort results! Often, you’ll need to aggregate results. For example, you
might want to count the number of male and female employees in your company. Here, what
you want is to group all the males together and count them, and group all the females together
and count them. In SQL, GROUP BY allows you to group a result by one or more columns, like
so:
SELECT sex, count (*) FROM employees GROUP BY sex;
This might give, for example:
sex
count
male
15
female
19
Commonly, GROUP BY is used with aggregate functions like COUNT () or MAX (). Note that
GROUP BY always goes after the FROM clause!
GROUP BY practice
As you’ve just seen, combining aggregate functions with GROUP BY can yield some powerful
results!
A word of warning: SQL will return an error if you try to SELECT a field that is not in your
GROUP BY clause without using it to calculate some kind of value about the entire group.
Note that you can combine GROUP BY with ORDER BY to group your results, calculate
something about them, and then order your results. For example,
SELECT sex, count(*) FROM employees GROUP BY sex ORDER BY count DESC;
might return something like
sex
count
female
19
male
15
because there are more females at our company than males. Note also that ORDER BY always
goes after GROUP BY.
*Aggregate functions are not allowed in GROUP BY.
Get the release year and average duration of all films, grouped by release year.
SELECT release_year, AVG(duration)
FROM films
GROUP BY release_year;
Get the release year and largest budget for all films, grouped by release year.
SELECT release_year, MAX(budget)
FROM films
GROUP BY release_year;
Get the IMDB score and count of film reviews grouped by IMDB score in the reviews table.
SELECT imdb_score, COUNT(*)
FROM reviews
GROUP BY imdb_score;
GROUP BY practice (2)
Now practice your new skills by combining GROUP BY and ORDER BY with some more aggregate
functions!
Make sure to always put the ORDER BY clause at the end of your query. You can’t sort values
that you haven’t calculated yet!
Get the release year and lowest gross earnings per release year.
SELECT release_year, MIN(gross)
FROM films
GROUP BY release_year;
Get the release year, country, and highest budget spent making a film for each year, for
each country. Sort your results by release year and country.
SELECT release_year, country, MAX(budget)
FROM films
GROUP BY release_year, country
ORDER BY release_year,country;
Get the country, release year, and lowest amount grossed per release year per country.
Order your results by country and release year.
SELECT country,release_year, MIN(gross)
FROM films
GROUP BY release_year, country
ORDER BY country, release_year;
HAVING a great time
In SQL, aggregate functions can’t be used in WHERE clauses. For example, the following query is
invalid:
SELECT release_year FROM films GROUP BY release_year WHERE COUNT(title) > 10;
This means that if you want to filter based on the result of an aggregate function, you need
another way! That's where the HAVING clause comes in. For example,
SELECT release_year FROM films GROUP BY release_year HAVING COUNT(title) > 10;
shows only those years in which more than 10 films were released.
In how many different years were more than 200 movies released? 13
SELECT release_year
FROM films
GROUP BY release_year
HAVING COUNT(release_year) > 200;
All together now
Time to practice using ORDER BY, GROUP BY and HAVING together.
Now you're going to write a query that returns the average budget and average gross earnings
for films in each year after 1990, if the average budget is greater than $60 million.
SELECT release_year, AVG(budget) AS avg_budget, AVG(gross)
AS avg_gross
FROM films
WHERE release_year > 1990
GROUP BY release_year
HAVING AVG(budget) > 60000000
ORDER BY AVG(gross) DESC;
All together now (2)
Get the country, average budget, and average gross take of countries that have made more
than 10 films. Order the result by country name, and limit the number of results displayed to 5.
You should alias the averages as avg_budget and avg_gross respectively.
SELECT country, AVG(budget) AS avg_budget, AVG(gross) AS avg_gro
ss
FROM films
GROUP BY country
HAVING COUNT(title) > 10
ORDER BY country
LIMIT 5;
SELECT
FROM
JOIN
WHERE
GROUP BY
HAVING
ORDER BY
Download