Meaningful Statistics in Ten Clicks or Less

advertisement

Meaningful Statistics in Ten Clicks or Less

Andrew Barker

1

Agenda

Quick Statistics

• Creating Histograms

– Building ordinary and cumulative histograms

• Regression Analysis

– Building models and reading ANOVA tables

Creating Box Plots and Control Charts

• Box Plots

– Learn about Tableau’s enhanced box plots, computing standard scores, and reading Tableau’s summary cards

• Control Charts

– Create an Np attribute control chart to detect when your manufacturing went awry

Using R: A Programming Language for Statistical Analysis

• T-tests

– Learn how to conduct basic hypothesis testing

• Correlation coefficients

– Learn how to find trends with your data

Challenge Activity: Be a Detective, Find Fraud!

• Benford’s Law

– Learn a basic application of Benford’s law to help find fraud in your data

Challenge Activity: Translate Math Formulas into Tableau Calculations

• Z-scores, Correlation Coefficients, and Derivatives

– Translate these calculations from their formulas to Tableau calculations

2

Quick Statistics

Histograms

Histograms show us the distribution of data.

For example, the following basic histogram shows us the distribution of SAT math scores across a sample of 173 colleges:

We see that most colleges have incoming students with SAT math scores somewhere between 450 and

550. We also see that some of the colleges in this sample have SAT scores above 600. In fact, in the following cumulative histogram , we see that roughly 5% of colleges have SAT math scores above 600:

2

3

4

5

Here are some sample data for the above charts:

College

1 490

499

459

575

575

3

SAT Math Scores

How to Create a Basic Histogram

Step 1

Open up the dataset, “Math SAT Scores by College”.

Step 2

From your Measures pane, select the field, SAT Math Scores :

Step 3

Go to Show Me and choose a histogram :

(Optional) Step 4

Right now, we see the count (i.e., number) of schools that fall within each SAT score bin. What if we want to see the percent of schools that fall within each bin?

To show the percent of schools that fall within each bin, go to the Rows shelf, and on the pill, CNT(SAT

Math Scores) , select the dropdown menu Quick Table Calculations>Percent of Total :

4

To show each bar’s specific value, we want to copy the pill, CNT(SAT Math Scores) and place it on the

Label shelf. To do this, simply hold down the Ctrl key, and while having this key held down, go to your

Rows shelf and drag and drop your CNT(SAT Math Scores) onto the Label shelf, then release the Ctrl key:

(Optional) Step 5

To create a cumulative histogram, simply duplicate the CNT(SAT Math Scores) (using the Ctrl-click method outlined in Step 4 ) and place a copy on the Rows shelf:

On the second pill’s dropdown menu, choose Quick Table Calculation>Running Total :

5

If you want this to be a cumulative percentage, select the dropdown menu again, and select Edit Table

Calculation… :

Choose to Perform a secondary calculation on the result and from the dropdown menu, choose

Percent of Total :

Select OK .

6

To have the text for the cumulative percentage appear, go to the Marks card, and choose the second

SUM(Number of Records) :

Note: Multiple Mark Types allows us to have different types of marks (e.g., lines and bar charts) on one chart. Effectively, it’s a combination chart.

And duplicate the second CNT(SAT Math Scores) (holding down the Ctrl key), and place it in the Label shelf:

7

(Optional) Step 6

Sometimes, we might want both graphs to be on top of each other, similar to a pareto chart:

Right-click the top graph’s y-axis, and under Mark Type , choose to make the top graph a bar. Next, make the bottom graph a line chart:

8

Right-click on the bottom graph’s y-axis and select Dual Axis :

(Optional) Step 7

Last, we might want to change the bin sizes. Right now, we have bin sizes of 50. E.g., 350, 400, 450,

500, etc. This looks reasonably good for our distribution of SAT scores. But what if we want to change the bin sizes to 100, such that our SAT Math Scores are bucketed into groups of 300, 400, 500, etc.?

To do this, simply go to your Dimensions pane, right-click on your bins, and choose Edit :

9

Change the Size of bins to 100, and select OK :

(Optional) Step 8

What if we want to dynamically change the size of our bins? Instead of choosing specific bin sizes, we can create a parameter.

To do this, simply go to your Dimensions pane, right-click on your bins, and choose Edit :

Select Create a new parameter… :

10

Choose a range such as 1 through 100, and select OK :

Now, use the parameter on the right to toggle between different bin sizes.

11

ACTIVITY:

Open the dataset, “Tryon’s Rat Experiment”.

This dataset is modeled after Robert Tryon’s famed rat experiment. In this experiment, Tryon tested hereditary rat intelligence. He did this by running 142 rats through a maze. Each rat ran the maze 19 times.

To determine intelligence, Tryon counted the number of times a rat ran into his maze’s dead ends. The number of times a rat messes up (i.e., hits a dead end) becomes the rat’s intelligence score. Rats that made lots of errors had higher (not good) scores. Rats that made fewer errors had lower (better) scores.

Tryon then categorized the rats as being either “maze-bright” (rats that completed the maze with few errors) or “maze-dull” (rats that completed the maze with more errors). He then bred the maze-bright rats with each other, and the maze-dull rats with each other.

After seven generations of breeding and dividing the rats, Tryon arrived at a visualization similar to what’s below. We see that there’s a big difference between the “maze-bright” and “maze-dull” rat strains after seven generations of breeding.

Can you recreate the following chart looking at maze-bright and maze-dull rats? (Filter out the first generation.)

Do you think the data indicate that some mental abilities are hereditary?

Would you say that maze-bright rats are superior in every way compared to maze-dull rats?

12

CHALLENGE ACTIVITY

Open up the dataset, “Housing Values”. Let’s create a histogram of housing values. To do this, use the field, Sale Price .

Using the tools, above, try to create an analysis similar to the below visual, where bins are sized to

~$25,000:

13

Quick Statistics

Trend Lines

Tableau trend lines allow us to model movement we see in our data. We can use them to understand the past, and even develop models to forecast what might happen in the future.

Below, we see median home sale prices in the greater Seattle area. We may notice that in 2006, home values were going up at a rate of roughly $4,057.08 per month. This translates to over $48k per year!

With this model, it becomes evident that home prices were increasing at an unsustainable rate.

Indeed, by 2007, the housing bubble peaked, and unfortunately, we see a steady decline in home values thereafter.

14

How to Create a Trend Line

Step 1

Open up the dataset, “Housing Values”.

Step 2

Drag and drop Sale Price onto the Rows shelf, use the dropdown menu, and select Measure>Median , and then drag and drop SaleDate onto the Columns shelf, expanding it first to year, then month, as shown below:

Step 3

Right-click on the graph, and select, Trend Lines .

15

ACTIVITY

Connect to the dataset, “Windmill” and create a trend line. We want to see if an increase in wind speed results in an increase in power output. Therefore, we want to place Wind Speed on the x-axis, and

Power Output on the y-axis. We have hundreds of windmills, and we want to see the power output by windmill. Right now, we only see the aggregated power output for all the windmills. To break the view down by windmill, place Windmill on the Detail shelf:

Create a trend line, similar to what’s below:

16

How to Access the Trend Line Model Table

You might wonder, “How well does a trend line fit the data?” “How is this trend line modeled?” This information is available in Tableau’s Trend Model. Follow the below steps to access the trend model.

Step 1

Continuing with the Windmill example, right-click on the visual, and choose Describe Trend Model :

We are now brought to an ANOVA table:

17

For a statistician, many of these values are relatively straight forward. However, for those of us who are not statisticians, this table can be intimidating. We’ll go over some of the basic parts, below:

Let’s start with the Model Formula. In the above case, this formula is in the format of a linear equation: y = m x + b where, y = power output and x = wind speed : power output = (m)( wind speed ) + b

The value for m (the coefficient) is in the lower right-hand corner of the ANOVA table, and it’s 2.24418. So now we see that for each 1 m/s increase in wind speed, our power output goes up by ~2.2 MW. Therefore, our equation is now: power output = 2.2 ( wind speed ) + b

We see that the intercept is labeled in the bottom right hand corner as -6.45329. That means our final model is: power output = 2.2 ( wind speed ) – 6.45329

R-Squared is one of the most often cited values. Effectively, it tells us how well our model fits the data.

It ranges from 0 to 1. Higher values are considered better. We see that in our example the R-Squared is very high at 0.956448. This means our model fits the data very well. An R-Squared of 1 indicates the regression fits the data perfectly.*

*Be wary that if your R-squared is incredibly high (e.g., 0.999), your model may be misleading. A common indicator of an artificially high Rsquared is low degrees of freedom, or just having too many observations among a slew of other possible contributors/indicators.

P-values are often spoken of in terms of the 5% level. It’s generally desired to have a p-value below 5%.

The p-value is the probability of obtaining data at least as extreme as what’s observed, given the null hypothesis. One can think of it as, “Is there significant evidence to reject the null hypothesis?” The larger the p-value, the more evidence we have that our model is on shaky grounds. (Keep in mind that p-values on intercepts do not need to be significant at the 5% level.)

18

How to Create Residuals

A residual is the difference between an observed value and an estimated value of a model.

When plotting residuals, we take each and every point, and find its distance from the line.

If our trend line is exceptionally good at explaining the movement in our data, that means the observed points should be (relatively) equally distributed above and below the trend line. When we plot these points’ residuals, we expect them to be plotted in a scattered, random pattern:

Great residuals:

19

If we detect a wave or a strong pattern in our residuals, this means our model might not best fit the data. We might say that the model is exhibiting significant “curvature”. Below are residuals that exhibit significant curvature:

Poor residuals:

We see that in our linear model, the points are weaving above and below the trend line. When we plot the residuals, we quickly see that these points follow a strong curvy motion. The boxed points in the lower left-hand corner of the linear model are bunched above the trend line. Those same points are illustrated (boxed) in the residuals plot.

20

To get these residuals:

Step 1

On the worksheet with your linear model, go to Worksheet>Export>Data…

Step 2

Name your file, then select OK :

Step 3

Choose to Connect after export :

21

Step 4

Drag the residuals from the Measures pane, to the Rows shelf. We see that our residuals add up to roughly zero, that is generally expected:

Step 5

Next, let’s drag our Predictions out to the Columns shelf, and drag Windmill to the Level of Detail :

We quickly see that there is indeed curvature to our residuals, and that we need to fix this.

So how do we fix our model so that the residuals are more evenly distributed? We need to change the trend line from being linear, to one that accounts for some of the curvature we see. Let’s go back to our original linear model, right-click on the view, and select, Edit Trend Lines…

22

Let’s try Logarithmic :

Now, let’s try exporting the residuals for the logarithmic model as well. Do we still see curvature?

CHALLENGE ACTIVITY

Find a model which best fits the Windmill data (e.g., linear, log, exponential, polynomial). Does the Rsquared go up when compared to the linear model we developed earlier? Are the residuals more evenly distributed when compared to the linear model?

23

Box Plots and Control Charts

Box Plots

Throughout much of my life, my father has complained that his favorite coffee shop (whose name shall not be divulged) always cheated him out of a full cup of coffee. He would come home with a sour look on his face, cup in hand, grab a pen, and mark the inside lining of the coffee cup where the coffee was filled to. Some cups had well over an inch of room at the top. “I always tell them, ‘20 oz. coffee please, and NO ROOM!’ Then they give me 16 oz. of coffee!” He always pouts when he says this. He then hastily stuffs his cup into one of the cupboards. This happens several times a week. And it has been happening for years.

Now, you might imagine that our cupboards must be getting mighty full, or that we have an exceptional amount of cupboard space. Surprisingly, neither the former nor the latter are true, as my mother quietly throws these cups away.

Now, you might ask, “How does this fit in with box plots?” Well, as a birthday present for my dad, I went out and purchased 14 cups of every size of coffee from his favorite coffee shop down the street from his house. On different days. At different times. I specified that they give me the cups with “no room”, then I weighed the amount of coffee inside each cup, and subtracted out the weight of the cup:

Each point illustrates a cup of coffee that I weighed. These points rest in the 8 oz., 12 oz., 16 oz., and 20 oz. bins.

We see that for an 8 oz. cup, the median weight was 8.05 oz. of coffee. This means in this particular sample, they typically gave me more coffee than was specified. However, we see that for 12 oz., 16 oz., and 20 oz. cups, I typically received less than the advertised amount of coffee. I then gifted these results to my father for his birthday. He was thrilled.

24

How to Create a Box Plot

Step 1

Connect to the dataset, “Coffee Data”.

Step 2

Drag Weight to the Rows shelf, Coffee Size to the Columns shelf, and Row ID to the Level of Detail :

Step 3

Go to the Marks card, and change it to Circle .

25

You may choose to reduce the size of these circles using the slider bar :

Step 4

Now, let’s add some quartiles. Right-click on the y-axis, and select Add Reference Line…

26

Select Box Plot .

Select OK .

Now you have a wonderful box plot that is much improved over the traditional box plot:

Traditional Box Plots

27

Calculating the Standard Scores

Standard scores tell us how many standard deviations a datum is from the mean.

The equation for the standard score is as follows:

That looks scary! How do we get THAT into Tableau? Well, it’s not really as scary as we may think.

Let’s break it down:

means each point of data. In this case, each point is a SUM(Weight).

means the average of all our points in the window. In this case,

WINDOW_AVG(SUM(Weight)).

means the standard deviation of all our points in the window. In this case,

WINDOW_STDEV(SUM(Weight)).

In summary, our Tableau equation will be:

(

SUM(Weight)-WINDOW_AVG(SUM(Weight))

)

/

WINDOW_STDEV( SUM(Weight) )

Simply go to

Analysis>Create Calculated Field

and write the equation:

28

Then drop your new

standard score

calculation onto the

Color

shelf. Right-click on it, and select

Compute Using>Row ID

. This is because we want it to compute this score for each individual point, and each point is a specific

Row ID

(meaning a specific purchase of coffee).

Now, when we hover over a point, we will see how many standard deviations it is from the mean.

Showing the Summary Card

Tableau’s summary card provides descriptive statistics of your selected values. For instance, it shows you the sum, average, minimum, maximum, and other descriptors regarding any values you’ve selected on your graph. It can be accessed by selecting Worksheet>Show Summary :

From its dropdown menu in the upper right-hand corner, you can choose to show other descriptive values such as Skewness and Excess Kurtosis .

What do Skewness and Excess Kurtosis mean? Skewness illustrates whether the data has a leftward or rightward tail. Negative skew illustrates that there is a leftward tail. Positive skew illustrates that the

29

data has a rightward tail. Notice below that home values have a rightward (positive) skew, as some people have homes that are millions of dollars, however, the bulk of us have homes valued in the low six-figures:

Excess kurtosis , on the other hand, indicates how wide or fat your tails of distribution are. Positive values indicate that there is a higher concentration of values around the mean, whereas negative values indicate that there are fewer data around the mean. A way to remember this is that negative values are small, and thus make smaller and stouter mountains. Positive numbers are large, and make taller mountains:

Positive Excess Kurtosis on top, Negative Excess Kurtosis on bottom

30

ACTIVITY

Open up the dataset, “Resume Names”.

Create box plots of the names and their callback rates , breaking them down by firstname , race and gender .

About this data set:

This is a sample of the dataset created by Marianne Bertrand and Sendhil Mullainathan from MIT and the University of Chicago. The title of their research paper is called, “Are Emily and Greg More

Employable than Lakisha and Jamal?” This well-known and highly regarded study took a sample of resumes and randomly placed African American-sounding and White-sounding names on top of the resumes, then sent the resumes to possible employers. They found that resumes with White-sounding names got 50% more callbacks than resumes with African American-sounding names. Other variables

(e.g., resume quality) held constant. As per their abstract:

We study race in the labor market by sending fictitious resumes to help-wanted ads in

Boston and Chicago newspapers. To manipulate perceived race, resumes are randomly assigned

African American or White sounding names. White names receive 50 percent more callbacks for interviews. Callbacks are also more responsive to resume quality for White names than for

African American ones. The racial gap is uniform across occupation, industry, and employer size.

We also find little evidence that employers are inferring social class from the names. Differential treatment by race still appears to still be prominent in the U.S. labor market.

31

Box Plots and Control Charts

Control Charts

Say you own a pompom factory. You noticed lately that a lot of the pompoms are defective and look more like bowling balls than fluffy pompoms. You noticed this rise a couple days ago, but aren’t quite sure if you’re just imagining things, or if this is a significant change and problem.

Say you run a hospital, and you noticed lately that patient wait times at noon have spiked considerably in the last couple weeks. You wonder how much they’ve spiked, and when this issue started. Can this be correlated with any other data? Can we explain and monitor this spike?

Control charts allow you to monitor things such as defects over time, and see if things fall outside the statistical processes and norms:

Control charts have a concept of upper control limits (UCLs) and lower control limits (LCLs), which illustrate the extreme levels of variation within your data. When you begin to notice that your data are falling outside of the control limits, you may consider taking steps to remedy any issues that may have appeared.

In the upcoming example, we’re going to discuss how to create an np-chart. Np-charts look at the number of defective units in a given sample in a production chain. We can think of these units as being pompoms.

32

Creating Np-Charts

Step 1

Open up the dataset, “Pompoms”.

Step 2

Drag Day to the Columns shelf, and from its dropdown menu, change it to Exact Date . Drag Defectives to the Rows shelf:

Step 3

Right-click on the y-axis, and choose to add +1/-1 Standard Deviations:

33

You now have a control chart:

Select OK .

However, this isn’t an np-chart, rather, it’s a basic variant.

Let’s create an np-chart (remove the standard deviations from above, please):

Step 1

Understand that the control limits in an np-chart are defined by the following equation:

Where n is the sample size and is the window’s total number of defective units divided by the window’s total number of units.

Step 2

Write an equation to determine . is calculated as being the (Number of Defectives)/(Total Sample

Size).

In Tableau, this calculation is,

WINDOW_SUM(SUM([Defectives]))/WINDOW_SUM(SUM([Sample Size]))

34

We now might see that must be the number of observations multiplied by . That is:

We now have all the elements we need to complete the equation for the upper and lower control limits:

LCL :

UCL :

35

Step 3

We can now place these three calculations on the Level of Detail shelf:

Step 4

To visualize them in the graph, simple right-click on the y-axis and select Add Reference Line.

Let’s first just graph the np-bar :

Select OK .

Repeat Step 4 for both the LCL and UCL .

36

Step 5

To color the points that lie outside of the upper and lower control limits, simply use the following equation:

Then place this equation on the Color shelf. Keep in mind, you may consider creating a dual-axis chart using the techniques we learned when making the histogram, and making multiple mark types, such that the In Control?

field is placed on the circle mark types:

37

Using R: A Programming Language for

Statistical Analysis

Background & Getting Started

R is a powerful object-based programming language (like C++, java, python, and many other modern programming languages). It is considered to be one of the leading tools to conduct advanced statistical analyses.

With Tableau’s connection to R, you can leverage Tableau’s rich data discovery, visualization and dashboard capabilities in conjunction with R’s advanced statistical computations.

In this class, we’ll go over how to connect Tableau to R, as well as some of the most basic statistical functions in R.

Connecting Tableau to R

1) Download R at: http://www.r-project.org/

2) Open up R

3) Type the following script (keep in mind that R is case sensitive): install.packages("Rserve") //when prompted, choose a mirror to install this library from library(Rserve)

Rserve() //run Rserve()

Your window should look similar to this:

4) Open Tableau

5) Select Help>Settings and Performance>Manage R Connection…

38

6) Specify the following and select OK:

Conducting a t-test

You might think back to our coffee data. Sure, it looked like some of the actual values were less than their advertised amounts. But, were they significantly less? How can we tell?

A simple t-test will help us get to the root of the matter.

But, how do we compute a t-test in R?

In R, many statistics are only one simple function away. Much like how Tableau has functions such as

DATEADD(), DATEDIFF(), and WINDOW_AVG(); R has a wide range of native functions that you can easily call and pass values into.

Let’s compute a t-test in R. The function for a t-test is t.test()

39

But, you might wonder, how do I use a t-test in R? Are there any example scripts?

It’s actually very similar to how Tableau’s functions have descriptions and examples:

In R, you can easily access this information by writing a question mark before any function and hitting return:

Other helpful tips in R:

?t.test help(t.test)

#Returns help on the t.test function

#Returns help on the t.test function apropos(“t.test”) #List of commands containing the string: “t.test” help.search(“t.test”) #Final all commands with “t.test” in them or in their description

40

By typing “?t.test” into the R console, you’ll see the following: t.test {stats}

Student's t-Test

Description

R Documentation

Performs one and two sample t-tests on vectors of data.

Usage

t.test(x, ...)

## Default S3 method: t.test(x, y = NULL,

alternative = c("two.sided", "less", "greater"),

mu = 0, paired = FALSE, var.equal = FALSE,

conf.level = 0.95, ...)

## S3 method for class 'formula' t.test(formula, data, subset, na.action, ...)

Arguments

x a (non-empty) numeric vector of data values. y mu an optional (non-empty) numeric vector of data values. alternative a character string specifying the alternative hypothesis, must be one of "two.sided" (default), "greater" or "less" . You can specify just the initial letter. a number indicating the true value of the mean (or difference in means if you are performing a two sample test). paired a logical indicating whether you want a paired t-test. var.equal

a logical variable indicating whether to treat the two variances as being conf.level

confidence level of the interval. formula a formula of the form lhs ~ rhs where lhs is a numeric variable giving the data values and rhs a factor with two levels giving the corresponding groups. data equal. If TRUE then the pooled variance is used to estimate the variance otherwise the Welch (or Satterthwaite) approximation to the degrees of freedom is used. an optional matrix or data frame (or similar: see model.frame

) containing the variables in the formula formula . By default the variables are taken from environment(formula) .

41

Well, that was scary. Let’s break it down:

First, our question is, “Is the coffee store selling us the advertised amount of coffee in a 12 oz. cup?”

From the jumble of words on the previous page, the full script we’ll be using in R is: t.test

( Weight , mu=12 , alternative = 'less' )

That’s a lot less painful. Let’s look at what elements we’re using from the previous page:

We see that:

 t.test() is the function we’re calling

X is the field we’re conducting the t-test on (in this case, weight)

Alternative=’less’ is the type of t-test we’re conducting

Mu = 12 indicates the true value of the mean (12 ounces)

42

When we run this code in R, we get the following:

If we were to write the code in Tableau, it would look like this:

You’ll notice several differences.

First, SCRIPT_REAL is a Tableau function which allows you to communicate with R. Tableau has the following four functions to communicate with R:

SCRIPT_REAL()

SCRIPT_BOOL()

SCRIPT_STR ()

SCRIPT_INT()

These functions require that R returns a specific value of type REAL/INT/BOOL/STR, which Tableau will then display. In this case, we want R to return just the p-value (a real number) from the ANOVA table.

Hence, we use SCRIPT_REAL() to capture that return type. In the R code, we use $p.value to indicate which part of the ANOVA table we want returned.

Second, you’ll likely notice that .arg1, .arg2, … .arg[N] are used as placeholders for the different Tableau variables (such as weight in this case).

Third, you’ll likely notice that there are quotes surrounding the R function.

43

Activity

Open the “12 ounce coffee” dataset. Create a box plot. Conduct a t-test as described above, placing your final calculation on “Label”. Keep in mind that SCRIPT functions in Tableau are Table Calculations and that once you place them in the view, you will need to right-click on their pills and change their

“Compute Using” option.

Challenge:

The above example only works with 12-ounce cups and not with the other sizes. Can you go back to your original box plot and write one script that will compute p-values for all four different coffee sizes at the same time?

Hint: .arg2[1] will be useful here.

44

Challenge Activity: Be a Detective, Find Fraud!

Benford’s Law

Let’s start out by looking at the leftmost digit of our sales values:

2

9

3

5

Leftmost digit

4

1

1

Sales

$4,692

$13,070

$18,057

$2,729

$9,396

$3,668

$5,973

Benford’s law states that in nature, these sales numbers’ leftmost digits aren’t uniformly distributed. In the real world, Benford’s law states that numbers have a 1 as their leftmost digit ~30% of the time, a 2 as their leftmost digit ~18% of the time, a 3 as their leftmost digit ~12% of the time, etc.

This may come as surprising, but it’s actually quite common. So common, in fact, that it is not uncommon for Benford’s law to be used to detect accounting fraud among various other forgeries.

Let’s check Benford’s law against Tableau’s own famed SuperStore dataset. After all, this dataset is computer generated, and it’s doubtful that the people who made the data modeled the sales after

Benford’s law:

Or did they?

45

Finding Fraud with Benford’s Law

Step 1

Connect to the dataset, “SuperStore”.

Step 2

We want to grab the leftmost number of every sales value. To do this, we need to use a LEFT() function.

Because the LEFT() takes string values, we need to convert our sales values to strings with the following equation:

Step 3

Drag the LEFT(Sales,1) field to the Columns shelf, and the Number of Records to the Rows shelf:

46

Step 4

Right-click the Number of Records pill, and change its calculation to Percent of Total :

Step 5

Copy it to the Text / Label shelf, and you’ve just used Benford’s law.

(Optional) Step 6

To create the 80, 100, and 120% colored marks, you will need Benford’s distribution equation (this works with log base 2 and above, however, we’re assuming log base ten in this case):

P (d) = log (d+1) – log(d)

Where d is the leftmost number. To translate this into Tableau terms:

Next, select OK. Now place this equation in the Level of Detail shelf, and set it to MIN:

47

Right-click on the y-axis to add a reference line, and select the following:

Then select OK .

Your finished product should look something like this:

48

Challenge Activity: Advanced Calculations

Derivatives

Derivatives can be thought of as a rate of change. For instance, I’m driving at 60 miles per hour. Over the period of ten seconds, I slow down to 30 miles per hour. What was my rate of change?

Fortunately, Tableau makes this very easy to do. Simply click on your pill, choose Quick Table

Calculation and select Percent Difference .

But what’s a real world application of derivatives, you might ask?

That’s a great question. Let’s take a look at the Purchasing Mangers Index (PMI). Feel free to open the

GDP and PMI dataset. The PMI is an indicator that surveys 400 purchasing managers in the manufacturing sector regarding whether production level, new orders from customers, speed of supplier deliveries, inventories, and employment level are 1. Better than last month, 2. The same as last month, 3. Worse than last month.

In general, a PMI over 50 indicates that the manufacturing sector is expanding, while a PMI below 50 means that the manufacturing sector is contracting. The range of possible PMI scores is from zero to

100.

According to investopedia, “The PMI is calculated by taking the percentage of responders that reported better conditions than the previous month and adding half of the respondents that reported no change in conditions. For example, a PMI reading of 50 would indicate an equal number of respondents reporting ‘better conditions’ and ‘worse conditions’”.

Many economists wish they could use the PMI as an indicator of how our GDP is doing. Because PMI reports are issued monthly, and GDP is issued quarterly, PMI can indicate how well our economy is doing before the GDP reports come in (essentially, it is a leading indicator). But when we graph them both together, we get something like this (orange is PMI, blue is GDP):

49

The above is pretty useless. Well, that makes sense. PMI is effectively a rate of change of how our economy is doing. GDP is just a total. We can think of GDP as the speed of our economy, and PMI as the acceleration. Let’s take the derivative of GDP, and see what happens.

To do this, simply go to the Rows shelf, right-click on GDP , and choose Quick Table Calculation>Percent

Difference :

And voila, we now have an excellent indicator of how well our GDP will do. Since PMI is released on a timely and monthly manner, it is several months ahead of when the next GDP report will come out, and therefore, can be used as a good indicator as to how GDP will change.

Calculating Correlation Coefficients

Correlation coefficients (which range in value from -1 to 1), tell us how strongly two fields are linearly related. If there’s a strong relationship (i.e., it’s very close to either -1 or 1), then that means that knowing one variable will help a lot in predicting the other variable.

The correlation coefficient is made up of Standard Scores (which were discussed in section regarding

Box Plots). Because of that, the challenge of this exercise is to find the correlation coefficient, with minimal input from this text. Therefore, the equation will be listed below, as well as some additional information. However, step-by-step instructions will not be provided:

50

Where: r = correlation coefficient n = sample size, modeled by the equation SIZE() in Tableau

= sample mean

= sample standard deviation

= standard score (as seen in the Box Plot exercises)

= WINDOW_SUM()

For this example, use the SuperStore dataset, and find the correlation coefficient between Profit and

Sales (i.e., let Sales = x, and Profit = y). Place correlation coefficient on color. What is the correlation coefficient broken down by different regions? Shipping containers?

What’s the correlation coefficient between PMI and GDP?

51

Download