Uploaded by Дар'я Ганноченко

Wage Analysis: Statistics & Correlation

advertisement
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from scipy.stats.contingency import expected_freq
from statsmodels.graphics.mosaicplot import mosaic
1.
Variable Classification
Variable
Data Scale
Classification
Explanation
age (in years)
ratio
quantitative
The age is
measured in
years. If the
age is zero, it
means no
age in year
(small baby).
This variable
has zero
measuremen
t, indicating
a lack of the
characteristi
c. It is
quantitative
because its
realizations
reflect both
differences
and
magnitudes.
case (identifier)
nominal
qualitative
The case
identifier is a
label that
does not
allow for
meaningful
comparison
beyond
equality or
inequality.
Case
(identifier)
shows which
category a
person
belongs to,
consequentl
y, it's
Variable
Data Scale
Classification
Explanation
nominal
data. It is
qualitative,
as it
describes an
attribute —
identifier.
sleep (mins sleep at night, per week)
ratio
quantitative
This variable
has a true
zero,
meaning
that if sleep
time is zero,
the person
did not sleep
at all. It is
quantitative
since its
realizations
reflect both
differences
and
magnitudes.
educ (years of schooling)
ratio
quantitative
This variable
has a true
zero,
representing
a complete
lack of
education
(e.g., a
toddler with
no
schooling). It
is
quantitative
since its
realizations
reflect both
differences
and
magnitudes.
exper (age - educ - 6)
ratio
quantitative
This is a
derived
quantitative
variable
Variable
Data Scale
Classification
Explanation
representing
experience.
It is ratio
because it
has a true
zero when
the
individual
has no work
experience.
yngkid (=1 if children < 3 present)
nominal
qualitative
A categorical
variable
indicating
whether a
young child
is present. It
does not
allow for
meaningful
numerical
comparisons
beyond
equality or
inequality. It
is qualitative,
as it
describes a
personal
attribute.
yrsmarr (years married)
ratio
quantitative
This variable
has a true
zero,
meaning
that if the
value is zero,
the person
has never
been
married. It is
quantitative
because its
realizations
reflect both
differences
and
magnitudes.
Variable
Data Scale
Classification
Explanation
male (=1 if male)
nominal
qualitative
A categorical
variable
indicating
whether the
person is
male. It does
not allow for
meaningful
numerical
comparisons
beyond
equality or
inequality. It
is qualitative,
as it
describes a
personal
attribute —
sex.
hrwage (hourly wage)
ratio
quantitative
This variable
has a true
zero,
meaning
that if the
value is zero,
the person
has no
hourly wage
(i.e., not
employed or
unpaid
labor). It is
quantitative
since its
realizations
reflect both
differences
and
magnitudes.
marr (=1 if married)
nominal
qualitative
A categorical
variable
indicating
marital
status
(married or
not). It
doesn't
Variable
Data Scale
Classification
Explanation
allow for
meaningful
numerical
comparisons
beyond
equality or
inequality. It
is qualitative,
as it
describes a
personal
attribute —
marital
status.
data_set = pd.read_csv('HW1_dataset.csv')
df = data_set['hrwage']
1.
Computing of common location measures: mean, median, upper and lower quartiles,
the upper and lower 5%-quantiles. Economic interpretation for every location
measure.
From this data we can see that average hourly wage is 5.16. But we cannot focus
on this indicator, because mean is sensitive to outliers. (For example if we have 3
person with wages [1, 2, 99] and their mean wage is 34; and we can see that it is not
representative.)
Median is 4.61, that says us that 50% of our respondents has hour wage less than
4.61.
Lower quartile means that 25% of people have hour wage less than 3.03.
Upper quartile means that 75% of people from dataset have hour wage less than
6.24.
Lower quantile means that 5% of people have hourly wage less than 1.44.
Upper quantile means that 95% of people from dataset have hourly wage less than
10.39.
location_measures= {"Mean": round(float(df.mean()), 2),
"Median": round(float(df.median()), 2),
"Lower quartile": round(float(df.quantile(0.25)),
2),
"Upper quartile": round(float(df.quantile(0.75)),
2)
}
upper_and_lower_5_quanlile = {"Lower quantile (5%)":
round(float(df.quantile(0.05)), 2),
"Upper quantile (5%)":
round(float(df.quantile(0.95)), 2)}
print(location_measures)
print(upper_and_lower_5_quanlile)
{'Mean': 5.16, 'Median': 4.61, 'Lower quartile': 3.03, 'Upper
quartile': 6.24}
{'Lower quantile (5%)': 1.44, 'Upper quantile (5%)': 10.39}
1.
Computing measures of variation for hrwage: range, interquartile range, variance.
Range (the difference between lowest and highest hour wage) is 35.16. It can
indicates about inequality of population or outliers.
Interquartile range (IQR) is the difference between the 75-quantile and the 25quantile is 3.21. IQR shows us how typical the hourly wages are in the middle half of
the group. IQR helps to focus on the middle of the data, but not accounting very big
or very low hourly wage. In our case it tells us that the half of group in the middle
(from the 25th to the 75th percentile) have wages that range from 3.03 to
6.24(lower quartile + IQR) per hour.
Variance 13.47 indicates that in average hour wage deviates by 13.47 squared from
the mean. In meant that wages are not all located closely around the mean. Some
workers prorably earn higher or lower than average hour wage.
measures_of_variation = {"range": round(float(df.max() - df.min()),
2),
"Interquartile range":
round(float(location_measures["Upper quartile"] location_measures["Lower quartile"]), 2),
"Variance": round(float(df.var()), 2)}
measures_of_variation
{'range': 35.16, 'Interquartile range': 3.21, 'Variance': 13.47}
1.
Ploting the histogram of hrwage and the Box-plot. Computing skewness of hrwage
and make the conclusion whether the distribution of this variable is symmetric.
Skewness 3.39 is positive, which indicates that we have some values which are
located at the right side of the mean. In case of hourly wages it means that there are
a few people who has higher hourly wage, than majority.
Based on this information we can say that our distribution isn't symmetric, but it's
right-skewed.
df.plot.box()
skewness = round(float(df.skew()), 2)
f"Skewness of {skewness}"
'Skewness of 3.39'
1.
Histogram for logarithmizated(hrwage). Comparison histogram of logarithmized data
with histogram of original data. Calculation of skewness of log(hrwage) conclusions.
log_data = np.log(df)
plt.hist(log_data, bins=12, edgecolor='pink')
plt.title("Histogram of logarithm hrwage")
plt.xlabel("logarithm of data")
plt.ylabel("frequency")
plt.show()
We can see that log data reduced extreme right skew, which we could see in original data
(skewness = -0.37). But there is still slight left skew. It means that there are few low wages, but
their effect isn't significant. Logarithmization makes the distribution more symmetric compared
to the original.
skewness_log = round(float(log_data.skew()), 2)
f"Skewness of log(hrwage): {skewness_log}"
'Skewness of log(hrwage): -0.37'
1.
Plot the scatter plots of hrwage vs. educ, yrsmarr, age, sleep. Compute the
corresponding correlation (Pearson and Spearmann) coefficients and interpret the
results.
fig, axs = plt.subplots(2, 2, figsize=(12, 10))
axs[0, 0].scatter(data_set['educ'], data_set['hrwage'], color='blue',
alpha=0.7)
axs[0, 0].set_title('hrwage vs educ')
axs[0, 0].set_xlabel('Years of education')
axs[0, 0].set_ylabel('Hourly wage')
axs[0, 1].scatter(data_set['yrsmarr'], data_set['hrwage'],
color='green', alpha=0.7)
axs[0, 1].set_title('hrwage vs yrsmarr')
axs[0, 1].set_xlabel('Years married')
axs[0, 1].set_ylabel('Hourly wage')
axs[1, 0].scatter(data_set['age'], data_set['hrwage'], color='red',
alpha=0.7)
axs[1, 0].set_title('hrwage vs age')
axs[1, 0].set_xlabel('Age')
axs[1, 0].set_ylabel('Hourly wage')
axs[1, 1].scatter(data_set['sleep'], data_set['hrwage'],
color='purple', alpha=0.7)
axs[1, 1].set_title('hrwage vs. sleep')
axs[1, 1].set_xlabel('Sleep (hours)')
axs[1, 1].set_ylabel('Hourly wage')
plt.tight_layout()
plt.show()
pearson_corr = data_set[['hrwage', 'educ', 'yrsmarr', 'age',
'sleep']].corr(method='pearson').round(2).iloc[:1]
spearman_corr = data_set[['hrwage', 'educ', 'yrsmarr', 'age',
'sleep']].corr(method='spearman').round(2).iloc[:1]
print("Pearson correlation coefficients:\n", pearson_corr)
print("\nSpearman correlation coefficients:\n", spearman_corr)
Pearson correlation coefficients:
hrwage educ yrsmarr
age sleep
hrwage
1.0 0.26
0.13 0.12 -0.05
Spearman correlation coefficients:
hrwage educ yrsmarr age sleep
hrwage
1.0 0.25
0.14 0.1 -0.08
Pearson correlation coefficients range from 1 to -1. In our case this coefficient range from -0.05
to 0.26. It means that correlation between hrwage and education, years of marrige, age and
sleep is weak or very weak. The same situation with Spearman correlation coefficients, from
table data we can see that correlation is weak or very weak.
1.
Two subsamples of observation: male (male = 1) and female (male = 0). Ploting separate
histograms and boxplots. Compare the results. Compute the corresponding location and
dispertion measures. What conclusions can be made?
female_sample = data_set[data_set['male'] == 0]
male_sample = data_set[data_set['male'] == 1]
fig, axs = plt.subplots(1, 2, figsize=(10, 8), sharey=True)
axs[0].hist(female_sample['hrwage'], bins=8, color='blue',
edgecolor='black', alpha=0.7)
axs[0].set_title('Female hourly wage')
axs[0].set_xlabel('Hourly wage')
axs[0].set_ylabel('Frequency')
axs[1].hist(male_sample['hrwage'], bins=8, color='red',
edgecolor='black', alpha=0.7)
axs[1].set_title('Male hourly wage')
axs[1].set_xlabel('Hourly wage')
plt.show()
fig, ax = plt.subplots(1, 1, figsize=(8, 6))
ax.boxplot([female_sample['hrwage'], male_sample['hrwage']],
labels=['Female', 'Male'])
ax.set_ylabel('Hourly wage')
ax.set_title('Boxplot of wourly wages by sex')
plt.show()
/var/folders/lr/g27h7d295svcp5fxx4r1hn080000gn/T/
ipykernel_98489/756544564.py:2: MatplotlibDeprecationWarning: The
'labels' parameter of boxplot() has been renamed 'tick_labels' since
Matplotlib 3.9; support for the old name will be dropped in 3.11.
ax.boxplot([female_sample['hrwage'], male_sample['hrwage']],
labels=['Female', 'Male'])
Histograms and boxplots indicates, that female hourly wages median is lower than male hourly
wages median. Also boxplots indicate that the maximum hourly wage for women doesn't exceed
15 units per hour, while the male wage reaches up to 35 units per hour in some cases. Mean
hourly wage for female is lower than male. And in general variance of female hourly wage (hw)
less than male hw variance. It means that women earn approximately the same low amount per
hour. (In my opinion, this can be related to poorer access to quality education and consequently
to the choice of high-paying jobs.)
female_location_measures= {"Mean":
round(float(female_sample['hrwage'].mean()), 2),
"Median":
round(float(female_sample['hrwage'].median()), 2),
"Lower quartile":
round(float(female_sample['hrwage'].quantile(0.25)), 2),
"Upper quartile":
round(float(female_sample['hrwage'].quantile(0.75)), 2)
}
female_measures_of_variation = {"range":
round(float(female_sample['hrwage'].max() female_sample['hrwage'].min()), 2),
"Interquartile range":
round(float(female_location_measures["Upper quartile"] female_location_measures["Lower quartile"]), 2),
"Variance":
round(float(female_sample['hrwage'].var()), 2)}
# female_location_measures, female_measures_of_variation
male_location_measures= {"Mean":
round(float(male_sample['hrwage'].mean()), 2),
"Median":
round(float(male_sample['hrwage'].median()), 2),
"Lower quartile":
round(float(male_sample['hrwage'].quantile(0.25)), 2),
"Upper quartile":
round(float(male_sample['hrwage'].quantile(0.75)), 2)
}
male_measures_of_variation = {"range":
round(float(male_sample['hrwage'].max() male_sample['hrwage'].min()), 2),
"Interquartile range":
round(float(male_location_measures["Upper quartile"] male_location_measures["Lower quartile"]), 2),
"Variance":
round(float(male_sample['hrwage'].var()), 2)}
# male_location_measures, male_measures_of_variation
In the table below can see difference between average hourly wage in female and male. The
difference is significant. Of course, I remember, that mean is sensetive to outliers. But median
hourly wage of men is more than 70% higher than the median hourly wage of women. This is an
indication of discrimination. Also measures of variation indicates that women in average rarely
have high-paying jobs. Interquartile range (IQR) indicates how spread out the middle half of the
data is, ignoring the very low and very high values. IQR shows us how typical the hourly wages
are in the middle half of the group. Also range for feamle 10.87 indicates that female earn
approximately the same low amount per hour compared to male, where range is 34.79.
Сomparative table of location and variation measures
Measure
Female
Male
Mean
3.56
6.29
Median
3.2
5.54
Lower Quartile
2.31
3.96
Upper Quartile
4.61
7.35
Range
10.87
34.79
Interquartile Range
2.3
3.39
Location Measures
Measures of Variation
Measure
Female
Male
Variance
3.65
17.35
1.
In this task I consider the grouping of data (from original dataset HW1 dataset.csv)
for
• Group1 (low wage) with hrwage ≤ 3
• Group2 (medium wage) with 3 < hrwage ≤ 6
• Group3 (heigh wage) with hrwage > 6
Below is the contingency table with absolute and relative frequencies. We can see that more
than 70% of Group 1 consists of women. However, as the hourly wage increases (in Groups 2
and 3), the proportion of women decreases. This indicates inequality in hourly wages between
females and males.
I have used this guideline to calculate contigency coefficient of Pearson(CCP).
I have got CCP equals to 0.55, which indicate strong relation between sex and hourly wage.
Accorging to this data set, there is correlation between being female and earning less.
We can not apply Pearson or Spearmann correlation coefficients to make some conclusions
about the relation of interest, because scale of measurement should be interval or ratio, but in
our case this scale is nominal (male and group of hourly wage).
Group #
Female
Total in
group
Male
Count
%
Count
%
Count
%
Group 1
(low
wage)
90
73%
34
27%
124
100%
Group 2
(medium
wage)
97
41%
141
59%
238
100%
Group 3
(heigh
wage)
20
14%
118
86%
138
100%
If you don't like numbers, look at this table! You can see the same, but in more pleasant way for
you 👇
contigency_table = np.array([[90, 34], [97, 141], [20, 118]])
obs = pd.DataFrame(
contigency_table,
columns=["female", "male",],
index=["Group 1", "Group 2", "Group 3"])
mosaic(obs.stack(), title="Observations")
plt.show()
obs
Group 1
Group 2
Group 3
female
90
97
20
male
34
141
118
group_1 = data_set[data_set['hrwage'] <= 3]
group_2 = data_set[(data_set['hrwage'] > 3) & (data_set['hrwage'] <=
6)]
group_3 = data_set[data_set['hrwage'] > 6]
print(f"Dataset size: {df.size}\nGroup 1 size: {group_1.shape[0]}\
nGroup 2 size: {group_2.shape[0]}\nGroup 3 size: {group_3.shape[0]}")
Dataset size: 500
Group 1 size: 124
Group 2 size: 238
Group 3 size: 138
female_group_1 = group_1[group_1['male'] == 0]
male_group_1 = group_1[group_1['male'] == 1]
female_number_group_1 = female_group_1.shape[0]
female_percent_group_1 =
round(female_number_group_1/group_1.shape[0]*100, 2)
print("Group 1")
print(f"Fem num: {female_number_group_1}; fem percent:
{female_percent_group_1}\nMale num:: {group_1.shape[0] female_number_group_1}; Male percent: {100-female_percent_group_1}")
Group 1
Fem num: 90; fem percent: 72.58
Male num:: 34; Male percent: 27.42
female_group_2 = group_2[group_2['male'] == 0]
male_group_2 = group_2[group_2['male'] == 1]
female_number_group_2 = female_group_2.shape[0]
female_percent_group_2 =
round(female_number_group_2/group_2.shape[0]*100, 2)
print("Group 2")
print(f"Fem num: {female_number_group_2}; fem percent:
{female_percent_group_2}\nMale num:: {group_2.shape[0] female_number_group_2}; Male percent: {100-female_percent_group_2}")
Group 2
Fem num: 97; fem percent: 40.76
Male num:: 141; Male percent: 59.24
female_group_3 = group_3[group_3['male'] == 0]
male_group_3 = group_3[group_3['male'] == 1]
female_number_group_3 = female_group_3.shape[0]
female_percent_group_3 =
round(female_number_group_3/group_3.shape[0]*100, 2)
print("Group 3")
print(f"Fem num: {female_number_group_3}; fem percent:
{female_percent_group_3}\nMale num:: {group_3.shape[0] female_number_group_3}; Male percent: {100-female_percent_group_3}")
Group 3
Fem num: 20; fem percent: 14.49
Male num:: 118; Male percent: 85.51
Below I calculate contigency coefficient of Pearson
chisqVal = np.sum((contigency_table - expected_freq(obs)) ** 2 /
expected_freq(obs))
chisqVal
np.float64(90.91665766095004)
C_star = np.sqrt(chisqVal / (np.sum(contigency_table) + chisqVal))
C_star
np.float64(0.3922460821192232)
count_row = obs.shape[0]
count_col = obs.shape[1]
r, c = obs.shape
k = min(r, c)
C_star_max = np.sqrt((k - 1) / k)
contigency_coeff_of_Pearson = C_star / C_star_max
contigency_coeff_of_Pearson
np.float64(0.5547197291207162)
Download