2. Modified Conventional rounding

advertisement

WP.31

Effects of Rounding on

Data Quality

Jay J. Kim , Lawrence H. Cox,

Myron Katzoff, Joe Fred Gonzalez, Jr.

U.S. National Center for Health Statistics

1

I.

Introduction

Reasons for rounding

• Rounding noninteger values to integer values for statistical purposes;

• To enhance readability of the data;

• To protect confidentiality of records in the file;

• To keep the important digits only.

2

• Purpose:

Evaluate the effects of four rounding methods on data quality and utility in two ways :

• (1) bias and variance;

• (2) effects on the underlying distribution of the data determined by a distance measure.

3

B : Base x

 q B x

 r x

q

x

: Quotient r x

: Remainder

( )

 q B x

( x

)

Types of rounding:

• Unbiased rounding: E[R(r)|r] = r

• Sum-unbiased rounding: E[R(r)] = E(r)

4

II. Four rounding rules

1. Conventional rounding

Suppose r = 0, 1, 2, . . . ,9 (=B-1).

else round down r to zero (0).

If B is odd, round r up when r 

B

1

2

5

r is assumed to follow a discrete uniform distribution

2. Modified Conventional rounding

Same as conventional rounding, except rounding 5 (B/2) up or down with probability ½.

3. Zero-restricted 50/50 rounding

Except zero (0), round r up or down with probability ½.

6

4. Unbiased

rounding rule

Round r up with probability r/B

Round r down with probability 1 - r/B

III. Mean and variance

III.1 Mean and variance of unrounded number r = 0, 1, 2, 3, . . . B-1.

7

B

1

2

B

2 

1

12

In general,

( )

( | )

 

( | )

8

V x

 2

( )

B

2 

1

12

III. 2 Conventional rounding when B is even

[

 

]

B

2

B

1

2 for unrounded number.

9

[

 

]

B

2

4

B

2 

1

12 for unrounded number .

( )

 

( )

B

2

,

4 and

( )

( )

B

2

4

1

.

10

III.3 Conventional rounding when B is odd

 

]

B

2

1

,

B

2

4

1

.

B

2 

1

12 for unrounded number

[ ( )]

3 V r

11

Modified conventional rounding , and

50/50 rounding unbiased rounding have the same mean , variance and MSE as the conventional rounding with odd B.

12

IV.

Distance measure

U

[ ( )

 x ]

2

 x

R r x

 r x

]

2 x

Define

 x

1,

0, if r is rounded up x

, otherwise .

13

Reexpressing the numerator of U, we have

(

 x

 x

)

2

With conventional rounding with B=10,

  x

1 with r x

5, 6, . . , 9

Then we have

E

 x

( | )

 x

1 

0

(

 x x

 x

)

2

|

 x P (

 x

).

14

Expected value of U

 q r

   q x r x

 x

[

(

 x x

 x

)

2 x P

 x

P r P q x x

We define

U

1

 r

( | ) |

    r x

 x

1

0

[

(

 x

 x q B r x

 x

)

2 x

 x x

15

IV.1 Conventional rounding with even B

U

1

[

B

 r x

1 r x

2 q B

 r x

B

1

 r x

B / 2

( x

 x

) x

2

]

1 q B r B which can be reexpressed as

U

1

[

B

 r x

1 r x

B

1

 r x

B / 2

(

 x r x

2

) 1

]

B

[

B

 r x

1 r x

2 q B

 r x

B

1

 r x

B / 2

( B x

 r

 x

) x

2

]

1 q B r B

16

U

1

[

B

 r x

1

[

B r x

1 r x

 r x

B

1

B / 2

(

 x r x

2

) 1

]

B r x

2

B

1

 r B x

/ 2

(

 x

2

) 1 x

] q B B

The upper and lower bounds for harmonic series

H n

1 2

1 

3

1 

. .

 n

1 are ln( n 1) H n

1 ln( )

The upper bound for the first term of U

1

B ln[

2( B

1)

]

B

2

B

1

2

17

B

2 

2

12 B

Note the second term of E(U) is,

E q x

1

[ ][ q

B x r x

1 r x

2 

B

1

 r x

B / 2

(

 x

1

B

2

IV.2 Modified conventional rounding with even B

Has the same E(U) as the conventional rounding.

18

IV.3 50/50 rounding

The first term of U

1

B ln ( B 1)

2

1

2

The second term

2 B

2 

3 B

1

6 B

IV.4 Unbiased rounding

The first term of U

1 B

1

2

19

The second term:

B

2 

1

6 B

IV.5 Comparisons of three rounding rules

Conv 50/50 Unbiased

1 st

Term

B ln[

2( B

1)

B

2

]

B

1

2

B ln ( B 1)

2

1

2

B

1

2

2 nd

Term

B

2 

12 B

2

E

1

 

 q x

2 B 2 

6

3

B

B

1

E

 q

1 x

2 B

2 

6

3

B

B

1

E

1 q x

20

Comparisons of three rounding rules

B=10

Conv 50/50

1 st term 2.61

2 nd term .85

E

1

 q x

11.49 (4.4)

2.85

E

 q

1 x

Unbiased

4.5 (1.7)

1.65 E

 q

1 x

21

Comparisons of three rounding rules

B=1,000

Conv 50/50 Unbiased

1 st term 193.65 3,453.88 (18) 499.5 (2.6)

2 nd term 83.33

E

1

 q x

322.83

E

1

  q x

166.67

E

1

  q x

22

IV.6 E(1/q) for log-normal distribution y N

 

2

)

Let y = ln x

Then, x has a lognormal distribution, i.e., f x

  2   

2

 x )

1 e

1

2

( ln x

 

 

) 2

, 0

23

Let c

 

1

 f x

  2

) dx

Then

E

1

( | q q

1,

  2

)

1 c e

1

2

 

2   

[1

 

(

  

 

)]

IV.6 E(1/q) for Pareto distribution of 2nd kind

The Pareto distribution of the second kind is,

 ak a q a

1

, a

0, 0

24

E

1

( )

1 q

( a k

1) where k is the minimum value of q and c is the cumulative probability from 1.

IV.7 Upper limit for E(1/q) for multinomial distribution

The multinomial distribution has the form

( ,

1 2

, . .

q k

| ,

1 2

, . .

p k

)

 n !

i k

1 q i

!

i k

1 p i q i q i

= 0,1,2,

25

Note,

E (

1

)

E ( q i q i

1

1

)

E [

3

( q i

1)( q i

2)

] for all i.

Let be the size of the category i and i n

  n

 i j

1

1 n i

1

E ( q q

2

. .

q k

)

 i k

1

[

5( n

1)( n

2) p r i

1  j

1 n j

2( n

1)( n

2) p i

2

2( n

2) p i

6

(1

 r i i

1  j

1 n j

)

]

26

V.

Concluding comments

Various methods of rounding and in some applications various choices for rounding base B are available.

The question becomes: which method and/or base is expected to perform best in terms of data quality and preserving distributional properties of original data and, quantitatively, what is the expected distortion due to rounding?

This paper provides a preliminary analysis toward answering these questions

27

References

Grab, E.L & Savage, I.R. (1954), Tables of the Expected

Value of 1/X for Positive Bernoulli and Poisson Variables,

Journal of the American Statistical Association 49 , 169-

177.

N.L. Johnson & S. Kotz (1969). Distributions in

Statistics, Discrete Distributions , Boston: Houghton

Mifflin Company.

N.L. Johnson & S. Kotz (1970). Distributions in

Statistics, Continuous Univariate Distributions-1 ,

New York: John Wiley and Sons, Inc.

Kim, Jay J., Cox, L.H., Gonzalez, J.F. & Katzoff, M.J.

(2004), Effects of Rounding Continuous Data Using

Rounding Rules, Proceedings of the American Statistical

Association, Survey Research Methods Section ,

Alexandria, VA, 3803-3807 (available on CD).

Vasek Chvatal. Harmonic Numbers, Natural Logarithm and the Euler-Mascheroni Constant. See www.cs.rutgers.edu/~chvatal/notes/harmonic.html

28

Download