Gsort - Stata

advertisement
Title
stata.com
gsort — Ascending and descending sort
Syntax
Remarks and examples
Menu
Also see
Description
Options
Syntax
gsort [ + | - ] varname
[ + | - ] varname . . .
, generate(newvar) mfirst
Menu
Data
>
Sort
Description
gsort arranges observations to be in ascending or descending order of the specified variables and
so differs from sort in that sort produces ascending-order arrangements only; see [D] sort.
Each varname can be numeric or a string.
The observations are placed in ascending order of varname if + or nothing is typed in front of the
name and are placed in descending order if - is typed.
Options
generate(newvar) creates newvar containing 1, 2, 3, . . . for each group denoted by the ordered
data. This is useful when using the ordering in a subsequent by operation; see [U] 11.5 by varlist:
construct and examples below.
mfirst specifies that missing values be placed first in descending orderings rather than last.
Remarks and examples
stata.com
gsort is almost a plug-compatible replacement for sort, except that you cannot specify a general
varlist with gsort. For instance, sort alpha-gamma means to sort the data in ascending order of
alpha, within equal values of alpha; sort on the next variable in the dataset (presumably beta),
within equal values of alpha and beta; etc. gsort alpha-gamma would be interpreted as gsort
alpha -gamma, meaning to sort the data in ascending order of alpha and, within equal values of
alpha, in descending order of gamma.
Example 1
The difference in varlist interpretation aside, gsort can be used in place of sort. To list the 10
lowest-priced cars in the data, we might type
. use http://www.stata-press.com/data/r13/auto
. gsort price
. list make price in 1/10
1
2
gsort — Ascending and descending sort
or, if we prefer,
. gsort +price
. list make price in 1/10
To list the 10 highest-priced cars in the data, we could type
. gsort -price
. list make price in 1/10
gsort can also be used with string variables. To list all the makes in reverse alphabetical order,
we might type
. gsort -make
. list make
Example 2
gsort can be used with multiple variables. Given a dataset on hospital patients with multiple
observations per patient, typing
. use http://www.stata-press.com/data/r13/bp3
. gsort id time
. list id time bp
lists each patient’s blood pressures in the order the measurements were taken. If we typed
. gsort id -time
. list id time bp
then each patient’s blood pressures would be listed in reverse time order.
Technical note
Say that we wished to attach to each patient’s records the lowest and highest blood pressures
observed during the hospital stay. The easier way to achieve this result is with egen’s min() and
max() functions:
. egen lo_bp = min(bp), by(id)
. egen hi_bp = max(bp), by(id)
See [D] egen. Here is how we could do it with gsort:
. use http://www.stata-press.com/data/r13/bp3, clear
. gsort id bp
. by id: gen lo_bp = bp[1]
. gsort id -bp
. by id: gen hi_bp = bp[1]
. list, sepby(id)
This works, even in the presence of missing values of bp, because such missing values are placed
last within arrangements, regardless of the direction of the sort.
gsort — Ascending and descending sort
3
Technical note
Assume that we have a dataset containing x for which we wish to obtain the forward and reverse
cumulatives. The forward cumulative is defined as F (X) = the fraction of observations such that
x ≤ X . Again let’s ignore the easier way to obtain the forward cumulative, which would be to use
Stata’s cumul command,
. set obs 100
. generate x = rnormal()
. cumul x, gen(cum)
(see [R] cumul). Eschewing cumul, we could type
.
.
.
.
sort x
by x: gen cum = _N if _n==1
replace cum = sum(cum)
replace cum = cum/cum[_N]
That is, we first place the data in ascending order of x; we used sort but could have used gsort.
Next, for each observed value of x, we generated cum containing the number of observations that
take on that value (you can think of this as the discrete density). We summed the density, obtaining
the distribution, and finally normalized it to sum to 1.
The reverse cumulative G(X) is defined as the fraction of data such that x ≥ X . To obtain this,
we could try simply reversing the sort:
. gsort -x
. by x: gen rcum = _N if _n==1
. replace rcum = sum(rcum)
. replace rcum = rcum/rcum[_N]
This would work, except for one detail: Stata will complain that the data are not sorted in the second
line. Stata complains because it does not understand descending sorts (gsort is an ado-file). To
remedy this problem, gsort’s generate() option will create a new grouping variable that is in
ascending order (thus satisfying Stata’s narrow definition) and that is, in terms of the groups it defines,
identical to that of the true sort variables:
.
.
.
.
gsort -x, gen(revx)
by revx: gen rcum = _N if _n==1
replace rcum = sum(rcum)
replace rcum = rcum/rcum[_N]
Also see
[D] sort — Sort data
Download