|
|
|
||||
|
|
|||||
Now, let's read in an example dataset, hsbs, a subset of the High School and Beyond data file with all of the variables as string variables.
As you see from the describe command below, the variables are all defined as string variables (e.g., science is str2, a string of length 2).use http://www.ats.ucla.edu/stat/stata/faq/hsbs, clear
Even though the variable science is defined as str2, you can see below that it contains just numeric values. Even so, because the variable is defined as str2, Stata cannot perform any kind of numerical analysis of the variable science. The same is true for the variable read.describe Contains data from hsbs.dta obs: 20 vars: 6 25 Nov 1999 19:09 size: 400 (99.6% of memory free) ------------------------------------------------------------------------------- 1. id str3 %3s 2. gender str1 %5s 3. race str5 %9s 4. schtyp str3 %5s 5. read str2 %5s 6. science str2 %5s ------------------------------------------------------------------------------- Sorted by:
list
id gender race schtyp read science
1. 70 m 1 pub 45 47
2. 121 f 1 pub 68 63
3. 86 m 1 pub 44 58
4. 141 m 1 pub 63 53
5. 172 m 1 pub 47 53
6. 113 m 1 pub 44 63
7. 50 m 3 pub 50 53
8. 11 m 2 pub 34 39
9. 84 m 1 pub 63 .
10. 48 m 3 pub 57 50
11. 75 m 1 pub 60 53
12. 60 m X pub 57 63
13. 95 m 1 pub 73 61
14. 104 m 1 pub 54 55
15. 38 m 3 pub 45 31
16. 115 m 1 pub 42 50
17. 76 m 1 pub 47 50
18. 195 m 1 pri 57
19. 114 m 1 pub 68 55
20. 85 m 1 pub 55 53
Let's try using the destring command
and see how it works.As you can see from the describe command below, the destring command converted all of the variables to numeric, except for race, gender and schtyp. Since these variables had characters in them, the destring command left such variables alone.destring , replace id has all characters numeric; replaced as int gender contains non-numeric characters; no replace race contains non-numeric characters; no replace schtyp contains non-numeric characters; no replace read has all characters numeric; replaced as byte science has all characters numeric; replaced as byte (2 missing values generated)
describe
Contains data from http://www.ats.ucla.edu/stat/stata/faq/hsbs.dta
obs: 20
vars: 6 25 Nov 1999 19:09
size: 340 (98.5% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id int %10.0g
gender str1 %5s
race str5 %9s
schtyp str3 %5s
read byte %10.0g
science byte %10.0g
-------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
How do we convert gender and schtyp
into numeric values? We can use the encode command as shown
below. These commands create gender2 and schtyp2.Notice in the describe command below that gender2 and schtyp2 are numeric variables and they have labels associated with them (called gender2 and schtyp2).encode gender, generate(gender2) encode schtyp, generate(schtyp2)
describe
Contains data from http://www.ats.ucla.edu/stat/stata/faq/hsbs.dta
obs: 20
vars: 8 25 Nov 1999 19:09
size: 560 (99.8% of memory free)
-------------------------------------------------------------------------------
storage display value
variable name type format label variable label
-------------------------------------------------------------------------------
id str3 %3s
gender str1 %5s
race str5 %9s
schtyp str3 %5s
read str2 %5s
science str2 %5s
gender2 long %8.0g gender2
schtyp2 long %8.0g schtyp2
-------------------------------------------------------------------------------
Sorted by:
Note: dataset has changed since last saved
If we list out the data, it appears that gender2
and schtyp2 are identical to gender and schtyp, however
they are really numeric and what you are seeing are the value labels
associated with the variables.
list
id gen~r race sch~p read sci~e gender2 schtyp2
1. 70 m 1 pub 45 47 m pub
2. 121 f 1 pub 68 63 f pub
3. 86 m 1 pub 44 58 m pub
4. 141 m 1 pub 63 53 m pub
5. 172 m 1 pub 47 53 m pub
6. 113 m 1 pub 44 63 m pub
7. 50 m 3 pub 50 53 m pub
8. 11 m 2 pub 34 39 m pub
9. 84 m 1 pub 63 . m pub
10. 48 m 3 pub 57 50 m pub
11. 75 m 1 pub 60 53 m pub
12. 60 m X pub 57 63 m pub
13. 95 m 1 pub 73 61 m pub
14. 104 m 1 pub 54 55 m pub
15. 38 m 3 pub 45 31 m pub
16. 115 m 1 pub 42 50 m pub
17. 76 m 1 pub 47 50 m pub
18. 195 m 1 pri 57 m pri
19. 114 m 1 pub 68 55 m pub
20. 85 m 1 pub 55 53 m pub
Below we use the nolabel option and you see
that gender2 and schtyp2 are really numeric.
list, nolabel
id gen~r race sch~p read sci~e gender2 schtyp2
1. 70 m 1 pub 45 47 2 2
2. 121 f 1 pub 68 63 1 2
3. 86 m 1 pub 44 58 2 2
4. 141 m 1 pub 63 53 2 2
5. 172 m 1 pub 47 53 2 2
6. 113 m 1 pub 44 63 2 2
7. 50 m 3 pub 50 53 2 2
8. 11 m 2 pub 34 39 2 2
9. 84 m 1 pub 63 . 2 2
10. 48 m 3 pub 57 50 2 2
11. 75 m 1 pub 60 53 2 2
12. 60 m X pub 57 63 2 2
13. 95 m 1 pub 73 61 2 2
14. 104 m 1 pub 54 55 2 2
15. 38 m 3 pub 45 31 2 2
16. 115 m 1 pub 42 50 2 2
17. 76 m 1 pub 47 50 2 2
18. 195 m 1 pri 57 2 1
19. 114 m 1 pub 68 55 2 2
20. 85 m 1 pub 55 53 2 2
What about the variable race?
It is still a character variable because our prior destring command saw
the X in the data and did not attempt to convert it because it had
non-numeric values. Below we can convert it to numeric by include the ignore(X)
option that tells destring to convert the variable to numeric and when
it encounters X to convert that to a missing value. You can see
the results in the list command below.
destring race, replace ignore(X)
race: characters X removed; replaced as byte
(1 missing value generated)
list
id gen~r race sch~p read sci~e gender2 schtyp2
1. 70 m 1 pub 45 47 m pub
2. 121 f 1 pub 68 63 f pub
3. 86 m 1 pub 44 58 m pub
4. 141 m 1 pub 63 53 m pub
5. 172 m 1 pub 47 53 m pub
6. 113 m 1 pub 44 63 m pub
7. 50 m 3 pub 50 53 m pub
8. 11 m 2 pub 34 39 m pub
9. 84 m 1 pub 63 . m pub
10. 48 m 3 pub 57 50 m pub
11. 75 m 1 pub 60 53 m pub
12. 60 m . pub 57 63 m pub
13. 95 m 1 pub 73 61 m pub
14. 104 m 1 pub 54 55 m pub
15. 38 m 3 pub 45 31 m pub
16. 115 m 1 pub 42 50 m pub
17. 76 m 1 pub 47 50 m pub
18. 195 m 1 pri 57 m pri
19. 114 m 1 pub 68 55 m pub
20. 85 m 1 pub 55 53 m pub
As you have seen, we can use destring
to convert string variables that contain numbers into numeric variables, and
it can handle situations where some values are stored as a character (like the
X we saw with race). If you have a character variable that is
stored as all characters, you can use encode to convert the character
variable to numeric and it will create value labels that have the values that
were stored with the character variable.Fore more information, see the help or reference manual about the destring and encode commands.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services