|
|
|
||||
|
Help the Stat Consulting Group by
giving a gift
| |||||
|
Loading
|
|||||
Stata can handle dates from Jan 1, 100 to December 31, 9999 (although one should be cautious in dealing with dates before October 15, 1582 when the Gregorian calendar was put into effect). However, as these examples will illustrate, your data and Stata programs may need some mending to be prepared for years 2000 and beyond. This page contains example programs focusing on two main problems, data files which use 2 digits to specify the year (e.g., 12/25/98) and displaying dates using only 2 digits for the year (e.g., December 25, 98).
Imagine that it is the summer of 2001 and you would like to create a very simple Stata data file containing the names and birthdays of your friends. The names and birthdates of your friends are...
Noel was born December 25, 1903 (Christmas Day)
Hank was born February 29, 1956 (A leap year)
Mary was born December 31, 1999 (New Year's Eve, before the year 2000)
Eric was born January 1, 2000 (Near Year's Day, year 2000)
Jane was born July 4, 2001 (Born on the 4th of July)
Ok, so it is a short list, and a couple of your friends (Mary, Eric and Jane) are a little bit young, but this list will help demonstrate problems which can arise when dealing with dates in the Year 2000 and beyond. We start with an example where everything is fine, where the data uses 4 digits to indicate the year of birth, and the date of birth is displayed using 4 digits for the year. If your data files and programs are like this example, your data and Stata programs may be fully ready for the Year 2000.
Data Dictionary, friendsa.dct
infile dictionary {
str4 name
str10 bday
}
Noel 12/25/1903
Hank 02/29/1956
Mary 12/31/1999
Eric 01/01/2000
Jane 07/04/2001
Stata Program
infile using friendsa.dct
gen bdate = date(bday,"mdy")
format bdate %dM-D-CY
list
Output from Stata Program
name bday bdate
1. Noel 12/25/1903 December-25-1903
2. Hank 02/29/1956 February-29-1956
3. Mary 12/31/1999 December-31-1999
4. Eric 01/01/2000 January-01-2000
5. Jane 07/04/2001 July-04-2001
In Example 1 above, Stata is used to read the names and birthdates of your friends, and then the list command is used to display their names and birthdates. This example has two nice features.
By using 4 digit years to store the birthdays, and by displaying the birthdates using 4 digit years, this program is ready for the Year 2000. In fact, this program is ready for the year 2100, 2200, all the way up to the year 9999. However, if a different Stata format is used for displaying the birth dates, you may not be sure when some of your friends were born, as shown in Example 2 below.
Data Dictionary, friendsa.dct
infile dictionary {
str4 name
str10 bday
}
Noel 12/25/1903
Hank 02/29/1956
Mary 12/31/1999
Eric 01/01/2000
Jane 07/04/2001
Stata Program
infile using friendsa.dct
gen bdate = date(bday,"mdy")
format bdate %dM-D-Y
list
Output from Stata Program.
name bday bdate
1. Noel 12/25/1903 December-25-03
2. Hank 02/29/1956 February-29-56
3. Mary 12/31/1999 December-31-99
4. Eric 01/01/2000 January-01-00
5. Jane 07/04/2001 July-04-01
Example 2 is just like Example 1, except that the %dM-D-Y format is used to display the birthdays. As you can see, this format only displays the last 2 digits of the year of birth, leaving you to wonder in what century some of your friends were born. When dealing with dates which are in the year 2000 and beyond, it is important to choose a display format which will display dates using 4 digits for the year (e.g., %M-D-CY). However, there is a greater problem if the data only includes 2 digits for the year of birth, as shown in Example 3 below.
Data Dictionary, friendsb.dct
infile dictionary {
str4 name
str8 bday
}
Noel 12/25/03
Hank 02/29/56
Mary 12/31/99
Eric 01/01/00
Jane 07/04/01
Stata Program
infile using friendsb.dct
gen bdate = date(bday,"md19y")
format bdate %dM-D-CY
list
Output from Stata Program
name bday bdate
1. Noel 12/25/03 December-25-1903
2. Hank 02/29/56 February-29-1956
3. Mary 12/31/99 December-31-1999
4. Eric 01/01/00 January-01-1900
5. Jane 07/04/01 July-04-1901
Example 3 demonstrates a problem of using inputting dates using only a 2 digit year. For example, Eric was born on Jan 1, 2000 but his birthday is input as 01/01/00. With a 2 digit date, the date(bday,"md19y")ASSUMES that the century portion is 19. (Note that the date(bday,"md20y") expression would assume the century portion is 20.) As you can see in the output, Eric is incorrectly assigned a birthday of Jan 1, 1900. In this simple example we could enter the data all over again using 4 digits for the year of birth. However, you may have data files with thousands or millions of records using dates with 2 digit years. Example 4, shown below, illustrates a possible solution to this problem by telling Stata when to treat a birthday as coming from the 1900s and when to treat a birthday as coming from the 2000s.
Data Dictionary, friendsb.dct
infile dictionary {
str4 name
str8 bday
}
Noel 12/25/03
Hank 02/29/56
Mary 12/31/99
Eric 01/01/00
Jane 07/04/01
Stata Program
infile using friendsb.dct
gen bdate = date(bday,"md19y")
gen bdate_y = year(bdate)
replace bdate = date(bday,"md20y") if bdate_y <= 1902
format bdate %dM-D-CYlist name bdate
Output from Stata Program
name bdate
1. Noel December-25-1903
2. Hank February-29-1956
3. Mary December-31-1999
4. Eric January-01-2000
5. Jane July-04-2001
Example 4 demonstrates using a replace ... if statement to deal with dates which have 2 digit years. The replace ... if statement instructs Stata replace the birthdate with one where the century portion of the date is 20 IF the person was born in 1902 or earlier (otherwise, no change is made to the birthdate). This strategy attempts to draw a line at a certain year (in this case 1902). Dates over that year are treated as being from the 1900s (e.g., 1903 to 1999 is treated as 1903-1999) but years 1902 and less (1900-1902) are treated as coming from the 2000s (2000-2002). As you can see in this output, this seems to have mended our problem with the birthdays using 2 digit years. Eric and Jane are now properly understood to have a birthday in the years 2000 and 2001 respectively. However, Example 5 below shows a major weakness in this strategy, when a 2 digit year could mean 19xx or 20xx.
Data Dictionary, friendsc.dct
infile dictionary {
str4 name
str8 bday
}
Noel 12/25/03
Hank 02/29/56
Mary 12/31/99
Eric 01/01/00
Jane 07/04/01
Will 10/31/03
Stata Program
infile using friendsc.dct
gen bdate = date(bday,"md19y")
gen bdate_y = year(bdate)
replace bdate = date(bday,"md20y") if bdate_y <= 1902
format bdate %dM-D-CYlist name bdate
Output from Stata Program
name bdate
1. Noel December-25-1903
2. Hank February-29-1956
3. Mary December-31-1999
4. Eric January-01-2000
5. Jane July-04-2001
6. Will October-31-1903
Example 5 demonstrates the major weakness of using this replace ... if strategy for solving the problems with 2 digit years. It is now Winter 2003 and you have a new friend, Will born on October 31, 2003 (Halloween 2003). As you can see, Noel, born in 1903 and Will, born in 2003 both have birth dates of 03. The IF statement cannot differentiate between Noel and Will, and in this case both are treated as being born in the 1900s.
Using this replace .... if strategy is only useful when you can clearly specify a cutoff year which divides years which should be treated as 19xx from years which should be treated as 20xx. However, when this line becomes blurred, this solution fails. You can permanently solve your problem by revising your data file to use 4 digit years (e.g. as shown in Example 1), but this could be very costly and time consuming, requiring you to entirely restructure your data files and shift column locations for all other variables. Example 6 shows a compromise solution by using a new variable to indicate the century portion of the date.
Data Dictionary, friendsd.dct
infile dictionary {
str4 name
str10 bday
int bday_yy
}
Noel 12/25/03
Hank 02/29/56
Mary 12/31/99
Eric 01/01/00 20
Jane 07/04/01 20
Will 10/31/03 20
Stata Program
infile using friendsd.dct
gen bdate = date(bday,"md19y")
gen bdate_y = year(bdate)
replace bdate = date(bday,"md20y") if (bday_yy == 20)
format bdate %dM-D-CYlist name bdate
Output from Stata Program
name bdate
1. Noel December-25-1903
2. Hank February-29-1956
3. Mary December-31-1999
4. Eric January-01-2000
5. Jane July-04-2001
6. Will October-31-2003
Example 6 solves the problem of the dates with 2 digit years by creating a separate variable indicating the century portion of the date. As you can see in the output, everyone is correctly assigned the proper birthdate because the data EXPLICITLY indicates which birthdates should have a 20 prefixed to the year (using the bday_yy variable). Here are some important points about this program.
Conclusion
These examples illustrate some of the problems which will arise when using Stata to process dates for the Year 2000 and beyond. For more information, please see the links on our Statistical Computing and the Year 2000. For assistance solving Year 2000 problems in Statistical Computing, feel free to use the Statistical Consulting Services provided by the UCLA Academic Technology Services.
UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services