UCLA Academic Technology Services HomeServicesClassesContactJobs
Search

Stata FAQ
How do I check that the same data input by two people are consistently entered?

When two people enter the same data (double data entry), a concern is whether discrepancies exist between the two datasets (the rationale of double data entry), and if so, where. We start by reading in the two datasets, one entered by person1 and the second by person2. After we read in the data, we sort the datasets by the id variable id and then save the data.

clear
input id str8 name  age ht wt income
11 john    23 68 145 23000
12 charlie 25 72 178 45000
13 sally   21 64 135 12000
4  mike    34 70 156  5600
43 paul    30 73 189 15600
end

sort id
save person1, replace

clear
input id str8 name age ht wt income
11 john    23.5 68 145 23000
12 charles   25 52 178 45000
13 sally     21 64  .  12000
4  michael   34 70 156  5600
43 Paul      30 73 189  5600
end

sort id
save person2, replace

We compare the two datasets with the cf command to see if any discrepancies exist between the two datasets.

use person1, clear
cf _all using person2, verbose

              id:  match
            name:  3 mismatches
             age:  1 mismatches
              ht:  1 mismatches
              wt:  1 mismatches
          income:  1 mismatches
r(9);

The cf command revealed that differences do exist, however, it did not specify for which observations the mismatches occurred, which is our main objective. To find out where the errors occurred, we start by creating a large dataset that combines the two. However, in the large dataset we must distinguish the data input by person1 and person2. We choose to rename all variables from person1, except for the id variable (this is for matching purposes), by adding the suffix "_person1" via the rename command. We use the foreach command to make the renaming process more efficient. Once we the variables are renamed, person2 is merged with person1 by the id variable, id, and then the merged dataset is listed.

use person1, clear

foreach var of varlist name-income{
  rename `var' `var'_person1
}

merge id using person2
list

     +---------------------------------------------------------------------------------------------------------+
     | id   name_p~1   age_pe~1   ht_per~1   wt_per~1   income~1      name    age   ht    wt   income   _merge |
     |---------------------------------------------------------------------------------------------------------|
  1. |  4       mike         34         70        156       5600   michael     34   70   156     5600        3 |
  2. | 11       john         23         68        145      23000      john   23.5   68   145    23000        3 |
  3. | 12    charlie         25         72        178      45000   charles     25   52   178    45000        3 |
  4. | 13      sally         21         64        135      12000     sally     21   64     .    12000        3 |
  5. | 43       paul         30         73        189      15600      Paul     30   73   189     5600        3 |
     +---------------------------------------------------------------------------------------------------------+

In exploring the discrepancies, we can either display discrepancies by the variables or discrepancies by observations. We start by listing the discrepancies by the variables. We start by using the foreach command and reference the variables from person2 (they do not have the suffix), name-income. We use the if clause, `var' != `var'_person1, which lists only observations for a given variable, the given variable referenced by `var' from the foreach command, when the data entered by person2 (`var') is not equal to person1 (`var'_person1). When this condition is met, we list id, the value entered by person2 (`var') and the value entered by person1 (`var'_person1).

Note that when we list the variables, the variables with no suffix correspond to the entries made by person2.

*Discrepancies listed by variables.

foreach var of varlist name-income{
  list id `var' `var'_person1 if `var' != `var'_person1, abbreviate(15)
}
     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
  3. | 12   charles        charlie |
  5. | 43      Paul           paul |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+

When we list discrepancies by observations, we need to modify the prior program to evaluate the variables on a case-by-case basis i.e., for observation 1, we evaluate the entries across all variables given in the foreach. Once observation 1 is checked and discrepancies listed, we move to observation 2. This process is repeated until the last observation is completed. First, we find how many observations are in the data with the count command and then insert that value in the forvalues loop. The forvalues argument will allow us to evaluate discrepancies on a case-by-case basis. We added _n == `i' to the if clause in the list command to evaluate the variables in the foreach command for a given observation before moving to the next observation.

*Discrepancies listed by id variable.

count
    5

forvalues i = 1/5 {
   foreach var of varlist name-income{
   list id `var' `var'_person1 if (`var' != `var'_person1) & _n == `i', abbreviate(15)
   }
}

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  1. |  4   michael           mike |
     +-----------------------------+

     +-------------------------+
     | id    age   age_person1 |
     |-------------------------|
  2. | 11   23.5            23 |
     +-------------------------+

     +-----------------------------+
     | id      name   name_person1 |
     |-----------------------------|
  3. | 12   charles        charlie |
     +-----------------------------+

     +----------------------+
     | id   ht   ht_person1 |
     |----------------------|
  3. | 12   52           72 |
     +----------------------+

     +----------------------+
     | id   wt   wt_person1 |
     |----------------------|
  4. | 13    .          135 |
     +----------------------+

     +--------------------------+
     | id   name   name_person1 |
     |--------------------------|
  5. | 43   Paul           paul |
     +--------------------------+

     +------------------------------+
     | id   income   income_person1 |
     |------------------------------|
  5. | 43     5600            15600 |
     +------------------------------+


How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.