UCLA Academic Technology Services HomeServicesClassesContactJobs
Stat Computing > Data File Manipulation

What kinds of problems can I solve with Unix-like Commands?

This page will show a number of examples using Unix-like commands to perform ASCII data file management tasks. The ASCII data file that these examples are based on can be downloaded here.

Example 1: Counting number of rows in a data file
Example 2: Displaying the length of the longest record in a data file
Example 3: Translating a word to another
Example 4: Printing the first few lines of a file
Example 5: Stacking files
Example 6: Splitting a big file
Example 7: Searching lines with specific characters


Example 1: Counting number of rows in an ASCII (text) data file

Let's say you have a huge file on your machine and you are going to input it into a statistical software. You may want to know before hand the number of records that the file has so that you can double check once the data file is read in. The example below shows how to use the wc (for word count) command to show the number of records.

D:\temp>ls -l test.csv
-rw-rw-rw-   1 user     group       11599 Sep 24  2002 test.csv
D:\temp>wc -l test.csv
    407 test.csv
D:\temp>more test.csv
MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$
18,307,130,3504,12,70,1,8,0
15,350,165,3693,12,70,1,8,0
18,318,150,3436,11,70,1,8,0
16,304,150,3433,12,70,1,8,0
17,302,140,3449,11,70,1,8,0
15,429,198,4341,10,70,1,8,0
14,454,220,4354,9,70,1,8,0
14,440,215,4312,9,70,1,8,0
14,455,225,4425,10,70,1,8,0
15,390,190,3850,9,70,1,8,0
#NULL!,133,115,3090,18,70,2,4,1
#NULL!,350,165,4142,12,70,1,8,0
#NULL!,351,153,4034,11,70,1,8,0
#NULL!,383,175,4166,11,70,1,8,0
...............................

ls -l: displaying the file attributes, such as access privilege,  file size and the date the file was created;

wc -l: displaying the number of records of the file;

more: displaying the file one page at a time until  you hit key q to quit.

Now we know that file test.csv has 407 rows with the first row being variable names.


Example 2: Displaying the length of the longest record in a data file

Very often we need to know the length of the longest record in order to input the data file correctly. Again we can use wc (for word count) to get the information.

D:\temp>wc -L test.csv
     59 test.csv

Now we know that the length of the longest record of test.csv is 59 characters.


Example 3: Translating a word to another

We have seen from the example above that our data file test.csv has some characters in it. For example, we have "#NULL!" for missing values instead of "." as most commonly used. We want to change "#NULL!" to simply a dot "." in the data file. We can use tr command to perform the translation. Command tr takes a few parameters and has a few options.

D:\temp>tr -s '#NULL!' . <test.csv > test_dot.csv
D:\temp>more test_dot.csv
MPG,E.GI.E,HORSE,WEIGHT,ACCE.,YEAR,ORIGI.,CY.I.DER,FI.TER_$
18,307,130,3504,12,70,1,8,0
15,350,165,3693,12,70,1,8,0
18,318,150,3436,11,70,1,8,0
16,304,150,3433,12,70,1,8,0
...........................
.,133,115,3090,18,70,2,4,1
.,350,165,4142,12,70,1,8,0
.,351,153,4034,11,70,1,8,0
.,383,175,4166,11,70,1,8,0
.,360,175,3850,11,70,1,8,0
15,383,170,3563,10,70,1,8,0
14,340,160,3609,8,70,1,8,0
.,302,140,3353,8,70,1,8,0
..........................

tr:

more: displaying the file one page at a time until  you hit key q to quit.

From the output of command more, we can see that the character string "#NULL!" has been replaced by "." in the output data file test_dot.csv.


Example 4: Printing the first few lines of a file

Sometimes, it is handy to work with a small subset of the original file. It can save time and it is easy to debug problems with a small data file. The command that we are going to use is head.

D:\temp>head test.csv
MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$
18,307,130,3504,12,70,1,8,0
15,350,165,3693,12,70,1,8,0
18,318,150,3436,11,70,1,8,0
16,304,150,3433,12,70,1,8,0
17,302,140,3449,11,70,1,8,0
15,429,198,4341,10,70,1,8,0
14,454,220,4354,9,70,1,8,0
14,440,215,4312,9,70,1,8,0
14,455,225,4425,10,70,1,8,0

head:  will print out the first 10 lines.

D:\temp>head -20 test.csv > test20.csv
D:\temp>wc -l test20.csv
     20 test20.csv

head: with option "-20" will print out the first 20 lines, but we now redirect the printout to a file called test20.csv.

wc: "-l" is an option to print out the total number of lines in the file.

So we see that in the file test20.csv created by command head has 20 lines.


Example 5: Stacking files

Stacking files on top of each other is a very common task in data management. You want to make sure that the files that are going to be stacked have same number of variables and the variables are positioned the same in each of the files.

In this example, we have two data files, test.csv and test2.csv. We use command cat (for concatenate) to combine the two files. One thing that we need to pay attention is that data file test.csv has a header row (variable names), while test2.csv only has data rows. Therefore, when the two are stacked together, test.csv has to be on top.

D:\temp>head test.csv
MPG,ENGINE,HORSE,WEIGHT,ACCEL,YEAR,ORIGIN,CYLINDER,FILTER_$
18,307,130,3504,12,70,1,8,0
15,350,165,3693,12,70,1,8,0
18,318,150,3436,11,70,1,8,0
16,304,150,3433,12,70,1,8,0
17,302,140,3449,11,70,1,8,0
15,429,198,4341,10,70,1,8,0
14,454,220,4354,9,70,1,8,0
14,440,215,4312,9,70,1,8,0
14,455,225,4425,10,70,1,8,0
D:\temp>head test2.csv
22,250,105,3353,15,76,1,6,1
24,200,81,3012,18,76,1,6,1
23,232,90,3085,18,76,1,6,1
29,85,52,2035,22,76,1,4,1
25,98,60,2164,22,76,1,4,1
29,90,70,1937,14,76,2,4,1
33,91,53,1795,17,76,3,4,1
20,225,100,3651,18,76,1,6,1
18,250,78,3574,21,76,1,6,1
19,250,110,3645,16,76,1,6,1
D:\temp>cat test.csv test2.csv > testall.csv
D:\temp>wc -l testall.csv
    607 testall.csv
D:\temp>wc -l test.csv
    407 test.csv
D:\temp>wc -l test2.csv
    200 test2.csv

head: displaying the first ten lines of a file. We can see that test2.csv does not have a header row but test.csv has.

cat: combining test.csv and test2.csv together vertically and output it to a file called testall.csv.

wc: counting the number of lines in each file to see if the stacking makes sense.


Example 6: Splitting a big file

Sometimes, a file is too big to be manipulated, for example to be transferred from one machine to another. One thing we can do is to split the file into several smaller files and re-assemble them back together later. Command split can be used for this task. Let's say we need to split our test.csv into two files. Since test.csv has total of 407 lines, the two smaller files can have size about 210 each.

D:\temp>split -l210 test.csv sml_file
D:\temp>ls -l sml_file*
-rw-rw-rw-   1 user     group        6031 Dec 18 12:19 sml_fileaa
-rw-rw-rw-   1 user     group        5568 Dec 18 12:19 sml_fileab
D:\temp>wc -l sml_fileaa
    210 sml_fileaa
D:\temp>wc -l sml_fileab
    197 sml_fileab

wc: counting the number of lines in each file

split:


Example 7: Searching lines with specific characters

Data entry errors are very common with a large data file. Sometimes a numeric variable can contain characters and has to be read in as a character variable. It may be useful to know the number of lines of data that has some specific characters, so we can get a good idea on how severe the data entry error problem is. For example, the data set below has some data entry errors, assuming that everything should be numeric.

123456
23?445
334566
456x77
534345
676767
3435?3
123245
3435k4
4668a3
243354
575784

Let's say this data set is called problem.txt.

D:\temp>more problem.txt
123456
23?445
334566
456x77
534345
676767
3435?3
123245
3435k4
4668a3
243354
575784
D:\temp>grep "[a-z?]" problem.txt >pline.txt
D:\temp>more pline.txt
23?445
456x77
3435?3
3435k4
4668a3
D:\temp>wc -l pline.txt
      5 pline.txt

How to cite this page

Report an error on this page

UCLA Researchers are invited to our Statistical Consulting Services
We recommend others to our list of Other Resources for Statistical Computing Help
These pages are Copyrighted (c) by UCLA Academic Technology Services


The content of this web site should not be construed as an endorsement of any particular web site, book, or software product by the University of California.