Using AWK To Process Huge CSV Files

Problem

I had 38 separate .csv files each consisted of at least 40K lines of data. Overall there were ≥ 2.6 million rows of information. The problem was that the data was not in the ideal form.

Each file had 5 columns. One of the columns was called birthday. It had a conventional(at least for the most of the world) format: DD/MM/YYYY. However I was going to put this data in a DB(more on that later!) and I needed to have 3 separate properties(ie. { day: DD, month: MM, year: YYYY }). Also there were some columns that were not useful so I had to get rid of them. In addition, I also wanted to reduce the number of files to one.

Approach

I know some python and from a couple of data science courses I was introduced to Pandas, a well-known tool for data analysis built on top of python. It had a straightforward way of reading, manipulating, and writing csv files. I was going to use that, but didn't really want to download this tool to use only one of the features and never use it again. So I choose something that comes with nearly all modern unix-like systems. AWK is designed for text processing and much like other programs in the unix-like world it does one thing well and produces fascinating results in regards to simplicity and performance when composed with other unix tools.

Solution

Now we need to make AWK process .csv files. AWK uses field separators to split an input into fields. It may either be a character or a RegExp. So all we need to do to make AWK become a .csv processor is to set FS(field separator) to be ",".

We can achieve this using AWK's Startup and Cleanup Actions.

# clean_csv.awk

BEGIN {
  FS = ","
}
{
  print $1
}
END {
}

Then if we pass an input such as one, two, three to the script, it will print one:

$ echo "one, two, three" | awk -f csv.awk

one

Now that we have this, we can finally solve our problems: splitting DD/MM/YYYY column into three separate columns and removing the useless column. AWK has a split function that takes a field, a new variable, and a separator and puts the separated values into the new variable.

The birthday column was #3 so to create the separate fields we need to make the following modifications:

# clean_csv.awk


BEGIN {
  FS = ","
}
{
  # split DD/MM/YYY into an array of values
  split($4, dob, "/")

  # turn these values into separate columns
  total = dob[1] "," dob[2] "," dob[3]

  # finalize the row, omitting the useless #4 column.
  row = $1 "," $2 "," total "," $5

  print row
}
END {
}

This mainly solves most of our problems, except for the first row of the input. .csv files have headers as a set of labels on the first row, thus we need to skip every first line of every .csv file.

In AWK NR and FNR represent the record number. The latter shows the record number of the current file, while the first holds the total count. Since we are going to use this script on more then one file, we'll stick with FNR.

So the final script looks like so:

# clean_csv.awk

BEGIN {
  FS = ","

  # create a new header with correct labels
  print "last_name,first_name,birth_day,birth_month,birth_year,country"
}
{
  if (FNR > 1) { # omit every first line
    # split DD/MM/YYY into an array of values
    split($4, dob, "/")

    # turn these values into separate columns
    total = dob[1] "," dob[2] "," dob[3]

    # finalize the row, omitting the useless #4 column.
    row = $1 "," $2 "," total "," $5

    print row
  }
}
END {
}

At last, to execute this on our 38 .csv files we need to run this command:

awk -f clean_csv *.csv > clean.csv

Conclusion

We read and manipulated over 30 .csv files with more than 2.6 million rows of data combined and reduced them into a single clean file without installing any software. So if you're faced with a problem, you may not need to install a bloated software to solve it when the solution comes pre-installed with your system.