Analytics

Pro Tip #1 – The Joy of TABs

sea_and_rocksExporting data into flat files is a really common task for a data scientist yet over the last few years I’ve seen almost everyone fall down a particular rabbit hole when it comes to data export – using comma delimited flat files.

For any number of reasons, you’ll often have to jump straight in at the SQL layer exporting result sets to file. In other cases you’re working in (say) a Jupyter notebook and you want to persist the data set for later processing (not necessarily by the same code or even tech stack!). Or the results of your Spark job are written from HDFS out to an S3 bucket.

Whatever you’re doing, you have to choose which character to use as the field delimiter in your records – and the pro tip of the day is to choose the humble TAB character over the comma.

I think the reason that comma is used by default so often may be because it’s the default separator in R and pandas as well as in many third party libraries (such as Java’s OpenCSV).

However most of the time, “CSV” is really just a catch-all term meaning “it’s a flat file, one record per line and the fields are delimited with the character X” for some given value of X.

8 Reasons To Make The Change

  1. avoid use of quotes in strings
  2. avoid complex escape sequences for the same reason
  3. if you’re working with web usage logs, urls will often contain commas
  4. any free text field from any source system will almost certainly have commas (Salesforce anyone?). Yes they can have tabs too – but it’s easy to replace these with spaces without losing text semantics
  5. It is easy to write NULL string fields (cos the writer doesn’t insist on using quotes)
  6. TAB is the default delimiter for text formatted data in common databases such as Postgres
  7. Many older C-based tools use fixed field size tab delimited I/O
  8. Most importantly – it is far, far easier to for people to read

And Then There Were None…

The only hurdle is that Excel won’t open a TSV by default – you have to open Excel first and then import the data, which is where you get to specify the delimiter. In my experience this only ever comes up at the end of your data mangling process when you’re sending the results to whichever person in Marketing/Sales/Finance asked for it.

The only other time I’ve heard anyone complain about TSV files was from a guy who still (as of 2017) codes C in vi. His issue was that it was too hard to read in an 80 character terminal window.

So he refused to produce TSVs and the system fell over when his urls all contained commas, we lost 3 weeks in consulting budget and in the end they had to re-write it in python to use tabs at twice the cost because they shouldn’t have been using C to produce web content in the first place.

I don’t think he’s still there now though…

 

 

Leave a Reply