It often happens that you need to provide a dataset to non-tech personnel in your company.
The most common use case for this that I’ve encountered is when a business unit wants a report that can’t be easily produced and they need to massage the data further.
This can happen for many reasons – the ETL pipeline is out of date, part of the data doesn’t exist anywhere other than in a spreadsheet on a shared drive, the data is actually a static JSON file living in an S3 bucket, the BI tool only reads from the corporate data warehouse and the data warehouse doesn’t actually have all the data needed for the report etc etc
Whatever the reason, you can pretty much rest assured that they’ll want it as an Excel file and its up to the data scientist to make this as painless as possible. So here’s some tips for dataset delivery – now these may seem self evident, but hardly anyone I’ve worked with does this:
- Resize all the columns so that the data can be read
- Make the column heading bold face and centre aligned
- Freeze the top row(s) so that the columns don’t disappear on vertical scrolling
- Freeze left column(s) so that the row identifiers don’t disappear on horizontal scroll
Check out the the Appliances and Energy Prediction dataset which contains 19735 instances with 29 attributes. How long would it take to determine the windspeed at 11:30 am on the 9th May 2016 if the spreadsheet didn’t look like this?
Sound unnecessarily pedantic? If you’re thinking “yes” then perhaps another career would suit you better – because professional data scientists are here to add value and that means helping people access, understand and leverage data as easily as possible.
Sometimes it’s the simple tricks make all the difference.