Data Science Trends

5 Things Data Scientists Need To Stop Saying…

“R is Better Than Python / Python is Better Than R”

This is all over the Internet exhibiting all the standard tribal behavioural excesses. I’m not going to get into this one because in my experience there isn’t an answer.

The truth is it doesn’t actually matter. What matters is that you learn both well enough to be proficient and most importantly be able to decide when you should use one or the other in a given scenario. They are tools not life partners.

If you’re a complete newbie in the space and just starting out you’ll probably find R a little harder than python (hey – it was designed by statisticians for statisticians) but both have their strengths and weaknesses.

I always find it strange that people are so attached to their toolkits. Most of the time you achieve the same thing in both so it actually depends less on your predilections and more on the skill set of the team you’re in and/or customer you’re working for.

For various reasons, you’ll often find that one tech stack seems to be “better” than another in a particular scenario for some given value of “better”. That’s just how it rolls.

And that’s ok.

serene“Super”

If you’ve been listening to pretty much any data science podcast and especially any of the Stanford Computer Vision courses you’ve probably noticed this one.

For some reason, over the last year or so, the data science community has thrown out its collective Thesaurus and decided to simply replace pretty much any synonym of “very” with the word “super”.

Consequently, we no longer use GPUs/TPUs so that we can reduce the train/validate/test cycle for deep networks on Tensorflow (say). Instead, GPUs let us train super-fast.

Things are no longer complex – they are super hard. Other things are no longer easy – they are super easy. The industry is moving super fast and choosing a bad learning rate will make your deep learning training convergence super slow.

Hosts of various podcasts are no longer merely excited about recent optimisations in Apache Spark or last weeks’ update to Kafka (say). They are super excited.

And guess how the guests are feeling these days?

I’m super sick of this.

If we’re going to engage in this kind of mindless herding behaviour then why don’t we go for broke and use last century’s literature as our guide and just adopt “doubleplusgood” for everything?

“The data are…”

I don’t care whether you rhyme it with later or martyr it’s a collective singular noun. Yes, we’re referring to more than one data point – I get it – but when you walk into a room and say ‘the data are being pre-processed’, well – I’m sorry but you sound like a complete prat.

Yes – you can argue about the correctness all day – but the reality is you just lost half your audience because most of them are thinking ‘how much are we paying this OCD-grammar-Nazi?’ (I’m aware of the irony here…).

If I can live with the plural of fish suddenly being ‘fishes’ instead of ‘fish’ (even in official journals these days) then we can all get past this – so please just stop it ok?

“Python is Better Than (select * from Java,Go,Scala…)”

I’m really worried about this. The tribal evangelism of the python crowd has all the hallmarks of the Agile/SCRUM religious zealotry that’s spread through the software industry like malignant lymphoma.

Look – python is bloody awesome for heaps of things – it’s also awesome for lots of stuff it probably wasn’t even designed for – and that’s a sign of a great tech stack. Scientists use it daily at scale and for ad-hoc analysis you can’t beat a Jupyter notebook.

The plethora of third party open source libraries makes python the tool of choice for data scientists everywhere – with OpenCV, pandas, numpy/scipy/dask, sckkit-learn etc it is probably the easiest and best data science stack around to start your journey with.

And yet…and yet.

When it comes to build software as opposed to doing stuff I’m just not convinced. python lacks almost all of the features needed to build industrial strength hardened, robust, type safe, isolated and distributed software systems – yet this is touted as if it was feature rather than a complete failure in design.

To make it worse, python has its own version of the SCRUM crowd’s Agile Manifesto in the form of the PEP-8 standard encoded on the tablets brought down from Mt Sinai by the prophet Guido et al.

For it has been decided that thou shalt not write lines longer 120 characters long. Thou shalt not use camel case. Thou shalt indent as directed – no more and no less. Thou must litter your codebase with references to ‘self’ even within thine own class and furthermore thou shalt refer to static class variables using the full class name even though you are a doing so from a method in that very same class.

Thou mayst not override constructors nor methods in thy classes – rather thou must either pollute signatures with defaults or pass a typeless dictionary of **kwargs that are hopefully defined (hard-coded) somewhere else in the codebase (yes really).

And thou mayst not isolate any part of thine system behind interfaces to facilitate system testing/mocking/stubbing. All classes, methods and all properties of all classes must be globally visible through out the entire codebase at all times.

Some of these are failures in the standard and others in the language. But in either case, other programming languages have these constructs in place for a reason – to help us because we are human and developing software is hard.

python literally let’s you do whatever you want – despite over 50 years of hard won knowledge by some of the smartest people on the planet proving that this is a really really bad idea…..

“Here are my codes”

spidersNo.

When Trump decides to blow up the planet, he’ll ask for The Codes. In the war movies, whenever the last outpost is being over-run the commanders will burn the code book to deny the enemy access to “the codes”. There was a band that used to come and play at the uni bar in the 90s called The Codes (and they were rubbish).

The fact is that whether you’re referring to code you are working on, code you are posting on Stack overflow, code you have stolen from your previous employer or simply some javascript code that you are busy pasting directly from Stack overflow into your production codebase, the correct term is “code” singular.

What you’re really telling the world is that your communication skills need work – and for a Data Scientist that’s bad.

 

Categories: Data Science Trends

Tagged as: , ,

Leave a Reply