Why would a data scientist bother learning multiple programming languages? Going through most job postings, one could conclude that Python is enough.
In this article I am going to argue that being exposed to multiple technology stacks can greatly benefit anyone coding for a living, and especially data scientists, even if they never actually ship code written in those stacks.
This is the water cooler discussion that inspired this post:
Colleague: Hey steremma, I was thinking of becoming a better programmer and you are the nerdiest person I happen to know. I already know Python, so which language would you recommend?
Me: Oh you should definitely try Scala. In fact there is a great Coursera Specialization taught by the very creator of the language. It’s great because <3 minute long ramble>.
Colleague: Oh sounds great. So, I take it you have been using Scala in most of your recent projects?
Me: Well…
In fact, I have to this day never used Scala in a work related project. There are practical reasons for this, specifically that I mostly work in teams where everyone knows and likes Python or R. Even when working alone, I don’t want to impose my inevitable learning curve on my problem owner, so I just stick with what I know best. Does this mean that Scala is in fact useless to me? Did I spend 5 months studying it for nothing? Is the meaning of life really equal to 42?
My main counter-argument, is that studying a language entails much more that learning its set of commands. In fact each language reflects a unique programming model. When learning it, one should focus on that model, rather than trivial syntax, since the latter can be usually googled or stack-overflowed. In fact as Alan Perlis so elegantly put it: A language that doesn’t affect the way you think about programming, is not worth knowing. And Scala does indeed change one’s understanding of programming as a whole, by bringing the functional paradigm into perspective.
Indeed, I presently cannot recall much about Scala’s syntax. Was it Integer x
or maybe x: Integer
? I don’t know! What I do know however, is that:
Collection<Cat>
is not necessarily a Collection<Animal>
, even though a Cat
is an Animal
.The above concepts have something in common: None of them are in any way tied to Scala! In fact all of those - except TRE - can be used to improve one’s Python code.
A very important property of the functional paradigm, is how rigorously it can model mathematical concepts. This comes in contrast to imperative programs, but is a point not always taken into consideration when discussing the advantages of functional languages. In a field like Data Science however, which lies in the border between CS and Mathematics, I consider this expressional power really important. Let’s break down this argument with some examples:
Every data scientist has used the word Function
in at least 2 contexts: A Programming Function
, and a Mathematical Function
. What we do not always realize however, is how different those concepts are in the imperative paradigm. You might expect that a programming function is a model of a mathematical one. But you would be wrong in the general case, and here is why:
Mathematical functions are stateless, in the sense that they only care about their arguments and not the world’s state. f(x, y) = x + 2*y
will always return the same value when called with the same pair of x
and y
, regardless of whether it is written on paper, a whiteboard, or an IDE. It will never be directly influenced by any external state z
(unless x
or y
depend on z
but we do not consider this case here).
On the other hand, consider this perfectly legitimate Python function:
counter = 0
def f(x):
global counter
counter += x
return counter > 5:
f(3) # returns False
f(3) # returns True
Our function f
depends on external state that has nothing to do with its argument. As a result, calling it with the same argument does not guarantee the same result.
Moreover, mathematical functions will never mutate a value, they might only generate a new value from a given input. But programming functions with side effects might actually mutate their arguments, or even the surrounding world.
Consider another perfectly valid snippet:
def g(x):
for i, v in enumerate(x):
x[i] = v ** 2
d = [1, 2, 3]
g(d)
d # prints [1, 4, 9]
The function mutated its argument. In other words, it performed a side effect. This is also incompatible to any valid mathematical function. You could re-write the above snippet as:
def g(x):
return [i ** 2 for i in x]
d = [1, 2, 3]
d_squared = g(d)
d # prints [1, 2, 3]
The function defined above does indeed model a valid mathematical function.
A very similar difference exists in the programming and mathematical concepts of a value
. Consider solving a mathematical problem on paper. At some point you might write X = 3
. Would it ever make sense if some lines after that, you wrote X = 4
? No, it would not! The reason is simple: re-assignment, as well as mutation, are completely unsupported concepts in all mathematical formulations I am aware of. What one might do instead is define a new value, say Y = X + 1
. However the snippet:
X = 3
do_stuff(X)
X = 4
is also perfectly legal Python code, even if it (again) fails to model a mathematically valid sequence of steps. Even worse, it greatly limits our ability to reason about the code: we now need to keep track of each variable’s state: to which object is the name X currently tied to?. In a functional program, values are final
: once they are set, they can never change. As long as I remember seeing X = 3
in the functions body, I never have to wonder what its current value is.
These issues become exceedingly important for data scientists, especially when trying to reproduce a mathematically derived model, say from a paper. In this case, a scientist would need to translate mathematical formula’s into code. And as shown above, avoiding side effects and enforcing immutability (or to be precise, referential transparency), can greatly benefit one’s ability to reason about the code, and directly map its modules to the paper’s sentences or formulas.
Generally, what Scala, Haskell, of even F# can teach us, is the functional paradigm: a programming model that expresses mathematical concepts in a much more natural way than alternative imperative models. Is it necessary that one employs this knowledge at one’s day to day job activities? In my opinion, no. Mutability is a powerful tool, one that I can not always afford to give up. One should however be conscious of the trade-off being silently accepted by sacrificing referential transparency, and allowing functions to perform side effects.
For very much the same reasons, I believe anyone making a living by writing code, should go through C (and this advice comes from a terrible C programmer). Similarly, anyone using distributed systems should take a look at MPI and anyone programming a GPU, even on the higher abstraction layers, should write a matrix multiplication program in CUDA. Everything in the syntax of these technologies is out of fashion: imperative to their core, using error codes in an era when exceptions are light, manual memory management… They don’t even properly support any of the modern programming paradigms. However they do reveal what happens under the protective hood on our preferred language’s programming model: What actually happens when I hit 2 + 2
in my Python interpreter.