2016 marked a zenith in the data science renaissance. In the wake of a series of articles and editorials declaiming the shortage of data analysts, the internet responded in force, exploding with blog posts, tutorials, and listicles aimed at launching the beginner into the world of data science. And yet, in spite of all the claims that this language or that library make up the essential know-how of a "real" data scientist, if 2016 has taught us anything it's that the only essential skill is a willingness to keep learning.
U.S. Chief Data Scientist DJ Patil famously referred to data science as “a team sport” — and within an organization, data science does work best when practiced collaboratively. But the emerging field of data science is more organic and mutable than it is systematic and coordinated. Data scientists must continue learning new domains, languages, techniques, and applications so as to move forward as the field continues to evolve. In the age of the data product, a more apt analogy for data science is the amoeba — an organism continually in motion, altering its shape, spreading and changing. For this reason, it's probably more difficult to stay a data scientist than it is to become one.
For those of us who spent 2016 reading all those articles ("50 essential things every data scientist MUST know" and "How to spot a FAKE data scientist"), taking Coursera and Codeacademy courses, following Analytics Vidhya tutorials, competing in Kaggle competitions, and trolling Kirk Borne on Twitter, now is a good time to think about what comes next.
So now you're a data scientist, congrats; where do you go from here?
Okay, you're a data scientist, now what?
In the spirit of New Years resolutions, here's a list of 10 (technical) things for the intermediate data scientist to try in 2017 — things you do can do to push yourself forward, keep your edge, set yourself apart, and be a better data scientist by 2018.
1. Adopt repeatable, systematic processes
Let's assume you're an old hat at data analytics, but how systematic is your process? If the answer is "not very," you might want to consider establishing a more structured path to discovery. Not only can a more systematic approach make you more efficient, but it can also ensure that you consistently consider a broader range of techniques for each problem, including things like graph analytics and time dynamics.
2. Explore a new data type
Love CSV? So do we! CSVs are great — simple, compact, distributable, and gotta love those header rows. But getting overly comfortable with wrangling one particular type of data can be limiting. This coming year, why not expand your analytical range by trying out a new serialization format like JSON or XML? Work mainly with categorical data? Try experimenting with a time series analysis. Mostly use relational data? Try your hand at unstructured text or geospatial data like rasters.
3. Break out of your machine learning rut
Have you fallen into an algorithmic comfort zone? People like to argue about which machine learning model is the best, and everyone seems to have their favorite! Sure, picking a good model is important, but it's debatable whether a model can actually be 'good' devoid of the context of the domain, the hypothesis, the shape of the data, and the intended application. Fortunately, high-level Python libraries like Scikit-learn (also Tensorflow, Theano, NLTK, Gensim, and Spacy) provide APIs that make it easy to test and compare a host of models without additional data wrangling. In 2017, build breadth by exploring new models — honestly, it has become lazy not to!
4. Learn how your favorite models actually work
We're living in an age where any data scientist with a little Python know-how can use a library like Scikit-Learn to predict the future, but few can describe what's actually happening under the hood. Guess what? Clients and customers are becoming more discerning and demanding more interpretability. For the many self-taught machine learning practitioners out there, now's the time to learn how that algorithm you love so much actually works. For Scikit-Learn-users, check out the documentation to find a link to the paper used in the implementation for each algorithm. You can also check out some of our previous posts to learn how things like PCA, distributed representations, skipgram, and parameter tuning work in theory as well as practice.
5. Start using pipelines
The machine learning process often combines a series of transformers on raw data, transforming the data set each step of the way until it is passed to the fit method of a final estimator. A pipeline is a mechanism for sanely combining these steps — a step-by-step set of transformers that takes input data and transforms it, until finally passing it to an estimator at the end. Pipelines can be constructed using a named declarative syntax so that they're easy to modify and develop. If you're just getting started with pipelines, check out Zack Stewart's excellent post on the topic.
6. Build up your software engineering chops
Data scientists tend to leave a massive amount of technical debt in their wake. But what do you think will happen to those data scientists when all the good software engineers figure out how to do logistic regressions?
In 2017, boost your software engineering skills by pushing yourself to develop higher quality code, to build object-oriented, reusable methods, and to practice good habits like writing documentation and using exception handling to facilitate better communication with the team (and with future you!).
7. Learn a new programming language
Move over imposter syndrome and meet your know-it-all sibling, contempt culture! Now that you're a data scientist and have accumulated enough confidence to override the natural impulse toward self-doubt, there's a tendency to get a bit cocky. Don't! Stay humble by pushing yourself to learn a new programming language. Know Python? Try teaching yourself Javascript or CSS. Know R? Branch out to learn Julia or master SQL.
8. Consider data security
What's the biggest security risk for a modern business? Hiring a data scientist! Why? It's because data scientists often unknowingly expose their companies to massive security vulnerabilities. Attackers are interested in all kinds of data, and as Will Voorhees says in Eat Your Vegetables, as data scientists we often mistakenly think we can rely on the magic information security elves to protect our precious data. In 2017, make an effort to learn about encryption, account separation, and temporary credentials — and for the love of Hilary Mason, stop committing your access tokens to Github.
9. Make your code go faster
Still have scripts running all night in the hopes you'll wake up to results to enjoy with your morning coffee? You're out of excuses; it's time to get on the MapReduce bandwagon and teach yourself Hadoop and Spark. Know what else will speed things up? More efficient code! One low-hanging fruit is mutable data structures. Sure, Pandas data frames are great, but did you ever wonder what makes those lookups, joins, and aggregations so easy? It's holding a bunch of data in memory all at the same time. In 2017, try switching to NumPy arrays and see what you think.
10. Contribute to an open source project
You may not realize it, but as a data scientist, you are already significantly involved in the open source community. Nearly every single one of the tools we use — Linux, Git, Python, R, Julia, Java, D3, Hadoop, Spark, PostgreSQL, MongoDB — is open source. We look to StackOverflow and StackExchange to find answers to our programming questions, grab code from blog posts, and pip install
like there's no tomorrow. In 2017, consider giving back by making your own contribution to an open source project — it's not just for data science karma, it's also a way to build up your GitHub cred! A lot of the senior data scientists I know don't even look at candidates' resumes anymore; someone's Github portfolio and commit history often tell volumes more.
Conclusion
In 2016, the world of analytics, machine learning, multiprocessing, and programming got a lot bigger. The result of the data science eruption has been a broader and more diverse community of colleagues, people who will meaningfully augment not only the quantity but the quality of the next generation of data products.
And yet, this expansion has also meant that the field of data science began to lose some of its mysticism and cache. As new practitioners flood the market, data scientist salaries have started to drop off, from highs in the $200K-range to ones topping out closer to $150K; as Barb Darrow signaled in her 2015 Fortune article, "Supply, meet demand. And bye-bye perks."
So how can you distinguish yourself in a landscape which may once have felt impenetrable, but has now started to feel routine? Whether you use ours or set your own, pick ten things you can do over the next year to keep your mind sharp and your skills current, and remember — when it comes to data science, nothing endures but change!
District Data Labs provides data science consulting and corporate training services. We work with companies and teams of all sizes, helping them make their operations more data-driven and enhancing the analytical abilities of their employees. Interested in working with us? Let us know!