Nate Harada Machine Learning in Real Life

Tensorflow I Love You, But You're Bringing Me Down

Tensorflow’s meteoric rise to the top of the deep learning world is, while unsurprising, pretty damn impressive. With almost 60k stars on Github (the only reasonable measure of software popularity), Tensorflow is far out in front of nearest competitor Caffe, with its paltry 18k. The framework has a lot going for it: Python, great tools like Tensorboard, Python, Google’s knowledge of distributed systems, Python, and popularity that all but guarantees future relevance.

But while Tensorflow is a wonderful framework, the decisions (or lack thereof) being made by the Tensorflow product team are making it increasingly difficult for external developers to adopt. In my eyes, Tensorflow’s public face has grown without proper direction, and is threatening to alienate developers and allow competing frameworks to take over.

Fragmented high level library support

My main gripe strikes me as a weird and totally avoidable issue: there are too damn many Google supported libraries for Tensorflow. Good software engineers know that reinventing the wheel is a bad thing, and so when the prospect of writing yet another training and batching loop rears its ugly head, we look to high level libraries to ease the pain. Apparently, Google employees were aware this would happen, and in a mad scramble to curry organizational favor managed to release no less than five(!) Google developed high level libraries. There’s tf.learn (which is of course different than 3rd party tool TFLearn), tf.slim, DeepMind’s Sonnet, something called prettytensor, and Keras, who if this were a high school drama would be rapidly trying to distance herself from her less cool friend Theano.

I appreciate the work that has gone into these tools, and certainly it’s a benefit to have options. However, these are first party, Google supported tools. There’s no clear preferred library, even internally, and while the champions for each library claim they are nothing alike, it’s difficult for an external developer or researcher to pick an option. When “new” == “risky” for most companies, developers want a toolkit they can commit to deploying internally that will still be considered “best practice” in a few months. By offering a whole slew of somewhat supported options, Google is hindering adaptation of the Tensorflow framework in general. Avoiding writing boilerplate code each new experiment is a must have for most devs, but having to learn a new “hot” framework because previous ones are no longer feature competitive severely limits research output, and is an unreasonable problem to have when all are controlled by the same company.

Build-mania

One of the best things a software product can have is a strong community. For most of us, learning a new library means reading examples on blogs and Github, and consulting forums or documentation for help on specifics. Unfortunately for the average developer, Google’s desire to build features and exciting new pieces of the ecosystem has left those resources in the dust. Every week it seems a new Tensorflow product is announced – XLA, TFDBG, a graph operation to turn on your toaster, etc. No doubt these features are beneficial, but it also means that any resource about Tensorflow is immediately out of date. Documentation tends to be the most up to date, but often provides no context or example usage. Example code is often stale, sometimes presenting old functions or workflows that aren’t used anymore. Stack overflow questions tend to be only half-useful, since at least some of the answer is probably outdated.

This problem should fade as time stabilizes the APIs and features, but to me it seems that this should have been planned for ahead of time. Tensorflow has been out for almost 2 years now (an eternity in deep learning time), but the Python API didn’t stabilize until March 2017. The other language bindings are still not stable. For a language touting its production-ready capabilities, you’d expect the C++ API to not be shifting under your feet.

Everything is a tensor

This one is hard to complain about, because I totally understand why the architecture was built this way. In fact, Derek Murray explicitly states that Google considers this a feature and not a bug in his Tensorflow dev summit talk. Hear me out anyway though – making everything a Tensor makes irrelevant a ton of knowledge about how to work with data in Python and negates many of the great tools that Python has built around it.

In Tensorflow, more and more of the tools built around the project operate as graph operations or nodes themselves. This means that the whole pipeline, from data loading to transformation to training, is all one giant GraphDef. This is highly efficient for Google: by making everything a graph operation, Google can optimize each little piece for whatever hardware the operation will run on (including exotic architectures like TPUs). However, it steepens the learning curve significantly. In this brave new tensor-fied world, I need to learn not only how the deep learning operations work (which are mostly math and therefore language agnostic), but also how the data loading operations work, and the checkpointing operations, and the distributed training operations, and et cetera. Many of the tools that Python relies on such as the Python debugger are no longer useful, and IDEs designed to visualize Python data have no clue how to interact with these new strange language constructs. Developers outside of Google don’t want to essentially have to learn a new language to use Tensorflow, and Google lock-in throws up a serious hurdle for organizations looking to de-risk new technology integration.

A cry for help

Tensorflow is trying to be everything to everyone, but does not present a developer friendly product to the greater deep learning community. Google is known for creating complex but effective internal tools, and taking these tools public is great for the developers at large. However, when you’re on a team at a company with minimal deep learning experience trying to build out production level systems, it’s almost impossible to learn how to do things correctly. Unlike the Google employees who use the framework on a day to day basis, there’s nobody for most of us to chat with when we have questions. To the Tensorflow team: we want to use your product, but at the end of the day it comes down to whatever lets us ship products most effectively. Please don’t make us go back to writing Lua.

comments powered by Disqus