Disclaimer: This article is how I advise personal friends to prepare for ML interviews. There is nothing in here specific to my current company, and you shouldn’t assume any advice here is endorsed or supported by them.
A Practical Approach to Solving an ML Design Problem
Design problems can be scary parts of an interview, especially for early career engineers who are familiar with coding questions and may have not experienced such vague question. I advise anyone tackling an open ended ML problem to take a structured approach – and this means both in their jobs and during a design interview. The approach here is a bit biased towards my particular domain of ML, which is deep learning heavy and often runs on robots or custom hardware, but most of this applies no matter what you’re working on.
Design a system
The most important thing to remember when being asked to design a system is that you do not have enough information. If your boss asked you to design Facebook’s news recommendation system you wouldn’t go off and build it without clearing up a lot of details. Similarly, a design interview should be a conversation with the interviewer.
A structured approach
Prompt: Design a system to detect if a person is frowning.
Stage 1: Understand the problem
What is the use case?
How will this be used in the wild?
What formulation do we want to put this problem in?
Is this an object detection problem? Classification?
What details matter?
Do we need to deal with multiple people?
Does it need to be very fast?
What hardware is this running on? GPU? Microcontroller?
What accuracy/precision/recall tradeoffs matter?
What about lighting conditions? Occlusion?
What are the pros/cons of the formulation you’ve picked?
Are there any existing implementations?
Stage 2: Data & Metrics
Data
What does our data look like?
How should the data be structured?
How do we label our data if this is a supervised task?
What kind of variety do we need?
How do we check label quality?
How can we improve the labeling process?
How do we split our data?
What’s the right order to feed the data?
Metrics
What is success?
What are the traditional metrics for this task?
How many metrics might we need? Do we need a metrics hierarchy?
What metrics might lie to us?
What modifications might we need to existing metrics?
How do we baseline our performance?
Stage 3: The Model
Features (especially if not deep learning)
As a human, how would I solve the problem? Can those be translated to features?
Can I use existing feature extraction methods (e.g. Word2Vec) or do I need to create custom ones?
What things would make this problem trivial if I knew them?
How expensive are these features to create?
Can I share features with other models?
Model Architecture
What kind of input representations might make sense?
What is our target output and loss function?
What architectures make sense?
Training and Tuning
How do we do the training? Which optimizer?
Do we need to do any transfer learning?
Do we need to employ tricks like negative sampling?
What kind of hyperparameters actually matter?
Model Debugging
How can we break down the model to simple parts to make sure those work?
What kind of visualizations do we want?
How can we know if we’re overfitting or underfitting?
Stage 4: Operations
Deployment
What will the deployment look like?
What are the deployment constraints – latency? memory? compute?
Does this model need specialized hardware (e.g. GPU)?
What happens if we need to scale up more than expected?
What things can I cache to improve performance?
Do we need to refresh the data? How often?
What conditions might require a data refresh sooner than expected?
Monitoring
How can we know our model is acting correctly once deployed?
Is there a way to collect metrics during deployment?
Alarms for issues that may crop up?
Stage 5: Making Improvements
Iteration
How can we improve the model once it’s deployed?
What kind of error analyses can we do?
Are there any representation issues that mean the model cannot learn certain things?
Can we use the model to improve itself (active learning)?
How can we make improvements without causing regressions?
Is this model like any others we have already? Can we merge them to improve generalization?
During the interview
The list above is huge and still only a subset of the things that matter when designing an ML system. During an interview, you can skip entire pieces of this if you think they don’t matter, or are implied by the question, or just to save on time. Again, remember the important thing is to gather information and use that to ultimately build a system with the right trade-offs and qualities.
Tensorflow’s meteoric rise to the top of the deep learning world is, while unsurprising, pretty damn impressive. With almost 60k stars on Github (the only reasonable measure of software popularity), Tensorflow is far out in front of nearest competitor Caffe, with its paltry 18k. The framework has a lot going for it: Python, great tools like Tensorboard, Python, Google’s knowledge of distributed systems, Python, and popularity that all but guarantees future relevance.
But while Tensorflow is a wonderful framework, the decisions (or lack thereof) being made by the Tensorflow product team are making it increasingly difficult for external developers to adopt. In my eyes, Tensorflow’s public face has grown without proper direction, and is threatening to alienate developers and allow competing frameworks to take over.
Fragmented high level library support
My main gripe strikes me as a weird and totally avoidable issue: there are too damn many Google supported libraries for Tensorflow. Good software engineers know that reinventing the wheel is a bad thing, and so when the prospect of writing yet another training and batching loop rears its ugly head, we look to high level libraries to ease the pain. Apparently, Google employees were aware this would happen, and in a mad scramble to curry organizational favor managed to release no less than five(!) Google developed high level libraries. There’s tf.learn (which is of course different than 3rd party tool TFLearn), tf.slim, DeepMind’s Sonnet, something called prettytensor, and Keras, who if this were a high school drama would be rapidly trying to distance herself from her less cool friend Theano.
I appreciate the work that has gone into these tools, and certainly it’s a benefit to have options. However, these are first party, Google supported tools. There’s no clear preferred library, even internally, and while the champions for each library claim they are nothing alike, it’s difficult for an external developer or researcher to pick an option. When “new” == “risky” for most companies, developers want a toolkit they can commit to deploying internally that will still be considered “best practice” in a few months. By offering a whole slew of somewhat supported options, Google is hindering adaptation of the Tensorflow framework in general. Avoiding writing boilerplate code each new experiment is a must have for most devs, but having to learn a new “hot” framework because previous ones are no longer feature competitive severely limits research output, and is an unreasonable problem to have when all are controlled by the same company.
Build-mania
One of the best things a software product can have is a strong community. For most of us, learning a new library means reading examples on blogs and Github, and consulting forums or documentation for help on specifics. Unfortunately for the average developer, Google’s desire to build features and exciting new pieces of the ecosystem has left those resources in the dust. Every week it seems a new Tensorflow product is announced – XLA, TFDBG, a graph operation to turn on your toaster, etc. No doubt these features are beneficial, but it also means that any resource about Tensorflow is immediately out of date. Documentation tends to be the most up to date, but often provides no context or example usage. Example code is often stale, sometimes presenting old functions or workflows that aren’t used anymore. Stack overflow questions tend to be only half-useful, since at least some of the answer is probably outdated.
This problem should fade as time stabilizes the APIs and features, but to me it seems that this should have been planned for ahead of time. Tensorflow has been out for almost 2 years now (an eternity in deep learning time), but the Python API didn’t stabilize until March 2017. The other language bindings are still not stable. For a language touting its production-ready capabilities, you’d expect the C++ API to not be shifting under your feet.
Everything is a tensor
This one is hard to complain about, because I totally understand why the architecture was built this way. In fact, Derek Murray explicitly states that Google considers this a feature and not a bug in his Tensorflow dev summit talk. Hear me out anyway though – making everything a Tensor makes irrelevant a ton of knowledge about how to work with data in Python and negates many of the great tools that Python has built around it.
In Tensorflow, more and more of the tools built around the project operate as graph operations or nodes themselves. This means that the whole pipeline, from data loading to transformation to training, is all one giant GraphDef. This is highly efficient for Google: by making everything a graph operation, Google can optimize each little piece for whatever hardware the operation will run on (including exotic architectures like TPUs). However, it steepens the learning curve significantly. In this brave new tensor-fied world, I need to learn not only how the deep learning operations work (which are mostly math and therefore language agnostic), but also how the data loading operations work, and the checkpointing operations, and the distributed training operations, and et cetera. Many of the tools that Python relies on such as the Python debugger are no longer useful, and IDEs designed to visualize Python data have no clue how to interact with these new strange language constructs. Developers outside of Google don’t want to essentially have to learn a new language to use Tensorflow, and Google lock-in throws up a serious hurdle for organizations looking to de-risk new technology integration.
A cry for help
Tensorflow is trying to be everything to everyone, but does not present a developer friendly product to the greater deep learning community. Google is known for creating complex but effective internal tools, and taking these tools public is great for the developers at large. However, when you’re on a team at a company with minimal deep learning experience trying to build out production level systems, it’s almost impossible to learn how to do things correctly. Unlike the Google employees who use the framework on a day to day basis, there’s nobody for most of us to chat with when we have questions. To the Tensorflow team: we want to use your product, but at the end of the day it comes down to whatever lets us ship products most effectively. Please don’t make us go back to writing Lua.
It’s been a while, but yesterday I attended a great lecture by UMass’ Ben Marlin. Ben works on very similar problems to my own research, and his paper on conditional random fields for morphological analysis of wireless ECG signals is a great example of how advances in machine learning can work to improve long standing problems in healthcare. The notes aren’t perfect, but I’ve tried to fix them up from their raw form. I am unable to find slides, unfortunately.
Segmenting and Labeling On-Body Sensor Data Streams with CRFs and Factor Graphs
Two big spaces in this research
Clinical data analytics (ICU EHRs)
mHealth Data Analytics — What we’ll talk about today. This is a broad space, incl. the app and device space like fitness wearables and iPhone apps. Wireless sensors, etc. The interesting this here is that the signals coming in are the same ones that you’ll find in an ICU.
With wearables, we want accurate, real-time, energy efficiency, and non-intrusive sensors. We work with addiction in our lab for example smoking or cocaine use. We also look at eating detection, etc. That may seem silly, but these things tie into ICU monitoring, e.g. pulmonary edema recovery.
For mobile health, we start with detection and move to prediction, understanding and finally intervening.
Current problem framework
Let’s look at the pipeline for these tasks.
At the raw data level: we are looking at quasi-periodic time series data.
[Slide: respiration data, one channel]
Then comes segmentation of some sort. This should be unsupervised and adaptive.
[Slide: segmentations overlaid on raw data, segments are heterogeneous]
Next is labeling these segments. This is basically state of the art, especially making independent predictions for each segmented datum.
[Slide: each segment has a color corresponding to a class]
From these segments we want to be able to come up with activity segments where a segment represents one action like eating a sandwich or smoking a cigarette.
[Slide: higher level colored segments, bigger than the individual segments]
Challenges in Mobile Health
We need these things to be low: cost, power usage, noise, drift, dropout. Obviously this isn’t possible.
Labeled data is very high cost. Not only that but there is limited ecological validity.
Self reporting results in lack of temporal precision and low accuracy
The “n=me” problem. Big data doesn’t really solve problems in this space because people are so different. With low data volumes, everyone looks different. Then end up with covariate shift or transfer learning problems.
For black boxes the need to infer meaningful model results in medicine is difficult. Model distillation is needed for something like deep learning. Doctors and patients don’t trust a black box.
All of this needs to be real time! Model compression is coming back for something like this.
Case Study — CRFs for Labeling and Segmenting ECG
Motivating factor is detecting cocaine usage. For cocaine users, there are morphological structural changes in the ECG besides just rate increase. For example, the QRS and QT prolongates. The detail changes are specific to this drug and thus allow us to filter out false positives that would result with something like heart rate. Detecting each part of the heart rate is very important but difficult.
The basic idea behind this technique is to use CRFs for each segment, given a window of features around each potential peak. Sparse coding is used for feature extraction (and feature learning) and then each window’s sparse coding dictionary is the feature representation.
[Slides: many results slides. accuracy is high, amount of train data required is low, CRF does not have differential recall]
Running out of time, but quick bit about hierarchical segmentation where we jointly label and segment.
I spoke to Ben after the lecture to talk about transfer learning from various datasets of ECGs. He claims that the sparse coding dictionaries are farily stable and consistant, and doesn’t believe that training sparse coding on more complete or noise-free datasets will see a large benefit. We also talked about trying to use the sparse coding coefficients as sequence learning inputs for far-off targets such as disease or outcome prediction. This is something I am considering applying in my own work. He has not tried this, but admits it is an idea worth pursuing.
While the SciPy project already maintains a complete list of the differences between NumPy and Matlab, that list is big and random and this list is small and somewhat ordered. My research is written in both Matlab and Python and, like the musician who yells the wrong city at their show, these are the mistakes I make most commonly when switching back and forth.
Matlab indexes beginning with 1, Python with 0. This is well known, but can still trip you up when you frequently switch back and forth. This applies to all indexed values, such as the axis to apply a function to (the first axis is 1 in Matlab, 0 in Python).
Numpy arrays are by default element-wise for multiplication and division. To perform traditional matrix multiplication you will need to use np.dot, because both * and np.multiply are element-wise. Python 3.5 will be introducting the @ symbol for infix matrix multiplication, which will hopefully resolve some of the confusion. Similarly, Numpy offers matrix as an alternative to ndarray, but if you value your sanity you should stick with the arrays. The matrix class makes traditional matrix multiplication the default operator for the * symbol, at the expense of adding restrictions and caveats to literally everything else.
Python:
Matlab:
In Numpy, many functions require a tuple as an argument. This happens in functions like concatenate and reshape:
Python:
Matlab:
In Numpy, arrays are not inherently multidimensional. Creating an array can create a 1d array, which does not even have a second dimension. Compare this to Matlab, where vectors are 1xN or Nx1 2d arrays. This small difference is a common source of pain, especially because it isn’t caught by static checkers and will inevitably end up crashing the very end of your long script, right after you finish training a huge model and right before you display the results.
I haven’t really had time to write new blog posts with both class and research in full swing, but I did have some leftover code for scraping music festival data, so I decided to do something with it. The festival season is starting soon, with the insanely early SXSW already wrapped up and the massive juggernaut that is Coachella starting this weekend.
Since the hipsters amoung us know that going to a popular music festival (especially to see only headliners) is akin to renoucing “On Avery Island” or writing music reviews for People Magazine, I’ve put together a handy chart to help you choose which festival to go to based on how mainsteam on average the bands are. Using the ever-handy Echonest API, I averaged out the familiarity and “hotttnesss” of the bands at the festival. The most mainstream festivals are towards the top right:
In this case, familiarity can be viewed as long-term brand recognition, and “hotttnesss” can be viewed as hype at this moment. So a band like The Rolling Stones has strong name-recognition but might not be very hyped, while The Weeknd may be very hot but probably isn’t familiar to most people over 40. It’s rare for a band to be popular and not familiar, so most of the festivals lie along the same line. However, there are clear deviations. For example, Lollapalooza seems to have less brand recognition than Summerfest for the same amount of popularity. This makes sense given Summerfest’s more family friendly demographic.
Many trends can be explained by the proportion of headliners to smaller acts. South By Southwest is quite hipster in this interpretation, while Boston Calling is very mainstream. This corresponds to SXSW’s relatively large, small-timer lineup and Boston Calling’s compact, headliner-heavy weekend. Hipsters will note that Pitchfork has stayed true to form and is deep in “you’ve probably never heard of them” territory.
A few points of criticism to address preemptively:
This graph relies on Echonest’s ranking system, which is hush-hush. They claim the numbers are based on activity over crawled webpages but who knows how accurate they actually are.
There are no axes values because both parameters are normalized dimensionless numbers (i.e. values [0 1]), thus relative values are all that matter.
I don’t work for or represent Echonest, even though I know I’ve used them twice now in my blog posts. I do really appreciate their product though.