log in | about 
 

As the number of data-driven applications increases, data is becoming an important part of the code base. It is such a clear trend that some people even rushed to announce a demise of code. "Code is a commodity", claims Henry Verdier, and "Data is the new code". While this seems to be an exaggeration, our increasing dependence on data has consequences. In fact, as Sculley and colleagues argue in their recently published paper "Machine Learning: The High-Interest Credit Card of Technical Debt" (D. Sculley, Gary Holt, Daniel Golovin, Eugene Davydov, Todd Phillips, Dietmar Ebner, Vinay Chaudhary, Michael Young), the cost of data dependencies outweighs the cost of code dependencies in projects heavily relying on data-driven (aka machine learning) approaches. Forget about the never ending "functional vs objected oriented" debate. Let us get straight on the issue of data dependencies first.

Unfortunately, it is not easy to do. Sculley and colleagues argue that in traditional software engineering, the number of interdependencies can be greatly reduced via encapsulation and modular design. This is possible because we write modules and functions to satisfy certain strict requirements. We know there are certain logical invariants and can check functionality via unit and integration tests. For certain applications, we can even formally verify correctness. However, as note the authors, "... it is difficult to enforce strict abstraction boundaries for machine learning systems by requiring these systems to adhere to specific intended behavior. Indeed, arguably the most important reason for using a machine learning system is precisely that the desired behavior cannot be effectively implemented in software logic without dependency on external data."

At the same time, it is not easy to decompose the data, because there are no truly independent features. It is hard to isolate an improvement (which features did contribute most?) and it is hard to debug problems. Overall, machine learning systems are much more fragile, because small local changes (e.g., in regularization parameters, convergence thresholds) may and often do have ripple effects. Sculley and colleagues call this phenomenon a CACE principle: Changing Anything Changes Everything.

Clearly, as the data-driven applications become even more common, there will be an established set of best practices and design patterns tailored specifically to management of data. Some of the emerging patterns are already discussed by Sculley and colleagues. Among other things, they recommend reducing the amount of glue code, removing little-impact features and experimental code paths.

There are a number of tools (in many languages) to identify code dependencies. Sculley and colleagues argue that data dependencies can be analyzed as well, in an automatic or semi-automatic manner. At the very least, one can catalog all the features used in the company. Different learning modules can report on the usage of the features to a central repository. When a version of the feature changes, or the feature becomes deprecated, it is possible to find all relevant consumers quickly. Such a feature management tool greatly reduces the risk of having a stealthy consumer, e.g., one that reads features from log files, whose behavior is adversarially affected by deprecation or change of certain input signals.

Machine learning is a powerful tool allowing us to quickly build complex systems based on previously observed data patterns instead of laboriously handcrafting the patterns manually. Yet, its performance hinges on the assumption that previously observed statistical properties of the data remain unchanged in the future. A situation, where this assumption is violated, is called a concept drift. As a result of the concept drift, performance of a predictive model deteriorates with time. The more sophisticated is the model, the more likely it is to suffer from this drift. In particular, an error rate of a simple linear model may become equivalent to that of a more sophisticated model!

Unfortunately, in the real world the main machine learning assumption does not hold. An example of the domain, where the concept drift is especially stark, is spam detection. The current anti-spam software is good, but it would not be good without constant retraining and introduction of new features. Again, Sculley and colleagues do discuss this problem and propose a couple of mitigation strategies.

To conclude, I again emphasize that data-driven applications are different from classic software projects. It is expected that new best practices and design patterns will evolve and mature in the future to deal with problems like data dependencies and the ever changing statistical properties of the external world. The paper "Machine Learning: The High-Interest Credit Card of Technical Debt" overviews some of the design practices already used successfully by Google folks. I would recommend reading this paper and following some of the references to everyone interested in building large interconnected machine learning systems.