Data Engineeringposted on Apr 24, 2017
Agile Data Science (Part 1)
The following is a guest post by Verdi March.
Modern businesses are increasingly adopting data science (i.e. data-driven practices) to improve their products, services and operations. Typical applications range from reporting to forecasting and actionable insights. A central theme in data science is focused on modelling.
This post will review common strategies to shorten the time required for delivering models to production. The upcoming second part will zoom into a technical example.
What is a Model?
For simplicity’s sake, let’s describe a model using layman’s terms (for technical definitions, refer to this).
A model can be viewed simply as a black box. We can ask this black box questions, such as “if there is a rain for one hour, what is the expected volume of umbrella sales?”
Then, what is in the black box? To explain this, we shall introduce two terminologies: techniques, and parameters. Examples of modelling techniques are aplenty:
- linear regression
- logistic regression
- score cards
- neural networks
Each technique has a set of parameters that is calibrated for a specific purpose (e.g. the umbrella forecasting). The calibration is commonly referred to using its technical term training.
Model Life Cycle
A model typically undergoes a continuous cycle of development to production, and the feedback loop from production back to development. This continuous life cycle involves plenty of diverse factors in terms of stakeholders, capabilities and technology stacks.
There has been huge fanfare towards model development recently, which entails mainly training and feature engineering based on (preferably very large) data sets. Model development is perceived as the shiniest, most glamorous aspect of data science. It’s the pinnacle of data science, so to speak.
However, for maximum impact to stakeholders and ultimately, maximum value to customers, a holistic data science cannot be lopsided just towards development. Deployment and operations are equally important. After all, only models that are fully utilised in production can deliver results to stakeholders and customers.
In the real world, the development and deployment stages are often done by separate teams. Such a structure reflects and acknowledges the need for different skill sets at each stage, and for a clear demarcation in responsibilities and accountability.
Nowadays, the model development team is commonly referred to as the data science team, which sometimes leads to confusion with the broader term: data science. The deployment team has more variations in its name, such as the data engineering team, the big data team, or in some cases, simply the software or IT folks!
It should be obvious by now that there is a challenge in simplifying the transition from one stage (or team) to another, bearing in mind that each stage can adopt different technology stacks and different best practices. A model may be developed using one commercial statistical package, yet deployed in an incompatible big data environment.
Such a complexity presents risks in the form of slow time-to-deliver (and to a certain extent, slowed innovation), vendor lock-in, and potential loss of institutional knowledge. As a result, businesses cannot have a timely response to the fast-paced ever-changing market.
There are three strategies to manage this data science life cycle:
Enforce the same technology stacks across all stages.
Often this is not practical due to various reasons, simply because the right tools for model prototyping may not be the most efficient for production. The new model must interact with various parts of the production environment.
Think of the model as a centrepiece surrounded by various features essential in production. Some tools trade-off between development time and runtime performance: it may be fast to prototype a new model, but the runtime performance (e.g. scalability) may be below industry standards.
Another source of friction is the differing philosophies between the stages. Model development should be nimble. As such, it is heavily biased towards an exploratory or research style, often with a healthy dose of ignorance with regards to production environments.
On the other hand, putting a new piece of software into production needs a relatively more disciplined approach, particularly in software engineering and change management. This means both teams may have remarkably different skill-sets and preferences with regards to technology stacks.
Port or re-implement the model to production environments.
Contrary to strategy #1, this strategy aims to adopt the right tool for the right job. However, it should be obvious that developing software is complicated, hence, time-to-production may be compromised.
Describe (i.e. store) models using a portable and open format.
Similar to strategy #2, the aim is to adopt the right tool for the right job. However, the difference is in the execution. Rather than manual labour for each new model, models are described using an open format such as PMML or PFA so that it can be deployed to a production environment in timely manner. Of course, the production environment must be capable of understanding this open format, which fortunately is usually the case nowadays.
A metadata of model representation based on open standards such as PMML or PFA serves as the standard representation that is agreed and understood by all the elements.
The standardised metadata ensures interoperability between the development and production environments. It also protects against vendor lock-in, thus preserves institutional knowledge, and increases innovation by allowing new elements to quickly on-board into the model life cycle. Ultimately, it enables businesses to respond to their customers in a timely manner.
Ultimately, which strategy to adopt depends on the context of specific use cases, as well as the readiness and maturity of each individual organisation. We’ve reached the end of this overview post. The next one will zoom in specifically on strategy #3 with a technical example. Stay tuned!
Read Part 2: Agile Data Science (Part 2): PMML