The big picture of MLOps – why does it matter, and where do you need it?
In recent years, we’ve had tons of academic breakthroughs in artificial intelligence and tremendous levels of development on the industrial side of things. Companies have been implementing AI into their business models. While some struggle to make these models work, the speed of advancing technologies and uncertainties in the marketplace make this a challenging endeavor.
AI lives embedded within software development, so it has adopted many of its practices with various degrees of success. Many AI developers shudder at the terms ‘AI’ and ‘agile’ put together, yet they must coexist.
One term that has gained momentum is Machine Learning Ops (MLOps). Let’s dive into an introduction of MLOps, why it has appeared, and our approach to harnessing it.
What does MLOps look like in practice?
The development of AI has had a great deal of experimental or ‘lab’ culture. Many highly skilled practitioners come from academia, making it common for data scientists to focus too much on science and research. Some organizations were not ready for large-scale integrations leading to many proofs of concepts and failed scale-up attempts. They did not have enough skilled people, amongst other factors, that could hamper AI’s success in their business, especially compared to technically advanced companies like Meta, Apple, Microsoft, Amazon, and Google’s parent company, Alphabet.
What we have learned from complex software and data-heavy developments is that there are some critical project elements:
Governance mechanisms become key when we are exposed to several kinds of risks with these models, requiring accountability and responsibility for the potential costs of utilizing heavy computer infrastructure beyond control. It is about who has access to what resources, such as data, assets, models, and deployment.
Repeatability is a critical concept in understanding the quality of any AI model. We can consider models are ‘in control’ only if we can generate them again from the same conditions. Repeatability enables auditing and improvement of the models and ensures that whatever we are using in production has functionality and is not just a spur-of-the-moment decision. In the context of AI models, repeatability implies tracking data sets, experimenting with configurations, analysis, codes used, etc.
Continuous process is one of the biggest learnings in the software world. The impact that CI/CD has on software development, its quality assurances, short iterations, versioning, and capability for distributed teams are all invaluable for something as complex as AI models.
There has been a confluence of technologies powering the platform side of MLOps transformation, in the same concept that many principles of CI/CD have solidified in different platforms and practices, such as git flow, build managers, and code pipelines.
MLOps as a platform – enabling seamlessness at scale
In business, AI dwells within the data space – whether you call it BI, analytics, data platform, or big data. Several generations of data technologies have been developed to overcome the limitations of previous versions. It is not entirely accurate to define ‘generations’ as newer versions superseding the previous ones. It is better to describe them as speciations akin to biological processes, where a technological approach has evolved to be superior in a bounded domain. They may get overextended temporarily, but they will eventually get pushed to their niche environment.
When the map-reduce paradigm overcame the file storage systems and processing limitations of SQL, mainly known and developed within the Apache Hadoop ecosystem, it enabled processing vast amounts of complex data in a batch mode, spurring the moniker and buzz over ‘big data.’ However, the batch processes were not a one-size-fits-all application. Other approaches were developed to handle data streams – from messaging queues to micro-batches, and more notably, with the Apache Spark Ecosystem, a current king of its hill. Spark is not suited for every use case, and for AI platforms, there is currently an explosion of approaches to tackle its specific needs.
Most of the platforms developed for AI are rooted in at least one core concern of MLOps. There are a host of specific necessities within these platforms, and they aim to tackle the following:
- Data versioning and data lineage for source data, processing code, trained models, deployment code, and usage logs
- Storage different modalities for source data used in training vs. data used in production
- Compute provisioning permissions, costs, configurations, and environments
- Approval processes, particularly when triggering deployment to production of models, but also for some data processes
- Decoupling of different process streams ingesting data and processing it into a reusable form, training models, evaluating them, and tracking performance in production and configurations across the board
- Experimentation sandbox to prevent constraining innovation while preventing affected critical systems
One salient feature of the MLOps world is the ascent of a particular structure named the ‘feature store,’ bringing back the idea of a data warehouse where being a single source of processed truth and valid data to be consumed downstream. In this case, it does not need to be structured. Still, it needs to be an agreed-upon process of data that acts as a reusable representation that can power any model built using the source data, which can be fed and enriched while in production.
Not all MLOps platforms will consider that feature, but it’s become a familiar pattern and a good practice in many cases.
The benefit of a good MLOps platform is that it enables companies to have specialized talents in different platform dimensions. It untangles the unicorn that may be needed to otherwise effectively handle core concerns of data engineering, cloud engineering, data science, data architecture, DevOps, and many other disciplines. A well-rounded profile is a good thing, even better with a T-shaped knowledge. Can you imagine how difficult and expensive it would be to find such an expert in everything? It is more than what most companies are willing to handle.
We can summarize the approach of MLOps as shifting the requirements from people to platforms, enabling fast iteration cycles on decoupled teams of experts in a modern data platform best suited for developing and serving evolving AI applications in a governable manner. It is a field undergoing a Cambrian explosion of alternatives, with its final product still being experimental. They represent a base level of maturity for any company serious about leveraging AI in the real world.