While the lifecycle models that we have discussed so far in this wiki are important and helpful for practitioners, at the end of the day "no plan survives enemy contact." To make sure this guide truly reflects a practitioner's perspective on lifecycle management, we have asked the partners of the appliedAI initiative to share with us their views on the challenges (both technical and non-technical) of ML lifecycle management. We have distilled the deep insights we received from our partners here, enriched it with our internal experiences collected over the last years, and organized it to maintain the qualitative nature of the feedback we got. We have additionally mapped the best practices that were suggested to solve these challenges so that other companies can learn from ones who already have solved key issues.
Best practices for ML lifecycle management from the industry
Here we provide a list of recommendations and best practices regarding the management of the ML lifecycle from technology partners of appliedAI. This list is unordered in importance and will be refined in the future:
- Have a viable business case sponsored by a business unit that is willing to take responsibility for human oversight for many years after it has been taken to production.
- Do not develop every PoC: Say "no" to cases where after a short but qualitative investigation no viable business case can be made. Experiment, but quickly. Usually, if after 4 weeks there is no model embedded in a PoC application that has been successfully tested as part of a real business process, it is not worthwhile to continue. On the other hand, if a PoC is viable (and you know soon enough), move it to production level code, and work iteratively to improve it. Have the benefits of e.g., CI/CD work for you.
- Do not use AI when it is not needed, just because it sounds fancy.
- The process does not end with deploying to production; in fact, at this point, the real and interesting work just starts: monitoring, oversight, and automating AI are key focus areas and must be taken into account already in the feasibility study.
- Customer relationships: Get customers to commit to projects before putting too much work into them.
- Expectation management: In AI and ML projects, it is important to break a problem down into subtasks in advance. Setting overly ambitious goals is not helpful. Instead, it is important to implement expectation management to counteract exaggerated expectations of the rapid benefits of AI projects.
- Data landscape/strategy and governance: Productive AI needs a solid data foundation, especially for products in heavily regulated environments, multi-tenant scenarios, and when leveraging possibilities to include more aspects of data in your AI applications.
- Own your data: Any productive AI that needs to be continuously maintained and updated needs to rely on well-maintained and governed data.
- Centralized data platform: Have a centralized platform for all enterprise-wide structured and unstructured data. The latter requires supporting technologies (e.g., Search).
- Centralized use case dashboard: Combine multiple cases in one platform that provides the AI services out of the box without custom development in Python and R.
- Scaling and success: Do not depend only on the technical aspect: non-technical process changes, user acceptance, and business considerations matter a lot. It is important to consider these early in the lifecycle.
- Transparency: Keep the flow of information high so that all stakeholders are consistently informed. Especially in the preparation and modeling phases, as these phases can be quite opaque for many stakeholders.
- Trustworthiness: Apply trustworthy AI principles at all lifecycle stages.
- Perfect reproducibility is almost impossible to ensure. Results of model training and even inference can vary, depending in particular on the hardware, despite fixing random seeds and choosing deterministic operations.
- Support culture shifts: In classical semiconductor projects the price of developing prototypes is prohibitive: An iterative approach is great as long as there is one iteration. Obviously, this is different in the case of ML projects, and we have to change our lifecycle accordingly.
- The lifecycle can be different depending on the domain: It is best to have a good general format that each team can then adapt slightly to fit their needs (e.g., improving processes in a manufacturing environment is quite different from improving processes in the sales and marketing environment; the same applies for online vs. offline models, among other things).